I started to use OpenMP several months ago and during my tests I found out that the OpenMP does NOT scale linearly on my computer for even a simple Matrix Vector Multiplication (MVX). I can neither find the mistake in my parallelization (i.e., code/compile/run ...) nor get efficient results (i.e., Linear Speed up - OpenMP with 2 threads must perform twice faster than the OpenMP with 1 thread, and so on...).

So I kindly ask all of you to share the followings information with me (either FORTRAN or C programs), if possible:

1) Do you get Linear Speed Ups? (or, better results...)

-- If the answer is YES, would you please share the results? What is wrong with my implementation?

-- If the answer is NO, do you have any logical explanation on that?

2) Would you please run the following code on your own platforms and send me a feedback about the results? (I want to check if it is system dependent or not?)

In order to get consistent results with each other, please run the MVX for following sizes and NOT FORGET TO REPORT YOUR SYSTEM SPECS:

i) Square Matrix of Size 1000*1000

ii) Square Matrix of Size 10,000*10,000

iii) Matrix of Size 5000*15000

iv) Matrix of Size 15000*5000

v) Matrix of Size 300*2000

I am looking forward to hear from you... I really need your kind helps and these results as my research topic is highly related with OpenMP implementations!

Following two links(images) are the output of my runs:

Image 1: MVX Run Time vs Matrix Elements

link: http://www.flickr.com/photos/93385967@N ... hotostream

Image 2: MVX Run Time vs Matrix Elements

link: http://www.flickr.com/photos/93385967@N ... hotostream

You can find my code below and the Matrix Vector Multiplication Part is important for me:

- Code: Select all
`program main`

INCLUDE "omp_lib.h"

integer i,j,k, h, m, n, q, proc_num, thread_num

real*8 t1, t2

real*8 time

real, allocatable :: a(:,:)

real, allocatable :: b(:,:)

real, allocatable :: x(:)

real, allocatable :: y(:)

n = 10000

m = 10000

DO WHILE (m.LT.100001)

DO WHILE (n.LT.100001)

allocate(a(m,n))

allocate(b(n,m))

allocate(x(n))

allocate(y(m))

h = 1

DO WHILE (h.LT.9)

call omp_set_num_threads(h)

print*,'The Thread numbers is set to :',h

proc_num = omp_get_num_procs ( )

thread_num = omp_get_max_threads ( )

print*, ' Compute matrix vector multiplications y = A*x.'

print*, ' The number of processors available = ', proc_num

print*, ' The number of threads available = ', thread_num

C

C Set the matrix A.

C

!$omp parallel

!$omp& shared (a,j)

!$omp& private (i)

!$omp do

do i = 1, m

do j = 1, n

a(i,j) = (10*i+j)

end do

end do

!$omp end do

!$omp end parallel

C TRANSPOSE A

!$omp parallel

!$omp& shared (b,a,j)

!$omp& private (i)

!$omp do

do i = 1, n

do j = 1, m

b(i,j) = a(j,i)

end do

end do

!$omp end do

!$omp end parallel

C

C Set the Vector x

C

!$omp parallel

!$omp& shared (x)

!$omp& private (i)

!$omp do

do i = 1, n

x(i) = i

end do

!$omp end do

!$omp end parallel

C

C Initialization

C

!$omp parallel

!$omp& shared (y)

!$omp& private (i)

!$omp do

do i = 1, m

y(i) = 0.0

end do

!$omp end do

!$omp end parallel

C #######################################

C

C Matrix Vector Multiplication Part

C

C #######################################

t1 = OMP_GET_WTIME()

!$omp parallel

!$omp& shared (b,x)

!$omp& private (i,j)

!$omp do reduction(+:y)

do i = 1, m

y(i) = 0.0

do j = 1, n

y(i) = y(i) + b(j,i) * x(j)

end do

end do

!$omp end do

!$omp end parallel

t2 = OMP_GET_WTIME()

time = (t2-t1)*1000

print*, m, n, time

h = h*2

END DO

deallocate(a)

deallocate(b)

deallocate(x)

deallocate(y)

n = n + 4000

END DO

n = 10000

m = m + 4000

END DO

stop

end

To run this FORTRAN code, I use following commands, and the output result for one of the steps can be seen below:

[mahdi@hpcn00 MVX]$ ifort -openmp -o MVX-fortran mahdi-MVX.f

[mahdi@hpcn00 MVX]$ ./MVX-fortran

The Thread numbers is set to : 1

Compute matrix vector multiplications y = A*x.

The number of processors available = 8

The number of threads available = 1

10000 10000 121.527910232544

The Thread numbers is set to : 2

Compute matrix vector multiplications y = A*x.

The number of processors available = 8

The number of threads available = 2

10000 10000 72.3700523376465

The Thread numbers is set to : 4

Compute matrix vector multiplications y = A*x.

The number of processors available = 8

The number of threads available = 4

10000 10000 64.0790462493896

The Thread numbers is set to : 8

Compute matrix vector multiplications y = A*x.

The number of processors available = 8

The number of threads available = 8

10000 10000 63.7650489807129

The Thread numbers is set to : 1

The platform I am testing my runs has the following Specs:

OS: Scientific Linux 5.7 - Kenel: 2.6.18

Compiler : Intel parallel studio XE 2013

CPU : 2 quad-core Intel Xeon 5345 (8 cores)

RAM : 32 GB

You are also welcome to contact me via the following email address:

kazempour[at]ee[dot]bilkent[dot]edu[dot]tr

Please don't hesitate to share any information with me.

Best Wishes and Regards,

Mahdi