Matrix Vector Multiplication Contest - Please Participate!

General OpenMP discussion

Matrix Vector Multiplication Contest - Please Participate!

Dear All,

I started to use OpenMP several months ago and during my tests I found out that the OpenMP does NOT scale linearly on my computer for even a simple Matrix Vector Multiplication (MVX). I can neither find the mistake in my parallelization (i.e., code/compile/run ...) nor get efficient results (i.e., Linear Speed up - OpenMP with 2 threads must perform twice faster than the OpenMP with 1 thread, and so on...).

So I kindly ask all of you to share the followings information with me (either FORTRAN or C programs), if possible:
1) Do you get Linear Speed Ups? (or, better results...)
-- If the answer is YES, would you please share the results? What is wrong with my implementation?
-- If the answer is NO, do you have any logical explanation on that?

2) Would you please run the following code on your own platforms and send me a feedback about the results? (I want to check if it is system dependent or not?)
In order to get consistent results with each other, please run the MVX for following sizes and NOT FORGET TO REPORT YOUR SYSTEM SPECS:
i) Square Matrix of Size 1000*1000
ii) Square Matrix of Size 10,000*10,000
iii) Matrix of Size 5000*15000
iv) Matrix of Size 15000*5000
v) Matrix of Size 300*2000
I am looking forward to hear from you... I really need your kind helps and these results as my research topic is highly related with OpenMP implementations!

Following two links(images) are the output of my runs:
Image 1: MVX Run Time vs Matrix Elements

Image 2: MVX Run Time vs Matrix Elements

You can find my code below and the Matrix Vector Multiplication Part is important for me:
Code: Select all
`      program main      INCLUDE "omp_lib.h"            integer  i,j,k, h, m, n, q, proc_num, thread_num      real*8 t1, t2          real*8 time     real, allocatable :: a(:,:)      real, allocatable :: b(:,:)      real, allocatable :: x(:)      real, allocatable :: y(:)               n = 10000     m = 10000           DO WHILE (m.LT.100001)      DO WHILE (n.LT.100001)               allocate(a(m,n))     allocate(b(n,m))     allocate(x(n))     allocate(y(m))               h = 1      DO WHILE (h.LT.9)            call omp_set_num_threads(h)      print*,'The Thread numbers is set to :',h     proc_num = omp_get_num_procs ( )     thread_num = omp_get_max_threads ( )                 print*, ' Compute matrix vector multiplications y = A*x.'      print*, '  The number of processors available = ', proc_num      print*, '  The number of threads available    = ', thread_num      C   C          Set the matrix A.C!\$omp parallel!\$omp& shared (a,j)!\$omp& private (i)!\$omp do         do i = 1, m      do j = 1, n       a(i,j) = (10*i+j)      end do      end do!\$omp end do!\$omp end parallel     C          TRANSPOSE A!\$omp parallel!\$omp& shared (b,a,j)!\$omp& private (i)!\$omp do     do i = 1, n      do j = 1, m       b(i,j) = a(j,i)      end do      end do!\$omp end do!\$omp end parallelCC       Set the Vector xC!\$omp parallel!\$omp& shared (x)!\$omp& private (i)!\$omp do      do i = 1, n       x(i) = i      end do!\$omp end do!\$omp end parallel    CC       InitializationC      !\$omp parallel!\$omp& shared (y)!\$omp& private (i)!\$omp do           do i = 1, m       y(i) = 0.0      end do!\$omp end do!\$omp end parallel   C #######################################  CC      Matrix Vector Multiplication Part C     C #######################################             t1 = OMP_GET_WTIME()!\$omp parallel!\$omp& shared (b,x)!\$omp& private (i,j)!\$omp do reduction(+:y)      do i = 1, m      y(i) = 0.0      do j = 1, n       y(i) = y(i) + b(j,i) * x(j)      end do      end do!\$omp end do!\$omp end parallel      t2 = OMP_GET_WTIME()           time = (t2-t1)*1000     print*, m, n, time           h = h*2      END DO      deallocate(a)      deallocate(b)      deallocate(x)      deallocate(y)     n = n + 4000      END DO     n = 10000     m = m + 4000      END DO                      stop      end`

To run this FORTRAN code, I use following commands, and the output result for one of the steps can be seen below:
[mahdi@hpcn00 MVX]\$ ifort -openmp -o MVX-fortran mahdi-MVX.f
[mahdi@hpcn00 MVX]\$ ./MVX-fortran
The Thread numbers is set to : 1
Compute matrix vector multiplications y = A*x.
The number of processors available = 8
The number of threads available = 1
10000 10000 121.527910232544
The Thread numbers is set to : 2
Compute matrix vector multiplications y = A*x.
The number of processors available = 8
The number of threads available = 2
10000 10000 72.3700523376465
The Thread numbers is set to : 4
Compute matrix vector multiplications y = A*x.
The number of processors available = 8
The number of threads available = 4
10000 10000 64.0790462493896
The Thread numbers is set to : 8
Compute matrix vector multiplications y = A*x.
The number of processors available = 8
The number of threads available = 8
10000 10000 63.7650489807129
The Thread numbers is set to : 1

The platform I am testing my runs has the following Specs:
OS: Scientific Linux 5.7 - Kenel: 2.6.18
Compiler : Intel parallel studio XE 2013

CPU : 2 quad-core Intel Xeon 5345 (8 cores)
RAM : 32 GB

You are also welcome to contact me via the following email address:
kazempour[at]ee[dot]bilkent[dot]edu[dot]tr

Please don't hesitate to share any information with me.

Best Wishes and Regards,
Mahdi
kazempour

Posts: 13
Joined: Wed Jul 25, 2012 4:11 am

Re: Matrix Vector Multiplication Contest - Please Participat

Hi Mahdi,

I'm not going to participate, but here are a few observations:

Matrix-vector multiplication is very memory bandwidth intensive, so scalability is typically limited by hardware contention, regardless of how well the code is written.

NUMA effects (i.e. the location of data in main memory) can be important. The way your code is written, the initialisation of the arrays, although inside a parallel region, is always executed on one thread the first time through. This means (assuming Linux is using its default first-touch allocation policy) than all the data will be allocated in the memory on one socket, which makes the memory bandwidth bottleneck worse. You should try removing the h loop and running a different instance of the executable on different numbers of threads. You should also invert the loop order in the loop which initialises b, so the code traverses memory in linear order.

In the matrix-vector multiplication loop, y does not need to be a reduction variable as different threads are accessing different elements.

Declaring sequential loop indices as shared is actually a bug (though it may not matter if the compiler never stores these values in memory).

Hope that helps,
Mark.
MarkB

Posts: 559
Joined: Thu Jan 08, 2009 10:12 am
Location: EPCC, University of Edinburgh

Re: Matrix Vector Multiplication Contest - Please Participat

Dear Mark,

Thank you very much for kind suggestions, but there are some issues I like to point out:

1) The code you see here is some how a simplified version of my past tries on MVX,
-- about h loop, the plots given in the above links, are figured out without this h loop (i.e., export OMP_NUM_THREADS=n); however there is really a slight difference between these two methods.
-- About memory access, please note that in Fortran programming, the access to memory in a column-wise manner is more efficient. This is the reason I transpose the matrix first and then compute the MVX.
2) I used reduction for "y", although I know that there is no need to do that, however during run, the "y" array may locked with threads and cause some inefficiency, so I tried to be away from this problem. (please note that putting/removing this directive do NOT change the slope.)
3) Would you please explain a little bit more about the following quote:
Declaring sequential loop indices as shared is actually a bug (though it may not matter if the compiler never stores these values in memory).

In the end I thank you very much again, as you spend some time on problem.

Best Wishes,
Mahdi
kazempour

Posts: 13
Joined: Wed Jul 25, 2012 4:11 am

Re: Matrix Vector Multiplication Contest - Please Participat

Hello Mahdi,

I have been facing the same problem. Here is the matrix-vector multiplication C code below. Only difference is I used pointers and still the result is not what I expected. It would be more proper to obtain the result in half of process time when you double the number of threads.

Currently I am using intel C compiler and did my tests on 1,2,4,8 threads. Timing results are given as,

bariscan@LILA:~/Desktop/openmp_test\$ icc -openmp MVX-pointer-C-1.c -o MVM
bariscan@LILA:~/Desktop/openmp_test\$ ./MVM
n=10000, m=10000, t=6006.625175
bariscan@LILA:~/Desktop/openmp_test\$ ./MVM
n=10000, m=10000, t=3150.537968
bariscan@LILA:~/Desktop/openmp_test\$ ./MVM
n=10000, m=10000, t=2252.232075
bariscan@LILA:~/Desktop/openmp_test\$ ./MVM
n=10000, m=10000, t=2339.959145

Comparing with 1 thread, timing of 2 threads is almost half, but when I increase to 4 threads, scaling loses its linearity.

Sincerely,

Bariscan

Code: Select all
`//**************************************// Name: Matrix Vector (MATRIX) Multiplication Using Pointers// BY: BARISCAN KARAOSMANOGLU//**************************************#include<stdio.h>#include<stdlib.h>#include <omp.h>int main(void){    //int num_threads = 4;    void omp_set_num_threads(int num_threads);    int omp_get_num_threads(void);    double start;    double end;        int **a, **b, **c;           int n,m,i,j,p,q,k;    n = 10000;    m = 10000;    //for(n=10;n<1001;n*=10)   //{    //for(m=10;m<1001;m*=10)    //{         a = (int **)malloc( n * sizeof(int *) );         for ( i = 0 ; i < n ; ++i ) {             a[i] =(int *) malloc( m * sizeof(int) );         }                  b =(int **) malloc( m * sizeof(int *) );         for ( i = 0 ; i < m ; ++i ) {             b[i] = (int *)malloc( 1 * sizeof(int) );         }                  c = (int **)malloc( n * sizeof(int *) );         for ( i = 0 ; i < n ; ++i ) {             c[i] = (int *)malloc( 1 * sizeof(int) );         }                 //int a[n][m],b[m][1],c[n][1];        int *pt,*pt1,*pt2;                  for(i=0;i<n;i++)         {             for(j=0;j<m;j++)             {                 a[i][j] = i+2*j+5;             }         }    //printf("%d   ",a[1][1]);             for(i=0;i<n;i++)         {             b[i][0]=3;         }    //printf("%d   ",b[3][0]);       pt=&a[0][0];   pt1=&b[0][0];   pt2=&c[0][0];             start = omp_get_wtime();         for (k=1; k<101; k++) {                          #pragma omp parallel default(shared) private(i,j)    {        #pragma omp for nowait       for(i=0;i<n;i++)       {        *(pt2+(i))=0;      for(j=0;j<m;j++)         {            *(pt2+(i))+=*(pt+(i*m+j))**(pt1+(j));         }        }    }         }         end = omp_get_wtime();         printf("n=%d,   m=%d,   t=%f\n",n,m,(end-start)*1000);         //printf("Work took %f milli sec. time.\n", (end-start)*1000);    //free( a );    //free(b);    //free(c);     //}    //}   return 0;}`
karaosmanoglu

Posts: 1
Joined: Wed Feb 20, 2013 6:47 am

Re: Matrix Vector Multiplication Contest - Please Participat

Dear Bariscan,

Thank you very that you participate in this post. I've also tried your code and you can see the results:

[mahdi@hpcn05 MVX]\$ icc -openmp bariscan-mvx.c -o MVM
[mahdi@hpcn05 MVX]\$ ./MVM
n=10000, m=10000, t=14302.963972
[mahdi@hpcn05 MVX]\$ ./MVM
n=10000, m=10000, t=8116.252184
[mahdi@hpcn05 MVX]\$ ./MVM
n=10000, m=10000, t=6580.485106
[mahdi@hpcn05 MVX]\$ ./MVM
n=10000, m=10000, t=6311.797857

As it's obvious, the results are not satisfying at all !!!
Besides there is an important thing I have to mention: The code you use is somehow more complicated and there is somehow a contradiction here, I mean the aim of OpenMP is making the parallelization easy, not to use more complicated routines for such a simple MVX. (It is worth to mention that I've also tried "Tiled (Block) MVX", but performing MVX is not the only task of a scientific program, so I came back to a VERY SIMPLE NESTED LOOP, and focused on its efficiency.)

However, I really appreciate your try. Thank again...

Regards,
Mahdi
kazempour

Posts: 13
Joined: Wed Jul 25, 2012 4:11 am

Re: Matrix Vector Multiplication Contest - Please Participat

kazempour wrote:1) The code you see here is some how a simplified version of my past tries on MVX,
-- about h loop, the plots given in the above links, are figured out without this h loop (i.e., export OMP_NUM_THREADS=n); however there is really a slight difference between these two methods.
-- About memory access, please note that in Fortran programming, the access to memory in a column-wise manner is more efficient. This is the reason I transpose the matrix first and then compute the MVX.

It is absolutely critical on multi-socket systems that you get the data distribution in main memory correct. In this case it is the array b that really matters.
In the initialisation of b, the loop nest traverses a in the right order, but b in the wrong one, and it is this loop which, under a first-touch policy will determine where in main memory the pages of b get allocated. Here are some quick test results on a 4-socket 64-core AMD Interlagos machine:

Code: Select all
`  do j = 1, m    do i = 1, n      b(i,j) = a(j,i)    end do  end do`

The number of processors available = 64
The number of threads available = 1
10000 10000 79.47206497192383
The number of processors available = 64
The number of threads available = 2
10000 10000 103.5268306732178
The number of processors available = 64
The number of threads available = 4
10000 10000 83.90808105468750
The number of processors available = 64
The number of threads available = 8
10000 10000 47.27816581726074
The number of processors available = 64
The number of threads available = 16
10000 10000 53.04503440856934
The number of processors available = 64
The number of threads available = 32
10000 10000 58.20178985595703
The number of processors available = 64
The number of threads available = 64
10000 10000 58.41898918151855

Initialise b in parallel, "wrong" loop order:

Code: Select all
`  !\$omp parallel do do i = 1, n   do j=1,n           b(i,j) = a(j,i)    end do  end do`

Better, but still pretty bad scaling:

The number of processors available = 64
The number of threads available = 1
10000 10000 79.51188087463379
The number of processors available = 64
The number of threads available = 2
10000 10000 73.86612892150879
The number of processors available = 64
The number of threads available = 4
10000 10000 40.90499877929688
The number of processors available = 64
The number of threads available = 8
10000 10000 21.40188217163086
The number of processors available = 64
The number of threads available = 16
10000 10000 26.78894996643066
The number of processors available = 64
The number of threads available = 32
10000 10000 15.28501510620117
The number of processors available = 64
The number of threads available = 64
10000 10000 15.73491096496582

Initialise b in parallel, "right" loop order:

Code: Select all
`  !\$omp parallel do do j = 1, n   do i=1,n           b(i,j) = a(j,i)    end do  end do`

Much better scaling (at least up to 16 cores):

The number of processors available = 64
The number of threads available = 1
10000 10000 77.93092727661133
The number of processors available = 64
The number of threads available = 2
10000 10000 39.24703598022461
The number of processors available = 64
The number of threads available = 4
10000 10000 19.68216896057129
The number of processors available = 64
The number of threads available = 8
10000 10000 10.03599166870117
The number of processors available = 64
The number of threads available = 16
10000 10000 5.902051925659180
The number of processors available = 64
The number of threads available = 32
10000 10000 5.238056182861328
The number of processors available = 64
The number of threads available = 64
10000 10000 5.841970443725586

2) I used reduction for "y", although I know that there is no need to do that, however during run, the "y" array may locked with threads and cause some inefficiency, so I tried to be away from this problem. (please note that putting/removing this directive do NOT change the slope.)

You are right, this won't affect performance much, except when m >> n.

3) Would you please explain a little bit more about the following quote:

Declaring sequential loop indices as shared is actually a bug (though it may not matter if the compiler never stores these values in memory).

Code: Select all
`!\$omp parallel!\$omp& shared (b,a,j)!\$omp& private (i)!\$omp do     do i = 1, n      do j = 1, m       b(i,j) = a(j,i)....`

should be
Code: Select all
`!\$omp parallel!\$omp& shared (b,a)!\$omp& private (i,j)!\$omp do     do i = 1, n      do j = 1, m       b(i,j) = a(j,i)....`

though any amount of compiler optimisation will likely mean that j is never stored in memory, so you won't get the wrong answer.
MarkB

Posts: 559
Joined: Thu Jan 08, 2009 10:12 am
Location: EPCC, University of Edinburgh

Re: Matrix Vector Multiplication Contest - Please Participat

Dear Mark,

Again thank you very much due to your efforts and the time you spend on my problem.

You are absolutely right about the order of loops "Transpose Function" and my loop order is extremely inefficient, but I've just leave it that way to show there is big difference in various memory access schemes.

( I profiled the code using OMPP and 93% of run time was wasted in TRANSPOSE loop, but in this step I do not care at all! (I've debugged it in the other versions of MVX! ))

However, I've focused on MVX part right now and in this post and as it is obvious, I get the timing over the MVX loop.
I get a little bit confused, as in papers I see "LINEARLY SCALED SPEED-UPS vs. THREADS" figures for such loops(when I compare the authors pseudocode with my own codes) and even ore complicated applications and codes, and I cannot get that much efficiency even for such simple MVX, and unfortunately I have to leave OpenMP method and this will a great time loss for me, in fact.

However I thank you again.

Sincerely yours,
Mahdi
kazempour

Posts: 13
Joined: Wed Jul 25, 2012 4:11 am

Re: Matrix Vector Multiplication Contest - Please Participat

Hi there,

I'm not sure explained clearly enough. The timings I showed are for the matrix-vector loop, which I have not altered. But these are strongly affected by the order of the transpose loops because this is where b is first accessed and so this determines where in physical memory the pages of b are stored. Try it for yourself!

For large matrices, matrix-vector multiply is really just a bandwidth test: nothing much else matters, and on NUMA systems, the data affinity has a big effect on the attained bandwidth.

Mark.
MarkB

Posts: 559
Joined: Thu Jan 08, 2009 10:12 am
Location: EPCC, University of Edinburgh

Re: Matrix Vector Multiplication Contest - Please Participat

Can anybody reach linear speed up in matrix vector multiplication using OpenMP? If so, please share your experience!
kazempour

Posts: 13
Joined: Wed Jul 25, 2012 4:11 am

Re: Matrix Vector Multiplication Contest - Please Participat

Even if you get the data affinity right, you will find that on most modern hardware the achievable memory bandwidth does not scale linearly with the number of cores which are accessing memory. This is purely a feature of the hardware, and nothing to do with OpenMP! Don't be put off using OpenMP by this: most real applications, especially if optimised to exploit cache locality, do not make such heavy demands on bandwidth.

On my 64-core system, I can get very close to linear speedup on 8 threads by assigning threads to different 8-core NUMA regions:

The number of threads available = 1
10000 10000 79.50186729431152
The number of threads available = 2
10000 10000 39.91198539733887
The number of threads available = 4
10000 10000 20.23506164550781
The number of threads available = 8
10000 10000 10.28203964233398

However, if I run all 8 threads on the same 8-core NUMA region, then speedup is limited to just over 3x.

The number of threads available = 1
10000 10000 78.04918289184570
The number of threads available = 2
10000 10000 41.08810424804688
The number of threads available = 4
10000 10000 25.30193328857422
The number of threads available = 8
10000 10000 24.48391914367676
MarkB

Posts: 559
Joined: Thu Jan 08, 2009 10:12 am
Location: EPCC, University of Edinburgh