[Omp] sparsematrix X vector multiplication performance decreases 30%

Ruud van der Pas Ruud.Vanderpas at Sun.COM
Mon Mar 28 00:40:10 PST 2005


Hi Drosos,

Thanks for sharing this problem with us.

> The problem concerns a sparse-matrix vector multiplication.
> When executed sequentially it needs 4.5 seconds! In parallel
> though it needs almost 6.0 seconds running in 4 CPUs.

This might be the case. It is not a given that a program parallelized
with OpenMP is always going to be faster. It depends on many things.
See below.

> And that turns to be a big mistake since the code in 4CPUs executes
> 30% slower than it executes sequentially!

Firs of all, for performance analysis purposes, it might be
good to also run this example on 2 CPUs to see whether it is
already slower then.

> 1:) Can anybody please let me know what is wrong 
>     with the above implementation?
> 2:) Can it be fixed someway?

Looking at the code, there are a couple of things that come
to my mind:

- Is there enough work to be done in order to amortize the
   cost of the OpenMP overhead:
   o What are the typical loop lengths for example?
   o What happens if you increase the problem size?
- Load balancing might be an issue in this case. The inner loop
   is most likely of a different length for different values of 'i'
   You could experiment with a more dynamic workload distribution
   scheme (e.g. schedule(dynamic) or schedule(guided))
- If the loops are very short, you might get hurt by false sharing when
   storing 'result[i]'
- Memory access could also be a bottleneck. You're doing random
   reads and multiple threads might hit the same memory bank. This
   will slow you down fairly quickly.

> 3:) Is this an OpenMP bug or something like that?

OpenMP does not say anything about performance. It is a functional
model. It is up to the user to use it in a sensible way :-)

You can never rule out a performance bug in the implementation
you're using, but this is not likely.

Kind regards,
Ruud
----------------------------------------------------------------
Senior Staff Engineer             Email: ruud.vanderpas at sun.com
Scalable Systems Group            Phone: +31-33-4515000 (x15920)
Sun Microsystems                  Fax  : +31-33-4515001
----------------------------------------------------------------
 > // suppose you have a sparse matrix M stored in CSR format
 > // pdata is the array containing the data
 > // prows is the array of rows of pdata such that row i
 > // of the corresponding full matrix starts from pdata[prows[i]]
 > // and ends at pdata[prows[i+1]-1].
 > // pcols is the array whose elements are the columns of the
 > // corresponding full matrix, e.g. pdata[i] has column pcols[i]
 >
 > // we implement the multiplication of CSR sparse matrix M with vector b.
 >
 > void sparsematrixXvector(double* pdata,
 >                          double* b,
 >                          double* result,
 >                          int* prows,
 >                          int* pcols,
 >                          int dim)
 >
 > {
 >   int i, j, index;
 >   // since there are not i-dependences it seems natural to
 >   // parallelize that i-loop like:
 > #pragma omp parallel for
 >   for (i = 0; i < dim; i++)
 >   {
 >      int from = prows[i];
 >      int to = prows[i+1];
 >      double sum = 0.0;
 >      for (index = from; index < to; index++)
 >      {
 >        j = pcols[index];
 >        sum += pdata[index]*b[j];
 >      }
 >      result[i] = sum;
 >   }
 > }




More information about the Omp mailing list