[Omp] sparsematrix X vector multiplication performance decreases
30%
Ruud van der Pas
Ruud.Vanderpas at Sun.COM
Mon Mar 28 00:40:10 PST 2005
Hi Drosos,
Thanks for sharing this problem with us.
> The problem concerns a sparse-matrix vector multiplication.
> When executed sequentially it needs 4.5 seconds! In parallel
> though it needs almost 6.0 seconds running in 4 CPUs.
This might be the case. It is not a given that a program parallelized
with OpenMP is always going to be faster. It depends on many things.
See below.
> And that turns to be a big mistake since the code in 4CPUs executes
> 30% slower than it executes sequentially!
Firs of all, for performance analysis purposes, it might be
good to also run this example on 2 CPUs to see whether it is
already slower then.
> 1:) Can anybody please let me know what is wrong
> with the above implementation?
> 2:) Can it be fixed someway?
Looking at the code, there are a couple of things that come
to my mind:
- Is there enough work to be done in order to amortize the
cost of the OpenMP overhead:
o What are the typical loop lengths for example?
o What happens if you increase the problem size?
- Load balancing might be an issue in this case. The inner loop
is most likely of a different length for different values of 'i'
You could experiment with a more dynamic workload distribution
scheme (e.g. schedule(dynamic) or schedule(guided))
- If the loops are very short, you might get hurt by false sharing when
storing 'result[i]'
- Memory access could also be a bottleneck. You're doing random
reads and multiple threads might hit the same memory bank. This
will slow you down fairly quickly.
> 3:) Is this an OpenMP bug or something like that?
OpenMP does not say anything about performance. It is a functional
model. It is up to the user to use it in a sensible way :-)
You can never rule out a performance bug in the implementation
you're using, but this is not likely.
Kind regards,
Ruud
----------------------------------------------------------------
Senior Staff Engineer Email: ruud.vanderpas at sun.com
Scalable Systems Group Phone: +31-33-4515000 (x15920)
Sun Microsystems Fax : +31-33-4515001
----------------------------------------------------------------
> // suppose you have a sparse matrix M stored in CSR format
> // pdata is the array containing the data
> // prows is the array of rows of pdata such that row i
> // of the corresponding full matrix starts from pdata[prows[i]]
> // and ends at pdata[prows[i+1]-1].
> // pcols is the array whose elements are the columns of the
> // corresponding full matrix, e.g. pdata[i] has column pcols[i]
>
> // we implement the multiplication of CSR sparse matrix M with vector b.
>
> void sparsematrixXvector(double* pdata,
> double* b,
> double* result,
> int* prows,
> int* pcols,
> int dim)
>
> {
> int i, j, index;
> // since there are not i-dependences it seems natural to
> // parallelize that i-loop like:
> #pragma omp parallel for
> for (i = 0; i < dim; i++)
> {
> int from = prows[i];
> int to = prows[i+1];
> double sum = 0.0;
> for (index = from; index < to; index++)
> {
> j = pcols[index];
> sum += pdata[index]*b[j];
> }
> result[i] = sum;
> }
> }
More information about the Omp
mailing list