Drop of performance for large arrays/matrices

General OpenMP discussion

Drop of performance for large arrays/matrices

Postby franglez » Tue Apr 29, 2008 2:08 am

I am using OpenMP to parallelize the following loop (in pseudo-code) in an Intel Core2, under Linux OS.

// Initialization of dense matrix and vectors

for (unsigned int i=0; i<NumLoops; i++)
#pragma omp parallel sections default(shared)
#pragma omp section
// Product v1out = A*v1 with BLAS routines.

#pragma omp section
// Product v2out = A*v2 with BLAS routines.

As expected, I'm getting speedups of about 1.9 while the size of A matrix is under 1000 variables. However, a drop of performance appears as the matrix size increases, decreasing down to 1.5 when the size of the matrix is 5000.

Has anybody experienced similar drops of performance with large matrices? What could be the reason for such behavior?

Thanks in advance.

Fran González
Posts: 1
Joined: Tue Apr 29, 2008 2:06 am

Re: Drop of performance for large arrays/matrices

Postby lfm » Tue Apr 29, 2008 11:27 am

First, make sure you have enough physical memory to hold the data. Assuming that you aren't paging, the second problem is likely cache effects. If you are streaming through two large matrices by two different threads you aren't getting much cache locality. You are far better off parallelizing the actual matrix operations and doing them consecutively. You want to use some kind of blocking algorithm, or use a library (e.g., Intel's MKL) that is already parallelized and tuned.

-- Larry
Posts: 135
Joined: Sun Oct 21, 2007 4:58 pm
Location: OpenMP ARB

Re: Drop of performance for large arrays/matrices

Postby Joseph M. Newcomer » Thu Aug 07, 2008 12:49 pm

Note that arranging your computation to be cache-aware can produce anywhere from 5x to 20x performance improvement in single-threaded code. The problem with portable code is that you have to figure out how to discover the caching parameters on the host machine, and have some way to tell the program to reoganize the computation based on the platform's caching strategy. As far as I know, this concept is not formally supported. The only code I know that this works well on is code that has been hand-tuned to a particular platform's cache, or automatically translated from input source to output source by a program that understands the cache model, and I have been told by someone who did this that a gain of 20x is not uncommon in single-threaded code.

In parallel code, the interactions of the threads in thrashing the cache are vastly more subtle and complex than with single-threaded code. Actually, I'm more amazed that the drop is so small (from 1.9 to 1.5) than that it exists at all.
Joseph M. Newcomer

Re: Drop of performance for large arrays/matrices

Postby geoff » Mon Aug 11, 2008 12:24 pm

I encountered this issue as well with sparse matrix multiplied by dense vector (3 000 000 rows, 70 000 000 non zeros). On a 4 core machine I only got a speedup of 1.25. When I switched from a computer that was using fb DIMM (en.wikipedia.org/wiki/Fully_Buffered_DIMM) the situation got much better. The speedup was upto 2.07. Very frustrating.
Posts: 11
Joined: Thu Jun 12, 2008 7:50 am

Return to Using OpenMP

Who is online

Users browsing this forum: Yahoo [Bot] and 8 guests