How to best optimize this load impbalance and data locality

General OpenMP discussion

How to best optimize this load impbalance and data locality

Postby ntt2011 » Sun Jan 27, 2013 10:12 pm

Hi everyone,

I have a specific problem of load imbalance and looking for a solution. I'm a physicist and don't want to reinvent the wheel :). If the optimal solution is already known, please help. The serial code is following

Code: Select all
double array[N], sum;

sum = 0.0;
for (i=0; i<N; i++) {
   for (j=0; j<i; j++) {
      sum += array[i] * array[j]
}


as you can see, index i does (i-1) multiplications. So it's longer for higher i. My current solution is interleaving the outer loop with "schedule(static,1)". This helps certainly, but is not perfect, and not enough data locality (index i = N-1 have to read the whole array, possibly thrashes the cache.)

Does anybody knows a better solution to this? thank you
ntt2011
 
Posts: 10
Joined: Sun Dec 30, 2012 6:03 am

Re: How to best optimize this load impbalance and data local

Postby ftinetti » Mon Jan 28, 2013 6:26 am

Hi,

I usually do something like

Code: Select all
   chunk_size = N / (omp_get_num_threads( ) *chunks_per_thread);
   ... schedule(static,chunk_size)


and test some values for chunks_per_thread. This certainly does not solve the issue:

... and not enough data locality (index i = N-1 have to read the whole array, possibly thrashes the cache.)


but the original code does have it anyway, and OpenMP does not optimize the serial code, it just parallelize it.

HTH,

Fernando.
ftinetti
 
Posts: 582
Joined: Wed Feb 10, 2010 2:44 pm

Re: How to best optimize this load impbalance and data local

Postby MarkB » Tue Jan 29, 2013 4:02 am

Hi again!

It would be possible to improve the locality by doing a 2D block cyclic decomposition of the iteration space, which (I think!) can reduce the amount of data
accessed by a thread from O(N) to O(N/sqrt(p)). You can't achieve this with OpenMP's schedule kinds, however, so you would have to use a parallel region and
compute explicitly what each thread does based on its ID. Before you attempt this, though, it would be worth excluding other overheads: it is worth considering the
affinity of the array with respect to how it was last accessed before this operation, and therefore where it is likely to be cached. Could the overhead of the parallel
region itself be a factor? What is the typical value of N, and how long does the loop nest take to execute on one thread?

Hope that helps,
Mark.
MarkB
 
Posts: 454
Joined: Thu Jan 08, 2009 10:12 am
Location: EPCC, University of Edinburgh

Re: How to best optimize this load impbalance and data local

Postby ntt2011 » Tue Feb 05, 2013 9:12 pm

Thank you. My N is about 40,000 and it's not simple multiplications like my example, but a much more complicated but invariant work.

I decided to do manual thread load assignment like you said, and tried the folding schedule ( 1 thread doing i and N-i blocks of indices ). I got a nice improvement.
ntt2011
 
Posts: 10
Joined: Sun Dec 30, 2012 6:03 am


Return to Using OpenMP

Who is online

Users browsing this forum: Google [Bot], Yahoo [Bot] and 9 guests