openMP coding plateau

General OpenMP discussion

openMP coding plateau

Postby magicfoot » Wed Mar 06, 2013 11:16 am

My openMP loop code sample shown below reaches a performance plateau on large Shared Memory Processors. i.e. it scales well to about 32 cores and then reaches a performance plateau when using more cores. The plateau is possibly an artifact of the memory access yet would liketo know whether anyone has an opinion on an improved more efficient coding scheme with openMP.

The sample is:

#pragma omp parallel for private(i,j)
for (i = 0; i < ie; i++) {
for (j = 1; j < je; j++) { // dont do ex at j=0 or j=je, it will be done in the PML section
ex[i][j] = caex[i][j] * ex[i][j] + cbex[i][j] * ( hz[i][j] - hz[i][j-1] );
}
}
magicfoot
 
Posts: 4
Joined: Sun May 22, 2011 11:46 am

Re: openMP coding plateau

Postby MarkB » Thu Mar 07, 2013 3:26 am

Hi there,

Couple of questions for you:
What are the typical values of ie and je, and what is the execution time for the loop on 1 thread?
Are the immediately preceding accesses to the arrays used in this loop scheduled to threads in the same way?

Mark.
MarkB
 
Posts: 433
Joined: Thu Jan 08, 2009 10:12 am

Re: openMP coding plateau

Postby magicfoot » Thu Mar 07, 2013 11:32 pm

Hi,

The values of ie and je lie in the range 1000 to 100000.

There is no timing data for the single loop but I can derive that. There are three of these loops in the program, all with different variables, and these loops use 98% of the total execution time.

The values preceding and after this loop all use different variables. Are you considering memory affinity or cache coherence issues ? Is there some way to stabilise that with openMP ?
magicfoot
 
Posts: 4
Joined: Sun May 22, 2011 11:46 am

Re: openMP coding plateau

Postby MarkB » Fri Mar 08, 2013 4:23 am

Hi there,

magicfoot wrote:The values of ie and je lie in the range 1000 to 100000.

There is no timing data for the single loop but I can derive that. There are three of these loops in the program, all with different variables, and these loops use 98% of the total execution time.


That seems large enough such than the overhead of the parallel region (typically in the 10s to 100s of microseconds range) is likely the be negligible.

MarkB wrote:Are you considering memory affinity or cache coherence issues ? Is there some way to stabilise that with openMP ?


On a multi-socket machine it can be important to get the distribution of data in main memory right. This means that the first access to large arrays (typically initialisation) should be made inside a parallel region. Your code might be getting some cache reuse (at least in L3), so making sure the same thread accesses the same data items ion different parallel loops might help.

The loop you posted looks very bandwidth-intensive, so you may simply be running into the limits of the hardware bandwidth scalability.

Hope that helps,
Mark.
MarkB
 
Posts: 433
Joined: Thu Jan 08, 2009 10:12 am


Return to Using OpenMP

Who is online

Users browsing this forum: Google [Bot] and 7 guests