Opteron 6200 issue - dramatic speed-down for nested loops

General OpenMP discussion

Opteron 6200 issue - dramatic speed-down for nested loops

Postby fadeyda » Tue Jun 25, 2013 9:52 am

Hi folks!
I'm having serious problems with openMP paralleled nested loops of the following type
#pragma omp parallel for num_threads(THS) schedule(static, N_z/THS )
for(i=0;i<N_z;i++){
for(j=0;j<N_r;j++) B[i*N_r+j]+=sin(A[i*N_r+j]);
}
I'm using dual opteron moby by TYAN with two Opterons 6272 128 Gb DDR3 and 4 690 GeForce GTX-es. All this is running Linux Debian kernel: 3.2.0-4-amd64 Compiler is gcc 4.7.
I have a SLOW DOWN of 1.5 times with 16 threads (THS in code) compared to single threaded run. With pthreads i have normal ~10 speedup. On Nehalem server with 8 cores everything is okay not 8x but 4x speed up but it is a speed up not slow down! You are welcome to test my code on your systems. (see attachment)

attachment
*.c/*.h - source
test - code compiled on my system but for pthreads 16... see run script. so you can remove it.
compile - compile script
run - run script
run.log - output log for my system.

P.S. I used taskset and test in case when 0 2 4 6 ... cores are used. The result is the same - openMP does not work. All threads are at 100 of load but exec time is greater by 1.5 times than in single threaded version. While pthreaded version behaves okay.
Attachments
server_test.tar.bz2
src and run scripts + log for dual Opteron 6272
(6.44 KiB) Downloaded 325 times
fadeyda
 
Posts: 3
Joined: Tue Jun 25, 2013 7:17 am

Re: Opteron 6200 issue - dramatic speed-down for nested loop

Postby MarkB » Tue Jun 25, 2013 5:42 pm

Hi there,

You have a bug in your code: the loop index j needs to be declared private.
This unintentional sharing is likely to be the cause of the poor performance.

Hope that helps,
Mark.
MarkB
 
Posts: 487
Joined: Thu Jan 08, 2009 10:12 am
Location: EPCC, University of Edinburgh

Re: Opteron 6200 issue - dramatic speed-down for nested loop

Postby fadeyda » Wed Jun 26, 2013 12:38 am

Thanx Mark! It works. Still seems very strange why non private j index does not affect so much when running on intel architecture.
To be honest for 8 processors with non private j I got 4x speed-up not 8x. When I declared j to be private I got 8x on 4+4cores Nehalem Xeons and ~12.5x when using 16 threads on 16+16cores Interlagos Opterons.

P.S. Reading manuals is a worth thing... I'm newbie to OMP. Used pthreads before.
fadeyda
 
Posts: 3
Joined: Tue Jun 25, 2013 7:17 am

Re: Opteron 6200 issue - dramatic speed-down for nested loop

Postby fadeyda » Wed Jun 26, 2013 6:32 am

Hi there again!
As MarkB suggested j must be local for each thread. So I have tried the following:
Code: Select all
#pragma omp parallel for num_threads(THS) schedule(static, N_z/THS ) private(i,j)

and it works!
I have also tried another solution:
Code: Select all
#pragma omp parallel for num_threads(THS) schedule(static, N_z/THS )
for(i=0;i<N_z;i++){
    int jl
    for(j=0;j<N_r;j++) B[i*N_r+j]+=sin(A[i*N_r+j]);
    }

and it also works. But the most interesting is that when i use the next code:
Code: Select all
int i;
volatile int j;
#pragma omp parallel for num_threads(THS) schedule(static, N_z/THS )
for(i=0;i<N_z;i++){
    for(j=0;j<N_r;j++) B[i*N_r+j]+=sin(A[i*N_r+j]);
    }

I get wrong result and same speed-down - 3 seconds while 2 seconds for single threaded version.

My question is what is really happening when I use external (for a omp threads) variable? I see that result of calculation is correct but it could not be if j variable is not in cache. While if j is in cache there is no conflict! I see that with volatile the result is wrong. Anyway why I have so dramatic speed down? all the time is spent on sine calculation here! It seems like omp tries to execute threads in sequential order... if so why the result is wrong with volatile j?

P.S. Thnx again, Mark for your advice. But I am still confused with behaviour of my code.
fadeyda
 
Posts: 3
Joined: Tue Jun 25, 2013 7:17 am

Re: Opteron 6200 issue - dramatic speed-down for nested loop

Postby MarkB » Wed Jun 26, 2013 6:41 am

fadeyda wrote:Thanx Mark!


You're very welcome! I recommend using the default(none) clause and declaring all variables explicitly as shared/private/reduction etc. - catches a lot of those type of bugs.

fadeyda wrote:Still seems very strange why non private j index does not affect so much when running on intel architecture.
To be honest for 8 processors with non private j I got 4x speed-up not 8x. When I declared j to be private I got 8x on 4+4cores Nehalem Xeons and ~12.5x when using 16 threads on 16+16cores Interlagos Opterons.


It is probably due to the number of L3 caches in the system that are contending for ownership of the cache line (2 on the Intel system and 4 on the AMD one?).
MarkB
 
Posts: 487
Joined: Thu Jan 08, 2009 10:12 am
Location: EPCC, University of Edinburgh


Return to Using OpenMP

Who is online

Users browsing this forum: Google [Bot], Yahoo [Bot] and 10 guests