Optimization flag for OpenMP in GCC compiler

General OpenMP discussion

Optimization flag for OpenMP in GCC compiler

Postby saurishdas » Sun Oct 27, 2013 2:10 am

Hi,
I am working on parallelization of Conjugated Gradient matrix solver using OpenMP. A piece of my code is attached below:

Code: Select all
# pragma omp parallel num_threads(NTt) default(none) private(j,k) shared(STA, COEFF, RLL, p_sparse_s, coef_a_sparse, res_sparse_s, normres_sparse, nx,ny, nz)
         {
      #pragma omp for reduction(+:normres_sparse)
            for (i=1; i<=nx; i++)  for (j=1; j<=ny; j++)    for (k=1; k<=nz; k++)
               {
                                  p_sparse_s[i][j][k] =                 COEFF[0][i][j][k]   * STA[i-1][j][k]
                                        +COEFF[2][i][j][k]   * STA[i][j-1][k]
                              + COEFF[4][i][j][k]   * STA[i][j][k-1]
                              + COEFF[6][i][j][k]   * STA[i][j][k]
                              + COEFF[5][i][j][k]   * STA[i][j][k+1]
                              + COEFF[3][i][j][k]   * STA[i][j+1][k]
                              + COEFF[1][i][j][k]   * STA[i+1][j][k];

               res_sparse_s[i][j][k] = RLL[i][j][k] + coef_a_sparse * p_sparse_s[i][j][k];

               normres_sparse += (res_sparse_s[i][j][k] * res_sparse_s[i][j][k])/ (nx*ny*nz);

               }
         }


Please note that here I had defined 3-D matrix of 200 X 200 X 200 size, i.e. nx= ny = nz = 200;

I am using GCC compiler in intel i7 quad code processor and without using any optimization flag I am getting around 95 % efficiency in 4 processors.
i.e. run time for serial code = 4 h, run time parallel code = 1 h 3 min. (in 4 processor)
But when I use –O3 flag, serial code starts taking 2 h 30 min and for parallel code it is around 55 min.
Though I am getting fast results, but efficiency decreases

So my questions are:
(1) For benchmarking should one use any optimizer flag ? If the Ans. is YES, What optimizer flag should I use for OpenMP ?
(2) I heard about the term “false sharing”, but don’t know much about it. Is this the problem with false sharing. As I need to share vary large number of arrays.

Any suggestion/ resolution will be highly appreciated. :mrgreen:

Regards,
Saurish
saurishdas
 
Posts: 2
Joined: Sat Oct 26, 2013 1:50 pm

Re: Optimization flag for OpenMP in GCC compiler

Postby pierrick » Mon Oct 28, 2013 6:32 am

Hi,

Looking at your code, I don't understand why i is not annotate as private.

(1) For benchmarking should one use any optimizer flag ? If the Ans. is YES, What optimizer flag should I use for OpenMP ?


I would say YES. User will use optimisation to you should do the same. Some optimisations are well known like loop unrooling (and many others) and compiler can do it while you keep your code clean.
I think optimisation are not related to OpenMP (I am not sure about it) but it is well know that some time O2 is faster than O3. You can also use Ofast but you have to test each of them to know which one is the better.

About efficiency, some time the CPU is not the bottleneck for your programme. Your band-width for memory may be slower than computation and it may explains that efficiency is decreasing.
Looking at your code, this happen because your data are not allign in your memory.When you write COEFF[5][i][j][k] you are jumping 4 time in your memory while if it is align, you can write : COEFF[(((5 *COEFF.shape[1] + i) * COEFF.shape[2]) + j) * COEFF.shape[3] + k]. In this case it is allign in memory so your computer should do it faster (and you may be able to use SSE instruction).

(2) I heard about the term “false sharing”, but don’t know much about it. Is this the problem with false sharing. As I need to share vary large number of arrays.


I don't know what is a “false sharing” so I may say something wrong but I think that shared data avoid copy so having all variable as shared should not cause any performance issue.

I hope it will help you,

Regards,
Pierrick
pierrick
 
Posts: 3
Joined: Wed Oct 23, 2013 4:17 am

Re: Optimization flag for OpenMP in GCC compiler

Postby MarkB » Mon Oct 28, 2013 7:49 am

I think the main bottleneck in this code is likely to be memory bandwidth. Turning optimisation on (which is clearly the right thing to do because it reduces the wall clock time) will reduce the number instructions executed, but cannot really do anything about the number of loads/stores required. The memory system becomes saturated by 4 threads all demanding data at the same time.

Reordering the COEFF array from COEFF[7][nx][ny][nz] to COEFF[nx][ny][nz][7] might improve the cache locality a bit: this might be what Pierrick is trying to say, but I'm not sure!

False sharing occurs where multiple threads access addresses which are on the same cache line (and at least one of the threads is writing the data). This does not look like a problem in your code as the data accessed by different threads is well separated in memory.

pierrick wrote:Looking at your code, I don't understand why i is not annotate as private.


i is the iterator of the parallel loop, so is private by default.
MarkB
 
Posts: 432
Joined: Thu Jan 08, 2009 10:12 am

Re: Optimization flag for OpenMP in GCC compiler

Postby saurishdas » Tue Oct 29, 2013 1:34 pm

Thanks for your reply :D

Yes Mark and Pierrick. I completely agree with you regarding reordering the COEFF array. I made it COEFF[nx][ny][nz][7] and it is running fast; but parallel efficiency remains same.

After reading some documents I understand that the problem is not with false sharing.

Actually I came to about a bug with GCC compiler; it inhabits automatic vectorization available with -O3 when we use -fopenmp flag,

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=46032

I will try it with icc compiler and let you know my findings. In the mean time if you have thought please let me know

regards,
saurish
saurishdas
 
Posts: 2
Joined: Sat Oct 26, 2013 1:50 pm


Return to Using OpenMP

Who is online

Users browsing this forum: Google [Bot], Yahoo [Bot] and 3 guests