I am working on parallelization of Conjugated Gradient matrix solver using OpenMP. A piece of my code is attached below:

- Code: Select all
`# pragma omp parallel num_threads(NTt) default(none) private(j,k) shared(STA, COEFF, RLL, p_sparse_s, coef_a_sparse, res_sparse_s, normres_sparse, nx,ny, nz)`

{

#pragma omp for reduction(+:normres_sparse)

for (i=1; i<=nx; i++) for (j=1; j<=ny; j++) for (k=1; k<=nz; k++)

{

p_sparse_s[i][j][k] = COEFF[0][i][j][k] * STA[i-1][j][k]

+COEFF[2][i][j][k] * STA[i][j-1][k]

+ COEFF[4][i][j][k] * STA[i][j][k-1]

+ COEFF[6][i][j][k] * STA[i][j][k]

+ COEFF[5][i][j][k] * STA[i][j][k+1]

+ COEFF[3][i][j][k] * STA[i][j+1][k]

+ COEFF[1][i][j][k] * STA[i+1][j][k];

res_sparse_s[i][j][k] = RLL[i][j][k] + coef_a_sparse * p_sparse_s[i][j][k];

normres_sparse += (res_sparse_s[i][j][k] * res_sparse_s[i][j][k])/ (nx*ny*nz);

}

}

Please note that here I had defined 3-D matrix of 200 X 200 X 200 size, i.e. nx= ny = nz = 200;

I am using GCC compiler in intel i7 quad code processor and without using any optimization flag I am getting around 95 % efficiency in 4 processors.

i.e. run time for serial code = 4 h, run time parallel code = 1 h 3 min. (in 4 processor)

But when I use –O3 flag, serial code starts taking 2 h 30 min and for parallel code it is around 55 min.

Though I am getting fast results, but efficiency decreases

So my questions are:

(1) For benchmarking should one use any optimizer flag ? If the Ans. is YES, What optimizer flag should I use for OpenMP ?

(2) I heard about the term “false sharing”, but don’t know much about it. Is this the problem with false sharing. As I need to share vary large number of arrays.

Any suggestion/ resolution will be highly appreciated.

Regards,

Saurish