I am working on parallelization of Conjugated Gradient matrix solver using OpenMP. A piece of my code is attached below:
- Code: Select all
# pragma omp parallel num_threads(NTt) default(none) private(j,k) shared(STA, COEFF, RLL, p_sparse_s, coef_a_sparse, res_sparse_s, normres_sparse, nx,ny, nz)
#pragma omp for reduction(+:normres_sparse)
for (i=1; i<=nx; i++) for (j=1; j<=ny; j++) for (k=1; k<=nz; k++)
p_sparse_s[i][j][k] = COEFF[i][j][k] * STA[i-1][j][k]
+COEFF[i][j][k] * STA[i][j-1][k]
+ COEFF[i][j][k] * STA[i][j][k-1]
+ COEFF[i][j][k] * STA[i][j][k]
+ COEFF[i][j][k] * STA[i][j][k+1]
+ COEFF[i][j][k] * STA[i][j+1][k]
+ COEFF[i][j][k] * STA[i+1][j][k];
res_sparse_s[i][j][k] = RLL[i][j][k] + coef_a_sparse * p_sparse_s[i][j][k];
normres_sparse += (res_sparse_s[i][j][k] * res_sparse_s[i][j][k])/ (nx*ny*nz);
Please note that here I had defined 3-D matrix of 200 X 200 X 200 size, i.e. nx= ny = nz = 200;
I am using GCC compiler in intel i7 quad code processor and without using any optimization flag I am getting around 95 % efficiency in 4 processors.
i.e. run time for serial code = 4 h, run time parallel code = 1 h 3 min. (in 4 processor)
But when I use –O3 flag, serial code starts taking 2 h 30 min and for parallel code it is around 55 min.
Though I am getting fast results, but efficiency decreases
So my questions are:
(1) For benchmarking should one use any optimizer flag ? If the Ans. is YES, What optimizer flag should I use for OpenMP ?
(2) I heard about the term “false sharing”, but don’t know much about it. Is this the problem with false sharing. As I need to share vary large number of arrays.
Any suggestion/ resolution will be highly appreciated.