My openMP loop code sample shown below reaches a performance plateau on large Shared Memory Processors. i.e. it scales well to about 32 cores and then reaches a performance plateau when using more cores. The plateau is possibly an artifact of the memory access yet would liketo know whether anyone has an opinion on an improved more efficient coding scheme with openMP.
The sample is:
#pragma omp parallel for private(i,j)
for (i = 0; i < ie; i++) {
for (j = 1; j < je; j++) { // dont do ex at j=0 or j=je, it will be done in the PML section
ex[i][j] = caex[i][j] * ex[i][j] + cbex[i][j] * ( hz[i][j] - hz[i][j-1] );
}
}
