I'm a new user of OpenMP (and a beginner to parallelization), and I'm trying to parallelize some numerical differentiation code that I've written. In essence, I have a set of subdomains which do not depend on each other. Each subdomain has a grid on which I'd like to compute the Laplacian of a function. Since the subdomains are independent, I naively tried to use parallelize over the index of the for loop which iterates through each subdomain. However, the code scales worse and worse with increasing numbers of threads. What I'm confused about is whether this type of behavior is to be expected or not. If anyone has some idea of how to handle these kinds of issues, I'd appreciate some help!
The relevant code (C++), which is the body of the function which returns the Laplacian over all subdomains (flattened into a single vector) is posted here:
- Code: Select all
std::vector<std::vector<double> > LL(n_domains);
/*BEGIN PARALLEL REGION*/
#pragma omp parallel shared(LL) private(i)
#pragma omp for schedule(static) nowait
for(i = 0; i < n_domains; i++)
LL[i] = domains[i].laplacian();
/*END PARALLEL REGION*/
for(int i = 0; i < n_domains; i++)
for(int j = 0; j < LL[i].size(); j++)
L[cumulative_npts[i]+j] = LL[i][j];
In terms of the scaling:
1 thread --> 0.175 s, 2 threads --> 0.094 s, 4 threads --> 0.054 s, 8 threads --> 0.034 s, 12 threads --> 0.041 s.
A single iteration of the for loop takes either 0.0035 s or 0.007 s (there are 2 sizes of grids being used).
Sorry for the long post! I'd just like some help understanding why or why not the scaling could be better. I tried reading some stuff on false sharing to see if that was it, but I'm afraid I don't completely understand the reasons that could be a problem.