OpenMP Scaling Questions

General OpenMP discussion

OpenMP Scaling Questions

Postby asen » Sat Jan 21, 2012 11:39 pm

Hi everyone,

I'm a new user of OpenMP (and a beginner to parallelization), and I'm trying to parallelize some numerical differentiation code that I've written. In essence, I have a set of subdomains which do not depend on each other. Each subdomain has a grid on which I'd like to compute the Laplacian of a function. Since the subdomains are independent, I naively tried to use parallelize over the index of the for loop which iterates through each subdomain. However, the code scales worse and worse with increasing numbers of threads. What I'm confused about is whether this type of behavior is to be expected or not. If anyone has some idea of how to handle these kinds of issues, I'd appreciate some help!

The relevant code (C++), which is the body of the function which returns the Laplacian over all subdomains (flattened into a single vector) is posted here:

Code: Select all

        std::vector<double> L(n_allpts,0);
        std::vector<std::vector<double> > LL(n_domains);
        omp_set_num_threads(12);

        int i;
        /*BEGIN PARALLEL REGION*/
        #pragma omp parallel shared(LL) private(i)
        {
                #pragma omp for schedule(static) nowait
                for(i = 0; i < n_domains; i++)
                {
                        LL[i] = domains[i].laplacian();
                }
        }
        /*END PARALLEL REGION*/

        for(int i = 0; i < n_domains; i++)
        {
                for(int j = 0; j < LL[i].size(); j++)
                {
                        L[cumulative_npts[i]+j] = LL[i][j];
                }
        }

        return L;



In terms of the scaling:
1 thread --> 0.175 s, 2 threads --> 0.094 s, 4 threads --> 0.054 s, 8 threads --> 0.034 s, 12 threads --> 0.041 s.
A single iteration of the for loop takes either 0.0035 s or 0.007 s (there are 2 sizes of grids being used).

Sorry for the long post! I'd just like some help understanding why or why not the scaling could be better. I tried reading some stuff on false sharing to see if that was it, but I'm afraid I don't completely understand the reasons that could be a problem.
asen
 
Posts: 3
Joined: Sat Jan 21, 2012 11:21 pm

Re: OpenMP Scaling Questions

Postby ftinetti » Mon Jan 23, 2012 8:10 am

HI,

I think the succession of times up to 8 threads is rather expected having 8 cores, of course. The time for 12 threads is not strongly strange, but we would need some data for understanding it better, such as:
Number of Processor/s:
Number of core/s:
OS:
Compiler, version, compiler options:

HTH.
ftinetti
 
Posts: 571
Joined: Wed Feb 10, 2010 2:44 pm

Re: OpenMP Scaling Questions

Postby asen » Tue Jan 24, 2012 11:11 pm

There are 2 processors, each with 6 cores. The OS is Red Hat Enterprise 5.7. I'm using g++ 4.1.2 with the flags -O2 -DNDEBUG -lfftw3 -lm -w -larpack -fopenmp.
asen
 
Posts: 3
Joined: Sat Jan 21, 2012 11:21 pm

Re: OpenMP Scaling Questions

Postby ftinetti » Wed Jan 25, 2012 6:53 am

I see, thanks. If there are 12 cores in total, maybe it could be better to see the performance of 10 and 11 threads, so to avoid contention with other processes (OS, usr apps., etc.). Even in that case (i.e. leaving some core/s to other processes) the performance gain could be less than expected, having into account that the processing is not much (tens of milliseconds).

HTH.
ftinetti
 
Posts: 571
Joined: Wed Feb 10, 2010 2:44 pm

Re: OpenMP Scaling Questions

Postby asen » Mon Jan 30, 2012 4:46 pm

Well, the scaling is just as bad for 10 and 11 threads. Regardless, I believe that part of the issue was due to load balancing, so I wrote up a piece of code to assign each patch to a thread before the actual computation. Now the way that I'm executing the parallelism is:

Code: Select all
#pragma omp parallel shared(L)
{
    function_thread(omp_get_thread_num(),&L[0]);
}


where "function_thread" is just a function that iterates over all the patches (and writes the result to L) that are designated to be part of the current thread. This, however, still causes problems. Executing each thread in serial, I see that each one takes 0.013 to 0.015s to run. However, when I actually use the parallelism, the fastest thread takes 0.018s (which seems reasonable) but the slowest thread takes 0.028s. I'm timing using omp_get_wtime(). I've also checked the CPU times for the total work, and using one thread it is 0.17s; using 12 threads it is 0.36 s. For some reason, the amount of work being done seems to increase when I use OpenMP. Does this suggest anything obvious?
asen
 
Posts: 3
Joined: Sat Jan 21, 2012 11:21 pm

Re: OpenMP Scaling Questions

Postby ftinetti » Tue Jan 31, 2012 5:18 am

Unbalance issues are not always easy to solve. However, I would not say
For some reason, the amount of work being done seems to increase when I use OpenMP.

since what seems to increase is the threads contention for some resource and I would suggest basically memory and, more specifically, cache memory. Another possible issue is synchronization in case you have to avoid race conditions, but it does not seem to be the case from what I understand from your description.

Does this suggest anything obvious?

Well, no... or at least not to me. My question would be Does it have a solution? and my answer would be that it depends...
a) If the problem is cache contention then it could be solved by changing access and computation patterns, but I know this could be a major algorithmic change.
b) If the problem is unbalanced workload then it could be solved by changing the function_thread() in an appropriate way... where "appropriate" is hard to define in terms of complexity.

As you can see, I do not have any useful suggestion... just some guess...
ftinetti
 
Posts: 571
Joined: Wed Feb 10, 2010 2:44 pm

Re: OpenMP Scaling Questions

Postby MarkB » Tue Feb 21, 2012 10:03 am

The problem could be due to communicating data between cores and/or sockets : the vector L is (presumably) being initialised on the master thread, accessed by all threads in the parallel region, then accessed by the master thread again at the end. The observed load imbalance can come from the fact that some threads will have the data cached locally (or stored in the local memory on the socket), while others will not. You could try parallelising the initialisation of the vector and see if that helps.
MarkB
 
Posts: 427
Joined: Thu Jan 08, 2009 10:12 am


Return to Using OpenMP

Who is online

Users browsing this forum: No registered users and 6 guests