reducing OpenMP overhead in inner loop

General OpenMP discussion

reducing OpenMP overhead in inner loop

Postby robnit » Thu Aug 21, 2008 6:03 am

Hi there,

I would like to parallelize the following problem (using Windows Visual C++ Studio 2008) to be able to run it at full load on a dual core machine:

Code: Select all
...
for (int j=0;j<Lines;j++)
      {
#pragma omp parallel for
         for (int k=0;k<Columns;k++)
         {
            do_some_calculations();
         }
      }
...


Lines is on the order of 1E6, Columns on the order of 100. I can not parallelize the outer loop since this is a timestep integration (in my real application) and the results of each timestep depend on the results of the previous one. So I am left trying to parallelize the inner (smaller) loop.
If I put the for directive as stated above the execution time on a dual core machine using both processors is even larger as compared to executing the code on only one processor. If I slightly change the code to:

Code: Select all
...
#pragma omp parallel
      {
      for (int j=0;j<Lines;j++)
      {
#pragma omp for
         for (int k=0;k<Columns;k++)
         {
            do_some_calculations();
         }
      }
      }
...


the execution time is now lower on the dual core machine but only about ~20%.
If I am not completely wrong, this is probably due to the OpenMP overhead involved in creating the threads and distributing the workload. That's probably also the reason why the second version is slightly faster (thread creation done outside the outer loop), but I am not 100% sure.
So what I would need is something like a thread pool to create the threads before the loops. Also, since Columns is a constant number which I know beforehand, it would be great if I could tell OpenMP before the loops how to distribute the workload. That way, OpenMP does not have to calculate the work distribution each time it encounters a for directive.

Is there a way to do this (or similar things) in OpenMP? Any other suggestions to solve my problem?

Any help is higly appreciated.

Thanks,

Robert
robnit
 
Posts: 1
Joined: Thu Aug 21, 2008 4:49 am

Re: reducing OpenMP overhead in inner loop

Postby geoff » Mon Aug 25, 2008 6:02 am

Hi Robert,

What about something like this:

Code: Select all
...
#pragma omp parallel
      {
      int thread_id = omp_get_thread_num();
      int num_threads = 2;
     
      for (int j=0;j<Lines;j++)
      {
#pragma omp for
         for (int k=thread_id;k<Columns;k = k+num_threads)
         {
            do_some_calculations();
         }
      }
      }
...


Geoff
geoff
 
Posts: 11
Joined: Thu Jun 12, 2008 7:50 am

Re: reducing OpenMP overhead in inner loop

Postby mwolfe » Mon Aug 25, 2008 3:29 pm

Geoff's reply is invalid. For a 'omp for', the loop limits must be computed the same on all threads; using thread_id here would not allow for that.

Your 2nd example does most of what you want; the 'thread pool' is created and instantiated at the pragma omp parallel clause. All threads execute the 'for j' loop. At the 'omp for' for the k loop, the OMP runtime will distribute the 'k' loop iterations among the threads. The default distribution depends on the compiler. You can choose the schedule with a schedule clause. Try '#pragma omp for schedule(static)' or 'schedule(static,10)'.

You can find training for OpenMP at www.compunity.org

-mw
mwolfe
 
Posts: 54
Joined: Mon Aug 25, 2008 3:19 pm

Re: reducing OpenMP overhead in inner loop

Postby geoff » Tue Aug 26, 2008 5:36 am

You are right MW, it should have been:

Code: Select all
...
int thread_id
#pragma omp parallel private(thread_id)
      {
      thread_id = omp_get_thread_num();
      int num_threads = 2;
     
      for (int j=0;j<Lines;j++)
      {
         for (int k=thread_id;k<Columns;k = k+num_threads)
         {
            do_some_calculations();
         }
      }
      }
...


This way you can measure the overhead associated with scheduling the threads using different methods (dynamic, static, ...)

However I never noticed the j loop, I was focusing on k only (so my example is garbage). j is public and will be updated unexpectedly. If do_some_calculations() uses j there will be trouble in both my example as well as Robert's.

Geoff
geoff
 
Posts: 11
Joined: Thu Jun 12, 2008 7:50 am


Return to Using OpenMP

Who is online

Users browsing this forum: Google [Bot], Yahoo [Bot] and 5 guests