[Omp] Overhead of #pragma omp for static nowait
Greg Bronevetsky
greg at bronevetsky.com
Fri Dec 8 13:58:36 PST 2006
I mean the following compiler transformation:
#pragma omp for static(1) nowait
for(int i=0; i<n; i++){}
should become:
for(int i=omp_get_thread_num(); i<n; i+=omp_get_num_threads())
{}
and
#pragma omp for static nowait
for(int i=0; i<n; i++){}
should become:
// the id of the last thread that gets 1 more iteration than others
int midPoint=n%omp_get_num_threads();
// number of iterations assigned to threads with smaller ids
int itersBeforeMe;
if(omp_get_thread_num()<=midPoint)
itersBeforeMe = omp_get_thread_num()*(n/omp_get_num_threads()+1);
else
itersBeforeMe = midPoint*(n/omp_get_num_threads()+1)+
(omp_get_thread_num()-midPoint)*(n/omp_get_num_threads());
// number of iterations assigned to this thread
int numIter;
if(omp_get_thread_num()<=midPoint)
numIter = n/omp_get_num_threads()+1;
else
numIter = n/omp_get_num_threads();
for(int i=itersBeforeMe; i<itersBeforeMe+numIter; i++)
{}
Other chunk sizes or loop bounds would involve more complex arithmetic to
set up loop bounds but the basic idea is pretty much the same. The overall
cost of the above implementation of "#pragma omp for static(1) nowait"
should be several ns per iteration. However, I am seeing much higher
overheads in my experiments.
Greg Bronevetsky
On Fri, 8 Dec 2006, Meadows, Lawrence F wrote:
> What do you mean by "converting to a set of serial loops"
>
> -----Original Message-----
> From: omp-bounces at openmp.org [mailto:omp-bounces at openmp.org] On Behalf
> Of Greg Bronevetsky
> Sent: Friday, December 08, 2006 12:48 PM
> To: omp at openmp.org
> Subject: [Omp] Overhead of #pragma omp for static nowait
>
> I have recently executed the EPCC microbenchmarks on several machines
> and
> noticed that there is a consistent overhead of ~1us (~several thousand
> cycles) for #pragma omp for static nowait and its variants on the
> platforms I've tried. Given the simplicity of this scheduling policy, it
> seems to me that it should be possible to convert the parallel loop into
> a
> set of serial loops at compile-time. This would result in a loop that
> requires no inter-thread communication and costs only a few tens of
> cycles.
>
> What is the reason for this much-higher than expected overhead? Is it
> just
> that the above compiler analysis is not typically performed or is there
> a
> more fundamental reason. Here at LLNL, we have applications that would
> like to use OpenMP to parallelize loops with ~50 iterations and ~.25us
> of
> work per iteration. ~1us overheads for the #pragma omp for static nowait
> make OpenMP too expensive for this task.
>
> Greg Bronevetsky
>
> _______________________________________________
> Omp mailing list
> Omp at openmp.org
> http://openmp.org/mailman/listinfo/omp
>
>
More information about the Omp
mailing list