[Omp] Overhead of #pragma omp for static nowait
James Beyer
beyerj at cray.com
Fri Dec 8 14:25:57 PST 2006
Several questions:
How many threads are used and are they active when the parallel loop is hit?
What is the call overhead of the omp_get_thread_num() ad
omp_get_num_threads() and any other calls the compiler might have to insert?
Have you looked at what the compiler is generating?
What compilers have you tried?
james
Greg Bronevetsky wrote:
> I mean the following compiler transformation:
> #pragma omp for static(1) nowait
> for(int i=0; i<n; i++){}
> should become:
> for(int i=omp_get_thread_num(); i<n; i+=omp_get_num_threads())
> {}
>
> and
> #pragma omp for static nowait
> for(int i=0; i<n; i++){}
> should become:
> // the id of the last thread that gets 1 more iteration than others
> int midPoint=n%omp_get_num_threads();
> // number of iterations assigned to threads with smaller ids
> int itersBeforeMe;
> if(omp_get_thread_num()<=midPoint)
> itersBeforeMe = omp_get_thread_num()*(n/omp_get_num_threads()+1);
> else
> itersBeforeMe = midPoint*(n/omp_get_num_threads()+1)+
> (omp_get_thread_num()-midPoint)*(n/omp_get_num_threads());
> // number of iterations assigned to this thread
> int numIter;
> if(omp_get_thread_num()<=midPoint)
> numIter = n/omp_get_num_threads()+1;
> else
> numIter = n/omp_get_num_threads();
>
> for(int i=itersBeforeMe; i<itersBeforeMe+numIter; i++)
> {}
>
> Other chunk sizes or loop bounds would involve more complex arithmetic to
> set up loop bounds but the basic idea is pretty much the same. The overall
> cost of the above implementation of "#pragma omp for static(1) nowait"
> should be several ns per iteration. However, I am seeing much higher
> overheads in my experiments.
>
> Greg Bronevetsky
>
> On Fri, 8 Dec 2006, Meadows, Lawrence F wrote:
>
>
>> What do you mean by "converting to a set of serial loops"
>>
>> -----Original Message-----
>> From: omp-bounces at openmp.org [mailto:omp-bounces at openmp.org] On Behalf
>> Of Greg Bronevetsky
>> Sent: Friday, December 08, 2006 12:48 PM
>> To: omp at openmp.org
>> Subject: [Omp] Overhead of #pragma omp for static nowait
>>
>> I have recently executed the EPCC microbenchmarks on several machines
>> and
>> noticed that there is a consistent overhead of ~1us (~several thousand
>> cycles) for #pragma omp for static nowait and its variants on the
>> platforms I've tried. Given the simplicity of this scheduling policy, it
>> seems to me that it should be possible to convert the parallel loop into
>> a
>> set of serial loops at compile-time. This would result in a loop that
>> requires no inter-thread communication and costs only a few tens of
>> cycles.
>>
>> What is the reason for this much-higher than expected overhead? Is it
>> just
>> that the above compiler analysis is not typically performed or is there
>> a
>> more fundamental reason. Here at LLNL, we have applications that would
>> like to use OpenMP to parallelize loops with ~50 iterations and ~.25us
>> of
>> work per iteration. ~1us overheads for the #pragma omp for static nowait
>> make OpenMP too expensive for this task.
>>
>> Greg Bronevetsky
>>
>> _______________________________________________
>> Omp mailing list
>> Omp at openmp.org
>> http://openmp.org/mailman/listinfo/omp
>>
>>
>>
>
> _______________________________________________
> Omp mailing list
> Omp at openmp.org
> http://openmp.org/mailman/listinfo/omp
>
More information about the Omp
mailing list