Overhead cost

General OpenMP discussion

Overhead cost

Postby dondilworth » Sat Nov 24, 2012 1:37 pm

My Fortran program runs a DO loop as follows:

IF (INDEX .EQ. NTHREADS) THEN ! ALL LOADED; TRACE RAYS NOW

!$OMP PARALLEL SHARED( /MTRAYS/,/MTCOM/,/MTPTRACE/,/RAY/ ) IF (TEST) NUM_THREADS (ISFLAGS(173))
!$OMP DO SCHEDULE(STATIC,1)

DO I = 1,INDEX
CALL MTRAYTRACE(I)
ENDDO

!$OMP END DO
!$OMP END PARALLEL
GO TO 8801
ENDIF

This works as it should, and if I give it a very complicated problem I can get about a 4X speed improvement on my 8-core AMD PC. This is very nice! The subroutine chain uses local and automatic variables, and all shared data are in indexed arrays in named common blocks. The program comes in from the top with a collection of cases to run. After it runs them all in parallel, it goes to statement 8801 where the next batch of cases are set up, and then comes in from the top again. Thus this parallel DO section is run many times. Integer INDEX is the number of cores to run. Here are the statistics for the complicated job:

CORE 0 1 2 3 4 6 8
Time 10.0 5.41 2.88 2.47 2.49 2.51 2.54

The problem is that when I give it a rather simple problem, it takes longer to run in multithread mode than in serial mode.

CORE 0 1 2 3 4 6 8
Time 0.273 0.285 0.293 0.293 0.300 0.316 0.324

I suspect that there is some overhead in creating and managing the threads, and this exceeds the time saving for the simple problem.

Did I use OpenMP correctly, is the overhead issue known and real, and is there any way to speed things up even more? Is this the best structure for this kind of problem?
dondilworth
 
Posts: 8
Joined: Sat Jun 02, 2012 5:52 am

Re: Overhead cost

Postby ftinetti » Tue Nov 27, 2012 4:12 am

Hi,

I suspect that there is some overhead in creating and managing the threads, and this exceeds the time saving for the simple problem.

Agreed.

Did I use OpenMP correctly, is the overhead issue known and real, and is there any way to speed things up even more? Is this the best structure for this kind of problem?

I think so, yes, I don't think so, and It is hard to say, but I don't think so, respectively.

Some overhead problems are mitigated by setting OMP_WAIT_POLICY to passive. If you try, share the results, please.

I'm curious about your hardware/software setting (specific processor model, compiler and compiler options, etc.), would you give some extra detail/s?

Fernando.
ftinetti
 
Posts: 581
Joined: Wed Feb 10, 2010 2:44 pm

Re: Overhead cost

Postby MarkB » Tue Nov 27, 2012 4:38 am

Hi there,

The overhead for a parallel region is typically in the 10-100 microseconds range (depending on the number of threads, compiler and hardware used).
If you know how many times the parallel region is executed, you should be able to figure out whether this explains your results.

Hope that helps,
Mark.
MarkB
 
Posts: 447
Joined: Thu Jan 08, 2009 10:12 am
Location: EPCC, University of Edinburgh

Re: Overhead cost

Postby dondilworth » Tue Nov 27, 2012 12:31 pm

I have the following:

AMD FX-8120 Eight-Core Processor
16.0 GB
64-bit OS

My C++ command line is

/Zi /nologo /W2 /WX- /O2 /Ot /Oy- /D "WIN32" /D "NDEBUG" /D "_WINDOWS" /D "_VC80_UPGRADE=0x0600" /GF /Gm- /EHsc /MT /GS /Gy- /fp:precise /Zc:wchar_t /Zc:forScope /GR /openmp /Fp".\Release\SYNOPSYS200.pch" /Fa".\Release\" /Fo".\Release\" /Fd".\Release\" /FR".\Release\" /Gd /analyze- /errorReport:queue

and my Fortran command line is

/nologo /O3 /Ob0 /Oy- /Qipo /I"Release/" /reentrancy:none /extend_source:132 /Qopenmp /Qauto /align:rec4byte /align:commons /assume:byterecl /Qzero /fpconstant /iface:cvf /module:"Release/" /object:"Release/" /Fd"Release\vc100.pdb" /check:none /libs:dll /threads /winapp /c

I guess that going through the parallel loops is very fast, but when I come in from the top with another batch of data maybe the system has to set up all the threads all over again. This is only a guess, since I don't know what's going on in the background -- and that's why I submitted the question. Or does it only have to expend the overhead once?

I'll look up the wait policy option and see if I can figure it out. I'll let you know what happens.

If this is not the best structure, then perhaps you can suggest a better arrangement if there is one.

If the overhead in my case were to be 100 us, or 0.0001 seconds, and I enter the loop 1000 times, that's 0.1 seconds of overhead, which is in line with my results. So maybe there's no way to improve it. Still, I keep hoping.
dondilworth
 
Posts: 8
Joined: Sat Jun 02, 2012 5:52 am

Re: Overhead cost

Postby dondilworth » Tue Nov 27, 2012 12:43 pm

Well, I thought it would be simple. I added to my C++ code

retint = setenv( OMP_WAIT_POLICY, passive, 1 ); // try to speed up OpenMP stuff

and got the error message

Error 1 error C2065: 'OMP_WAIT_POLICY' : undeclared identifier c:\synopsysv14\synopsys.cpp 725 1 SYNOPSYS200

My headers include

#include <stdlib.h>

What did I do wrong? Can I set this from a Fortran code instead of from C++?
dondilworth
 
Posts: 8
Joined: Sat Jun 02, 2012 5:52 am

Re: Overhead cost

Postby ftinetti » Tue Nov 27, 2012 2:27 pm

Hi,

OS? Compiler?

Well, I thought it would be simple. I added to my C++ code

retint = setenv( OMP_WAIT_POLICY, passive, 1 ); // try to speed up OpenMP stuff

and got the error message

Error 1 error C2065: 'OMP_WAIT_POLICY' : undeclared identifier c:\synopsysv14\synopsys.cpp 725 1 SYNOPSYS200

My headers include

#include <stdlib.h>

What did I do wrong? Can I set this from a Fortran code instead of from C++?


I suggest you set the environment variable before starting the program i.e. in the command line/shell, since the OpenMP environment/threads would be already set up by the time the first line of code of the program is executed.

AMD FX-8120 Eight-Core Processor

I'm not surprised by the overhead/lack of performance improvement when the runtime is below 1 sec., but I guess there is something strange for
CORE 0 1 2 3 4 6 8
Time 10.0 5.41 2.88 2.47 2.49 2.51 2.54

since there is no improvement for 6 and 8 cores...

Fernando.
ftinetti
 
Posts: 581
Joined: Wed Feb 10, 2010 2:44 pm

Re: Overhead cost

Postby MarkB » Wed Nov 28, 2012 5:37 am

dondilworth wrote:I guess that going through the parallel loops is very fast, but when I come in from the top with another batch of data maybe the system has to set up all the threads all over again. This is only a guess, since I don't know what's going on in the background -- and that's why I submitted the question. Or does it only have to expend the overhead once?


Typically threads are kept alive between parallel regions, but the overhead is still there for every parallel region: a good part of this comes from simply synchronising the threads at the end of the region.

It might be useful to use OMP_GET_WTIME() to figure out how much time is spent in the parallel region versus the rest of the code.
MarkB
 
Posts: 447
Joined: Thu Jan 08, 2009 10:12 am
Location: EPCC, University of Edinburgh

Re: Overhead cost

Postby dondilworth » Wed Nov 28, 2012 1:02 pm

I suggest you set the environment variable before starting the program i.e. in the command line/shell, since the OpenMP environment/threads would be already set up by the time the first line of code of the program is executed.

This is helpful, but being a dummy I don't know how to do that. Where, exactly, in the Property Pages do I change what to set that environment variable?
dondilworth
 
Posts: 8
Joined: Sat Jun 02, 2012 5:52 am

Re: Overhead cost

Postby ftinetti » Wed Nov 28, 2012 1:47 pm

Hi,

I suggest you set the environment variable before starting the program i.e. in the command line/shell, since the OpenMP environment/threads would be already set up by the time the first line of code of the program is executed.

This is helpful, but being a dummy I don't know how to do that. Where, exactly, in the Property Pages do I change what to set that environment variable?


No problem, I could try to help (no guaranteed sucess...). I need you tell me about your OS and the way you are currently running your program.

Also, you could follow Mark's suggestion:
It might be useful to use OMP_GET_WTIME() to figure out how much time is spent in the parallel region versus the rest of the code.

since it would provide valuable information.

Fernando.
ftinetti
 
Posts: 581
Joined: Wed Feb 10, 2010 2:44 pm

Re: Overhead cost

Postby dondilworth » Fri Nov 30, 2012 7:14 am

Fernando:

I have Windows 7 64 bit, and I run from Visual Studio 2010 Professional. It is a mixed-language project that starts in C++ which then calls Fortran subroutines. Does that tell you what you need to know?

Regarding the strange timings I posted above, I am also puzzled. Why would eight cores run slower than 4? Also, why would 1 core (which goes through the OpenMP loop once per trip) run faster than 0 cores, which does the same thing but not inside an OMP construct? Weird. I speculate that Windows has to use some of the cores for its own purposes, and if I try to access them I have to wait my turn. Is that plausible?

DD
dondilworth
 
Posts: 8
Joined: Sat Jun 02, 2012 5:52 am

Next

Return to Using OpenMP

Who is online

Users browsing this forum: No registered users and 5 guests