Many implementations use "pools of threads", so the first parallel region takes a bit longer than the following regions. This is because when the first region is encountered is when the threads are actually "gotten". After that they are reused. This is one reason that the OpenMP spec has not had an OMP_SET_STACKSIZE call added to it - so threads could be reused. This of course is not the only way that OpenMP can be implemented.
The user has no way of knowing how the implementation of OpenMP is done in most cases. While the gnu manual doesn't state how it is done, the code is available to look at (though I have never done so). As for Intel, I believe they use a pool approach (from my conversations with Intel engineers in the past). You can look at the various literature on the web for more information (do a yahoo search on something like "+openmp +intel +pool"):
(see header Thread Pooling" on the following page:)http://www.intel.com/software/products/ ... atform.htm
(locate on pool in the following document:)
This seems to indicate that the Intel compiler does what you want.
I do know that Sun's implementation of OpenMP uses "thread pools" (since I work on it).