Issues distributing parallel functions to multiple threads

General OpenMP discussion

Issues distributing parallel functions to multiple threads

Postby zeusz4u » Mon Jan 23, 2012 5:49 am

Hi,

Can someone please tell me what am I doing wrong with the following?
Please take a look at the following snippet I'm using in my code. Single -threaded version runs fine, but I want to distribute the 16 functions calls to 4 cores, each executing 4 calls of the function, in parallel of course. After doing this, it even take longer to execute than before using any pragmas in the single threaded. I don't want to distribute iteration, I've done that before, but I want to distribute the computeBlac76() function calls to different threads. I have to do 16 calculation in one loop, so it would be optimal to have 4 threads running 4 calculations each.

This is the version that is not working (I found a 2 threaded version of QuickSort implemented in the very same way):

Code: Select all
for(int i=0; i < numPasses; i++)
{
#pragma omp parallel sections num_threads(4)
{
                #pragma omp section
                computeBlack76('C', 318, 72, 0.676712328767123, 0.05, 0.7);
                #pragma omp section
                computeBlack76('C', 317, 208, 0.446575342465753, 0.05, 0.7);
                #pragma omp section
                computeBlack76('C', 78, 125, 0.972602739726027, 0.05, 0.7);
                #pragma omp section
                computeBlack76('C', 276, 398, 0.975342465753425, 0.05, 0.7);
                #pragma omp section
                computeBlack76('C', 312, 81, 0.517808219178082, 0.05, 0.7);
                #pragma omp section
                computeBlack76('C', 165, 390, 0.167123287671233, 0.05, 0.7);
                #pragma omp section
                computeBlack76('C', 93, 337, 0.254794520547945, 0.05, 0.7);
                #pragma omp section
                computeBlack76('C', 307, 256, 0.986301369863014, 0.05, 0.7);
                #pragma omp section
                computeBlack76('C', 286, 168, 0.619178082191781, 0.05, 0.7);
                #pragma omp section
                computeBlack76('C', 434, 92, 0.542465753424658, 0.05, 0.7);
                #pragma omp section
                computeBlack76('C', 361, 199, 0.994520547945206, 0.05, 0.7);
                #pragma omp section
                computeBlack76('C', 233, 393, 0.268493150684932, 0.05, 0.7);
                #pragma omp section
                computeBlack76('C', 415, 103, 0.408219178082192, 0.05, 0.7);
                #pragma omp section
                computeBlack76('C', 271, 175, 0.550684931506849, 0.05, 0.7);
                #pragma omp section
                computeBlack76('C', 353, 370, 0.73972602739726, 0.05, 0.7);
                #pragma omp section
                computeBlack76('C', 163, 449, 0.495890410958904, 0.05, 0.7);
}
}


It seems to me that my CPU is executing the very same 16 calculations 4 times, on the four different threads.
Is there a way to fix this?

I have been working on this all day long, and have not found any explanation to this.

I'd appreciate any kind of help or suggestion.
zeusz4u
 
Posts: 4
Joined: Mon Jan 23, 2012 5:41 am

Re: Issues distributing parallel functions to multiple threa

Postby ftinetti » Mon Jan 23, 2012 8:17 am

Hi,

It seems to me that my CPU is executing the very same 16 calculations 4 times, on the four different threads.

How do you know this?

HTH.
ftinetti
 
Posts: 582
Joined: Wed Feb 10, 2010 2:44 pm

Re: Issues distributing parallel functions to multiple threa

Postby zeusz4u » Mon Jan 23, 2012 11:57 am

ftinetti wrote:Hi,

It seems to me that my CPU is executing the very same 16 calculations 4 times, on the four different threads.

How do you know this?

HTH.



I suppose this is happening. Because I'm monitoring CPU activity with the "top" command in linux, and I can see CPU usage goes up to 400% - this means 4 cores, however the overall execution time is much longer than in a single-threaded case.

Is there any way to tell the compiler to execute those calulation is parallel? WHat am I doing wrong?
zeusz4u
 
Posts: 4
Joined: Mon Jan 23, 2012 5:41 am

Re: Issues distributing parallel functions to multiple threa

Postby zeusz4u » Mon Jan 23, 2012 12:26 pm

I am measuring execution time with High Performance Timer function:

clock_gettime(CLOCK_REALTIME, &time1);


In this regard, running the last pass - where I do 10 000 000 passes, it executes in 37 seconds on the Core i7 CPU by using the above pragmas, but without them it executes in about 17 seconds. By using OMP on 4 cores, I would except to get about 1/4 of the original execution time (or soething similar, but any way the program should be quicker than before).

Can someone point out the problem here?

I have used the QuickSort example I have found as a base for my calculations. It's somewhat similar, but I have 16 calculations to divide into 4 threads, rather than just 2 to divide to 2 threads:

Code: Select all
void QuickSort (int numList[],  int nLower, int nUpper)
   {
   if (nLower < nUpper)     
      {
      // create partitions
      int nSplit = Partition (numList, nLower, nUpper);
      #pragma omp parallel sections
      {
         #pragma omp section
         QuickSort (numList, nLower, nSplit - 1);

         #pragma omp section
         QuickSort (numList, nSplit + 1, nUpper);
      }
   }
}



A small remark, I'm using 64-bit Linux server, with CentOS 6.2 Release, with Intel core i7, and 50 GB of RAM. Tried compiling the code, and making measurements, with both Intel and G++ compilers. My goal is to have this compiled and running with Intel C++ compiler, and still getting better result than without OMP.
zeusz4u
 
Posts: 4
Joined: Mon Jan 23, 2012 5:41 am

Re: Issues distributing parallel functions to multiple threa

Postby ftinetti » Mon Jan 23, 2012 3:10 pm

Hi,

I suppose this is happening. Because I'm monitoring CPU activity with the "top" command in linux, and I can see CPU usage goes up to 400% - this means 4 cores, however the overall execution time is much longer than in a single-threaded case.


The top command is just telling you that every one of the four processors are being used, the overall runtime is not necessarily always used for useful computing, but computing. Example: having spin locks every core is busy, but just waiting for something to happen.

Is there any way to tell the compiler to execute those calulation is parallel? WHat am I doing wrong?

Well, that what the sections construct is used for... but now I think you are suggesting something different of your previous post... Anyway, there should be no problem such as that of a section executed more than once, since the spec. defines:
Each structured block is executed once by one of the threads in the team in the context
of its implicit task.

However, there could be a scheduling problem (from a performance point of view), since the spec. defines
The method of scheduling the structured blocks among the threads in the team is
implementation defined.

i.e. there could be a extreme case in which only one thread executes every section. This does not seem to be the case since the top reports 400% CPU usage, but everything else could be happening. My first suggestions would be:
1) Use the function omp_get_wtime() in order to measure execution time.
2) Set OMP_WAIT_POLICY to passive
3) If every call to computeBlack76() takes about the same execution time, then group them in 4 sections.
4) Using a for with a case inside seems to be unnatural, but maybe helps for checking performance measurements.
5) Tasks seems to be another natural way of distributing execution among threads.

HTH.
ftinetti
 
Posts: 582
Joined: Wed Feb 10, 2010 2:44 pm

Re: Issues distributing parallel functions to multiple threa

Postby zeusz4u » Tue Jan 24, 2012 7:19 am

I was trying to run the program with 4 computations only , grouped in 4 sections, so each thread should be executnig 1 caculation as follows:

Code: Select all
   for(int i=0; i < numPasses; i++)
   {
      omp_set_num_threads(4);
      #pragma omp parallel sections
         {
            #pragma omp section
            {
               computeBlack76('C', 318, 72, 0.676712328767123, 0.05, 0.7);
            }
            #pragma omp section
            {
               computeBlack76('C', 312, 81, 0.517808219178082, 0.05, 0.7);
            }
            #pragma omp section
            {
               computeBlack76('C', 286, 168, 0.619178082191781, 0.05, 0.7);
            }
            #pragma omp section
            {
               computeBlack76('C', 415, 103, 0.408219178082192, 0.05, 0.7);
            }
         }
         #pragma omp barrier
         
      }
   }


I also inserted a barrier primitive here, as I want all 4 threads to finish before starting the next iteration.

3) If every call to computeBlack76() takes about the same execution time, then group them in 4 sections.


Well, this will be the ideal case here - each 4 section would require roughly the same execution time, in the ideal case - by using the code above, I would have to wait until the longest calculation is over, and this would be the maximum time required for the program, even if it takes slightly longer to compute than the other ones, I would expect to have at least half the execution time here.

4) Using a for with a case inside seems to be unnatural, but maybe helps for checking performance measurements.

Can you be a little bit more specific, as I don't understand here neither the benefits, nor the purpose of using a case statement, As I want to parallelize, and not have different cases in each iteration that execute separately, and one-by-one.

2) Set OMP_WAIT_POLICY to passive

I'm nut sure whether it's turned of or not. I used the shell to set this environment variable:
Code: Select all
OMP_WAIT_POLICY=PASSIVE

Before doing this, I have echoed the shell variable, and had no value, don't even think it was defined. So it is now passive if I display it's content. BTW, I understand that the #pragma omp barrier should have the same effect, should not?


I'm doing testings and step-by-step execution, but I still cannot figure out which thread executes which part of the code. I'm using Visual Studio 2010 with Intel Parallel Studio Extension... I've enabled the openmp for the project, I can even see using it. I have a breakpoint at the first section, then I see in the Visual Studio debugger, that I have 1 master thread, and 3 worker threads. I hit F10, then it stays on the same line in the code, and it jumps to a different thread at the threads section.
zeusz4u
 
Posts: 4
Joined: Mon Jan 23, 2012 5:41 am

Re: Issues distributing parallel functions to multiple threa

Postby ftinetti » Tue Jan 24, 2012 7:48 am

Hi,

Some simple suggestions:
1) if you want to see the thread executing each section just use (and print the result of) the function omp_get_thread_num()
2) Having barriers is not the same as setting OMP_WAIT_POLICY to passive. The env. var. "controls" or "suggests" the way in which barriers are implemented. Please see the spec for full explanations

I do not know VS, so I will not be able to help on that... either...

HTH.
ftinetti
 
Posts: 582
Joined: Wed Feb 10, 2010 2:44 pm

Re: Issues distributing parallel functions to multiple threa

Postby MarkB » Thu May 03, 2012 2:35 am

zeusz4u wrote:In this regard, running the last pass - where I do 10 000 000 passes, it executes in 37 seconds on the Core i7 CPU by using the above pragmas, but without them it executes in about 17 seconds. By using OMP on 4 cores, I would except to get about 1/4 of the original execution time (or soething similar, but any way the program should be quicker than before).

Can someone point out the problem here?


I think the clue is in these timings: each pass takes 1.7 microseconds when executed sequentially, and this is too small to offset the overheads of the parallel region (the parallel execution time suggests this overhead is of the order of 3 microseconds on 4 cores, which is around what I would expect).

So the problem is that there is not enough computation in the 16 function calls to offset the costs of parallelisation. Is there any scope in the application for doing more than these 16 calls in parallel?
MarkB
 
Posts: 481
Joined: Thu Jan 08, 2009 10:12 am
Location: EPCC, University of Edinburgh


Return to Using OpenMP

Who is online

Users browsing this forum: Yahoo [Bot] and 7 guests