very poor efficiency on simple benchmark code

General OpenMP discussion

very poor efficiency on simple benchmark code

Postby joe2748 » Wed Nov 23, 2011 5:14 pm

I have written a very simple code to test openmp, its job is to estimate pi.

However, on my lenovo w510 with
cpu: Intel Core I7 @ 1.6 ghz
os: Fedora Core 15
compiler: gcc 4.6.1, 4.4.6, and intel composer xe 2011.
I notice that I get very poor efficiency with 2 and 4 threads.
When using a friends laptop (an old core duo) I see nearly perfect efficiency with 2 threads,
and when using the work desktop with a 12 core AMD I see nearly perfect efficiency with 2-12 threads.

Below is the code:
Code: Select all

#include <stdio.h>
#include <omp.h>
#include <math.h>


double estimate_pi(double radius, int nsteps){

   int i;
   double h=2*radius/nsteps;
   double sum=0;
   for (i=1;i<nsteps;i++){
      sum+=sqrt(pow(radius,2)-pow(-radius+i*h,2));
      //sum+=.5*sum;
   }
   sum*=h;
   sum=2*sum/(radius*radius);
   //printf("radius:%f --> %f\n",radius,sum);
   return sum;


}

int main(int argc, char* argv[]){

   
   double ser_est,par_est;
   long int radii_range;
   if (argc>1) radii_range=atoi(argv[1]);
   else radii_range=500;   

   int nthreads;
   if (argc>2) nthreads=atoi(argv[2]);
   else nthreads=omp_get_num_procs();
   
   printf("Estimating Pi by averaging %ld estimates.\n",radii_range);
   printf("OpenMP says there are %d processors available.\n",omp_get_num_procs());

   int r;
   double start, stop, serial_time, par_time;



   par_est=0;
   double tmp=0;
   ser_est=0;
   start=omp_get_wtime();
   for (r=1;r<=radii_range;r++){
      tmp=estimate_pi(r,1e6);
      ser_est+=tmp;
   }
   stop=omp_get_wtime();
   serial_time=stop-start;
   ser_est=ser_est/radii_range;

   omp_set_num_threads(nthreads);
   start=omp_get_wtime();
   #pragma omp parallel for private(r,tmp) reduction(+:par_est)
   for (r=1;r<=radii_range;r++){
      tmp=estimate_pi(r,1e6);
      par_est+=tmp;
   }
   stop=omp_get_wtime();
   par_time=stop-start;
   par_est=par_est/radii_range;
   
   printf("Serial Estimate: %f\nParallel Estimate:%f\n\n",ser_est,par_est);
   printf("Serial Time: %f\nParallel Time:%f\nNumber of Threads: %d\nSpeedup: %f\nEfficiency: %f\n",serial_time,par_time,nthreads,serial_time/par_time, serial_time/par_time/nthreads);


}




I've compiled this with gcc 4.6.1, gcc 4.4.6 and the free trial of intel's compiler. I've used GOMP_CPU_AFFINITY to keep threads
from being scheduled on "cores" that are actually the same core but look like 2 due to hyperthreading. Independent of compiler I
get results similar to
Code: Select all
[joe@w]$ export GOMP_CPU_AFFINITY="0,1"
[joe@w]$ ./bench1 500 2
Estimating Pi by averaging 500 estimates.
OpenMP says there are 2 processors available.
Serial Estimate: 3.141593
Parallel Estimate:3.141593

Serial Time: 6.127195
Parallel Time:6.108861
Number of Threads: 2
Speedup: 1.003001
Efficiency: 0.501501

[joe@w]$ export GOMP_CPU_AFFINITY="0,2"
[joe@w]$ ./bench1 500 2
Estimating Pi by averaging 500 estimates.
OpenMP says there are 2 processors available.
Serial Estimate: 3.141593
Parallel Estimate:3.141593

Serial Time: 6.068996
Parallel Time:3.568947
Number of Threads: 2
Speedup: 1.700500
Efficiency: 0.850250


So setting the processor affinity is helping, but I am still not getting efficiency near 1.
It looks like:
a) there is plenty of work to go around
b) there is no interprocess communication
c) there is not too much syncronization
d) there should be no problem with false sharing (I think)

So I can't see why I can't get higher efficiency with this code on this machine.
On the work machine

Code: Select all
joe@d$ ./bench1 500 2
Estimating Pi by averaging 500 estimates.
OpenMP says there are 48 processors available.
Serial Estimate: 3.141593
Parallel Estimate:3.141593

Serial Time: 8.334875
Parallel Time:4.152589
Number of Threads: 2
Speedup: 2.007151
Efficiency: 1.003576



I notice that the work machine gets perfect efficiency, although the serial
code takes longer than on my laptop! The work machine is has 4
AMD Opteron 6172 processors at 2.1ghz. running Ubuntu 11.04.
As mentioned before, a friends old core duo also gives effiency of 1.

Does anyone have any idea why my efficiency is so poor? Using 4 cores
I go down to efficiency of about .6.

Thanks for the help!
joe2748
 
Posts: 3
Joined: Wed Nov 23, 2011 4:46 pm

Re: very poor efficiency on simple benchmark code

Postby ftinetti » Thu Nov 24, 2011 1:21 pm

Hi,

I've experimented a little bit with your code and the efficiency is ok (0.98) in a dual quad-core Xeon. Some minor comments/questions on your performance problem:
1) Sharing execution cores via (multi)/(hyper)thread usually tends to penalize performance.
2) 0.85 performance (the one you achieved by "solving" the shared core performance penalization via GOMP_CPU_AFFINITY) is not bad at all, I would be pretty happy with that performance result.
3) Is it possible that your i7 computer has some frequency handling (it's rather usual by default on laptops) enabled?
4) Is it possible there are other CPU-bound programs running thus having some CPU contention?

HTH.
ftinetti
 
Posts: 567
Joined: Wed Feb 10, 2010 2:44 pm

Re: very poor efficiency on simple benchmark code

Postby joe2748 » Thu Nov 24, 2011 11:30 pm

Thanks for the reply ftinetti. I realize that .85 is pretty good efficiency, but on all other computers I have tried (and now yours),
efficiency is much higher. The real issue is that the real application I'm working on is also displaying much worse efficiency on
the laptop. As far as I can tell it should be possible to get efficiency near 1, and I don't understand why I can't.

I have been careful not to run the code while anything else might be tying up the processors, and in my BIOS I chose "best performance"
on all settings, so I don't think frequency scaling is an issue.

However, I have just noticed something truely confusing. I put this laptop to sleep last night, and just picked it up. It has not been shut down,
nothing has been recompiled, nothing changed since I posted my results last night. However, this is the output of the program currently:


Code: Select all
[joe@w benchmarks]$ export GOMP_CPU_AFFINITY="0,2,4,6"
[joe@w benchmarks]$ ./bench1 500 2
Estimating Pi by averaging 500 estimates.
OpenMP says there are 4 processors available.
Serial Estimate: 3.141593
Parallel Estimate:3.141593

Serial Time: 11.213485
Parallel Time:5.601070
Number of Threads: 2
Speedup: 2.002025
Efficiency: 1.001013
[joe@w benchmarks]$ ./bench1 500 4
Estimating Pi by averaging 500 estimates.
OpenMP says there are 4 processors available.
Serial Estimate: 3.141593
Parallel Estimate:3.141593

Serial Time: 11.227187
Parallel Time:2.785253
Number of Threads: 4
Speedup: 4.030940
Efficiency: 1.007735



Amazingly, after putting the computer to sleep the serial code takes longer, but the efficiency is in line with what I expect from
running the code on other computers. Again, there has been no recompile, no nothing. Just put the computer to sleep and wake it
back up....

Any explaination?
joe2748
 
Posts: 3
Joined: Wed Nov 23, 2011 4:46 pm

Re: very poor efficiency on simple benchmark code

Postby ftinetti » Fri Nov 25, 2011 3:54 am

Hi,

I'm even more surprised by eff. > 1! (but I know it happens, sometimes...)

Well, I think now we have more data supporting a "scaling" or "frequency variation" issue. I think (sorry I don't have references on this) frequency BIOS settings do not necessarily imply OS settings. OS settings usually are defined on its own configuration data even when BIOS has its configuration defined for "best performance". But do not take my word for granted, I don't know enough of this to justify/explain what really really happens...

Other issue I always forget (and I also do not know well enough... sorry): i7 has something called "Intel Turbo Boost technology,
which I think your processor has, take a look at
http://ark.intel.com/products/43122/Int ... e-1_60-GHz)
if your i7 model is 720QM or look for your specific model about the so called Turbo Boost...
The turbo boost lets the processor change temperature, voltage, etc. limits for extra performance,
http://www.intel.com/content/www/us/en/ ... ology.html
and (I think) it could affect performance measurements like yours, which spans a few seconds. I think this performance measurements "noise" could be reduced if sequential as well as parallel execution runtime is in the order of several minutes. Remember: this is only my guess...

HTH.
ftinetti
 
Posts: 567
Joined: Wed Feb 10, 2010 2:44 pm

Re: very poor efficiency on simple benchmark code

Postby joe2748 » Fri Nov 25, 2011 2:31 pm

Thanks ftinetti!

It looks like it is turboboost affecting my performance.
I haven't figured out how to disable turboboost altogether, but I have read that on batter power turboboost does not work.
So on AC power, and thus with turboboost on I get

Code: Select all
[joe@w]$ ./bench1 500 4
Estimating Pi by averaging 500 estimates.
OpenMP says there are 4 processors available.
Serial Estimate: 3.141593
Parallel Estimate:3.141593

Serial Time: 5.801716
Parallel Time:2.334106
Number of Threads: 4
Speedup: 2.485627
Efficiency: 0.621407


Unplugging the AC to turn off turboboost results in
Code: Select all
[joe@w]$ ./bench1 500 4
Estimating Pi by averaging 500 estimates.
OpenMP says there are 4 processors available.
Serial Estimate: 3.141593
Parallel Estimate:3.141593

Serial Time: 11.022563
Parallel Time:2.774667
Number of Threads: 4
Speedup: 3.972571
Efficiency: 0.993143


So it looks like that solves the mystery! Turboboost is scaling up the frequency
when I'm using fewer cores, thus lowering the apparent efficiency of the parallel
code.

In general, this is a nice feature. But it makes it a little difficult to determine
how effective your parallel programming is. I guess I will search for a way to turn
it off for a while-- until i finish working on this project.

Thanks again for the help!
joe2748
 
Posts: 3
Joined: Wed Nov 23, 2011 4:46 pm

Re: very poor efficiency on simple benchmark code

Postby ftinetti » Fri Nov 25, 2011 3:10 pm

I see, thanks for the details. I've been searching for a while, and it seems (again, not granted...) that turbo boot is BIOS
enabled/disabled. Some links saying this (neither fully read nor checked... I do not have an i7 at hand... well... I do not know anybody near me having one, after all...)
http://www.intel.com/support/processors ... 029908.htm
http://www.tomshardware.com/forum/27458 ... urbo-boost
http://forum.notebookreview.com/lenovo- ... -t410.html
http://forum.notebookreview.com/lenovo- ... boost.html

I agree that this specific feature makes parallel performance evaluation a real problem...

Edit: and I think disabling turbo boost is the only fair way of parallel performance evaluation, because i7 + turbo boost enabled is like having a different processor from those used in parallel processing (even when the physical cores are the same, different frequency leads to a lot of performance change).

Well... thank you again.
ftinetti
 
Posts: 567
Joined: Wed Feb 10, 2010 2:44 pm

Re: very poor efficiency on simple benchmark code

Postby bjland2 » Mon Nov 28, 2011 9:12 pm

Maybe you need a better computer. :)
bjland2
 
Posts: 1
Joined: Mon Nov 28, 2011 9:08 pm

Re: very poor efficiency on simple benchmark code

Postby ftinetti » Tue Nov 29, 2011 3:41 am

Maybe you need a better computer.

Why? What would "a better computer" be?
ftinetti
 
Posts: 567
Joined: Wed Feb 10, 2010 2:44 pm

Re: very poor efficiency on simple benchmark code

Postby ono » Thu May 10, 2012 8:47 am

FYI Turbo Boost can be disabled and enabled back on demand using following tools/methods per operating system:

On my iMac Sandy Bridge i5 having 4 cores, when Turbo Boost is enabled (default) 4 core/single core speedup is no higher than 3.2x. When I disable it it gets close to 3.9x which is expected value.
ono
 
Posts: 1
Joined: Thu May 10, 2012 8:32 am
Location: Kraków, Poland

Re: very poor efficiency on simple benchmark code

Postby miliksitek » Sun Sep 08, 2013 5:17 am

His computer is good enough.
miliksitek
 
Posts: 3
Joined: Sat Sep 07, 2013 1:41 am


Return to Using OpenMP

Who is online

Users browsing this forum: Yahoo [Bot] and 7 guests