OpenMP C++ Not able to get linear speedup with no. of cores

General OpenMP discussion

OpenMP C++ Not able to get linear speedup with no. of cores

Postby vmaddire » Mon Jun 24, 2013 11:37 am

Please see the below results and let me know where I can optimize my code further to get a better speedup.

Result

Machine used: Mac Book Pro Processor: 2.5 GHz Intel Core i5(at least 4 logical cores)
Memory: 4GB 1600 MHz Compiler: Mac OSX Compiler

Sequential Time:0.016466
Using two threads:0.0120111
Using four threads:0.0109911(Speed Up ~ 1.5)
Using 8 threads: 0.0111289

II Machine: OS: Linux Hardware: Intel(R) Core™ i5-3550 CPU @ 3.30GHz × 4 Memory: 7.7 GiB Compiler: G++ Version 4.6

Sequential Time:0.0128901
Using two threads:0.00838804
Using four threads:0.00612688(Speed up = 2)
Using 8 threads: 0.0101049

Please let me know what's the overhead in my code that is not giving a linear speedup. There is nothing much in the code. I am calling the function "findParallelUCHWOUP" in the main function like this:

#pragma omp parallel for private(th_id)
for (th_id = 0; th_id < nthreads; th_id++)
findParallelUCHWOUP(points, th_id + 1, nthreads, inp_size, first[th_id], last[th_id]);

Code:
http://codepad.org/Rrf29jNi
vmaddire
 
Posts: 4
Joined: Mon Jun 24, 2013 11:01 am

Re: OpenMP C++ Not able to get linear speedup with no. of co

Postby MarkB » Mon Jun 24, 2013 5:55 pm

Does the scalability get better if you increase the number of points (say by a factor of 10 or 100) or not?
MarkB
 
Posts: 477
Joined: Thu Jan 08, 2009 10:12 am
Location: EPCC, University of Edinburgh

Re: OpenMP C++ Not able to get linear speedup with no. of co

Postby vmaddire » Mon Jun 24, 2013 6:43 pm

Not by much. I increases the number of points by a factor of 10. It's almost the same.
vmaddire
 
Posts: 4
Joined: Mon Jun 24, 2013 11:01 am

Re: OpenMP C++ Not able to get linear speedup with no. of co

Postby MarkB » Tue Jun 25, 2013 3:32 am

That suggests that it may be memory bandwidth which is limiting the scalability: your code looks quite bandwidth intensive, since it trawls through the array of points and doesn't do very much computation with the loaded data.
MarkB
 
Posts: 477
Joined: Thu Jan 08, 2009 10:12 am
Location: EPCC, University of Edinburgh

Re: OpenMP C++ Not able to get linear speedup with no. of co

Postby vmaddire » Tue Jun 25, 2013 11:38 am

Could you please tell me how to avoid this? Or is there a tool that I can use to identify these memory access patterns? Any help would be highly appreciated.

Thanks,
Vijay
vmaddire
 
Posts: 4
Joined: Mon Jun 24, 2013 11:01 am

Re: OpenMP C++ Not able to get linear speedup with no. of co

Postby MarkB » Tue Jun 25, 2013 12:53 pm

vmaddire wrote:Could you please tell me how to avoid this? Or is there a tool that I can use to identify these memory access patterns? Any help would be highly appreciated.


There's not much you can do without restructuring your code to improve the data re-use (I guess this is really part of a larger code).
Possibly the best tool for identifying this kind of problem is Rogue Wave's Threadspotter, but it does not come cheap!
You can do a certain amount with hardware counters (e.g. via the PAPI interface, but it does need some experience.

Hope that helps,
Mark.
MarkB
 
Posts: 477
Joined: Thu Jan 08, 2009 10:12 am
Location: EPCC, University of Edinburgh

Re: OpenMP C++ Not able to get linear speedup with no. of co

Postby vmaddire » Tue Jun 25, 2013 1:22 pm

Thanks for your quick response. Any pointers on how to restructure the code so that the memory access times reduce? May be a tutorial or some documentation.

It is really strange that a simple divide step is not scaling with the number of processors. The algorithm very much supports SPMD style and I thought OpenMp is perfect for these kind of algorithms.

Just a Comment:
And even if we figure out a way to tune the data access to suit this hardware. May be It needs a further tuning when I run it on a different hardware? Is there a way to write multi-threaded applications at a much higher abstraction which gives the same speedup on any hardware?

Thanks,
Vijay
vmaddire
 
Posts: 4
Joined: Mon Jun 24, 2013 11:01 am

Re: OpenMP C++ Not able to get linear speedup with no. of co

Postby MarkB » Tue Jun 25, 2013 2:16 pm

Thanks for your quick response. Any pointers on how to restructure the code so that the memory access times reduce? May be a tutorial or some documentation.


You are welcome. Essentially you need to do as much computation as possible with data that gets loaded from main memory into cache. This might help: http://users.ece.cmu.edu/~franzf/papers/gttse07.pdf

It is really strange that a simple divide step is not scaling with the number of processors. The algorithm very much supports SPMD style and I thought OpenMp is perfect for these kind of algorithms.


OpenMP is perfect: it's the hardware that is the problem, not the language!

And even if we figure out a way to tune the data access to suit this hardware. May be It needs a further tuning when I run it on a different hardware? Is there a way to write multi-threaded applications at a much higher abstraction which gives the same speedup on any hardware?


Optimisation is always to some extent hardware-specific, but general improvements such as improving cache locality will likely be beneficial on most systems.
MarkB
 
Posts: 477
Joined: Thu Jan 08, 2009 10:12 am
Location: EPCC, University of Edinburgh

Re: OpenMP C++ Not able to get linear speedup with no. of co

Postby ftinetti » Wed Jun 26, 2013 6:53 am

Hi,

Just a few comments, beyond those of Mark (which I suggest you take into account, btw):

Any pointers on how to restructure the code so that the memory access times reduce? May be a tutorial or some documentation.

Specifically in this case, maybe you could look for the so called tiling or blocking optimization.

It is really strange that a simple divide step is not scaling with the number of processors. The algorithm very much supports SPMD style and I thought OpenMp is perfect for these kind of algorithms.


Specifically looking at your code, I'm not sure about load balancing, since the workload depends on processing at findParallelUCHWOUP(...) which, in turn, seems to depend on the values of inp_size, first[th_id] and last[th_id] (maybe I'm wrong)... did you try to measure workload? Maybe just measuring findParallelUCHWOUP(...) runtime for the each pair first-last would be enough...

HTH,

Fernando.
ftinetti
 
Posts: 582
Joined: Wed Feb 10, 2010 2:44 pm


Return to Using OpenMP

Who is online

Users browsing this forum: Yahoo [Bot] and 15 guests