Massive runtime increase when using omp

General OpenMP discussion

Massive runtime increase when using omp

Postby cellard0or » Mon Feb 24, 2014 4:47 am

Hey there,
now that the results of my openMP augmented program are correct I have another issue:
When running the code below on a 24 core node (one node of the Cray XC30 system) the runtime of program.run() increases with the number of used threads. One single-thread run() takes 17s where 24 threads need 33(!)s each (each thread executes run() one time). The runtime does not decrease when re-entering run() so this is not just initialization overhead.
I know that there is some overhead using omp but since all threads work completely independently I think there is something wrong. When just starting 24 processes on one node instead of the omp program they all take only 19s which confirms that there is something wrong with my usage of omp.
The whole code does not contain any other omp statements. There also no global variables left.
I am stuck here and cannot find out why the omp version runs so much slower. I could just use the process approach mentioned above but I am very curious about any misuse and/or bug which could cause these bad runtimes.
By profiling the execution one can see that with more threads a function "lock_wait_private" takes more and more runtime. I cannot imagine where these locks could come from though (assuming the loop scheduling is not too expensive).
This work is a part of my master thesis and I would be very happy if this could be cleared up. Feel free to ask for further information!

Code: Select all
class Program {
        Mat image;
        ...
        void run( string inputFileName ) {
           ...
           someFunctionInAnotherFile( image, ... ); // call by reference, image gets manipulated
           ...
        }
        ...
    };

    int main( ) {
      #pragma omp parallel default(none) shared(inputVector)
      Program program;

      #pragma omp for schedule(guided,1)
      for( unsigned int inputNumber = 0; inputNumber < inputVector.size( ); ++inputNumber ) {
        program.run( inputVector[ inputNumber ] );
    }
    }
cellard0or
 
Posts: 10
Joined: Thu Feb 13, 2014 8:17 am

Re: Massive runtime increase when using omp

Postby MarkB » Tue Feb 25, 2014 3:39 am

cellard0or wrote:By profiling the execution one can see that with more threads a function "lock_wait_private" takes more and more runtime.


This may be a symptom of contention for memory allocation: does your code do a lot of new and deletes inside the parallel region?

Which compiler are you using on the Cray?
MarkB
 
Posts: 480
Joined: Thu Jan 08, 2009 10:12 am
Location: EPCC, University of Edinburgh

Re: Massive runtime increase when using omp

Postby cellard0or » Wed Feb 26, 2014 5:50 am

There is no new/delete in our code, only a couple of function local variables. I can not tell what the library we use (opencv) does internally, though.
And why would the possible contention be no problem in the process approach (which is much faster, as I wrote)?
Could it be that the OPM threads are allocating memory near the main process and not near their executing cores?

I am using gcc on the Cray machine. Due to our use of opencv I have not been able to compile with Intel or Cray compiler to date.
cellard0or
 
Posts: 10
Joined: Thu Feb 13, 2014 8:17 am

Re: Massive runtime increase when using omp

Postby MarkB » Thu Feb 27, 2014 4:45 am

If you have separate processes, each process allocates off its own heap, so there is no contention. With a threaded program, threads share the same heap, so contention can occur.

cellard0or wrote:Could it be that the OPM threads are allocating memory near the main process and not near their executing cores?


This can be a problem is memory is allocated by the master thread. Do you see any speedup if you use 2 threads?

cellard0or wrote:I am using gcc on the Cray machine. Due to our use of opencv I have not been able to compile with Intel or Cray compiler to date.


It would be worth trying to get the code to run with another compiler if possible. You can also try the CNL malloc environment variable settings recommended here,
http://www.nersc.gov/users/computationa ... g-options/ if they are not the default on your system.

It would also be worth checking that your threads are being correctly bound to cores using this program https://github.com/olcf/XC30-Training/b ... ity/Xthi.c
MarkB
 
Posts: 480
Joined: Thu Jan 08, 2009 10:12 am
Location: EPCC, University of Edinburgh

Re: Massive runtime increase when using omp

Postby cellard0or » Thu Feb 27, 2014 4:54 am

Hello again,

I just implemented the multithreading myself (was not a big deal in that case) and it runs exactly as slow as the OMP version.
This means there is some synchronization (or another unnecessary slow-downer) happening when using threads which is not happening when using stand-alone processes.
I see 2 possible causes for this: Either (like I posted above) the threads do allocate memory near the main process which is slow on NUMA nodes
or threads need to synchronize while allocating memory and therefore cannot use underlying memory parallelism.
On a second thought the latter explanation seems more likely because maybe the threads do not allocate memory themselves but their main process does it, which would inevitably introduce synchronization. Is that the case? (I don't know enough about processes/threads on a low level to answer this myself)

EDIT: Funny, I just wanted to submit this text when your reply appeared^^ I will answer your post in another post!
cellard0or
 
Posts: 10
Joined: Thu Feb 13, 2014 8:17 am

Re: Massive runtime increase when using omp

Postby cellard0or » Thu Feb 27, 2014 5:04 am

This can be a problem is memory is allocated by the master thread. Do you see any speedup if you use 2 threads?

Yes, when using only 2 threads there is almost no slowdown. And you answered a question of my last post: The main process will do the allocation so there has to be synchronization. Maybe I should let the master just idle so that he can do all the allocation work faster.
It would also be worth checking that your threads are being correctly bound to cores using this program https://github.com/olcf/XC30-Training/b ... ity/Xthi.c

Well, I already tried to use threadpinning but I will check whether that worked with the program, thanks. EDIT:
It would be worth trying to get the code to run with another compiler if possible. You can also try the CNL malloc environment variable settings recommended here,
http://www.nersc.gov/users/computationa ... g-options/ if they are not the default on your system.

Switching to another compiler seems much too involved at that time, maybe we can try that later. But I will try out the malloc parameters you mentioned. EDIT: This did not improve runtimes.
cellard0or
 
Posts: 10
Joined: Thu Feb 13, 2014 8:17 am

Re: Massive runtime increase when using omp

Postby MarkB » Thu Feb 27, 2014 6:52 am

cellard0or wrote: The main process will do the allocation so there has to be synchronization. Maybe I should let the master just idle so that he can do all the allocation work faster.


I'm not sure I explained clearly enough! If memory is allocated by the master thread outside of parallel regions, then by default all this memory may be physically allocated on the same NUMA domain, which can lead to a bottleneck if threads running on cores not in this NUMA domain all access the data.

If the threads in the parallel region are calling opencv routines, which are internally allocating and freeing memory, then the above is not a problem and leaving the master idle won't help.
MarkB
 
Posts: 480
Joined: Thu Jan 08, 2009 10:12 am
Location: EPCC, University of Edinburgh

Re: Massive runtime increase when using omp

Postby cellard0or » Thu Feb 27, 2014 7:16 am

There is no allocation outside a parallel region except the fileName array which is used to distribute the work.
cellard0or
 
Posts: 10
Joined: Thu Feb 13, 2014 8:17 am

Re: Massive runtime increase when using omp

Postby ftinetti » Thu Feb 27, 2014 7:28 am

Hi,

Maybe this is just noise (I apologize, if that's the case), but there seem to be some problem when new/malloc is used intensively in multhreaded (OpenMP) code, as suggested in

http://stackoverflow.com/questions/7992 ... ithreading

It seems to be that a "different malloc" should be used, and I'm guessing (a poor quality guess, I'd say) it could be as much too involved as switching compiler/s.

Just my two cents,

Fernando.
ftinetti
 
Posts: 582
Joined: Wed Feb 10, 2010 2:44 pm

Re: Massive runtime increase when using omp

Postby cellard0or » Fri Feb 28, 2014 11:32 am

That could indeed be a possible cause. At least it shows that I am not alone with that problem.
Thank you for your reply!
cellard0or
 
Posts: 10
Joined: Thu Feb 13, 2014 8:17 am


Return to Using OpenMP

Who is online

Users browsing this forum: Google [Bot], Majestic-12 [Bot], Yahoo [Bot] and 6 guests