Hybrid (MPI/OpenMP) programming problem (NUMA?)

General OpenMP discussion

Hybrid (MPI/OpenMP) programming problem (NUMA?)

Postby gavell » Sat Apr 14, 2012 10:17 am

Hello everyone!
I wrote paralell program in hybrid model (it is matrix-vector multiplication)and run it on cluster (16 nodes, 2 sockets per node, 4 cores per socket), performed a series of tests (for different count of MPI processes and OpenMP threads), this is graph with results:
X-axis is MPI process count, Y-axis is time, legend contains number of threads (CCS and CRS are matrix storage format)
My question is: why there is speed-up for 2 threads, and there isn't for larger number of threads?
One thing I noticed is when I change thread affinity form "KMP_AFFINITY=granularity=fine,scatter" to "KMP_AFFINITY=granularity=fine,compact"I get these results:
I suppose this is related to NUMA architecture and shared memory, but I would like to know details. Maybe someone can help me.

Re: Hybrid (MPI/OpenMP) programming problem (NUMA?)

Postby MarkB » Tue Apr 17, 2012 6:51 am

Are you always running one MPI process per node in your experiments? If so, then you may just be seeing memory bandwidth saturation.

In the first graph, the two-thread case is likely running one thread per socket, so you get access to twice the memory bandwidth
than with one thread, but adding further threads gives no extra benefit because the memory bandwidth is already being used up by the first two threads.

In the second graph, the first four threads probably run on the same socket, so you see little benefit from them,
but threads 4-7 run on the second socket, and you again see an increase in available bandwidth.
Posts: 627
Joined: Thu Jan 08, 2009 10:12 am
Location: EPCC, University of Edinburgh

Return to Using OpenMP

Who is online

Users browsing this forum: Google [Bot], Yahoo [Bot] and 8 guests