I wrote paralell program in hybrid model (it is matrix-vector multiplication)and run it on cluster (16 nodes, 2 sockets per node, 4 cores per socket), performed a series of tests (for different count of MPI processes and OpenMP threads), this is graph with results:

X-axis is MPI process count, Y-axis is time, legend contains number of threads (CCS and CRS are matrix storage format)
My question is: why there is speed-up for 2 threads, and there isn't for larger number of threads?
One thing I noticed is when I change thread affinity form "KMP_AFFINITY=granularity=fine,scatter" to "KMP_AFFINITY=granularity=fine,compact"I get these results:

I suppose this is related to NUMA architecture and shared memory, but I would like to know details. Maybe someone can help me.
