false sharing

General OpenMP discussion

false sharing

Postby HeinzM » Thu Feb 21, 2013 4:58 am

Hello,

I think, I have a problem with "false sharing". I solve a large, sparse, banded Matrix with a Cholesky-factorysation. It works with OpenMP, but it scales not very good. Using 8 cores instead of 4 has no(!) speedup.

In "Using OpenMP" I read, what false sharing is and under what circumstances it probably occours: using shared data, writing these data in array-locations located near by each other, doing this very often.

I have exact these conditions. The book recommends using private data instead of shared data. Of course I could copy the lokal parts of the matrix in threadprivate arrays and do the computation there, but the private data of the threads will be lost after the parallel region ist finished.

I dont find a solution to copy back the private data, to gather the data back in one arry or so still existing after the end of the parallel region. Is there a possibility? I dont find one.


Thanks and best regards,
Heinz
HeinzM
 
Posts: 17
Joined: Sat Nov 03, 2012 7:22 am

Re: false sharing

Postby MarkB » Thu Feb 21, 2013 5:09 am

Hi Heinz,

One solution might be to declare a shared array with an extra, outer, dimension of size the number of threads. If each thread uses its thread ID to index this extra dimension, then false sharing should be largely avoided, and the data will persist beyond the parallel region.

If you could post the main computational loop(s) from the code, I may be able to advise you better!

Hope that helps,
Mark.
MarkB
 
Posts: 450
Joined: Thu Jan 08, 2009 10:12 am
Location: EPCC, University of Edinburgh

Re: false sharing

Postby HeinzM » Thu Feb 21, 2013 11:32 pm

Hi Mark,

MarkB wrote:One solution might be to declare a shared array with an extra, outer, dimension of size the number of threads. If each thread uses its thread ID to index this extra dimension, then false sharing should be largely avoided, and the data will persist beyond the parallel region.


Thank you, that could work. But it would depend on the length of a cache line. Do you know, how long it is an an Intel I7 board? And which cache is important, L1,L2,L3? I dont know much avout these caches.

Regards
Heinz
HeinzM
 
Posts: 17
Joined: Sat Nov 03, 2012 7:22 am

Re: false sharing

Postby MarkB » Fri Feb 22, 2013 3:07 am

Hi Heinz,

HeinzM wrote:Hi Mark,
Thank you, that could work. But it would depend on the length of a cache line. Do you know, how long it is an an Intel I7 board? And which cache is important, L1,L2,L3? I dont know much avout these caches.



You can get false sharing between L2 caches on a chip and between L3 caches on multiple sockets, though the latter is likely to be the worst in terms of performance hits.
I believe the line size is 64 bytes in both cases.

Mark.
MarkB
 
Posts: 450
Joined: Thu Jan 08, 2009 10:12 am
Location: EPCC, University of Edinburgh

Re: false sharing

Postby HeinzM » Fri Feb 22, 2013 5:23 am

Hi Mark,

64 bytes are 8 (Kind=8)-array-elements. Now I'm shure I have a false sharing problem. I will try the extra dimension of an arry.

Thanks and best regards,
Heinz
HeinzM
 
Posts: 17
Joined: Sat Nov 03, 2012 7:22 am

Re: false sharing

Postby HeinzM » Sun Feb 24, 2013 9:47 am

Hello,

I tried the way with an extra dimension, but it has no effect. It works even more slowly than without. The extra-overhead is not small, I think.

Maybe, I do not have false sharing, I dont know.....

Here is an example-code, showing the main structure. Perhaps someone has an advice for me.
(Could the EQUIVALENCE-statement cause the problem?)

As I said, It works with OpenMP, but it scales not very good. Using 8 cores instead of 4 has no(!) speedup, it is even a little bit slower.



SUBROUTINE LE0130Sl()
!
!!! CHOLESKY- Factorisation of a sparse banded symmetric matrix,
!!! stored with a skyline-sheme
!
REAL (KIND=8) :: FR
INTEGER :: FI(1)
COMMON FR(1) ! large array, size defined in the main-program
EQUIVALENCE (FR(1),FI(1))



DO I=IA,N
!
!
! doing some overhead for the skyline-storage
!
! the matrix is stored by columns in the onedimensional array FR,
! the columns are NOT equal-sized
!

!$OMP parallel do if(NB>4*NThreads) default(firstprivate) shared(FR,FI) schedule(guided,chunk)


DO K=KA,KE

! first computing the adresses, the start and the end of the dot-products


ADRI= ......
ADRK= ......
ADRHI= ......
L= ......


! the following loops do the most work:
!
! dot-products of the columns of the matrix
! The length L of the dot-product is in the range from about 1000 to 0, depending on the finite-element-net.
! Length always decreases to 0 because of the triangular form of the matrix. Therefore I used the if-clause in the openmp-directive.
!
! the vectors FR(ADRI+II) and FR(ADRK+II) do not overlay each other

IF(L.gt.0) THEN
S=0.0
DO II=0,L-1
S=S+FR(ADRI+II)*FR(ADRK+II)
END DO
FR(ADR)=(FR(ADR)-S)*FR(ADRHI) ! in this line I assumed false sharing, because I update the matrix-element. But it seems not do be so.
ELSE
FR(ADR)=FR(ADR)*FR(ADRHI)
END IF


END DO
!$OMP end parallel do
!
END DO
RETURN
END SUBROUTINE



Thank you and best regards,
Heinz
HeinzM
 
Posts: 17
Joined: Sat Nov 03, 2012 7:22 am

Re: false sharing

Postby MarkB » Mon Feb 25, 2013 3:55 am

Hi Heinz,

HeinzM wrote:Maybe, I do not have false sharing, I don't know.....


I think it's unlikely that false sharing is a big problem in this code: the accesses to the shared array are relatively infrequent.
I don't think the EQUIVALENCE is a problem either.

Let's try to eliminate some other possible causes of the bad scaling:

Are you convinced that the amount of spent outside of parallel regions is insignificant? It is maybe worth confirming this by putting a timer around
the parallel region and summing up the time spent in it.

Is the amount of time in each parallel region long enough to offset the associated overheads? How long does the call to LE0130Sl take sequentially, and how many times is the parallel construct encountered (i.e. how many iterations are there in the DO I=IA,N loop)?.

Is the load balance OK? Guided schedule may not work very well for loops where the amount of work decreases with iteration number, as it issues large chunks first. A quick test would be to run the parallel loop backwards (i.e. DO K=KE,KA,-1 ).

Ultimately, the code likely spends most of its time doing dot-products which are very memory bandwidth intensive. You may simply be hitting the hardware limits here. Are you on a multi-socket system? If so, then NUMA effects can cause serious bandwidth degradation. Is the array FR initialised sequentially or in parallel? In the former case, most likely all the data ends up being allocated on one socket, which can result in a bottleneck.

Hope that gives you a few ideas of things to try,
Mark.
MarkB
 
Posts: 450
Joined: Thu Jan 08, 2009 10:12 am
Location: EPCC, University of Edinburgh

Re: false sharing

Postby HeinzM » Mon Feb 25, 2013 1:09 pm

Hi Mark,

MarkB wrote:Hope that gives you a few ideas of things to try.


Many thanks for your effort. I will go on analysing and will be back soon.

Thanks and regards,
Heinz
HeinzM
 
Posts: 17
Joined: Sat Nov 03, 2012 7:22 am

Re: false sharing

Postby HeinzM » Thu Feb 28, 2013 1:31 pm

Hi Mark,

I checked what you adviced me.

MarkB wrote:Are you convinced that the amount of spent outside of parallel regions is insignificant? It is maybe worth confirming this by putting a timer around
the parallel region and summing up the time spent in it.

Yes, the the amount of time spent outside of the parallel region is very low, about 1%.

MarkB wrote:Is the amount of time in each parallel region long enough to offset the associated overheads? How long does the call to LE0130Sl take sequentially, and how many times is the parallel construct encountered (i.e. how many iterations are there in the DO I=IA,N loop)?.

The parallel construct is encountered for each line of the sparse matrix, f.e. several 100000 times. This could be the bottleneck.

MarkB wrote:Is the load balance OK? Guided schedule may not work very well for loops where the amount of work decreases with iteration number, as it issues large chunks first. A quick test would be to run the parallel loop backwards (i.e. DO K=KE,KA,-1 ).

Running the loop vive versa is even a little bit slower, so the load balance seems to be ok.

MarkB wrote:Ultimately, the code likely spends most of its time doing dot-products which are very memory bandwidth intensive. You may simply be hitting the hardware limits here. Are you on a multi-socket system? If so, then NUMA effects can cause serious bandwidth degradation. Is the array FR initialised sequentially or in parallel? In the former case, most likely all the data ends up being allocated on one socket, which can result in a bottleneck.

No, I have no multi-socket system, no NUMA. I use a simple Intel I7 board with 4 cores and hyperthreading. Could hyperthreading be the bottleneck?

The first vector within the dotproduct is the same for all dotproducts in the loop. So I tried to copy it in an own allocatable vector outside the matrix stored in FR(....). I thought that simultanous reading the same array-elements from several threads could slow down. Of course I got a little speed up, but only 3%. But the code became unreadable, so I dropped.

I have no more ideas to achieve better scaling. It seems I have to accept that this special code doesnt yield more. With four threads it speeds up by a factor of three. Eight threads slow even down a little bit.

Another code of mine, which does no dotproducts and no matrix times vector operations scales well on the same hardware. Eight threads nearly speeds up by a factor of seven.

Mark, thank you for your help,
Best regards,
Heinz
HeinzM
 
Posts: 17
Joined: Sat Nov 03, 2012 7:22 am

Re: false sharing

Postby MarkB » Fri Mar 01, 2013 5:07 am

Hi Heinz,

HeinzM wrote:The parallel construct is encountered for each line of the sparse matrix, f.e. several 100000 times. This could be the bottleneck.


The typical overhead for a parallel region is in the tens of microseconds range (it depends a bit on the implementation, the hardware, and the number of threads). If you know how long the whole call takes, you should be able to estimate whether this is part of the problem or not.

HeinzM wrote:No, I have no multi-socket system, no NUMA. I use a simple Intel I7 board with 4 cores and hyperthreading. Could hyperthreading be the bottleneck?


With a bandwidth-intensive code, you are unlikely to see much benefit from hyperthreading, since running one thread per core can already use up all the available bandwidth. Adding an extra thread per core can even cause some slowdown due to additional contention.

HeinzM wrote:The first vector within the dotproduct is the same for all dotproducts in the loop. So I tried to copy it in an own allocatable vector outside the matrix stored in FR(....). I thought that simultanous reading the same array-elements from several threads could slow down.


Simultaneous repeated reads should be no problem: each thread will get a copy of this data in its local cache.

My feeling is that this code is simply limited by the memory bandwidth, so there is likely not much you can do to improve the scaling.

HeinzM wrote:Mark, thank you for your help


You're very welcome!
Mark.
MarkB
 
Posts: 450
Joined: Thu Jan 08, 2009 10:12 am
Location: EPCC, University of Edinburgh


Return to Using OpenMP

Who is online

Users browsing this forum: Google [Bot], Yahoo [Bot] and 5 guests