MarkB wrote:One solution might be to declare a shared array with an extra, outer, dimension of size the number of threads. If each thread uses its thread ID to index this extra dimension, then false sharing should be largely avoided, and the data will persist beyond the parallel region.
HeinzM wrote:Hi Mark,
Thank you, that could work. But it would depend on the length of a cache line. Do you know, how long it is an an Intel I7 board? And which cache is important, L1,L2,L3? I dont know much avout these caches.
HeinzM wrote:Maybe, I do not have false sharing, I don't know.....
MarkB wrote:Are you convinced that the amount of spent outside of parallel regions is insignificant? It is maybe worth confirming this by putting a timer around
the parallel region and summing up the time spent in it.
MarkB wrote:Is the amount of time in each parallel region long enough to offset the associated overheads? How long does the call to LE0130Sl take sequentially, and how many times is the parallel construct encountered (i.e. how many iterations are there in the DO I=IA,N loop)?.
MarkB wrote:Is the load balance OK? Guided schedule may not work very well for loops where the amount of work decreases with iteration number, as it issues large chunks first. A quick test would be to run the parallel loop backwards (i.e. DO K=KE,KA,-1 ).
MarkB wrote:Ultimately, the code likely spends most of its time doing dot-products which are very memory bandwidth intensive. You may simply be hitting the hardware limits here. Are you on a multi-socket system? If so, then NUMA effects can cause serious bandwidth degradation. Is the array FR initialised sequentially or in parallel? In the former case, most likely all the data ends up being allocated on one socket, which can result in a bottleneck.
HeinzM wrote:The parallel construct is encountered for each line of the sparse matrix, f.e. several 100000 times. This could be the bottleneck.
HeinzM wrote:No, I have no multi-socket system, no NUMA. I use a simple Intel I7 board with 4 cores and hyperthreading. Could hyperthreading be the bottleneck?
HeinzM wrote:The first vector within the dotproduct is the same for all dotproducts in the loop. So I tried to copy it in an own allocatable vector outside the matrix stored in FR(....). I thought that simultanous reading the same array-elements from several threads could slow down.
HeinzM wrote:Mark, thank you for your help
Users browsing this forum: Google [Bot] and 7 guests