I checked what you adviced me.
MarkB wrote:Are you convinced that the amount of spent outside of parallel regions is insignificant? It is maybe worth confirming this by putting a timer around
the parallel region and summing up the time spent in it.
Yes, the the amount of time spent outside of the parallel region is very low, about 1%.
MarkB wrote:Is the amount of time in each parallel region long enough to offset the associated overheads? How long does the call to LE0130Sl take sequentially, and how many times is the parallel construct encountered (i.e. how many iterations are there in the DO I=IA,N loop)?.
The parallel construct is encountered for each line of the sparse matrix, f.e. several 100000 times. This could be the bottleneck.
MarkB wrote:Is the load balance OK? Guided schedule may not work very well for loops where the amount of work decreases with iteration number, as it issues large chunks first. A quick test would be to run the parallel loop backwards (i.e. DO K=KE,KA,-1 ).
Running the loop vive versa is even a little bit slower, so the load balance seems to be ok.
MarkB wrote:Ultimately, the code likely spends most of its time doing dot-products which are very memory bandwidth intensive. You may simply be hitting the hardware limits here. Are you on a multi-socket system? If so, then NUMA effects can cause serious bandwidth degradation. Is the array FR initialised sequentially or in parallel? In the former case, most likely all the data ends up being allocated on one socket, which can result in a bottleneck.
No, I have no multi-socket system, no NUMA. I use a simple Intel I7 board with 4 cores and hyperthreading. Could hyperthreading be the bottleneck?
The first vector within the dotproduct is the same for all dotproducts in the loop. So I tried to copy it in an own allocatable vector outside the matrix stored in FR(....). I thought that simultanous reading the same array-elements from several threads could slow down. Of course I got a little speed up, but only 3%. But the code became unreadable, so I dropped.
I have no more ideas to achieve better scaling. It seems I have to accept that this special code doesnt yield more. With four threads it speeds up by a factor of three. Eight threads slow even down a little bit.
Another code of mine, which does no dotproducts and no matrix times vector operations scales well on the same hardware. Eight threads nearly speeds up by a factor of seven.
Mark, thank you for your help,