3d arrays and OpenMP

General OpenMP discussion

3d arrays and OpenMP

Postby johannes » Thu Apr 24, 2014 8:57 am

Hi all,
I am a beginner with OpenMP and quite desparate after sevaral days without progress.

1. Bad loops?
I am solving 3D finite volume temperature fields using with Intel FV on windows7. I have tried so much of variations if OMP directives, but never got a speed up more than factor 2 even with 16 cores. What bothers me is, that in the test program attached, the CPU time recordings are so odd. The CPU time is more or less independent on NTHREADS. If you rename test3.piz to test3.zip you will find my .exe inside. I compile simply with ifort /c /Qopenmp test3.f90 link with link test3.obj and run with test3

The reason why Test3.f90 is as it is, is because the typical sort of loop in my 'big' program looks alike:
Code: Select all
do k=1,Nz
   do j=1,Ny
     do i=1,Nx
       c(i,j,k)=a(i,j,k)*b(i,j,k)+ other matrix elements - other matrix elements
     enddo
   enddo
enddo


- Is this type of loop structure impeding the use of OpenMP and how to make it better? I also have loops like
Code: Select all
     do i=1,N ; some arrays a(3,i) ; enddo 

which also do not run better.
- Are there special compiler directives to make it better (e.g. avoid conflicts with hyperthreading)?
- What to use as diagnostics? (I have to admit that I'm usung the old fashioned way of .bat to compile. I do not use the visual stuff).
- Do you remember the main tripping hazard when you started with openMP?

2. That openMP does work on my PC at all is supported by this piece of code which behaves as expected:
Code: Select all
!$OMP PARALLEL PRIVATE(i,j,k)  reduction(+:prod)
!$omp do
do k=1,Nz
do j=1,Ny
  do i=1,Nx
   prod=prod+a(i,j,k)*b(i,j,k) 
  enddo
enddo
enddo
!$omp end do
!$OMP END PARALLEL


Hoping anybody can provide me with a key idea
Best regards,
Johannes
Attachments
test3.zip
test3.zip contains test3.f90 and test3.exe
(258.91 KiB) Downloaded 141 times
johannes
 
Posts: 7
Joined: Thu Apr 24, 2014 8:43 am

Re: 3d arrays and OpenMP

Postby MarkB » Mon Apr 28, 2014 4:18 am

What are you using to measure the execution time? You need to make sure that you are measuring wall clock time and not the accumulated CPU time across all the threads....
MarkB
 
Posts: 456
Joined: Thu Jan 08, 2009 10:12 am
Location: EPCC, University of Edinburgh

Re: 3d arrays and OpenMP

Postby johannes » Tue Apr 29, 2014 2:33 am

I was using
Code: Select all
t=OMP_GET_WTIME()
which seems to be correct compared to CPU_Time(t1-t0)/Nthreads
johannes
 
Posts: 7
Joined: Thu Apr 24, 2014 8:43 am

Re: 3d arrays and OpenMP

Postby MarkB » Tue Apr 29, 2014 7:39 am

I think the parallel loop goes faster on one thread than the serial one because you are not initialising the c array beforehand. The serial loop most likely results in lots of page faults as the c array is mapped into physical memory. If you are lucky, the mappings will persist even though you deallocate and reallocate the array, and so the parallel loop is not affected.

The loop you are measuring is a terrible memory bandwidth hog, so I expect the lack of speedup beyond two threads is simply due to the fact that two threads are enough to saturate the memory bandwidth on your hardware. You may need to think about restructuring your code to improve its temporal locality / cache reuse.
MarkB
 
Posts: 456
Joined: Thu Jan 08, 2009 10:12 am
Location: EPCC, University of Edinburgh

Re: 3d arrays and OpenMP

Postby johannes » Thu May 01, 2014 12:17 am

Hi Marc,
I tried with SCHEDULE(....,some chunksize), but this didn't speed up either.

- Should the ultimate procedure be that I shall design some 'domain decomposition' manually?
- Is there a way to find out the actual 'memory bandwidth' without making experiments?
Best regards,
Johannes
johannes
 
Posts: 7
Joined: Thu Apr 24, 2014 8:43 am

Re: 3d arrays and OpenMP

Postby MarkB » Thu May 01, 2014 2:07 am

johannes wrote:I tried with SCHEDULE(....,some chunksize), but this didn't speed up either.

- Should the ultimate procedure be that I shall design some 'domain decomposition' manually?
- Is there a way to find out the actual 'memory bandwidth' without making experiments?


There's no load imbalance in the loop, so there's no reason to expect anything other than a STATIC schedule to improve the performance, I'm afraid. You could consider using transformations such as loop fusion and loop tiling in your main code to try to improve reuse of cached data and reduce the memory traffic. If you want to measure the memory bandwidth of your system, the STREAM benchmark might be useful: http://www.cs.virginia.edu/stream/
MarkB
 
Posts: 456
Joined: Thu Jan 08, 2009 10:12 am
Location: EPCC, University of Edinburgh

Re: 3d arrays and OpenMP

Postby johannes » Thu May 01, 2014 6:35 am

Hi Marc,
STREAM is interesting. Hoping I understand.
I guess 'domain decomposition' is what you call 'tiling'.
BR, Johannes
johannes
 
Posts: 7
Joined: Thu Apr 24, 2014 8:43 am

Re: 3d arrays and OpenMP

Postby MarkB » Thu May 01, 2014 6:45 am

johannes wrote:I guess 'domain decomposition' is what you call 'tiling'.


Tiling is different from domain decomposition: there's a brief description on Wikipedia http://en.wikipedia.org/wiki/Loop_tiling
MarkB
 
Posts: 456
Joined: Thu Jan 08, 2009 10:12 am
Location: EPCC, University of Edinburgh

revisited Re: 3d arrays and OpenMP

Postby johannes » Sun Sep 14, 2014 1:56 am

Hi all,
let me revisit my observation of bad performance of OpenMP on my Windows7 64-bit installations. I tried a lot with arrays' size and OMP commands. Finally I found a lecture of Ruud vad der Pas from which I want to use page 32 as a benchmark code.
http://www.compunity.org/training/tutorials/4%20OpenMP_and_Performance.pdf

I compiled the code below with latest Intel Fortran using
ifort /Qopenmp vanderpas.f90
link vanderpas.obj


Regardless whether Hyperthreading was turned on or off on a Intel XEON X5570 2Proc 4 cores, the gain with 2 threads is close to a factor 2 but doesn't get much better when using more threads or even worse.
1 thread: 0.4 sec
2 threads: 0.25 sec
3 and more threads: not better

My plea: could anyone recompile this code with his favourite compiler and post the Cpu time? Is there somthing missing in my compiler calls?

Code: Select all
! Tutorial IWOMP 2010 slide 32
! modif with allocate
use omp_lib
implicit none
integer :: is, ie, m, n
       real(kind=8),allocatable :: x(:,:,:)
real(kind=8) :: scale
integer :: i, j, k
       real*8 :: endtime,starttime
       integer :: NTHREADS,irepeat
n=20 ; m=7500
      Allocate (x(m,n,n))
      x(:,:,:)=1.
      scale=0.5
      print *,'Enter number of threads'
      read *,NTHREADS
      CALL OMP_SET_NUM_THREADS(NTHREADS)
      starttime = OMP_get_wtime()
      repeat: DO irepeat=1,100   ! just to make execution time a bit longer
!Original van der Pas:       
!$omp parallel default(none) &
!$omp private(i,j,k) shared(m,n,scale,x)
do k = 2, n
do j = 2, n
!$omp do schedule(static)
do i = 1, m
x(i,j,k) = x(i,j,k-1) + x(i,j-1,k)*scale
end do
!$omp end do nowait
end do
end do
!$omp end parallel   
      ENDDO repeat
      endtime = OMP_get_wtime()
      print *, 'done, OMPtime=',SNGL(endtime - starttime)
      end


BR
johannes
johannes
 
Posts: 7
Joined: Thu Apr 24, 2014 8:43 am

Re: 3d arrays and OpenMP

Postby ftinetti » Mon Sep 15, 2014 5:15 pm

Hi johannes,

2 x Intel(R) Xeon(R) CPU E5405 @ 2.00GHz (4 cores each, 8 cores total)

$ gfortran -v
Using built-in specs.
Target: x86_64-linux-gnu
...
gcc version 4.4.5

$ gfortran -fopenmp ...

$ a.out
Enter number of threads
1
done, OMPtime= 8.5746050

$ a.out
Enter number of threads
2
done, OMPtime= 4.2897229

$ a.out
Enter number of threads
4
done, OMPtime= 2.1507766

$ a.out
Enter number of threads
8
done, OMPtime= 1.0773643


Is there somthing missing in my compiler calls?


I don't think so, are you sure boost is turned off?

Did you try with greater n and m so that more data should be processed?

HTH,

Fernando.
ftinetti
 
Posts: 582
Joined: Wed Feb 10, 2010 2:44 pm

Next

Return to Using OpenMP

Who is online

Users browsing this forum: Google [Bot] and 11 guests