[Omp] OpenMP Parallel Do Loops
Breshears, Clay
clay.breshears at intel.com
Thu Nov 17 08:47:38 PST 2005
Craig -
I expect there are some memory access/contention issues at work here.
By changing the schedule I was able to get better performance. I'm
using a dual processor system to run the code and I've got some results
below for a 1000x1000 case
1 thread: 26.7 seconds
2 threads: 109.2
2 threads (Dyn, 8): 51.4
2 threads (stat,8): 41.5
2 threads (stat,16): 37.0
Not as good as serial execution, yet. Changing the schedule may
alleviate some of the memory problems, but there can still be issues of
data size and granularity coming into play here.
clay
-----Original Message-----
From: Omp-bounces at openmp.org [mailto:Omp-bounces at openmp.org] On Behalf
Of ta.cbra at maths.strath.ac.uk
Sent: Monday, November 14, 2005 4:52 PM
To: Neil Summers
Cc: Omp at openmp.org
Subject: Re: [Omp] OpenMP Parallel Do Loops
Neil,
Thank you for your reply. I have attached a more detailed version of my
code that actually applies the Given Rotations (this uses BLAS routines
drotg and drot).
I have tried to implement your suggestions in this new code but I am
still
unable to get any kind of speed up when I increase the number of
processors.
Any obvious reason why to anyone? Any help is greatly appreciated!
program CDGR
include 'omp_lib.h'
c Declare variable types
integer :: i, j, x, m, n
double precision, dimension(:,:), allocatable :: W
double precision :: cc,ss,time
integer np, me
c Get the size of matrix to use
write(*,*) 'What size of matrix do you wish to use?'
write(*,*) 'Number of rows (m) ='
read(*,*) m
write(*,*) 'Number of columns (n) ='
read(*,*) n
write(*,*)'m= ',m,' and n = ',n,' thank you.'
allocate(W(m,n))
do i=1,m
do j=1,n
W(i,j)=1000*rand(i+j)
end do
end do
! do i=1,m
! write(*,*)(W(i,j),j=1,n)
! end do
time=dtime(timearray)
c Show time step (i) that each element would be annihilated during
c to leave the matrix W upper triangular
!$OMP parallel private(x,cc,ss,me,i) shared(W,np,m,n)
np = omp_get_num_threads()
me = omp_get_thread_num()
do i=1,m+n-2
c Every node uses the same value of i but the j values
c are shared out and can be preformed at the same time
!$OMP do schedule(dynamic,1)
do j=1,n
x=m+2*j-i-1
c make sure element W(x,j) is with-in the matrix W
if (j .lt. x) then
if (x .le. m) then
call drotg(W(x-1,j), W(x,j), cc, ss)
W(x,j)=0d0
call drot(n-j,W(x-1,j+1:n),1,W(x,j+1:n),1,cc,ss)
! W(x,j)=i
endif
endif
enddo
!$OMP end do
enddo
!$OMP end parallel
time=dtime(timearray)
write(*,*)'CDGR with ',np
write(*,*)'m=',m,'n=',n,'time=',time,
c Print W if you want to see how it was annihilated
! do i=1,m
! write(*,*)(W(i,j),j=1,n)
! end do
deallocate(W)
stop
end
On Mon, 14 Nov 2005, Neil Summers wrote:
> 2 things i have noticed on a quick scan of your code.
>
> 1) you should define the parallel region outside
> the i loop, creating a parallel region within a do loop
> causes excessive overhead, as the program fork/joins excessively.
> You should define the parallel region outside the i loop
> to reduce overhead then use omp do to split work up
> between threads. ie
>
> !$OMP parallel private(x,me,np)
> do i=1,m+n-2
> !$OMP do
> do j=1,n
> ...
> enddo
> enddo
> !$OMP end parallel
>
> 2) i'm supprised you get the right results,
> by defining firstprivate(m,n), these are then undefined
> on exiting the parallel region, so i would guess the second
> iteration of i would not happen correctly.
> you don't need these private, so i'd leave them shared
>
> Neil
--
CB.
**************************************
* *
* Craig Brand *
* University of Strathclyde *
* Department of Mathematics *
* e-mail ta.cbra at maths.strath.ac.uk *
* *
**************************************
_______________________________________________
Omp mailing list
Omp at openmp.org
http://openmp.org/mailman/listinfo/omp_openmp.org
More information about the Omp
mailing list