Threading an outer loop

General OpenMP discussion

Threading an outer loop

Postby trav9 » Wed Mar 20, 2013 10:28 am

I am attempting to make a simple test case and thought I had a decent understanding of how to thread a loop, but am unable to see any speed up. The code is below:

Code: Select all
PROGRAM TEST
use omp_lib          ! Fortran 95; omp_get_thread_num, omp_get_num_threads
implicit none
   
   integer j,i,k,tot,x,nthreads
   integer,dimension(3)::counter
   real*8,dimension(3)::A
 

   A(1) = 10
   A(2) = 600
   A(3) = 10000000


   tot = OMP_get_max_threads()         ! tells you maximum allowable thread on machine
   write(*,*) 'maximum allowable threads =', tot
   x = 2
   call OMP_SET_NUM_THREADS(x)           ! allows you to set the number of threads outside of parallel environment



   !$OMP PARALLEL
   nthreads = omp_get_num_threads()    ! get number of threads being used
   !$OMP DO SCHEDULE(DYNAMIC)

      do j = 1,3,1
    do i = 1,150000,1
        do k = 1,150000,1
         A(j) = A(j)+i-(k/j)
             enddo
         enddo
      enddo

    !$OMP END DO NOWAIT
    WRITE (*,*) 'Parallel threads used: ',nthreads
    !$OMP END PARALLEL


    write(*,*) 'A =', A
   
 
end


I see the same run time putting it in serial as I do threading it. I basically just want to be able to thread through the first j loop with each thread taking a j and doing the rest of the calculations since each j is not dependent on the other j values. Any ideas on what is happening/I'm doing wrong?
trav9
 
Posts: 7
Joined: Wed Jun 13, 2012 3:01 pm

Re: Threading an outer loop

Postby MarkB » Wed Mar 20, 2013 11:13 am

Hi there,

trav9 wrote: Any ideas on what is happening/I'm doing wrong?


Here's some possibilities:

Firstly, how are you measuring the time? Are you sure you are measuring wall clock time and not total CPU time?

Secondly, does the run time look sensible compared to the total number of arithmetic operations executed (i.e. are you sure that the compiler is not optimising away one or more of the loops?)

Finally, if the compiler is not smart enough, your code could have a bad case of false sharing. Does it help to rewrite it like this?

Code: Select all
    PROGRAM TEST
    use omp_lib          ! Fortran 95; omp_get_thread_num, omp_get_num_threads
    implicit none
       
       integer j,i,k,tot,x,nthreads
       integer,dimension(3)::counter
       real*8,dimension(3)::A
       real*8 :: tmp
     

       A(1) = 10
       A(2) = 600
       A(3) = 10000000


       tot = OMP_get_max_threads()         ! tells you maximum allowable thread on machine
       write(*,*) 'maximum allowable threads =', tot
       x = 2
       call OMP_SET_NUM_THREADS(x)           ! allows you to set the number of threads outside of parallel environment



       !$OMP PARALLEL
       nthreads = omp_get_num_threads()    ! get number of threads being used
       !$OMP DO SCHEDULE(DYNAMIC) PRIVATE (TMP)

          do j = 1,3,1
          tmp = A(j)
        do i = 1,150000,1
            do k = 1,150000,1
             tmp  = tmp +i-(k/j)
                 enddo
             enddo
             A(j) =tmp
          enddo

        !$OMP END DO NOWAIT
        WRITE (*,*) 'Parallel threads used: ',nthreads
        !$OMP END PARALLEL


        write(*,*) 'A =', A
       
     
    end



Hope that helps,

Mark.
MarkB
 
Posts: 433
Joined: Thu Jan 08, 2009 10:12 am

Re: Threading an outer loop

Postby trav9 » Wed Mar 20, 2013 11:34 am

Hi Mark,

Thank you for the response. I'm actually measuring real, system, and user time. The run time appears to be right as well. The third suggestion that you had mentioned seems to be what is happening. Is there any other way to do a similar process, but without making A private? My final goal is to apply this to a much larger code. In it there are calculations performed on many vectors/matrices dependent on a size which I represent as j. I'm afraid making them all private would require too much memory.
trav9
 
Posts: 7
Joined: Wed Jun 13, 2012 3:01 pm

Re: Threading an outer loop

Postby ftinetti » Wed Mar 20, 2013 12:54 pm

Hi,

I think there are some problems with shared/private data, aren't there? I changed a little bit the code to

Code: Select all
PROGRAM TEST

use omp_lib          ! Fortran 95; omp_get_thread_num, omp_get_num_threads, omp_get_wtime

implicit none

   integer j,i,k,tot,x,nthreads
   integer,dimension(3)::counter
   real*8,dimension(3)::A

   double precision :: tick           ! timing

   tot = OMP_get_max_threads()         ! tells you maximum allowable thread on machine
   write(*,*) 'maximum allowable threads =', tot

   do x = 1, 3, 1   ! to change number of threads

     A(1) = 10
     A(2) = 600
     A(3) = 10000000

     tick = omp_get_wtime()

     call OMP_SET_NUM_THREADS(x)       ! allows you to set the number of threads outside of parallel environment

     !$OMP PARALLEL SHARED(A) PRIVATE(j, i, k, nthreads)
     nthreads = omp_get_num_threads()  ! get number of threads being used

     !$OMP DO
     do j = 1,3,1
       do i = 1,150000,1
         do k = 1,150000,1
           A(j) = A(j)+i-(k/j)
         enddo
       enddo
     enddo

     !$OMP END DO NOWAIT
     WRITE (*,*) 'Parallel threads used: ', nthreads
     !$OMP END PARALLEL

     tick = omp_get_wtime()-tick

     write(*,*) 'A =', A, 'in ', tick, 'seconds'
   enddo
end


and it seems to work, since

$ gfortran -O3 -fopenmp test.f90 -o test
$ ./test
maximum allowable threads = 8
Parallel threads used: 1
A = 10.000000000000000 843761250000600.00 1125015010000000.0 in 329.74776512398967 seconds
Parallel threads used: 2
Parallel threads used: 2
A = 10.000000000000000 843761250000600.00 1125015010000000.0 in 210.17432722699596 seconds
Parallel threads used: 3
Parallel threads used: 3
Parallel threads used: 3
A = 10.000000000000000 843761250000600.00 1125015010000000.0 in 119.59034825599520 seconds

HTH,

Fernando.
ftinetti
 
Posts: 581
Joined: Wed Feb 10, 2010 2:44 pm

Re: Threading an outer loop

Postby MarkB » Thu Mar 21, 2013 8:14 am

trav9 wrote:Is there any other way to do a similar process, but without making A private? My final goal is to apply this to a much larger code. In it there are calculations performed on many vectors/matrices dependent on a size which I represent as j. I'm afraid making them all private would require too much memory.


The problem occurs in your test code because the number of iterations of the j loop is small (3). False sharing occurs when different threads access addresses which lie on the same cache line (whose is size is typically 64 bytes on modern x86 architectures). Eight elements of A span one cache line, so provided you can choose a chunksize for the parallel loop which is larger than eight, without having load imbalance in the loop, the problem may not occur in a code with larger arrays. Alternatively you can cross your fingers and hope the compiler does the job for you (see below).

ftinetti wrote:I think there are some problems with shared/private data, aren't there?


Only nthreads is incorrectly scoped (A is shared by default and i,j,k are all private since they are Fortran loop iterators), which will not affect performance.
Whether you see false sharing or not will depend on whether the compiler optimisation removes the loads and stores of A in the innermost loop and does the accumulation in a register: I guess this is happening with gfortran and -O3.
MarkB
 
Posts: 433
Joined: Thu Jan 08, 2009 10:12 am


Return to Using OpenMP

Who is online

Users browsing this forum: Google [Bot], Yahoo [Bot] and 10 guests