Parallel program showing speedup, but same wall time [F90]

General OpenMP discussion

Parallel program showing speedup, but same wall time [F90]

Postby Aertsvijand » Mon Jun 17, 2013 9:04 am

I'm following an introduction class to (parallel) programming, in which we devoted two lessons to making Fortran90-code parallel with the aid of OpenMP. As exam assignment, we had to parallelize a program by ourself. I chose to work with Game Of Life (http://www.pdc.kth.se/education/tutorials/summer-school/mpi-exercises/mpi-lab-codes/game_of_life-serial.f90/view) and this is what I came up with:
Code: Select all
!----------------------
!  Conway Game of Life
!    serial version
!----------------------

program life
 
  use omp_lib
 
  implicit none
 
  integer, parameter :: ni=2000, nj=2000
  integer :: i, j, n, im, ip, jm, jp, nsum, isum, num_thr, nsteps
  integer, allocatable, dimension(:,:) :: old, new
  real :: arand, et, t1, e1, t2, e2, tarray(2)
 
  ! request the ammount of iterations
 
  write(*,'(A)',advance='no') "Please enter the number of iterations: "
  read(*,*) nsteps
 
  ! initiate time measurement
 
  et=0.0
  num_thr=1
  t1=dtime(tarray)
  e1=secnds(et)

  ! allocate arrays, including room for ghost cells

  allocate(old(0:ni+1,0:nj+1), new(0:ni+1,0:nj+1))

 
  do j = 1, nj
     do i = 1, ni
        call random_number(arand)
        old(i,j) = nint(arand)
     enddo
  enddo

  !  iterate
 

  time_iteration: do n = 1, nsteps

     ! corner boundary conditions

     old(0,0) = old(ni,nj)
     old(0,nj+1) = old(ni,1)
     old(ni+1,nj+1) = old(1,1)
     old(ni+1,0) = old(1,nj)

     ! left-right boundary conditions

     old(1:ni,0) = old(1:ni,nj)
     old(1:ni,nj+1) = old(1:ni,1)

     ! top-bottom boundary conditions

     old(0,1:nj) = old(ni,1:nj)
     old(ni+1,1:nj) = old(1,1:nj)
     
     !$omp parallel private(jm,j,jp,im,i,ip,nsum)
     !$omp do
     do j = 1, nj       
        do i = 1, ni

           im = i - 1
           ip = i + 1
           jm = j - 1
           jp = j + 1
           nsum = old(im,jm) + old(im,j) + old(im,jp) &
                + old(i,jm )             + old(i,jp ) &
                + old(ip,jm) + old(ip,j) + old(ip,jp)

           select case (nsum)
           case (3)
              new(i,j) = 1
           case (2)
              new(i,j) = old(i,j)
           case default
              new(i,j) = 0
           end select

        enddo     
     enddo
     !$omp enddo
     !$omp end parallel

     ! copy new state into old state

     old = new

  enddo time_iteration

  ! Iterations are done; sum the number of live cells
 
  isum = sum(new(1:ni,1:nj))
 
  ! Calculate resources used
 
  t2=dtime(tarray)
  e2=secnds(et)
 
  ! Print final number of live cells, including resources used.
 
  write(*,'(A14,A9,A10,A14,A9)') " Living Cells"," Threads"," CPU time"," Elapsed time"," Speedup"
  write(*,'(I14,I9,F10.4,F14.4,F9.4)') isum,num_thr,t2,e2-e1,t2/(e2-e1)

  deallocate(old, new)

end program life


And this is the output
Code: Select all
.../Par_Prog/OpenMP $ ./ser_game_of_life
Please enter the number of iterations: 300
  Living Cells  Threads  CPU time  Elapsed time  Speedup
        259004        1   26.0216       26.0234   0.9999


.../Par_Prog/OpenMP $ ./par_game_of_life
Please enter the number of iterations: 300
  Living Cells  Threads  CPU time  Elapsed time  Speedup
        259004        1   87.2094       25.1367   3.4694

So the parallel code is showing a speedup, but the same wall time and the seriel code, which I don't understand. Could somebody enlighten me?
Aertsvijand
 
Posts: 4
Joined: Mon Jun 17, 2013 4:33 am

Re: Parallel program showing speedup, but same wall time [F9

Postby ftinetti » Mon Jun 17, 2013 10:44 am

Hi,

There are some issues in your code and in your speedup calculation:
0) What is/are the difference/s among ./ser_game_of_life and ./par_game_of_life?
1) The number of actual OpenMP threads is not related to the variable num_thr
2) You are computing serial time as the CPU time of the parallel code, which is not fair, since you are computing parallel overhead/s time as being part of the serial time. The standard way of computing speedup is by taking serial and parallel wall clock time, for which the OpenMP function omp_get_wtime() is usually suggested.

HTH,

Fernando.
PS: the source code seems a little bit strange to me, because array "new" first and last rows and columns are never assigned... but maybe I'm losing something...
ftinetti
 
Posts: 575
Joined: Wed Feb 10, 2010 2:44 pm

Re: Parallel program showing speedup, but same wall time [F9

Postby Aertsvijand » Mon Jun 17, 2013 12:09 pm

ftinetti wrote:Hi,

There are some issues in your code and in your speedup calculation:
0) What is/are the difference/s among ./ser_game_of_life and ./par_game_of_life?
1) The number of actual OpenMP threads is not related to the variable num_thr
2) You are computing serial time as the CPU time of the parallel code, which is not fair, since you are computing parallel overhead/s time as being part of the serial time. The standard way of computing speedup is by taking serial and parallel wall clock time, for which the OpenMP function omp_get_wtime() is usually suggested.

HTH,

Fernando.
PS: the source code seems a little bit strange to me, because array "new" first and last rows and columns are never assigned... but maybe I'm losing something...


Hey Fernando, thanks for replying!

0) ser_game of life is compiled as fortran -o ser_game_of_life game_of_life.f90, while par_game_of_life is compiled as fortran -o par _game_of_life -fopenmp game_of_life.f90, aka the seriel and parallel version of the program.
1) I fixed that with
Code: Select all
     ...
     !$omp parallel private(jm,j,jp,im,i,ip,nsum)
     !$omp master
     !$ num_thr = omp_get_num_threads()
     !$omp end master
     !$omp do
     ...

2) I'm not quite following the explanation you give about the measurement of time. I have looked into omp_get_wtime(), but that function is only recognised when used in combination with the -fopenmp compiler flag, so I'm not sure how to use it...

About the outermost rows/columns: they function as "dummy"-rows/columns, since the next-to-outermost cells wouldn't have the required neighbours. This is fixed by making some kind of torus of the map by copying the last "real" row to the upper dummy row, the first "real" row to the lower dummy row and the same for the dummy columns.

However, the main issue still stand; why it takes the program just as much real time in both the seriel and parallel version.
Aertsvijand
 
Posts: 4
Joined: Mon Jun 17, 2013 4:33 am

Re: Parallel program showing speedup, but same wall time [F9

Postby ftinetti » Mon Jun 17, 2013 4:44 pm

However, the main issue still stand; why it takes the program just as much real time in both the seriel and parallel version.


Hmmm... maybe it will take a little bit of work, but I think it would be possible.

First, please post the some details about the environment:
a) Computer: processor and number of processors. If you are in Linux, the output of
$ cat /proc/cpuinfo
would be good enough.
b) Compiler and compiler options used for generating the serial as well as the parallel version.

Second: run the serial version, i.e. the one generated without the openmp compiler option and post the output (maybe it is the previous one you posted, but post it anyway, just for completeness).

third: run the sequence
$ export OMP_NUM_THREADS=1
$ par _game_of_life
... <program output here>
$ export OMP_NUM_THREADS=2
$ par _game_of_life
... <program output here>
(if you have 4 cores or more)
$ export OMP_NUM_THREADS=4
$ par _game_of_life

Maybe at this point you'll find the explanation by yourself, but post the results anyway.

HTH,

Fernando.
ftinetti
 
Posts: 575
Joined: Wed Feb 10, 2010 2:44 pm

Re: Parallel program showing speedup, but same wall time [F9

Postby Aertsvijand » Tue Jun 18, 2013 1:31 am

I'm using Linux Mint running as virtual machine with VirtualBox as environment. I have dedicated my two cores, which have access to HyperThreading, to the virtual machine.

a) CPU info by using $ cat /proc/cpuinfo
Code: Select all
processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 58
model name      : Intel(R) Core(TM) i7-3517U CPU @ 1.90GHz
stepping        : 9
cpu MHz         : 2247.600
cache size      : 6144 KB
physical id     : 0
siblings        : 4
core id         : 0
cpu cores       : 4
apicid          : 0
initial apicid  : 0
fpu             : yes
fpu_exception   : yes
cpuid level     : 5
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx rdtscp lm constant_tsc rep_good nopl pni ssse3 lahf_lm
bogomips        : 4495.20
clflush size    : 64
cache_alignment : 64
address sizes   : 36 bits physical, 48 bits virtual
power management:

processor       : 1
vendor_id       : GenuineIntel
cpu family      : 6
model           : 58
model name      : Intel(R) Core(TM) i7-3517U CPU @ 1.90GHz
stepping        : 9
cpu MHz         : 2247.600
cache size      : 6144 KB
physical id     : 0
siblings        : 4
core id         : 1
cpu cores       : 4
apicid          : 1
initial apicid  : 1
fpu             : yes                                                                                                                                                                                             
fpu_exception   : yes                                                                                                                                                                                             
cpuid level     : 5                                                                                                                                                                                               
wp              : yes                                                                                                                                                                                             
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx rdtscp lm constant_tsc rep_good nopl pni ssse3 lahf_lm                           
bogomips        : 4495.20                                                                                                                                                                                         
clflush size    : 64                                                                                                                                                                                               
cache_alignment : 64                                                                                                                                                                                               
address sizes   : 36 bits physical, 48 bits virtual                                                                                                                                                               
power management:                                                                                                                                                                                                 
                                                                                                                                                                                                                   
processor       : 2                                                                                                                                                                                               
vendor_id       : GenuineIntel                                                                                                                                                                                     
cpu family      : 6                                                                                                                                                                                               
model           : 58                                                                                                                                                                                               
model name      : Intel(R) Core(TM) i7-3517U CPU @ 1.90GHz
stepping        : 9
cpu MHz         : 2247.600
cache size      : 6144 KB
physical id     : 0
siblings        : 4
core id         : 2
cpu cores       : 4
apicid          : 2
initial apicid  : 2
fpu             : yes
fpu_exception   : yes
cpuid level     : 5
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx rdtscp lm constant_tsc rep_good nopl pni ssse3 lahf_lm
bogomips        : 4495.20
clflush size    : 64
cache_alignment : 64
address sizes   : 36 bits physical, 48 bits virtual
power management:

processor       : 3
vendor_id       : GenuineIntel
cpu family      : 6
model           : 58
model name      : Intel(R) Core(TM) i7-3517U CPU @ 1.90GHz
stepping        : 9
cpu MHz         : 2247.600
cache size      : 6144 KB
physical id     : 0
siblings        : 4
core id         : 3
cpu cores       : 4
apicid          : 3
initial apicid  : 3
fpu             : yes
fpu_exception   : yes
cpuid level     : 5
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx rdtscp lm constant_tsc rep_good nopl pni ssse3 lahf_lm
bogomips        : 4495.20
clflush size    : 64
cache_alignment : 64
address sizes   : 36 bits physical, 48 bits virtual
power management:


b) Compiler and compiler options
Compiler: gfortran, so that would be the the gcc compiler
Code: Select all
$ gfortran -v
Using built-in specs.
COLLECT_GCC=gfortran
COLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-linux-gnu/4.7/lto-wrapper
Target: x86_64-linux-gnu
Configured with: ../src/configure -v --with-pkgversion='Ubuntu/Linaro 4.7.2-2ubuntu1' --with-bugurl=file:///usr/share/doc/gcc-4.7/README.Bugs --enable-languages=c,c++,go,fortran,objc,obj-c++ --prefix=/usr --program-suffix=-4.7 --enable-shared --enable-linker-build-id --with-system-zlib --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --with-gxx-include-dir=/usr/include/c++/4.7 --libdir=/usr/lib --enable-nls --with-sysroot=/ --enable-clocale=gnu --enable-libstdcxx-debug --enable-libstdcxx-time=yes --enable-gnu-unique-object --enable-plugin --enable-objc-gc --disable-werror --with-arch-32=i686 --with-tune=generic --enable-checking=release --build=x86_64-linux-gnu --host=x86_64-linux-gnu --target=x86_64-linux-gnu
Thread model: posix
gcc version 4.7.2 (Ubuntu/Linaro 4.7.2-2ubuntu1)

Serial: gfortran -o ser_game_of_life game_of_life.f90
Parallel: gfortran -o par_game_of_life -fopenmp game_of_life.f90

c) Serial version
Code: Select all
$ ./ser_game_of_life
Please enter the number of iterations: 300
  Living Cells  Threads  CPU time  Elapsed time  Speedup
        259004        1   27.6657       27.6680   0.9999


d) Sequence
Code: Select all
$ export OMP_NUM_THREADS=1
$ ./par_game_of_life
Please enter the number of iterations: 300
  Living Cells  Threads  CPU time  Elapsed time  Speedup
        259004        1   40.2865       40.2891   0.9999

$ export OMP_NUM_THREADS=2
$ ./par_game_of_life
Please enter the number of iterations: 300
  Living Cells  Threads  CPU time  Elapsed time  Speedup
        259004        2   51.2912       26.4102   1.9421

$ export OMP_NUM_THREADS=4
$ ./par_game_of_life
Please enter the number of iterations: 300
  Living Cells  Threads  CPU time  Elapsed time  Speedup
        259004        4   84.7533       24.5039   3.4588


I guess the program ís speeding up, but the overheat of the parallelisation is causing the program to be just as fast as the serial version?
Aertsvijand
 
Posts: 4
Joined: Mon Jun 17, 2013 4:33 am

Re: Parallel program showing speedup, but same wall time [F9

Postby ftinetti » Tue Jun 18, 2013 3:51 am

I guess the program ís speeding up, but the overheat of the parallelisation is causing the program to be just as fast as the serial version?


Exactly, and unfortunately you don't have more than 2 cores to see any actual improvement wrt serial time. What happens in a non-virtual pair of Xeon processors is similar:
$ ./ser_game_of_life
Please enter the number of iterations: 300
Living Cells Threads CPU time Elapsed time Speedup
259004 1 53.3193 53.3184 1.0000

$ export OMP_NUM_THREADS=1
$ ./par_game_of_life
Please enter the number of iterations: 300
Living Cells Threads CPU time Elapsed time Speedup
259004 1 98.2101 98.2090 1.0000

$ export OMP_NUM_THREADS=2
$ ./par_game_of_life
Please enter the number of iterations: 300
Living Cells Threads CPU time Elapsed time Speedup
259004 1 102.4504 51.3730 1.9942


Now, please compile with some compiler optimization option, e.g.

gfortran -O2 -o ser_game_of_life game_of_life.f90
gfortran -O2 -o par_game_of_life -fopenmp game_of_life.f90

and please post the three runtimes (serial, parallel with one thread, and parallel with two threads). Usually, runtimes change a lot with optimized code (or, at least, with no debug-specific code generation).

HTH,

Fernando.
ftinetti
 
Posts: 575
Joined: Wed Feb 10, 2010 2:44 pm

Re: Parallel program showing speedup, but same wall time [F9

Postby Aertsvijand » Tue Jun 18, 2013 4:52 am

Well indeed, what a difference!

$ gfortran -o ser_game_of_life game_of_life.f90
$ gfortran -o par_game_of_life -fopenmp game_of_life.f90
$ gfortran -O2 -o ser_game_of_life_opt game_of_life.f90
$ gfortran -O2 -o par_game_of_life_opt -fopenmp game_of_life.f90


$ ./ser_game_of_life
Please enter the number of iterations: 300
Please enter the number of rows: 2000
Please enter the number of columns: 2000
Living Cells Threads CPU time Elapsed time Speedup
259004 1 23.4895 23.4844 1.0002

$ ./ser_game_of_life_opt
Please enter the number of iterations: 300
Please enter the number of rows: 2000
Please enter the number of columns: 2000
Living Cells Threads CPU time Elapsed time Speedup
259004 1 6.5964 6.6016 0.9992


$ export OMP_NUM_THREADS=1


$ ./par_game_of_life
Please enter the number of iterations: 300
Please enter the number of rows: 2000
Please enter the number of columns: 2000
Living Cells Threads CPU time Elapsed time Speedup
259004 1 38.0304 38.0508 0.9995

$ ./par_game_of_life_opt
Please enter the number of iterations: 300
Please enter the number of rows: 2000
Please enter the number of columns: 2000
Living Cells Threads CPU time Elapsed time Speedup
259004 1 7.5285 7.5273 1.0001


$ export OMP_NUM_THREADS=2


$ ./par_game_of_life
Please enter the number of iterations: 300
Please enter the number of rows: 2000
Please enter the number of columns: 2000
Living Cells Threads CPU time Elapsed time Speedup
259004 2 47.7430 24.5195 1.9471

$ ./par_game_of_life_opt
Please enter the number of iterations: 300
Please enter the number of rows: 2000
Please enter the number of columns: 2000
Living Cells Threads CPU time Elapsed time Speedup
259004 2 9.4526 5.0625 1.8672


I guess I'll have enough to explain of what I learned from this assignment when I meet up with the professor :-) Maybe it wasn't the best example of a program to parallelize, but at least I learned a lot, thanks for the help ;)
Aertsvijand
 
Posts: 4
Joined: Mon Jun 17, 2013 4:33 am


Return to Using OpenMP

Who is online

Users browsing this forum: Google [Bot] and 12 guests