OpenMP updating sharing array very slow

General OpenMP discussion

OpenMP updating sharing array very slow

Postby jiwa » Tue May 21, 2013 12:04 am

Hi all, I am quite new to OpenMP and just started to use it to do some big file i/o stuff. Here is my parallel region in my Fortran code:
Code: Select all
t1=secnds(0.0)
call omp_set_num_threads(4)
!$OMP PARALLEL private(x1,x2,y1,y2,z1,z2,xx1,xx2,yy1,yy2,zz1,zz2,dx,dy,dz,volume,volume_max,m)
!$OMP DO
do k=1,nlayer
    write(*,*) "layer: ", k, secnds(0.0)-t1
    do j=1,nrow
        do i=1,ncolumn
            x1=grd_x(i); x2=grd_x(i+1)
            y1=grd_y(j); y2=grd_y(j+1)
            z1=grd_z(k); z2=grd_z(k+1)
            volume_max=0.0d0
            m=1
            do l=1,nn
                xx1=coord(l,1)-coord(l,4)/2.0d0; xx2=xx1+coord(l,4)
                yy1=coord(l,2)-coord(l,5)/2.0d0; yy2=yy1+coord(l,5)
                zz1=coord(l,3)-coord(l,6)/2.0d0; zz2=zz1+coord(l,6)
                ! nn greater than 10 million
                ! some heavylifting stuff here to caculate m
                ! take ~5 seconds for this single loop on a single thread
                dx=minval((/x2,xx2/))-maxval((/x1,xx1/))
                dx=maxval((/0.0d0,dx/))
                dy=minval((/y2,yy2/))-maxval((/y1,yy1/))
                dy=maxval((/0.0d0,dy/))
                dz=minval((/z2,zz2/))-maxval((/z1,zz1/))
                dz=maxval((/0.0d0,dz/))
                volume=dx*dy*dz
                if(volume>volume_max) then
                    volume_max=volume; m=code_num(l)
                endif
             enddo
            big_array(i,j,k)=m
        enddo !i
    enddo !j
enddo !k
!$OMP END DO NOWAIT
!$OMP END PARALLEL
write(*,*)'Total calculation time is (T_t):    ',secnds(0.0)-t1, '  seconds.'

My problem is on that big_array updating statement: big_array(i,j,k)=m.

If I comment this statement from executing, the parallel performance is just what I excepted: T_t=42s for a seriel code; T_t=41 for parallel code (threads=2); T_t=31 for parallel code (threads=4), given nlayer=10.

But if I keep this array updating statement, the parallel code becomes very slow, T_t=413s. I have tried to put '$OMP CRITICAL' OR '$OMP FLUSH' in front of this statement, but the problem remains.

This seems strange to me, because the array updating is outside the big-time loop, and each thread should have different (i,j,k) so there is no way multiple threads need to access the same memory location of this array.

Can anybody give me a hint of what's going on here?

Many thanks

Ji
jiwa
 
Posts: 5
Joined: Mon May 20, 2013 11:01 pm

Re: OpenMP updating sharing array very slow

Postby MarkB » Tue May 21, 2013 2:23 am

Hi there,

A couple of questions to try and help me figure out what's going on:

How many threads were you running to get the 413s time?
If you have the big_array assignment in, what is the sequential time, and what is the parallel time on one thread?
What are the values of nrow and ncolumn?

jiwa wrote:If I comment this statement from executing, the parallel performance is just what I excepted: T_t=42s for a seriel code; T_t=41 for parallel code (threads=2); T_t=31 for parallel code (threads=4), given nlayer=10.


Why do you expect such poor speedup?

Thanks,
Mark.
MarkB
 
Posts: 422
Joined: Thu Jan 08, 2009 10:12 am

Re: OpenMP updating sharing array very slow

Postby jiwa » Tue May 21, 2013 4:48 am

Thanks Mark.

The 413s was run on 4 threads.

The sequential time with big_array assignment is about 40s. I did the one thread parallel test but I forgot the time (im at home now). The nrow and ncolumn is quite small for this trail problem: 10 and 20. So the big_array is not big here.

The real problem I am trying to solve has some 500 for nrow and ncoloum and about 100~200 for nlayer. The idea is to reduce the runtime of a half-week job to half day or so with a 16 micro processor machine.

I tested the 10 layer trial problem with 10 threads parallel without the big_array assignment line, it only took 4 to 9 seconds. So speedup isn't a problem from what i can see.

Thanks
jiwa
 
Posts: 5
Joined: Mon May 20, 2013 11:01 pm

Re: OpenMP updating sharing array very slow

Postby ftinetti » Tue May 21, 2013 6:06 am

Hi Ji,

Would you post a complete example to play around a little with?

Fernando.
ftinetti
 
Posts: 567
Joined: Wed Feb 10, 2010 2:44 pm

Re: OpenMP updating sharing array very slow

Postby MarkB » Tue May 21, 2013 6:06 am

Are you actually outputting any results? If not, it is possible that the compiler optimisation is eliminating dead code and not doing all the computation, except in the case where big_array is being assigned to, and is in a parallel region (which would require interprocedural analysis).
MarkB
 
Posts: 422
Joined: Thu Jan 08, 2009 10:12 am

Re: OpenMP updating sharing array very slow

Postby jiwa » Tue May 21, 2013 6:56 am

To Fernando: the rest of code is rather simple I can post tomorrow. It's just about reading a file of grid coordinates into grd_x, grd_y and grd_z, and reading another big coordinate file into coord(3,nn). The big coordinate file contains ~10 million lines of coordinate data (700Mb), which is essential to make the nn loop time-consuming.

To Mark: the purpose on these code is to update a grid property index array (big_array(i,j,k)), by comparing the cell spacial location and volume with a huge database array coord(3,nn) and code_num(nn). So my output is the big_array. If the assignment line is commented, the code will be meaningless.

btw: all the array is dynamic array, they are allocated before the parallel region.

Thanks
Ji
jiwa
 
Posts: 5
Joined: Mon May 20, 2013 11:01 pm

Re: OpenMP updating sharing array very slow

Postby MarkB » Tue May 21, 2013 7:20 am

jiwa wrote:To Mark: the purpose on these code is to update a grid property index array (big_array(i,j,k)), by comparing the cell spacial location and volume with a huge database array coord(3,nn) and code_num(nn). So my output is the big_array. If the assignment line is commented, the code will be meaningless.


Sorry, what I meant was: does your test code actually write out the values in big_array (or something that depends on them)?
MarkB
 
Posts: 422
Joined: Thu Jan 08, 2009 10:12 am

Re: OpenMP updating sharing array very slow

Postby ftinetti » Tue May 21, 2013 8:54 am

Hi again,

To Fernando: the rest of code is rather simple I can post tomorrow.

Thanks, I'll try to see if there is something I missed.

It's just about reading a file of grid coordinates into grd_x, grd_y and grd_z, and reading another big coordinate file into coord(3,nn). The big coordinate file contains ~10 million lines of coordinate data (700Mb), which is essential to make the nn loop time-consuming.

Do not worry about input files, since
a) We are interested only in performance right now
b) The processing requirements of the code you posted is almost data independent
Thus, we can fill in the arrays with constant values. The important stuff is, I think, to have a run similar to the one you reported, i.e. the actual values of nlayer, nrow, ncolumn and nn (for the original post).

Fernando.
ftinetti
 
Posts: 567
Joined: Wed Feb 10, 2010 2:44 pm

Re: OpenMP updating sharing array very slow

Postby jiwa » Tue May 21, 2013 7:01 pm

Here is the a trimed code which should be able to reproduce my problem on your machine.
My computer details: Win7 64-bit, Core i5 2.60GHz, 4G Ram
Compiler: MS Visual Studio 2008 with Intel Fortran Compiler

Code: Select all
program ppp
implicit none

integer, parameter :: ncolumn=20,nrow=10,nlayer=10,nn=5000000
real*8, allocatable :: grd_x(:),grd_y(:),grd_z(:),coord(:,:)
integer, allocatable :: code_num(:),big_array(:,:,:)
real*8 x1,x2,y1,y2,z1,z2,xx1,xx2,yy1,yy2,zz1,zz2,dx,dy,dz,volume,volume_max,t1
integer i,j,k,l,m

allocate(grd_x(ncolumn+1),grd_y(nrow+1),grd_z(nlayer+1),coord(nn,6),code_num(nn),big_array(ncolumn,nrow,nlayer))
grd_x=1.0; grd_y=2.0; grd_z=3.0; coord=4.0; code_num=5; big_array=6

t1=secnds(0.0)
call omp_set_num_threads(4)
!$OMP PARALLEL private(x1,x2,y1,y2,z1,z2,xx1,xx2,yy1,yy2,zz1,zz2,dx,dy,dz,volume,volume_max,m)
!$OMP DO
do k=1,nlayer
    write(*,*) "layer: ", k, secnds(0.0)-t1
    do j=1,nrow
        do i=1,ncolumn
            x1=grd_x(i); x2=grd_x(i+1)
            y1=grd_y(j); y2=grd_y(j+1)
            z1=grd_z(k); z2=grd_z(k+1)
            volume_max=0.0d0
            m=1
            do l=1,nn
                xx1=coord(l,1)-coord(l,4)/2.0d0; xx2=xx1+coord(l,4)
                yy1=coord(l,2)-coord(l,5)/2.0d0; yy2=yy1+coord(l,5)
                zz1=coord(l,3)-coord(l,6)/2.0d0; zz2=zz1+coord(l,6)
                dx=minval((/x2,xx2/))-maxval((/x1,xx1/))
                dx=maxval((/0.0d0,dx/))
                dy=minval((/y2,yy2/))-maxval((/y1,yy1/))
                dy=maxval((/0.0d0,dy/))
                dz=minval((/z2,zz2/))-maxval((/z1,zz1/))
                dz=maxval((/0.0d0,dz/))
                volume=dx*dy*dz
                if(volume>volume_max) then
                    volume_max=volume; m=code_num(l)
                endif
             enddo
            big_array(i,j,k)=m
        enddo !i
    enddo !j
enddo !k
!$OMP END DO NOWAIT
!$OMP END PARALLEL
write(*,*)'Total calculation time is (T_t):    ',secnds(0.0)-t1, '  seconds.'

open(1,file='out.txt'); write(1,*) big_array; close(1)

end

4 thread parallel output:
Code: Select all
c:\fortran\parallel\Release>parallel.exe
layer:            7  0.000000000000000E+000
layer:            4  0.000000000000000E+000
layer:            1  4.699999999866122E-002
layer:            9  4.699999999866122E-002
layer:           10   158.511999999995
layer:            8   159.260999999999
layer:            5   161.413000000000
layer:            2   162.582999999999
layer:            6   312.250000000000
layer:            3   324.464999999997
Total calculation time is (T_t):       416.940999999999        seconds.


1 thread parallel output:
Code: Select all
c:\fortran\parallel\Release>parallel.exe
layer:            1  0.000000000000000E+000
layer:            2   37.4400000000023
layer:            3   75.9559999999983
layer:            4   113.521000000001
layer:            5   151.896999999997
layer:            6   189.913999999997
layer:            7   230.786000000000
layer:            8   268.928000000000
layer:            9   306.913999999997
layer:           10   345.290000000001
Total calculation time is (T_t):       382.807999999997        seconds.


Sequential code output:
Code: Select all
c:\fortran\parallel\Release>parallel.exe
layer:            1 -1.156250000349246E-003
layer:            2   28.9838437500002
layer:            3   60.4798437499994
layer:            4   90.6348437499983
layer:            5   121.803843750000
layer:            6   150.741843750002
layer:            7   179.850843749999
layer:            8   208.695843750000
layer:            9   238.257843749998
layer:           10   267.195843750000
Total calculation time is (T_t):       295.899843749998        seconds.


Sequential code output (without big_array assignment line):
Code: Select all
c:\fortran\parallel\Release>parallel.exe
layer:            1 -7.499999992433004E-004
layer:            2   2.99425000000338
layer:            3   6.00525000000198
layer:            4   9.01524999999674
layer:            5   12.2452499999999
layer:            6   15.4272500000006
layer:            7   18.5632499999992
layer:            8   21.6512500000026
layer:            9   24.6782499999972
layer:           10   27.7042500000025
Total calculation time is (T_t):       30.7462499999965        seconds.
jiwa
 
Posts: 5
Joined: Mon May 20, 2013 11:01 pm

Re: OpenMP updating sharing array very slow

Postby MarkB » Wed May 22, 2013 3:26 am

Hi there,

I still think that removing the assignment to big_array is causing the compiler to optimise away most of the code (I've seen compilers do strange things in this situation before, such as only executing every nth iteration of the innermost loop).

I strongly suspect the lack of scaling of the code is due to memory bandwidth contention: the code is basically just repeatedly trawling through the coord array with no re-use.
On an AMD Interlagos system (which has a better memory subsystem than the i5) with the PGI compiler I get:

sequential: 144s
parallel 1 thread: 144s
parallel 2 thread: 86s
parallel 5 threads: 55s
parallel 10 threads: 51s

which is still pretty poor scaling.

You might be able to improve the performance by swapping the loop order so that the do l=1,nn is outermost, which will require expanding volume_max into a 3-D array. Then the coord array only gets trawled once instead of nlayer*ncol*nrow times.

Hope that helps,
Mark.
MarkB
 
Posts: 422
Joined: Thu Jan 08, 2009 10:12 am

Next

Return to Using OpenMP

Who is online

Users browsing this forum: Google [Bot] and 10 guests

cron