problem about reduction

General OpenMP discussion

problem about reduction

Postby LungShengChien » Mon Feb 25, 2008 1:16 am

Dear all:
I try to parallelize the following code

Code: Select all
   double A_inf = 0.0 ;
   for (i=1 ; i <=nx ; i++){
      for(j=1 ; j <= ny ; j++){
         for(k=1 ; k <= nz ; k++){
            err = fabs( A(i,j,k) - A_e(i,j,k) ) ;
            A_inf = QMAX( err, A_inf ) ;
         }// for k
      }// for j
   }// for i



A and A_e are 3D matrix and QMAX is macro, defined by
#define QMAX(a, b) ((b) < (a) ? (a) : (b))

In spec of OpenMp, I cannot use reduction clause directly, so I
try two ways

Scheme 1:
Code: Select all
      double *A_inf_list = NULL ;
      A_inf_list = (double*) malloc( sizeof(double)* NUM_THREAD ) ;
      assert( A_inf_list ) ;
      for ( i = 0 ; i < NUM_THREAD ; i++){
         A_inf_list[i] = 0.0 ;
      }
      
      int th_id ;
#pragma omp parallel for default(none) shared( nx, ny, nz, NUM_THREAD, A, A_e, A_inf_list ) \
      private(i,j,k,th_id,err)  num_threads(NUM_THREAD)
      for (i=1 ; i <=nx ; i++){
         // [fetch openmp thread variable]
         th_id = omp_get_thread_num() ;
         assert( 0 <= th_id ) ;
         assert( th_id < NUM_THREAD ) ;
         // [end fetch openmp thread variable]
         err = 0.0 ;
         for(j=1 ; j <= ny ; j++){
            for(k=1 ; k <= nz ; k++){
               err =   fabs( A(i,j,k) - A_e(i,j,k) ) ;
               A_inf_list[th_id] = QMAX( err, A_inf_list[th_id] ) ;
            }// for k
         }// for j
      }// for i

      A_inf = 0.0 ;
      for ( i = 0 ; i < NUM_THREAD ; i++){
         A_inf = QMAX(A_inf_list[i], A_inf ) ;
      }      


Scheme 2:
Code: Select all
      double *A_inf_list = NULL ;
      A_inf_list = (double*) malloc( sizeof(double)* NUM_THREAD ) ;
      assert( A_inf_list ) ;
      for ( i = 0 ; i < NUM_THREAD ; i++){
         A_inf_list[i] = 0.0 ;
      }
      
      int th_id ;
#pragma omp parallel for default(none) shared( nx, ny, nz, NUM_THREAD, A, A_e, A_inf_list ) \
      private(i,j,k,th_id,err)  num_threads(NUM_THREAD)
      for (i=1 ; i <=nx ; i++){
         // [fetch openmp thread variable]
         th_id = omp_get_thread_num() ;
         assert( 0 <= th_id ) ;
         assert( th_id < NUM_THREAD ) ;
         // [end fetch openmp thread variable]
         err = 0.0 ;
         for(j=1 ; j <= ny ; j++){
            for(k=1 ; k <= nz ; k++){
               err = QMAX( err, fabs( A(i,j,k) - A_e(i,j,k) ) ) ;
            }// for k
         }// for j
         A_inf_list[th_id] = QMAX( err, A_inf_list[th_id] ) ;
      }// for i

      A_inf = 0.0 ;
      for ( i = 0 ; i < NUM_THREAD ; i++){
         A_inf = QMAX(A_inf_list[i], A_inf ) ;
      }



Intuitively speaking, scheme 2 is faster than scheme 1. However
I have no parallel efficiency in scheme 1, instead, the more
threads are, the slower scheme 1 is.

I do experiment on Fedora 7, with icpc v10,
use Intel(R) Xeon(R) CPU X5365 @ 3.00GHz
compiler option is
icpc -pipe -w -openmp -O2 -mp -DQT_NO_DEBUG -DQT_SHARED -DQT_THREAD_SUPPORT


The experimental result is as follows
we choose nx = ny = nz = N and N = 256, 512, 640, 768 and 1024 respectively.
for each field, we have 3 numbers,
top : 1 thread
middle : 2 threads
bottom: 4 threads
------------------------------------------------------------------
N = 256 N=512 N=640 N=768 N=1024
seuential
273 ms 1980 ms 3898 ms 6883 ms 16975 ms

scheme 1
376 ms 2980 ms 6002 ms 9996 ms 24231 ms
498 ms 4410 ms 8605 ms 15991 ms 36936 ms
917 ms 6611 ms 13828 ms 22423 ms 53399 ms

scheme 2
325 ms 2626 ms 5043 ms 9088 ms 21508 ms
164 ms 1376 ms 2695 ms 4516 ms 10741 ms
83 ms 657 ms 1298 ms 2433 ms 5703 ms
------------------------------------------------------------------

my question is:
why "the more threads are, the slower scheme 1 is".


Lung Sheng Chien
Department of Mathematics ,
Tsing Hua university, R.O.C
Lung Sheng Chien
Department of Mathematics,
Tsing Hua university, R.O.C
LungShengChien
 
Posts: 2
Joined: Tue Feb 19, 2008 8:14 am

Re: problem about reduction

Postby ejd » Mon Feb 25, 2008 10:03 pm

I believe what you are seeing is a problem of "false sharing". We have done quite well over the years optimizing our hardware for serial execution. Unfortunately, this is a case where our serial "optimizations" get us in trouble. In this case, the problem in scheme 1 is your use of the array A_inf_list. You are indexing into the array by the thread number thinking that each element is a separate entity. The problem is, that when an element is retrieved from storage, it is not just an element retrieved but a cache line (i.e., multiple elements of the array). So when another thread wants to store a value into this array, it has the whole cache mechanism getting in the way - just as if you were sharing a single storage element (giving rise to the name "false sharing").

One way around this is to waste some storage. Instead of just allocating A_inf_list as NUM_THREAD elements and indexing into the array by the thread number, allocate enough elements so that a cache line will be filled with an element being accessed and "wasted space". Then index into the array by the thread number times some value, to get an offset large enough to fill a cache line when an element is addressed. I don't know how big a cache line on a Intel Xeon X5365 is, but generally it will be something like 32, 64, or 128 bytes. If it is a "false sharing" problem, which is what it appears to be on other hardware (this not being a problem specific to Intel), then you will see a substantial improvement in scheme 1.

So for your scheme 1 code, assuming that a cache line is 64 bytes (which I am not sure of), you want to use array elements 0, 8, 16, etc. of A_inf_list for the threads (remember each element is a double or 8 bytes) instead of consecutive elements. Thus the code would look something like the following (modified lines marked):
Code: Select all
          double *A_inf_list = NULL ;
          A_inf_list = (double*) malloc( sizeof(double)* (8*NUM_THREAD)) ; // more than enough ??
          assert( A_inf_list ) ;
          for ( i = 0 ; i < 8*NUM_THREAD ; i++){  // could just zero elements used
             A_inf_list[i] = 0.0 ;
          }
         
          int th_id ;
    #pragma omp parallel for default(none) shared( nx, ny, nz, NUM_THREAD, A, A_e, A_inf_list ) \
          private(i,j,k,th_id,err)  num_threads(NUM_THREAD)
          for (i=1 ; i <=nx ; i++){
             // [fetch openmp thread variable]
             th_id = omp_get_thread_num() ;
             assert( 0 <= th_id ) ;
             assert( th_id < NUM_THREAD ) ;
             // [end fetch openmp thread variable]
             err = 0.0 ;
             for(j=1 ; j <= ny ; j++){
                for(k=1 ; k <= nz ; k++){
                   err =   fabs( A(i,j,k) - A_e(i,j,k) ) ;
                   A_inf_list[8*th_id] = QMAX( err, A_inf_list[8*th_id] ) ;  // offset to element
                }// for k
             }// for j
          }// for i

          A_inf = 0.0 ;
          for ( i = 0 ; i < NUM_THREAD ; i++){
             A_inf = QMAX(A_inf_list[i*8], A_inf ) ;  // offset to appropriate element
          }     

This is just a "quick and dirty" example of one possible change that should help, but I hope you get the idea. There is a lot of literature that covers this problem in detail if you want a really complete answer.
ejd
 
Posts: 1025
Joined: Wed Jan 16, 2008 7:21 am

Re: problem about reduction

Postby LungShengChien » Tue Feb 26, 2008 9:33 pm

Thanks for ejd's comment, I modify the code and test 4 cases,
cache line = 8*8, 8*16, 8*32, 8*64 bytes, all of them work,
experimental result is shown below

Code: Select all
  double *A_inf_list = NULL ;
  A_inf_list = (double*) malloc( sizeof(double)*( stride*NUM_THREAD +1 ) ) ;
  assert( A_inf_list ) ;
  for ( i = 0 ; i < NUM_THREAD ; i++){
        A_inf_list[ stride*i ] = 0.0 ;
  }
  int th_id ;
#pragma omp parallel for default(none) shared( nx, ny, nz, stride, NUM_THREAD, A, A_e, A_inf_list ) \
        private(i,j,k,th_id,err)  num_threads(NUM_THREAD)
  for (i=1 ; i <=nx ; i++){
    // [fetch openmp thread variable]
    th_id = omp_get_thread_num() ;
    assert( 0 <= th_id ) ;
    assert( th_id < NUM_THREAD ) ;
   // [end fetch openmp thread variable]
    err = 0.0 ;
    for(j=1 ; j <= ny ; j++){
        for(k=1 ; k <= nz ; k++){
            err = QMAX( err, fabs( A(i,j,k) - A_e(i,j,k) ) ) ;
            A_inf_list[ stride*th_id] = QMAX( err, A_inf_list[stride*th_id] ) ;
        }// for k
    }// for j
}// for i

A_inf = 0.0 ;
for ( i = 0 ; i < NUM_THREAD ; i++){
     A_inf = QMAX(A_inf_list[ i*stride ], A_inf ) ;
}
free( A_inf_list ) ;
                       


The experimental result is as follows
we choose nx = ny = nz = N and N = 256, 512, 640, 768 and 1024 respectively.
for each field, we have 3 numbers,
top : 1 thread
middle : 2 threads
bottom: 4 threads

scheme 1:

N = 256 N=512 N=640
stride = 1:
197ms 1488ms 2833ms
310ms 2481ms 4081ms
552ms 4138ms 8631ms

stride = 8:
197ms 1547ms 2832ms
101ms 724ms 1410ms
49ms 406ms 767ms

stride = 16:
190ms 1538ms 2894ms
101ms 724ms 1410ms
50ms 394ms 767ms

stride = 32:
196ms 1458ms 3007ms
101ms 724ms 1410ms
49ms 393ms 767ms

stride = 64:
197ms 1494ms 2954ms
101ms 724ms 1409ms
54ms 394ms 767ms

--------------------------------------------------------------

Remark: stride = 4 does not work, performance of stride = 4 is unstable.

Moreover I refer to ejd's comment in subject "Doing a reduction with MAX/MIN in C" and
modify my code as

scheme 3
Code: Select all
  A_inf = 0.0 ;
#pragma omp parallel default(none) shared( nx, ny, nz, NUM_THREAD, A, A_e, A_inf ) \
        private(i,j,k,err)  num_threads(NUM_THREAD)
{
   err = 0.0 ;
   #pragma omp for
   for (i=1 ; i <= nx ; i++){
      for(j=1 ; j <= ny ; j++){
         for(k=1 ; k <= nz ; k++){
            err = QMAX( err, fabs( A(i,j,k) - A_e(i,j,k) ) ) ;
         } // for k
      }// for j
      if ( err > A_inf ){
         #pragma critical
         { // other thread may update A_inf first, so we need to check again
            if ( err > A_inf ) A_inf = err ;
         }
      }
   }// for i
}


Then experimental result shows scheme 2 ~= scheme 3.

However I think that scheme 3 is better for me, since it needs to change smaller part of code.
Lung Sheng Chien
Department of Mathematics,
Tsing Hua university, R.O.C
LungShengChien
 
Posts: 2
Joined: Tue Feb 19, 2008 8:14 am


Return to Using OpenMP

Who is online

Users browsing this forum: Google [Bot] and 13 guests