What's wrong with my program?

General OpenMP discussion

What's wrong with my program?

Postby hjnln » Mon Dec 16, 2013 7:06 am

The subroutine:
Code: Select all
void valuep(float x,float y,float *xaxis,float *yaxis,float *r,long long int totalnum,float *p)
{
  long long int i,j;
  float dist,path=0,tp=0;
#pragma omp parallel for default(none) num_threads(num) shared(totalnum,x,y,xaxis,yaxis,r) private(i,dist,path) reduction(+:tp)
  for(i=1;i<=totalnum;i++)
    {
      path=0;
      dist=(x-xaxis[i])*(x-xaxis[i])+(y-yaxis[i])*(y-yaxis[i]);
      if(dist<r[i])
   {
     path=sqrt(r[i]-dist);
   }
      tp=tp+path;
    }
#pragma omp barrier
  *p=tp;
}


The main function:
Code: Select all
float ph;
  for(j=1;j<=130000;j++)
    {
      for(i=1;i<=130000;i++)
   {
     valuep(i,j,x,y,r,totalnum,&ph);
     phase[(j-1)*130000+i]=ph;
   }
      printf("j= %lld   ",j);
    }


The problem:
My computer has 24 cores.
totalnum=2730256 is large enough.
When I set num=20,it cost 57s for every "printf("j= %lld ",j)"
But when I set num=5,it cost 44s for every "printf("j= %lld ",j)"
More cores,but not less time.So I don't know what's wrong with my program? Did you experience this problem before?
Who can help me ? Thank you very much!!! :oops:
hjnln
 
Posts: 5
Joined: Sun May 12, 2013 10:11 pm

Re: What's wrong with my program?

Postby MarkB » Mon Dec 16, 2013 8:12 am

Hi there,

Your program may be suffering from NUMA effects.
The problem is that your code is very memory bandwidth intensive: almost all the time will be spent loading the xaxis, yaxis and r arrays.
If these arrays are intialised by one thread, then they will most likely all be allocated (by the default first touch policy) in the memory of one NUMA node,
and frequent accesses to this by lots of threads become a serious bottleneck.

Some possible solutions:

  • If you are on a linux system, run the code with numactl --interleave=all to change the allocation policy from first touch to round-robin
  • Parallelise the initialisation of all your arrays with OpenMP
  • Best of all, restructure your code to get better memory locality, for example by making the parallel loop compute the ph value for a block of x values, rather than just a single one. This should improve the sequential performance as well as the parallel scaling, though you will need to code the reduction operations "by hand", using atomics, for example. If this doesn't make sense, let me know and I can sketch some code for you.
Hope that helps,
Mark.
MarkB
 
Posts: 477
Joined: Thu Jan 08, 2009 10:12 am
Location: EPCC, University of Edinburgh

Re: What's wrong with my program?

Postby hjnln » Mon Dec 16, 2013 8:33 pm

Hi~ Mark,
I am on a linux system.(RedHat)
I have realized the problem of my code. Maybe from the loading xaxis, yaxis and r arrays all the time. (NUMA effects,I knew this words from you just now :cry: )
I try to restructure my code for many times,but it does not work well. So I need you to sketch some code for me and also hope that you can give me some suggestion to help me to master this part. My major is physics and I am also a beginner in OpenMP. I think I will meet this problem frequently in my job.
Thanks for your help.





MarkB wrote:Hi there,

Your program may be suffering from NUMA effects.
The problem is that your code is very memory bandwidth intensive: almost all the time will be spent loading the xaxis, yaxis and r arrays.
If these arrays are intialised by one thread, then they will most likely all be allocated (by the default first touch policy) in the memory of one NUMA node,
and frequent accesses to this by lots of threads become a serious bottleneck.

Some possible solutions:

  • If you are on a linux system, run the code with numactl --interleave=all to change the allocation policy from first touch to round-robin
  • Parallelise the initialisation of all your arrays with OpenMP
  • Best of all, restructure your code to get better memory locality, for example by making the parallel loop compute the ph value for a block of x values, rather than just a single one. This should improve the sequential performance as well as the parallel scaling, though you will need to code the reduction operations "by hand", using atomics, for example. If this doesn't make sense, let me know and I can sketch some code for you.
Hope that helps,
Mark.
hjnln
 
Posts: 5
Joined: Sun May 12, 2013 10:11 pm

Re: What's wrong with my program?

Postby MarkB » Tue Dec 17, 2013 5:48 am

This is just a sketch to give you the idea: please don't just copy it, as it's not very elegant, and I haven't tested it.
You will need to experiment to find a reasonable value of the block size BS: it needs to be small enough such that tp[] fits in the level 1 cache.

Code: Select all
       
float ph[BS];
for(j=1;j<=130000;j++) {
     for(i=1;i<=130000;i+=BS){
        valuep(i,j,x,y,r,totalnum,&ph);
         for (k=0;k<BS;k++){
              phase[(j-1)*130000+i+k]=ph[k];
         }
     }
printf("j= %lld   ",j);
}


void valuep(int x,int y,float *xaxis,float *yaxis,float *r,long long int totalnum,float *p)
    {
      long long int i,j,k;
      float dist,tp[BS],fx,fy,tri,txaxis,tyaxis;
      x = (float)x;
    #pragma omp parallel default(none) num_threads(num) shared(totalnum,fx, xaxis,yaxis,r,p) private(fy,i,k,dist,tp,tri,txaxis,tyaxis)
    {
      for (k=0;k<BS;k++){
           tp[k]=0;
      }
    #pragma omp for
      for(i=1;i<=totalnum;i++) {
          tri = r[i];
          txaxis = xaxis[i];
          tyaxis = yaxis[i];
          for (k=0;k<BS;k++) {
               fy = (float)(i+k); 
               dist=(fx-txaxis)*(fx-txaxis)+(fy-tyaxis)*(fy-tyaxis);
               if(dist<tri)  {
                  tp[k]+=sqrt(tri-dist);
               }
          }
          for (k=0;k<BS;k++) {
   #pragma omp atomic
             p[k]+= tp[k];
          }
   } // end parallel region
}
MarkB
 
Posts: 477
Joined: Thu Jan 08, 2009 10:12 am
Location: EPCC, University of Edinburgh


Return to Using OpenMP

Who is online

Users browsing this forum: No registered users and 12 guests