Why large scale DGEMM parallelization appears strange?

General OpenMP discussion

Why large scale DGEMM parallelization appears strange?

Postby loveislonely » Thu Aug 28, 2008 10:10 am

Hi, I am working on a program that using DGEMM for matrix multiplication. And this DGEMM has been parallelized already:

Code: Select all
C$OMP Parallel
C$OMP Single
C$      NP=omp_get_num_threads()
C$      MinCoW=16
C$OMP End Single
C$OMP End Parallel
          ColPW = Max((N+NP-1)/NP,MinCoW)
          NWork = (N+ColPW-1)/ColPW        [i]!...N is the number of column of C(M,N).[/i]
          If(XStr2.eq.'T'.or.XStr2.eq.'C') then
            IncB = 1
           else
              IncB = LDB
            endIf
           IncB = IncB*ColPW
           IncC = ColPW*LDC
C$OMP Parallel Do Default(Shared) Schedule(Static,1) Private(IP,XN)
          Do 100 IP = 0, (NWork-1)
              XN = Min(N-IP*ColPW,ColPW)
              Call DGEMM(XStr1,XStr2,XM,XN,XK,Alpha,A,XLDA,B(1+IP*IncB),
     $          XLDB,Beta,C(1+IP*IncC),XLDC)
100      Continue


I am using the pgf77 BLAS library called: libf77blas-amd64 / libatlas-em64t for doing this calculation.

Now the problem is: when I run the matrix multiplication jobs (the size of the matrices is 3432X3432) parallelized, upto 7 processors the speedup is perfect, but once the jobs are parallelized by 8 processors, the speedup becomes really poor (less than 3 times). However, when I change the size of the matrices, e.g. 924X924, the speedup for 8 processors becomes normal. I tried to assemble more memory for the 3432X3432 matrix multiplication of 8 processors, but it seems the speedup for a 10GB memory (the limit of our hardware) is still the same. Any one here can help me? Thank you very much!!!
loveislonely
 
Posts: 31
Joined: Wed Aug 20, 2008 11:32 am

Re: Why large scale DGEMM parallelization appears strange?

Postby mwolfe » Thu Sep 04, 2008 9:00 am

I've been looking at this, and I have essentially no ideas just yet.
I don't think it's a memory limit, 3432x3432x8(bytes/double) is still only 100MB or so, three arrays that big is way less than even 1 GB.
I don't think it's an alignment issue, though you might try dimensioning the arrays 3433x3433 and only using the 3432 subarray for data.
I don't think it s a scheduling issue, or it would affect the 924x924 matrix as well.
I will try some more experiments, but I'm leaving on vacation for a week, shortly, so it may be another 10 days or so before I have any ideas.
-mw
mwolfe
 
Posts: 54
Joined: Mon Aug 25, 2008 3:19 pm

Re: Why large scale DGEMM parallelization appears strange?

Postby loveislonely » Fri Sep 05, 2008 4:00 am

Hi MW,

Thank you very much for your analysis. After these days of trying to find out the reason, I finally concluded that the reason for the problem is the pgi blas library I have used so far for the matrix multiplication. Because once I changed to the latest pgi library, the speedup for 8 processors becomes normal. Thank you.

Best wishes,
Sharp
loveislonely
 
Posts: 31
Joined: Wed Aug 20, 2008 11:32 am


Return to Using OpenMP

Who is online

Users browsing this forum: No registered users and 12 guests