- Code: Select all
`C$OMP Parallel`

C$OMP Single

C$ NP=omp_get_num_threads()

C$ MinCoW=16

C$OMP End Single

C$OMP End Parallel

ColPW = Max((N+NP-1)/NP,MinCoW)

NWork = (N+ColPW-1)/ColPW [i]!...N is the number of column of C(M,N).[/i]

If(XStr2.eq.'T'.or.XStr2.eq.'C') then

IncB = 1

else

IncB = LDB

endIf

IncB = IncB*ColPW

IncC = ColPW*LDC

C$OMP Parallel Do Default(Shared) Schedule(Static,1) Private(IP,XN)

Do 100 IP = 0, (NWork-1)

XN = Min(N-IP*ColPW,ColPW)

Call DGEMM(XStr1,XStr2,XM,XN,XK,Alpha,A,XLDA,B(1+IP*IncB),

$ XLDB,Beta,C(1+IP*IncC),XLDC)

100 Continue

I am using the pgf77 BLAS library called: libf77blas-amd64 / libatlas-em64t for doing this calculation.

Now the problem is: when I run the matrix multiplication jobs (the size of the matrices is 3432X3432) parallelized, upto 7 processors the speedup is perfect, but once the jobs are parallelized by 8 processors, the speedup becomes really poor (less than 3 times). However, when I change the size of the matrices, e.g. 924X924, the speedup for 8 processors becomes normal. I tried to assemble more memory for the 3432X3432 matrix multiplication of 8 processors, but it seems the speedup for a 10GB memory (the limit of our hardware) is still the same. Any one here can help me? Thank you very much!!!