OpenMP efficiency problem with fortran code

General OpenMP discussion

OpenMP efficiency problem with fortran code

Postby zhouping » Tue Jul 29, 2008 7:20 pm

I have some questions about Fortran code with openmp. The code is complied by visual studio 2005 and intel fortran complier 10.0, and the project run on window xp system, and the machine has 8 cores. But after code test, I found the efficiency is very low, please give me some advices.

First, please look the serial code.
Code: Select all
      SUBROUTINE INTERF_BT(IIII,XN,YN,ZN,BX1,BX2,BY1,BY2,VNE1,VNE2,VNE3,
     &                     WEIGH,H,DMARX,DMARY,DMARZ,STRS1,STRS2,STRS3,
     &                     STRS4,STRS5,WWW,GS,FX,FY,FZ,FRX,FRY,FRZ,
     &                     VX,VY,VZ,VRX,VRY,VRZ,HX,HY,HZ,HRX,HRY,
     &                     TN,DK,MATE,E_GP,U_GP,HRATIO_GP,EleArea)
      IMPLICIT REAL*8(A-H,O-Z)
      COMMON/PRTMAIN/L0,NE,NB,NK,NELE,NDIS,NVEL,NFOR,NMAT,NCOD,NREF
      COMMON/REPORTERR/ERRORPORT
      DIMENSION IIII(4,NE),XN(L0),YN(L0),ZN(L0),
     &          BX1(NE),BX2(NE),BY1(NE),BY2(NE),
     &          VNE1(3,NE),VNE2(3,NE),VNE3(3,NE),
     &          WEIGH(NE),H(NE),DMARX(L0),DMARY(L0),DMARZ(L0),
     &          STRS1(NB,NE),STRS2(NB,NE),STRS3(NB,NE),
     &          STRS4(NB,NE),STRS5(NB,NE),WWW(NB),GS(NB),
     &          FX(L0),FY(L0),FZ(L0),FRX(L0),FRY(L0),FRZ(L0),
     &          VX(L0),VY(L0),VZ(L0),VRX(L0),VRY(L0),VRZ(L0),
     &          HX(4,NE),HY(4,NE),HZ(4,NE),HRX(4,NE),HRY(4,NE),
     &          MATE(NE),E_GP(10),U_GP(10),HRATIO_GP(10),
     &          EleArea(NE)

      DIMENSION IX(4),XX(4),YY(4),ZZ(4),DDF(5),DDM(3),GAMA(4),SQ(5),
     &          FLX(4),FLY(4),FLZ(4),FLRX(4),FLRY(4),FLRZ(4),
     &          FGX(4),FGY(4),FGZ(4),FGRX(4),FGRY(4),FGRZ(4),
     &          HLX(4),HLY(4),HLZ(4),HLRX(4),HLRY(4),HLRZ(4),
     &          QBX(4),QBY(4),QBZ(4),QBRX(4),QBRY(4),BQ(5)
     
      RS=0.01D0
      RW=0.01D0
      RF=0.01D0
     
      DO II=1,NE
      ...thread independent code
C---------------------------------
     IF(expression) THEN
      ...thread independent code
C-------------------
      ...thread independent code
C---------------------
      ...thread independent code
C-----------------------------
      ...thread independent code
      [b]DO I=1,4
      K=IX(I)
      DMARX(K)=DMARX(K)+DINERTIA*V1
      DMARY(K)=DMARY(K)+DINERTIA*V2
      DMARZ(K)=DMARZ(K)+DINERTIA*V3
      ENDDO[/b]
C------------------------------------
     DO I = 1, 5
      ...thread independent code
     IF(IX(3).NE.IX(4)) THEN
      DO I=1,4
      ...thread independent code
      ENDDO
      ...thread independent code
      DO I=1,4
      ...thread independent code
      ENDDO
C-----------------------------
      DO I=1,4
      ...thread independent code
      ENDDO
     ENDIF
C---------------------------------
     DO I=1,4
      ...thread independent code
     ENDDO
      [b]DO I=1,4
      K=IX(I)
      FX(K)=FX(K)+FGX(I)
      FY(K)=FY(K)+FGY(I)
      FZ(K)=FZ(K)+FGZ(I)
      FRX(K)=FRX(K)+FGRX(I)
      FRY(K)=FRY(K)+FGRY(I)
      FRZ(K)=FRZ(K)+FGRZ(I)
      ENDDO[/b]     
     ENDIF
     ENDDO
     
      IF(ERRORPORT.GT.0.D0) WRITE(12,*) 'INTERF_BT'
      RETURN
      END


Second, please look the parallel code with openmp:
Code: Select all
SUBROUTINE INTERF_BT(IIII,XN,YN,ZN,BX1,BX2,BY1,BY2,VNE1,VNE2,VNE3,
     &                     WEIGH,H,DMARX,DMARY,DMARZ,STRS1,STRS2,STRS3,
     &                     STRS4,STRS5,WWW,GS,FX,FY,FZ,FRX,FRY,FRZ,
     &                     VX,VY,VZ,VRX,VRY,VRZ,HX,HY,HZ,HRX,HRY,
     &                     TN,DK,MATE,E_GP,U_GP,HRATIO_GP,EleArea)
      IMPLICIT REAL*8(A-H,O-Z)
      COMMON/PRTMAIN/L0,NE,NB,NK,NELE,NDIS,NVEL,NFOR,NMAT,NCOD,NREF
      COMMON/REPORTERR/ERRORPORT
      DIMENSION IIII(4,NE),XN(L0),YN(L0),ZN(L0),
     &          BX1(NE),BX2(NE),BY1(NE),BY2(NE),
     &          VNE1(3,NE),VNE2(3,NE),VNE3(3,NE),
     &          WEIGH(NE),H(NE),DMARX(L0),DMARY(L0),DMARZ(L0),
     &          STRS1(NB,NE),STRS2(NB,NE),STRS3(NB,NE),
     &          STRS4(NB,NE),STRS5(NB,NE),WWW(NB),GS(NB),
     &          FX(L0),FY(L0),FZ(L0),FRX(L0),FRY(L0),FRZ(L0),
     &          VX(L0),VY(L0),VZ(L0),VRX(L0),VRY(L0),VRZ(L0),
     &          HX(4,NE),HY(4,NE),HZ(4,NE),HRX(4,NE),HRY(4,NE),
     &          MATE(NE),E_GP(10),U_GP(10),HRATIO_GP(10),
     &          EleArea(NE)

      DIMENSION IX(4),XX(4),YY(4),ZZ(4),DDF(5),DDM(3),GAMA(4),SQ(5),
     &          FLX(4),FLY(4),FLZ(4),FLRX(4),FLRY(4),FLRZ(4),
     &          FGX(4),FGY(4),FGZ(4),FGRX(4),FGRY(4),FGRZ(4),
     &          HLX(4),HLY(4),HLZ(4),HLRX(4),HLRY(4),HLRZ(4),
     &          QBX(4),QBY(4),QBZ(4),QBRX(4),QBRY(4),BQ(5)
     
      RS=0.01D0
      RW=0.01D0
      RF=0.01D0
   
!$omp parallel do num_threads(2)
!$omp&default(private)
!$omp&shared(IIII, XN, YN, ZN, BX1, BX2, BY1, BY2, VNE1, VNE2, VNE3,
!$omp&       WEIGH, H, STRS1, STRS2, STRS3, STRS4, STRS5, WWW, GS,
!$omp&       VX, VY, VZ, VRX, VRY, VRZ, HX, HY, HZ, HRX, HRY,
!$omp&       TN, DK, MATE, E_GP, U_GP, HRATIO_GP, EleArea, RS, RW, RF,
!$omp&       L0,NE,NB,NK,NELE,NDIS,NVEL,NFOR,NMAT,NCOD,NREF)
!$omp&reduction(+ : DMARX, DMARY, DMARZ, FX, FY, FZ, FRX, FRY, FRZ)   
      DO II=1,NE
      ...thread independent code
C---------------------------------
     IF(expression) THEN
      ...thread independent code
C-------------------
      ...thread independent code
C---------------------
      ...thread independent code
C-----------------------------
      ...thread independent code
      [b]DO I=1,4
      K=IX(I)
      DMARX(K)=DMARX(K)+DINERTIA*V1
      DMARY(K)=DMARY(K)+DINERTIA*V2
      DMARZ(K)=DMARZ(K)+DINERTIA*V3
      ENDDO[/b]
C------------------------------------
     DO I = 1, 5
      ...thread independent code
     IF(IX(3).NE.IX(4)) THEN
      DO I=1,4
      ...thread independent code
      ENDDO
      ...thread independent code
      DO I=1,4
      ...thread independent code
      ENDDO
C-----------------------------
      DO I=1,4
      ...thread independent code
      ENDDO
     ENDIF
C---------------------------------
     DO I=1,4
      ...thread independent code
     ENDDO
      [b]DO I=1,4
      K=IX(I)
      FX(K)=FX(K)+FGX(I)
      FY(K)=FY(K)+FGY(I)
      FZ(K)=FZ(K)+FGZ(I)
      FRX(K)=FRX(K)+FGRX(I)
      FRY(K)=FRY(K)+FGRY(I)
      FRZ(K)=FRZ(K)+FGRZ(I)
      ENDDO[/b]     
     ENDIF
     ENDDO
!omp end parallel do

      IF(ERRORPORT.GT.0.D0) WRITE(12,*) 'INTERF_BT'
      RETURN
      END


The bold-face part are reduction variable, and the other variables are all independent.

The loop variable NE may be 1000 to 10000, even more, and the variable L0 also may be 1000 to 10000,even more.

I tried to change the num_threads = 2, 4, 6, 8. But the efficiency is very low especially when num_threads is 6 and 8. When num_threads is 2 and 4, the efficiency increased about 0.7 and 0.8 respectively. When num_threads is 6and 8, the efficiency even lower than serial code with one core. I don't know why. Please give me some advices.
zhouping
 
Posts: 1
Joined: Tue Jul 29, 2008 12:44 am

Return to Using OpenMP

Who is online

Users browsing this forum: Google [Bot], Yahoo [Bot] and 12 guests