at my work, I have been asked to develop a rather big fortran program (big for me) for which the speed of execution is important since it should run for about a whole week to be useful. (when it will be completed and checked)
BUT I encounter a BIG problem, since it gives different results depending on the number of threads (or proc's) used.
I am somewhere about a quarter of the developement process, as far as I can evaluate it; so I already ran several different parts of the program, and they behave correctly.
Especially the part wich I expected to be the most "hungry" of machine resources, has already worked correctly on one single proc', then two, eight... and finally all the forty ones of the machine.
I am presently working on a critical part of the program, which counts a specific kind of events happening within a recording. (this recording is one input file among many of them, for which I already posted a message here to manage a "mess" between them)
The structure of the program is the following :
First, some readings are done in three files under an "OMP MASTER" directive, to acquire (shared) data useful for all the followng work :
-The first file is named the "ana" and gives data useful for all the followng work.
(this data is an array of a bit more than 2 million ascii lines of 40 char's)
-The second file is the "cvt" and gives auxiliary data to be used with the previons one.
(this data is a couple of real arrays using many subscripts : Ccvt1(99999, 0:5, 0:8) and Ccvt2(99999, 0:5, 0:8) )
-The third file is just the list of the numbers of the entities that will be processed later on.
Then it constructs a matrix whith all that data (this is the most "hungry" part, subroutine fabmat)
Then each column is cleaned by subroutines doublons and ValInter.
Then some events in each column are counted by the subroutine rainflow, giving the FAT scalar output.
Finally, the FAT output from rainflow must be accumulated over the 326 columns of the matrix, and this has to be done for each entitiy EID to EIDMAX. (several hundreds of thousands of entities, each giving itx own matrix)
pseudo code of the program :
- Code: Select all
open(1, file=file1, err=799) ! access='READONLY',
call lecana(lina, LNGANA) ! LECture of the "ana" file.
call leccvt(Ccvt1, Ccvt2, FScvt2) ! LECture of the "cvt" file. (open and close of fortran unit 2)
open(3, file='1D/elements.input', err=700) ! file containing the numbers of the entities
71 READ(3, '(A)', err=702, end=77)LINE ! sortie de la boucle de lecture en fin de fichier
77 print '(A, I6)', 'rfomp1:00 end=77 fin de elements.input-3 Ok'
close(3) ! fermeture du fichier elements.input
open(4, file='rfomp1.out', status='replace') ! output file
C$OMP END MASTER
CALL OMP_SET_NUM_THREADS(40) ! 1) ! 2) ! 3) ! 4) ! 16) ! 8) !
C$OMP PARALLEL PRIVATE(cont, Diml, EID, Sstf, TID, tmp)
C$OMP+ SHARED(Ccvt1, Ccvt2, elem, elemout, lina)
C$OMP DO SCHEDULE(DYNAMIC)
do EID=1, EIDMAX ! loop on the entities which numbers were read in the third file
call fabmat(cont, Diml, elem(EID), lina, LNGANA, nocc, TID) ! FABricate MATrix
do col=1, 326 ! loop on the 326 columns of the previous matrix
call doublons(cont, Diml, ntf, valinter) ! first part of the cleaning
call ValInt(Arr, Diml, ntf, valinter) ! second part of the cleaning
call rainflow(Arr, ntf, Diml, EID, FAT, p, q) ! counting of events
enddo ! end of the loop on the columns of the matrix
enddo ! end of the loop on the entities processed
c All threads join master thread and disband
C$OMP END DO
C$OMP END PARALLEL
As it is now, the program gives different results depending on the number of threads allowed.
for reference, one thread gives the results of the first column, then more...
- Code: Select all
Entity_ID 1_thread 2_T. 3_T. 2_T. 4_T. 3_T.
110005 0.700 0.700 0.602 0.532 2.912 2.446
110006 2.407 0.719 2.446 2.912 2.912 2.602
110007 2.851 2.602 2.912 2.602
110026 2.330 2.534
It looks as if there was a kind of mix between the values obtained, as many of them "appear" from time to time, more than once.
I made several other runs, and even those whith a given number of threads (e.g. 2 above) do not produce the same results. Could this be explained by the workload due to other users of the machine that changes the order in which the calculation is conducted, by changing the time slices attributed by the O.S., and therthore there is again a "mess" between the inputs or between intermedate results ?
I really do not know which directive I should check; the skeleton of my program given below may help to make a step or two towards an explanation and a solution.
Later on, I will face an other issue when it will be time to write the results to an output file;
I may need to make the rainflow subroutine write them in an "elemout" array, and flush it before "END PARALLEL", and then write the array to the output file afterwards.
Or may an atomic constructs be needed, just for the "write" to the output file ?
Do I need to allow nested loops, since those on the 326 columns of each matrix are enclosed within that on the EIDMAX entites that should be proceeded ?
Thank you to help me have a more clear sight on all this work,