Problem in Nested for Loop

General OpenMP discussion

Problem in Nested for Loop

Postby shahz » Sat Jan 30, 2010 8:14 pm

Hi,
Hope u all will be fine. I need your help, i have been working on code to parallelize it since more than one week but didn't succeeded till now to get required results. I am going to attach my code in two files. I want to compute the parallel computation time using one and two threads using my core2duo system. But using 2nd thread the time increases as compared to 1st thread time, instead of decreasing and becoming half. Its my first programming using openmp thats why may be i am not able to realize my faults.
In one attached file i am going to parallelize the nested for loops but all in vain as i studied that openmp is not good in nested parallelism, u will find lot of pragmas in that region that i has tried. In second file i am using sections to parallelize the block which call the user defined function but still i am not getting the required efficiency. If both files will work then at last i will merge both files to get final full one parallelized program.Plz help me as much as u can and elaborate ur reply with examples. Plz reply soon and help me our as its very important for me. Thanx
Attachments
files.tar.gz
Related code file.
(2.8 KiB) Downloaded 719 times
shahz
 
Posts: 4
Joined: Sat Jan 30, 2010 7:22 pm

Re: Problem in Nested for Loop

Postby anv » Fri Feb 05, 2010 11:50 am

Let me try to answer your questions. In your first example you cannot expect any speedup because you use parallel construct and no worksharing constructs inside (all of them are commented). Thus (1) all threads will duplicate the same work, and (2) you will get undefined results because of lot of race conditions in the code (e.g. "sumX=0;", "sumX=sumX+...", while one thread will collect some data another thread will override it with 0).
I would suggest you starting form one combined <parallel for> construct here, because your parallel region consists of the only outer loop. Later you can experiment with nested parallel regions and collapse clause, but try to get correct results with the only simple construct first. What you need to do carefully is to specify sharing attributes for all writable variables (you can leave read-only variables shared by default). For example, all loop iteration variables need to be private (only parallel loop iteration variables get private attribute implicitly, sequential loop iteration variable is shared by default, you need to make them private explicitly). And the variables used in calculations (sumX, sumY, SUM) need to be private, so that each thread works with its own copy. As you want to write result in a file char by char, you was right to use schedule(dynamic) and ordered clauses, though you should expect some slowdown because of ordered clause. Each iteration of (X,Y) loop writes in separate location of edgeImage.data array, so I think you don't need ordered here, but to correctly write data into file you definitely need ordered (otherwise data will be intermixed randomly).
Hope this will help you to parallelize the first program.

As to your second program, I haven't got the idea of parallelization here. You read only three numbers from a file, this is so small chunk of work that you unlikely can get any speedup from parallelization. Anyway, you have fundamental synchronization problem in your code. You use shared file handle, that means different threads work simultaneously with the file. Imagine you set file position it thread 0 to char 2 and expect it to start reading there. But before the read another thread can move the position to char 18 or 22, so finally you will get undefined results. To solve this you can either make read operations ordered (which eventually kills all the parallelism), of open the file inside parallel using private file handles so that each thread has its own file position (not sure how much parallelism you can get here taking into account hardware serialization of access to the file).

Best regards,
Andrey
anv
 
Posts: 31
Joined: Wed Dec 12, 2007 9:36 am

Re: Problem in Nested for Loop

Postby shahz » Tue Mar 02, 2010 12:50 am

Thanx for guiding me.Now i will try to implement as u have suggested and will come back to you if i will find some problem.
Thanx again for ur this kind gesture.
shahz
 
Posts: 4
Joined: Sat Jan 30, 2010 7:22 pm

Re: Problem in Nested for Loop(again problem)

Postby shahz » Mon Mar 08, 2010 4:21 pm

Hi,
I have tried what you have told, Andrey. I am attaching the updated file. Again instead of decreasing time in the second thread time is increasing as compared to 1st thread. I am using the following gray-scale image : "http://upload.wikimedia.org/wikipedia/commons/d/dc/Webcam_grayscale.jpg"(just save it as Bmp) in my program. Kindly help me out from this mess its a request.
I shall be very thankful to you.
Attachments
edgesob1.cc.tar.gz
Program
(2.29 KiB) Downloaded 1007 times
shahz
 
Posts: 4
Joined: Sat Jan 30, 2010 7:22 pm

Re: Problem in Nested for Loop

Postby anv » Wed Mar 10, 2010 6:06 am

Sorry for long silence. Let me try to help you a bit more.

1. Even if your program would work correct, you can hardly expect any speedup from parallelization because (1) ordered clause and (2) write to file in parallel loop. Ordered only make sense if for example two threads make a lot of work in parallel on each iteration and only shortly make ordered things at the end. In your case the work on each iteration is too small to pay for ordered. Similarly the writing into file takes longer than the rest of calculations, so it impacts the parallelism significantly.

2. I see the following errors in attached program:
- length of the vector is (Nx * Ny * 3) as each pixel takes 3 bytes in .bmp file. But you only work on one third of the data in a loops (X, Y), because you iterate (from 0 to Nx) by (from 0 to Ny) and get one char on each iteration. as a result I got corrupted bmp file after running your program (it should be 3 times longer due to header data).
- you put array assignment under ordered clause but the write to file is not under the ordered. You would better do the opposite.

3. I tried to re-write a bit your program, got image from the link you pointed, converted it into bitmap and run the program. Results were 0.5 sec on 1 thread and 0.4 sec on 2 threads. Then I moved final file writing out of the timing region and got 0.2 sec on 1 thread and 0.1 sec on 2 threads - perfect scaling, isn't it. I attached the result program for you to try.

Major changes I made:
- moved final file writing out of parallel region (and out of timing region finally). Thus no more ordered clause is needed, and the schedule can be any now, I tried both static and dynamic with little difference.
- combined reading and writing so that (1) read and write input header in one operation each, (2) read input data 3 bytes at a time (save only the first byte, the two other should be the same), (3) write final array in one operation (my timing shows that it takes 0.3 sec on my laptop).
- made input array 3 times shorter, so that only one char per pixel saved (as initial image is black-and white the two other chars are identical to first char for each pixel). Correspondingly I've written each value 3 times in the result array (I guess these are (red,green,blue) triplets, they are the same for black-and-white images). Making result array 3x longer than input array greatly simplifies final writing as one operation (should be much faster).

I am not an expert in binary image formats, so I could mistake in the program. But I couldn't found what is color table, because I saw in my converted bitmap that after 54 bytes of header the pixel data started immediately. Thus I skipped the color table, and finally was able to open the result image. It has the same header as initial image and re-calculated pixel data.

Regards,
Andrey
Attachments
edgesob1-new.zip
(2.59 KiB) Downloaded 1042 times
anv
 
Posts: 31
Joined: Wed Dec 12, 2007 9:36 am

Re: Problem in Nested for Loop

Postby shahz » Thu Mar 11, 2010 8:55 pm

Hi anv,
Thanks alot my friend atlast i have gotten what i wanted and its just because of ur help. Ur attached program and ur comments help me alot. When i run the program attached by i got segmentation error and little more errors. After making them correct i have done and its because of u.
After a long silence i got reply and its solved my problem. Now i am thinking that u should teach me also,u can be my teacher :) .
Please can u give me ur email id so that i can contact u directly?
Thanks again,bye tc
This forum is really very helpfull.

Regards!!!
shahz
shahz
 
Posts: 4
Joined: Sat Jan 30, 2010 7:22 pm

Re: Problem in Nested for Loop

Postby msohail » Thu Mar 06, 2014 9:24 am

Hi
I am trying to parallised a long long code which has been divided in subroutines. Here I attached a subroutine in which I try to collapse the nested loop but did not use any private directive.
At the end the code should be able copy the global file that other subroutines can use .

SUBROUTINE CALCV

!$omp parallel do collapse (3)

DO K=2,NKM1
DO J=2,NJM1
DO I=2,NIM1

IF (IBLANK(I,J,K).EQ.0) CYCLE ! block

VISE=VIS(I ,J,K)*(1.-FX(I ,J,K))+VIS(I+1,J,K)*FX(I ,J,K)
VISW=VIS(I-1,J,K)*(1.-FX(I-1,J,K))+VIS(I ,J,K)*FX(I-1,J,K)
VISN=VIS(I,J ,K)*(1.-FY(I,J ,K))+VIS(I,J+1,K)*FY(I,J ,K)
VISS=VIS(I,J-1,K)*(1.-FY(I,J-1,K))+VIS(I,J ,K)*FY(I,J-1,K)
VIST=VIS(I,J,K )*(1.-FZ(I,J,K ))+VIS(I,J,K+1)*FZ(I,J,K )
VISB=VIS(I,J,K-1)*(1.-FZ(I,J,K-1))+VIS(I,J,K )*FZ(I,J,K-1)

VOLE=VOL(I ,J,K)*(1.-FX(I ,J,K))+VOL(I+1,J,K)*FX(I ,J,K)
VOLW=VOL(I-1,J,K)*(1.-FX(I-1,J,K))+VOL(I ,J,K)*FX(I-1,J,K)
VOLN=VOL(I,J ,K)*(1.-FY(I,J ,K))+VOL(I,J+1,K)*FY(I,J ,K)
VOLS=VOL(I,J-1,K)*(1.-FY(I,J-1,K))+VOL(I,J ,K)*FY(I,J-1,K)
VOLT=VOL(I,J,K )*(1.-FZ(I,J,K ))+VOL(I,J,K+1)*FZ(I,J,K )
VOLB=VOL(I,J,K-1)*(1.-FZ(I,J,K-1))+VOL(I,J,K )*FZ(I,J,K-1)
!
DENE=DEN(I ,J,K)*(1.-FX(I ,J,K))+DEN(I+1,J,K)*FX(I ,J,K)
DENW=DEN(I-1,J,K)*(1.-FX(I-1,J,K))+DEN(I ,J,K)*FX(I-1,J,K)
DENN=DEN(I,J ,K)*(1.-FY(I,J ,K))+DEN(I,J+1,K)*FY(I,J ,K)
DENS=DEN(I,J-1,K)*(1.-FY(I,J-1,K))+DEN(I,J ,K)*FY(I,J-1,K)
DENT=DEN(I,J,K )*(1.-FZ(I,J,K ))+DEN(I,J,K+1)*FZ(I,J,K )
DENB=DEN(I,J,K-1)*(1.-FZ(I,J,K-1))+DEN(I,J,K )*FZ(I,J,K-1)

Q11E
&=(1.-FX(I ,J,K))*(XX(I ,J,K)**2+XY(I ,J,K)**2+XZ(I ,J,K)**2)
& +FX(I ,J,K) *(XX(I+1,J,K)**2+XY(I+1,J,K)**2+XZ(I+1,J,K)**2)
Q11W
&=(1.-FX(I-1,J,K))*(XX(I-1,J,K)**2+XY(I-1,J,K)**2+XZ(I-1,J,K)**2)
& +FX(I-1,J,K) *(XX(I ,J,K)**2+XY(I ,J,K)**2+XZ(I ,J,K)**2)

Q22N
&=(1.-FY(I,J ,K))*(YX(I,J ,K)**2+YY(I,J ,K)**2+YZ(I,J ,K)**2)
& +FY(I,J ,K)* (YX(I,J+1,K)**2+YY(I,J+1,K)**2+YZ(I,J+1,K)**2)
Q22S
&=(1.-FY(I,J-1,K))*(YX(I,J-1,K)**2+YY(I,J-1,K)**2+YZ(I,J-1,K)**2)
& +FY(I,J-1,K) *(YX(I,J ,K)**2+YY(I,J ,K)**2+YZ(I,J ,K)**2)

Q33T
&=(1.-FZ(I,J,K ))*(ZX(I,J,K )**2+ZY(I,J,K )**2+ZZ(I,J,K )**2)
& +FZ(I,J,K ) *(ZX(I,J,K+1)**2+ZY(I,J,K+1)**2+ZZ(I,J,K+1)**2)
Q33B
&=(1.-FZ(I,J,K-1))*(ZX(I,J,K-1)**2+ZY(I,J,K-1)**2+ZZ(I,J,K-1)**2)
& +FZ(I,J,K-1) *(ZX(I,J,K )**2+ZY(I,J,K )**2+ZZ(I,J,K )**2)
!
DV1(I,J,K)=XY(I,J,K)
DV2(I,J,K)=YY(I,J,K)
DV3(I,J,K)=ZY(I,J,K)

UTNE=0.125*(
& U(I ,J ,K )+U(I ,J+1,K )+U(I+1,J+1,K )+U(I+1,J ,K )
&+U(I ,J ,K+1)+U(I ,J+1,K+1)+U(I+1,J+1,K+1)+U(I+1,J ,K+1))

UTNW=0.125*(
& U(I-1,J ,K )+U(I-1,J+1,K )+U(I ,J+1,K )+U(I ,J ,K )
&+U(I-1,J ,K+1)+U(I-1,J+1,K+1)+U(I ,J+1,K+1)+U(I ,J ,K+1))

UBNE=0.125*(
& U(I ,J ,K-1)+U(I ,J+1,K-1)+U(I+1,J+1,K-1)+U(I+1,J ,K-1)
&+U(I ,J ,K )+U(I ,J+1,K )+U(I+1,J+1,K )+U(I+1,J ,K ))

UBNW=0.125*(
& U(I-1,J ,K-1)+U(I-1,J+1,K-1)+U(I ,J+1,K-1)+U(I ,J ,K-1)
&+U(I-1,J ,K )+U(I-1,J+1,K )+U(I ,J+1,K )+U(I ,J ,K ))

UTSE=0.125*(
& U(I ,J-1,K )+U(I ,J ,K )+U(I+1,J ,K )+U(I+1,J-1,K )
&+U(I ,J-1,K+1)+U(I ,J ,K+1)+U(I+1,J ,K+1)+U(I+1,J-1,K+1))

UTSW=0.125*(
& U(I-1,J-1,K )+U(I-1,J ,K )+U(I ,J ,K )+U(I ,J-1,K )
&+U(I-1,J-1,K+1)+U(I-1,J ,K+1)+U(I ,J ,K+1)+U(I ,J-1,K+1))

UBSE=0.125*(
& U(I ,J-1,K-1)+U(I ,J ,K-1)+U(I+1,J ,K-1)+U(I+1,J-1,K-1)
&+U(I ,J-1,K )+U(I ,J ,K )+U(I+1,J ,K )+U(I+1,J-1,K ))

UBSW=0.125*(
& U(I-1,J-1,K-1)+U(I-1,J ,K-1)+U(I ,J ,K-1)+U(I ,J-1,K-1)
&+U(I-1,J-1,K )+U(I-1,J ,K )+U(I ,J ,K )+U(I ,J-1,K ))

WTNE=0.125*(
& W(I ,J ,K )+W(I ,J+1,K )+W(I+1,J+1,K )+W(I+1,J ,K )
&+W(I ,J ,K+1)+W(I ,J+1,K+1)+W(I+1,J+1,K+1)+W(I+1,J ,K+1))

WTNW=0.125*(
& W(I-1,J ,K )+W(I-1,J+1,K )+W(I ,J+1,K )+W(I ,J ,K )
&+W(I-1,J ,K+1)+W(I-1,J+1,K+1)+W(I ,J+1,K+1)+W(I ,J ,K+1))

WBNE=0.125*(
& W(I ,J ,K-1)+W(I ,J+1,K-1)+W(I+1,J+1,K-1)+W(I+1,J ,K-1)
&+W(I ,J ,K )+W(I ,J+1,K )+W(I+1,J+1,K )+W(I+1,J ,K ))

WBNW=0.125*(
& W(I-1,J ,K-1)+W(I-1,J+1,K-1)+W(I ,J+1,K-1)+W(I ,J ,K-1)
&+W(I-1,J ,K )+W(I-1,J+1,K )+W(I ,J+1,K )+W(I ,J ,K ))

WTSE=0.125*(
& W(I ,J-1,K )+W(I ,J ,K )+W(I+1,J ,K )+W(I+1,J-1,K )
&+W(I ,J-1,K+1)+W(I ,J ,K+1)+W(I+1,J ,K+1)+W(I+1,J-1,K+1))

WTSW=0.125*(
& W(I-1,J-1,K )+W(I-1,J ,K )+W(I ,J ,K )+W(I ,J-1,K )
&+W(I-1,J-1,K+1)+W(I-1,J ,K+1)+W(I ,J ,K+1)+W(I ,J-1,K+1))

WBSE=0.125*(
& W(I ,J-1,K-1)+W(I ,J ,K-1)+W(I+1,J ,K-1)+W(I+1,J-1,K-1)
&+W(I ,J-1,K )+W(I ,J ,K )+W(I+1,J ,K )+W(I+1,J-1,K ))

WBSW=0.125*(
& W(I-1,J-1,K-1)+W(I-1,J ,K-1)+W(I ,J ,K-1)+W(I ,J-1,K-1)
&+W(I-1,J-1,K )+W(I-1,J ,K )+W(I ,J ,K )+W(I ,J-1,K ))

XXYYE=(1.-FX(I ,J,K))*XX(I ,J,K)*YY(I ,J,K)
& +FX(I ,J,K) *XX(I+1,J,K)*YY(I+1,J,K)
XXYYW=(1.-FX(I-1,J,K))*XX(I-1,J,K)*YY(I-1,J,K)
& +FX(I-1,J,K) *XX(I ,J,K)*YY(I ,J,K)

SC2=VISE/VOLE*( (UTNE+UBNE-UTSE-UBSE)*XXYYE )
& -VISW/VOLW*( (UTNW+UBNW-UTSW-UBSW)*XXYYW )
SC2=0.5*SC2

YYYYN=(1.-FY(I,J ,K))*YY(I,J ,K)*YY(I,J ,K)
& +FY(I,J ,K) *YY(I,J+1,K)*YY(I,J+1,K)
YYYYS=(1.-FY(I,J-1,K))*YY(I,J-1,K)*YY(I,J-1,K)
& +FY(I,J-1,K) *YY(I,J ,K)*YY(I,J ,K)

SC5=VISN/VOLN*( (V(I,J+1,K)-V(I,J,K))*YYYYN )
& -VISS/VOLS*( (V(I,J,K)-V(I,J-1,K))*YYYYS )

ZZYYT=(1.-FZ(I,J,K ))*ZZ(I,J,K )*YY(I,J,K )
& +FZ(I,J,K ) *ZZ(I,J,K+1)*YY(I,J,K+1)
ZZYYB=(1.-FZ(I,J,K-1))*ZZ(I,J,K-1)*YY(I,J,K-1)
& +FZ(I,J,K-1) *ZZ(I,J,K )*YY(I,J,K )

SC8=VIST/VOLT*( (WTNE+WTNW-WTSE-WTSW)*ZZYYT )
& -VISB/VOLB*( (WBNE+WBNW-WBSE-WBSW)*ZZYYB )
SC8=0.5*SC8

SU(I,J,K)=SC2+SC5+SC8 ! stress

PRESE=P(I ,J,K)*(1.-FX(I ,J,K))+P(I+1,J,K)*FX(I ,J,K)
PRESW=P(I-1,J,K)*(1.-FX(I-1,J,K))+P(I ,J,K)*FX(I-1,J,K)
PRESN=P(I,J ,K)*(1.-FY(I,J ,K))+P(I,J+1,K)*FY(I,J ,K)
PRESS=P(I,J-1,K)*(1.-FY(I,J-1,K))+P(I,J ,K)*FY(I,J-1,K)
PREST=P(I,J,K )*(1.-FZ(I,J,K ))+P(I,J,K+1)*FZ(I,J,K )
PRESB=P(I,J,K-1)*(1.-FZ(I,J,K-1))+P(I,J,K )*FZ(I,J,K-1)

SP(I,J,K)=0.
SU(I,J,K)=SU(I,J,K) ! stress
& +DV1(I,J,K)*(PRESW-PRESE) ! pressure
& +DV2(I,J,K)*(PRESS-PRESN)
& +DV3(I,J,K)*(PRESB-PREST)
& -(9.8*9)/(5.6*5.6)*(DEN(I,J,K)-1) !MAHWISH GRAVITY FORCES
& +((AF+BT)*DENO(I,J,K)*VO(I,J,K)-BT*VM(I,J,K))*VOL(I,J,K)/DELT ! MAHWISH DENSITY transient

END DO
END DO
END DO

!$OMP END PARALLEL DO

!$omp parallel do collapse(3)

DO I=1,NI
DO J=1,NJ
DO K=1,NK
Q(I,J,K)=V(I,J,K) ( update the variable )
END DO
END DO
END DO
!$OMP END PARALLEL DO

c CALL UMIST(Q)

CALL MODV (call boundary condition)

RESORV=0.
!$omp parallel do collapse (3)


DO K=2,NKM1
DO J=2,NJM1
DO I=2,NIM1

IF (IBLANK(I,J,K).EQ.0) CYCLE ! Block ( This is due to the presence of block )

AC=AE(I,J,K)+AW(I,J,K)+AN(I,J,K)+AS(I,J,K)+AT(I,J,K)+AB(I,J,K)
AP(I,J,K)=AC-SP(I,J,K)
& +AF*DEN(I,J,K)*VOL(I,J,K)/DELT ! MAHWISH DENSITY transient

RESOR=AE(I,J,K)*V(I+1,J,K)+AW(I,J,K)*V(I-1,J,K)
& +AN(I,J,K)*V(I,J+1,K)+AS(I,J,K)*V(I,J-1,K)
& +AT(I,J,K)*V(I,J,K+1)+AB(I,J,K)*V(I,J,K-1)+SU(I,J,K)
& -AP(I,J,K)*V(I,J,K)

IF(SP(I,J,K).EQ.-GREAT) RESOR=0.
RESORV=RESORV+ABS(RESOR)

C UNDERRELAX

AP(I,J,K)=AP(I,J,K)/URFV
SU(I,J,K)=SU(I,J,K)+(1.-URFV)*AP(I,J,K)*V(I,J,K)

C For SIMPLE use the following code: (and, comment out code for SIMPLEC below)

c DV1(I,J,K)=DV1(I,J,K)/AP(I,J,K)
c DV2(I,J,K)=DV2(I,J,K)/AP(I,J,K)
c DV3(I,J,K)=DV3(I,J,K)/AP(I,J,K)

C For SIMPLEC use the following code: (and, comment out code for SIMPLE above)

DV1(I,J,K)=DV1(I,J,K)/(AP(I,J,K)-AC)
DV2(I,J,K)=DV2(I,J,K)/(AP(I,J,K)-AC)
DV3(I,J,K)=DV3(I,J,K)/(AP(I,J,K)-AC)

END DO
END DO
END DO

!$OMP END PARALLEL DO

CALL SIP3D(V,RESORV,FRACV,NSWPV,NVMIN)

RETURN
END
msohail
 
Posts: 2
Joined: Thu Mar 06, 2014 9:02 am

Re: Problem in Nested for Loop

Postby MarkB » Thu Mar 06, 2014 10:11 am

I'm not sure exactly what your question is, but to get a correct code, you need to declare all the scalar variables accessed inside the loop in a PRIVATE clause. I would strongly recommend using the DEFAULT(NONE) clause and declaring all variables as either shared or private to avoid accidental omissions.
MarkB
 
Posts: 450
Joined: Thu Jan 08, 2009 10:12 am
Location: EPCC, University of Edinburgh

Re: Problem in Nested for Loop

Postby msohail » Thu Mar 06, 2014 10:29 am

Thanks alot . The collapse looks ok to yo u
msohail
 
Posts: 2
Joined: Thu Mar 06, 2014 9:02 am

Re: Problem in Nested for Loop

Postby MarkB » Fri Mar 07, 2014 4:39 am

The collapse clauses look fine to me. In each case the three nested loops all have independent iterations.
MarkB
 
Posts: 450
Joined: Thu Jan 08, 2009 10:12 am
Location: EPCC, University of Edinburgh

Next

Return to Using OpenMP

Who is online

Users browsing this forum: No registered users and 6 guests