General OpenMP discussion

Hello all parallel coders! I need your insights...

I have a big loop that I have parallelized and I need to see if the way I have the data is the optimal:

Code: Select all
integer, parameter :: num_elem=40, num_domains=10000
integer :: IPVT(num_elem,num_domains), info, i, counter, chunk
double precision ::  A(num_elem,num_elem,num_domains), Bx(num_elem,num_domains),By(num_elem,num_domains), &
Tx(num_elem,num_domains),Ty(num_elem,num_domains), TB(4,num_domains), Cx(num_elem), Cy(num_elem)
double precision :: f(num_elem,num_domains),x(num_elem,num_domains)
logical :: converged

call initialize(Cy,Cx,counter,chunk,x,f,converged)
do while(.not. converged)
!\$omp parallel do default(none) &
!\$omp private(i, info) &
!\$omp shared(num_domains,A,IPVT,Ty,Tx,Bx,By,Cy,Cx,TB)&
!\$omp reduction(+:counter) &
!\$omp schedule(dynamic,chunk)
do i=1,num_domains
call calculate_domain(A(:,:,i),IPVT(:,i),Ty(:,i),Tx(:,i),Bx(:,i),By(:,i),TB(:,i),x(:,i),f(:,i),Cy,Cx)
call solve_domain(A(:,:,i),IPVT(:,i),Ty(:,i),Tx(:,i),Bx(:,i),By(:,i),TB(:,i),x(:,i),f(:,i))
counter=counter+1
enddo

call check_convergence(x,f,converged)
enddo

-In this loop many domains are calculated by each thread.
-Inside calculate_domain(): A,IPVT,Ty,Tx,TB,f,Bx and By are intent(out) while Cy,Cx and x are intent(in).
-Inside solve_domain(): A,IPVT,Ty,Tx,TB,f,Bx and By are intent(in) while x is intent(inout).
-The schedule is dynamic because the work inside each subdomain is not the same.

I get some speedup but not the expected. This part of the code sums for 80% of the total program. I would expect a maximum theoretic speedup of 5x. I get 2.5-3x (I vary the number of threads from 2 to 48, on a 48-core machine. The max speedup is at 20-22 cores).

Thoughts I have:
1) Use thread private data. This will mean multiplying the memory used by the num_threads.
2) Use allocatable arrays to be sure each domain is sized as needed and not at the size of the biggest domain.
3) Use static schedule by manually dividing the domains
4) Put the data of each domain in a type structure and then have an array of structures. That is:
Code: Select all
type domain_data
double precission :: A(num_elem,num_elem), Bx(num_elem)
...
end type domain_data
type(domain_data), dimension(num_domains) :: all_domains_data

5) Any combination of the above.

Can you give me your thoughts? Someone more experienced maybe?

Petros
p3tris

Posts: 9
Joined: Fri May 06, 2011 5:57 pm

### Re: Sharing memory among threads

p3tris wrote:-Inside calculate_domain(): A,IPVT,Ty,Tx,TB,f,Bx and By are intent(out) while Cy,Cx and x are intent(in).
-Inside solve_domain(): A,IPVT,Ty,Tx,TB,f,Bx and By are intent(in) while x is intent(inout).

Are A,IPVT,Ty,Tx,TB,f,Bx and By required after the parallel loop? If not, you could make them private, and remove their num_domains dimension, which would reduce your memory requirements.

You may be suffering from some NUMA effects on a large system: if all your data is initialised on the master thread, if might all be allocated on the same node.
Making data private will help, but you may want to consider parallelising the initialisation of any shared arrays as well to try and avoid this.
MarkB

Posts: 670
Joined: Thu Jan 08, 2009 10:12 am
Location: EPCC, University of Edinburgh

### Re: Sharing memory among threads

MarkB wrote:
p3tris wrote:-Inside calculate_domain(): A,IPVT,Ty,Tx,TB,f,Bx and By are intent(out) while Cy,Cx and x are intent(in).
-Inside solve_domain(): A,IPVT,Ty,Tx,TB,f,Bx and By are intent(in) while x is intent(inout).

Are A,IPVT,Ty,Tx,TB,f,Bx and By required after the parallel loop? If not, you could make them private, and remove their num_domains dimension, which would reduce your memory requirements.

You may be suffering from some NUMA effects on a large system: if all your data is initialised on the master thread, if might all be allocated on the same node.
Making data private will help, but you may want to consider parallelising the initialisation of any shared arrays as well to try and avoid this.

First of all thanks for answering. The actual code is this:
Code: Select all
integer, parameter :: num_elem=40, num_domains=10000
integer :: IPVT(num_elem,num_domains), info, i, counter, chunk
double precision ::  A(num_elem,num_elem,num_domains), Bx(num_elem,num_domains), &By(num_elem,num_domains), Tx(num_elem,num_domains), Ty(num_elem,num_domains), &TB(4,num_domains), Cx(num_elem), Cy(num_elem)
double precision :: f(num_elem,num_domains),x(num_elem,num_domains)
logical :: converged

call initialize(Cy,Cx,counter,chunk,x,f,converged)
do while(.not. converged)
!\$omp parallel do default(none) &
!\$omp private(i, info) &
!\$omp shared(num_domains,A,IPVT,Ty,Tx,Bx,By,Cy,Cx,TB)&
!\$omp reduction(+:counter) &
!\$omp schedule(dynamic,chunk)
do i=1,num_domains
call calculate_domain(A(:,:,i),IPVT(:,i),Ty(:,i),Tx(:,i),Bx(:,i),By(:,i),TB(:,i),x(:,i),f(:,i),Cy,Cx)
call solve_domain(A(:,:,i),IPVT(:,i),Ty(:,i),Tx(:,i),Bx(:,i),By(:,i),TB(:,i),x(:,i),f(:,i))
counter=counter+1
enddo

call check_convergence(x,f,converged)
enddo

You see all data are reused many times always inside the parallel part, except x,f which are accessed also by the master thread. All are initialized inside the parallel region (except Cy and Cx). The outer do while() loop can reach several thousand iterations!
p3tris

Posts: 9
Joined: Fri May 06, 2011 5:57 pm

### Re: Sharing memory among threads

p3tris wrote:You see all data are reused many times always inside the parallel part, except x,f which are accessed also by the master thread. All are initialized inside the parallel region (except Cy and Cx). The outer do while() loop can reach several thousand iterations!

OK, I think my suggestion is still valid if A,IPVT,Ty,Tx,TB,f,Bx and By are only used to communicate data between the calls to calculate_domain and solve_domain within one iteration of the i loop. It should be fine to declare these as private arrays without the num_domains dimension: they will be allocated on the stack of each thread, so the overhead for creating/destroying them for each parallel region should be negligible.
MarkB

Posts: 670
Joined: Thu Jan 08, 2009 10:12 am
Location: EPCC, University of Edinburgh

### Re: Sharing memory among threads

The problem is, I need them from one parallel call to the other. So, they cannot be private because they will get lost at the end of each parallel call. That is why I was thinking of threadprivate, that is persistent for all parallel calls.
p3tris

Posts: 9
Joined: Fri May 06, 2011 5:57 pm

### Re: Sharing memory among threads

p3tris wrote:The problem is, I need them from one parallel call to the other.

OK, you obviously understand the code better than me! I must be missing something as I don't see how that works with your intents.....
MarkB

Posts: 670
Joined: Thu Jan 08, 2009 10:12 am
Location: EPCC, University of Edinburgh

### Re: Sharing memory among threads

Oh, sorry, should have made it clearrer: x,f are needed to the next iteration as the previous values.
The reason that I need A and B to be persistent is that I do a so called dishonest update. Meaning, sometimes they are not re-calculated but the previous values are used. This way a lot of calculations are saved (calculation, algebrization, LU factorization).

Hope I explained a bit better!
p3tris

Posts: 9
Joined: Fri May 06, 2011 5:57 pm

### Re: Sharing memory among threads

Aha, I see!

It might be useful to time both the parallel loop and the convergence check on different numbers of threads. If the convergence check is sequential, then the bottleneck might be the data movement of x and f to/from a single core's cache on each iteration.
MarkB

Posts: 670
Joined: Thu Jan 08, 2009 10:12 am
Location: EPCC, University of Edinburgh

### Re: Sharing memory among threads

MarkB wrote:Aha, I see!

It might be useful to time both the parallel loop and the convergence check on different numbers of threads. If the convergence check is sequential, then the bottleneck might be the data movement of x and f to/from a single core's cache on each iteration.

Yeap, the check is sequential, so I need to gather all the x,f and then call the check from the master thread.

With these in mind, moving data to be threadprivate would make a difference?
p3tris

Posts: 9
Joined: Fri May 06, 2011 5:57 pm

### Re: Sharing memory among threads

p3tris wrote:Yeap, the check is sequential, so I need to gather all the x,f and then call the check from the master thread.

With these in mind, moving data to be threadprivate would make a difference?

Hmmm, I don't quite see how that would work, unless you can parallelise the convergence check!
MarkB

Posts: 670
Joined: Thu Jan 08, 2009 10:12 am
Location: EPCC, University of Edinburgh

Next

### Who is online

Users browsing this forum: Yahoo [Bot] and 10 guests