Sharing memory among threads

General OpenMP discussion

Re: Sharing memory among threads

Postby MarkB » Tue May 22, 2012 1:38 am

p3tris wrote:3) Use static schedule by manually dividing the domains


This could certainly help improve the data affinity/cache reuse: worth a try!
MarkB
 
Posts: 480
Joined: Thu Jan 08, 2009 10:12 am
Location: EPCC, University of Edinburgh

Re: Sharing memory among threads

Postby ftinetti » Tue May 22, 2012 7:26 am

Hi,

I get some speedup but not the expected. This part of the code sums for 80% of the total program. I would expect a maximum theoretic speedup of 5x. I get 2.5-3x (I vary the number of threads from 2 to 48, on a 48-core machine. The max speedup is at 20-22 cores).

What is the do runtime of for 20-22 cores? Maybe it is too little time to be decreased by adding more cores?

1) Use thread private data. This will mean multiplying the memory used by the num_threads.
2) Use allocatable arrays to be sure each domain is sized as needed and not at the size of the biggest domain.

From the point of view of memory, both alternatives multiply memory usage, but I think they would be useful only if there is some performance gain. From a performance point of view, I think that memory access patterns are the priority. Now, would some of these alternatives enhance the memory access pattern/s (e.g. by reducing memory contention and/or avoiding NUMA effects)?

3) Use static schedule by manually dividing the domains

Did you try with static and different chunks?

Aha, I see!

It might be useful to time both the parallel loop and the convergence check on different numbers of threads. If the convergence check is sequential, then the bottleneck might be the data movement of x and f to/from a single core's cache on each iteration.

Yeap, the check is sequential, so I need to gather all the x,f and then call the check from the master thread.

With these in mind, moving data to be threadprivate would make a difference?

I don't know about "moving data to be threadprivate", but is it possible to make at least some intermediate convergence check computations in parallel? Thus, the sequential one would use those intermediate computations in the complete check.

Just a minor comment: in the code:
Code: Select all
do while(.not. converged)
  !$omp parallel do default(none) &
  !$omp private(i, info) &
  !$omp shared(num_domains,A,IPVT,Ty,Tx,Bx,By,Cy,Cx,TB)&
  !$omp reduction(+:counter) &
  !$omp schedule(dynamic,chunk)
  do i=1,num_domains
    call calculate_domain(A(:,:,i),IPVT(:,i),Ty(:,i),Tx(:,i),Bx(:,i),By(:,i),TB(:,i),x(:,i),f(:,i),Cy,Cx)
    call solve_domain(A(:,:,i),IPVT(:,i),Ty(:,i),Tx(:,i),Bx(:,i),By(:,i),TB(:,i),x(:,i),f(:,i))
    counter=counter+1
  enddo

why don't yo make
Code: Select all
do while(.not. converged)
  !$omp parallel do default(none) &
  !$omp private(i, info) &
  !$omp shared(num_domains,A,IPVT,Ty,Tx,Bx,By,Cy,Cx,TB)&
  !$omp schedule(dynamic,chunk)
  do i=1,num_domains
    call calculate_domain(A(:,:,i),IPVT(:,i),Ty(:,i),Tx(:,i),Bx(:,i),By(:,i),TB(:,i),x(:,i),f(:,i),Cy,Cx)
    call solve_domain(A(:,:,i),IPVT(:,i),Ty(:,i),Tx(:,i),Bx(:,i),By(:,i),TB(:,i),x(:,i),f(:,i))
  enddo
  counter = counter + num_domains


Regards,

Fernando.
ftinetti
 
Posts: 582
Joined: Wed Feb 10, 2010 2:44 pm

Re: Sharing memory among threads

Postby p3tris » Tue May 22, 2012 7:43 am

What is the do runtime of for 20-22 cores? Maybe it is too little time to be decreased by adding more cores?

Yes, when I reach 20-22 cores the work in the loop is small to use any more cores.

From the point of view of memory, both alternatives multiply memory usage, but I think they would be useful only if there is some performance gain. From a performance point of view, I think that memory access patterns are the priority. Now, would some of these alternatives enhance the memory access pattern/s (e.g. by reducing memory contention and/or avoiding NUMA effects)?

My idea was based on the thought that the data will be allocated on the CPU used and thus be faster with threadprivate. That is because I have 4-CPU units, with 12-cores each and shared L3 cache. So, increasing locality.

Did you try with static and different chunks?

Yes I did. I made a shell script with 3 nested loops. OMP_NUM_THREADS=1..48, chunk=1..1001..50 and OMP_SCHEDULE="<dynamic,static,guided>,chunk".
The best results are given by dynamic 100.

I don't know about "moving data to be threadprivate", but is it possible to make at least some intermediate convergence check computations in parallel? Thus, the sequential one would use those intermediate computations in the complete check.

I understand. That would mean changing the norm from norm-2 (that needs all the x and f to be calculated) to norm-infinite. It's a change to the algorithm that I'll check later on.

Just a minor comment: in the code:

Yeap, the code is a reduced version of the real one. actually the counter is not increased every time in the loop (cause of the dishonest update I noted in a previous post).

Thanks for the ideas!
p3tris
 
Posts: 9
Joined: Fri May 06, 2011 5:57 pm

Re: Sharing memory among threads

Postby tomriddle » Thu May 31, 2012 8:59 am

Slightly off-topic but, I had massive trouble sharing memory/global variables amongst threads with Perl not so long ago. I even got ridiculed on the perl IRC and told it's the worst thing you can do, as it is basically seen as bad practice...

So how does this translate over to C++? I've thought about having shared variables over threads before, but never actually got that far into the development of my MMORPG server...
I'm a web developer in Ipswich who does web design - Ask me if you need help!
tomriddle
 
Posts: 2
Joined: Thu May 31, 2012 8:36 am

Previous

Return to Using OpenMP

Who is online

Users browsing this forum: Google [Bot], Yahoo [Bot] and 4 guests