## Figure 3.11

Use this forum to discuss the book: Using OpenMP - Portable Shared Memory Parallel Programming, by Barbara Chapman, Gabriele Jost and Ruud van der Pas Read the viewtopic.php?f=8&t=465 for book info and to download the examples. Post your feedback about the book and examples to this forum

### Figure 3.11

As someone learning OpenMP, I have looked at the Fortran code for Figure 3.11.
To test a range of sizes of problems, I have placed the code in a DO loop:
DO m = 500,25000,500
n = m
I have also used CPU_time and SYSTEM_CLOCK to time processor and elapsed times.

The test hardware is :HP Z400 : Intel(R) Xeon(R) W3520 CPU @ 2.67 GHz : 12.0 GB : Win 7 Enterprise
The test compiler is :Intel(R) Visual Fortran Intel(R) 64 Compiler XE for applications running on Intel(R) 64, Version 12.1.5.344 Build 20120612

I have found that the initialisation part of the code takes a significant amount of time, far more than the mxv code.
Code: Select all
`      print *, 'Initializing matrix B and vector c'      c(1:n) = 1.0      do i = 1, m         b(i,1:n) = i      end do`

My testing of alternative coding, such as:
" do j = 1,n; do i=1,m ; b(i,j) = i ; end do ; end do " or
" forall (i=1:m, j=1:n) b(i,j) = i "
These provided impriovement, but did not reduce the time to the mxv code.
My conclusion is that memory "grabbing" takes a significant amount of time. This is a real issue for teh total test time.

From a non-OMP viewpoint, I am also concerned that the inner loop "j" steps over a large memory footprint for b(i,j) (which is the problem in a matrix multiplication)
Code: Select all
`         do j = 1, n            a(i) = a(i) + b(i,j)*c(j)         end do`

I was pleased to see that for large GB sizes (up to 4.6gb), that memory accessing, associated with loss of cache did not appear to be an issue in this test.
I tested an alternative transposed MXV-t, but this did not show an improvement:
Code: Select all
`      a = 0!\$OMP PARALLEL DO DEFAULT(NONE)      &!\$OMP SHARED(m,n,a,b,c) PRIVATE(i,j)      do j = 1, n         do i = 1, m            a(i) = a(i) + b(i,j)*c(j)         end do      end do!\$OMP END PARALLEL DO`

or possibly
Code: Select all
`      do j = 1, n          a = a + b(:,j)*c(j)      end do`

I have created 3 test programs for the existing mxv routine, a transposed mxv and a transposed mxv using array syntax. While the transposed version performs better as a serial code, the OMP mvx routine in Fig 3.11 performs better. All perform significantlty better than the initialisation loop, where the memory for the arrays is being used !!

My ultimate goal is to create an OpenMP version of a direct solver for large sets of linear equations. (symmetric, positive definate).
My two non-OMP alternatives are a gaussean solver or a crout skyline solver. All my attempts with the gaussean solver to date appear to have failed on the memory address bandwidth, although I am hoping to learn a lot more as I go along. The crout solver, which has a dominant sequential update, does not appear to be suited to OMP.

I am keen to hear from others about my comments on Figure 3.11

John Campbell
johncampbell

Posts: 10
Joined: Tue Aug 27, 2013 6:48 pm

### Re: Figure 3.11

Hi John,

The first time a program touches memory, it will be expensive because page faults occur (the OS has to decide where in physical memory to put each page). Typically you would not measure this as part of a benchmark, since it is unlikely to be a significant part of the run time for a real application.

My ultimate goal is to create an OpenMP version of a direct solver for large sets of linear equations. (symmetric, positive definate).

Getting good performance for this depends on doing careful blocking for both cache and register re-use. Library routines (e.g. PLASMA http://icl.cs.utk.edu/plasma/ ) will likely do a much better job than anything you can hand code.

Hope that helps,
Mark.
MarkB

Posts: 615
Joined: Thu Jan 08, 2009 10:12 am
Location: EPCC, University of Edinburgh

### Re: Figure 3.11

Mark,

I have a question regarding coding of a gaussean elimination solver. The problem I have is understanding the overhead of entering the parallel zone many times. My basic approach to the gauss solver is something like:
Code: Select all
`   do n = 1, num_equations!   generate row of active equations      num_band = 0      do k = 1, max_band         if (inactive equation) cycle          num_band = num_band+1          row_i(num_band) = k          ...      end do!!\$OMP PARALLEL DO      do ii = 1, num_band        i = row_i(ii)        c = row_f(ii)/row_f(0)        do jj = ii, num_band          j = row_i(jj)          stif(j-i,n+i) = stif(j-i,n+i) - c*row_f(jj)        end do      end do!\$OMP END PARALLEL DO!   end do`

As num_equations can be very large, I wonder if moving DO n into the parallel region, but processing this loop sequentially, plus the active equation selection as a single or barrier loop might improve performance. This could be something like ( although incomplete)
Code: Select all
`!\$OMP PARALLEL!\$OMP DO sequential   do n = 1, num_equations!   generate row of active equations!\$OMP SINGLE      num_band = 0      do k = 1, max_band         if (inactive equation) cycle          num_band = num_band+1          row_i(num_band) = k          ...      end do!\$OMP END SINGLE!!\$OMP DO      do ii = 1, num_band        i = row_i(ii)        c = row_f(ii)/row_f(0)        do jj = ii, num_band          j = row_i(jj)          stif(j-i,n+i) = stif(j-i,n+i) - c*row_f(jj)        end do      end do!\$OMP END DO!   end do!\$OMP END PARALLEL`

From my understanding of OpenMP implementation, the purpose of my change is to mininmise the initiation of the OMP region to once, rather than num_equations. However, I do not know how to control the n loop as serial and the k loop as a single process.

There is ample computation in the 2inner i j loops to achieve parallel performance, however my inplementation of this shows a bottleneck in the "get" and "put" of stif(..). I think your reference to blocking to improve cacheing needs to be better understood. My impression of cache bottleneck and the memory delays of the previous post is that these are significant issues for implementing !\$OMP on an intel pc, which is a cheap and very common platform for using parallel processing, in comparison to previous workstations.

The advantage of the skyline solver is there is only a single put of stif(..), but at the expense of requiring serial operation of loops n and i, reducing parallel implementation of loop j only. My impression of my failure with this approach is that I am initiating the !\$OMP region num_equations x num_band time and this overhead becomes too expensive. (The memory cache issue remains significant also ?)
Could you comment on my interpretation of this repeated entry into the !\$OMP region ?

I am finding that implementation of AVX instructions as a serial process is providing achievable benefit in an easier way, especially for dot_product style calculations. Are the intel i7 style processors not yet suited to OMP computation ?

John
johncampbell

Posts: 10
Joined: Tue Aug 27, 2013 6:48 pm

### Re: Figure 3.11

Hi John,

Reducing the number of entries/exits to parallel regions may reduce the overheads a bit, though you are replacing one parallel region with two implicit barriers (at the END SINGLE and END DO). Depending on the implementation, this might not save you very much.

Your code is pretty much correct, if you simply remove the DO sequential and allow every thread to (redundantly) execute all the iterations of the outermost loop. Using MASTER followed by an explicit BARRIER may be slightly less expensive that SINGLE for large numbers of threads.

I think that data affinity is more likely to be the main issue here. You need to try to arrange that the same thread accesses the same columns of the stif array in subsequent instances of the parallel loop as much as possible, without compromising the load balance, so that cache reuse is maximised. You may be able to do this by changing the loop schedule (and possibly running the ii loop in reverse), or you might need to use just a parallel region and explicitly compute which iterations each thread needs to do.

As an analogy, let's consider Gaussian elimination on a full dense matrix with no pivoting:
Code: Select all
` do k = 1, n-1   do i = k+1, n      a(i, k) = a(i, k) / a(k, k)   end do!\$omp parallel do     do j = k+1, n      do i = k+1, n           a(i, j) = a(i, j) - a(i, k) * a(k, j)      end do   end doend do`

This doesn't have very good data affinity because the distribution of columns of a(:,:) to threads changes from one iteration of the k loop to the next.
This can be fixed by running the j loop backwards and using a cyclic schedule:

Code: Select all
` do k = 1, n-1   do i = k+1, n      a(i, k) = a(i, k) / a(k, k)   end do!\$omp parallel do  schedule(static,1)    do j = n, k+1, -1       do i = k+1, n           a(i, j) = a(i, j) - a(i, k) * a(k, j)      end do   end doend do`

johncampbell wrote:I am finding that implementation of AVX instructions as a serial process is providing achievable benefit in an easier way, especially for dot_product style calculations. Are the intel i7 style processors not yet suited to OMP computation ?

Parallelising dot products with OpenMP efficiently relies on vectors being long enough to offset the parallel region overheads, and also on good data affinity. If the vector is all in one cache at the start of the dot product, for example, then cost of remote data accesses can be significant.

Hope that helps,
Mark.
MarkB

Posts: 615
Joined: Thu Jan 08, 2009 10:12 am
Location: EPCC, University of Edinburgh