Openmp performance on different hardware/OS

General OpenMP discussion

Re: Openmp performance on different hardware/OS

Postby ilmarw » Sun Jan 20, 2008 1:41 pm

Anyone?

ilmar
ilmarw
 
Posts: 5
Joined: Tue Jan 08, 2008 3:47 am

Re: Openmp performance on different hardware/OS

Postby lfm » Tue Jan 22, 2008 9:43 am

The first line after the first pragma looks like this:
// #pragma omp parallel shared(A, col, row)
for (k = 0; k<SIZE-1; k++) {

There is no worksharing directive so all threads execute the for loop in parallel. You can't execute that for loop in parallel because of data dependences.
lfm
 
Posts: 135
Joined: Sun Oct 21, 2007 4:58 pm
Location: OpenMP ARB

Re: Openmp performance on different hardware/OS

Postby ilmarw » Tue Jan 22, 2008 9:51 am

OK, I see it now. What dependencies do you mean, the fact that the variable n depends on k?

ilmar
ilmarw
 
Posts: 5
Joined: Tue Jan 08, 2008 3:47 am

Re: Openmp performance on different hardware/OS

Postby lfm » Wed Feb 06, 2008 10:55 am

Here is the code:
Code: Select all
for (k = 0; k<SIZE-1; k++) {
    /* set col values to column k of A */
    for (n = k; n<SIZE; n++) {
      col[n] = A[n][k];
    }

    /* scale values of A by multiplier */
    for (n = k+1; n<SIZE; n++) {
      A[k][n] /= col[k];
    }

    /* set row values to row k of A */
    for (n = k+1; n<SIZE; n++) {
      row[n] = A[k][n];
    }

    /* Here we update A by subtracting the appropriate values from row
       and column.  Note that these adjustments to A can be done in
       any order */
#pragma omp parallel for shared(A, row, col)
    for (i = k+1; i<SIZE; i++) {
      for (j = k+1; j<SIZE; j++) {
   A[i][j] = A[i][j] - row[i] * col[j];
      }
    }
  }


Let's look at iteration m of the outer loop. It uses A[m:SIZE-1][m], A[m][m+1:SIZE-1], and A[m+1:SIZE-1][m+1:SIZE-1]. It modifies A[m][m+1:SIZE-1] and A[m+1:SIZE-1][m+1:SIZE-1]. So on iteration m+1, the value of A[m+1][m+2:SIZE-1] (for example) is used, which was computed on the previous iteration. Thus there is a loop-carried true dependence from iteration m to iteration m+1 and the loop cannot be parallelized as written.

One way to get more parallelism here is to block the computation. This URL seems to have a good explanation:
http://www.cs.berkeley.edu/~demmel/cs267/lecture12/lecture12.html#link_5

-- Larry
lfm
 
Posts: 135
Joined: Sun Oct 21, 2007 4:58 pm
Location: OpenMP ARB

Re: Openmp performance on different hardware/OS

Postby jmhal » Sat Mar 01, 2008 7:46 am

For all the examples at http://kallipolis.com/openmp/, the parallel version runs slower than the serial version. Here`s my hardware setup:

Intel(R) Core(TM)2 CPU T5300 @ 1.73GHz
Ubuntu Linux 7.10
GCC 4.2.1
Intel C Compiler 10.1

When the parallel version is running, there's only one process on the output of top. I can't prove for sure, but I'm almost sure that the threads are not being divided equally among the cores. Has anyone got a good speedup using Linux?
jmhal
 
Posts: 3
Joined: Tue Feb 12, 2008 6:29 pm

Re: Openmp performance on different hardware/OS

Postby ejd » Tue Apr 08, 2008 2:24 am

I downloaded the two files combined.c and combined_mp.c from the site you gave (http://kallipolis.com/openmp/) and ran them on an IBM box. It was running Linux 2.6.9-11.EL smp and had 2 Intel Pentium 4 processors running at 3.6 Ghz. The compiler I had access to was the Intel C Compiler 9.1.037. Here is what I saw:
Code: Select all
% icc combined.c
% time a.out
e started at 0
e done at 5970000
pi started at 5970000
pi done at 11060000
integration started at 11060000
integration done at 19930000
Values: e*pi = 8.539734,  integral = 9.666667
Total elapsed time: 19930.000 seconds
19.901u 0.035s 0:20.02 99.5%    0+0k 0+0io 0pf+0w

% setenv OMP_DYNAMIC FALSE
% setenv OMP_NUM_THREADS 2
% icc -openmp combined_mp.c
combined_mp.c(33) : (col. 1) remark: OpenMP DEFINED SECTION WAS PARALLELIZED.
combined_mp.c(65) : (col. 1) remark: OpenMP DEFINED LOOP WAS PARALLELIZED.
combined_mp.c(31) : (col. 1) remark: OpenMP DEFINED REGION WAS PARALLELIZED.
% time a.out
e started at 0
pi started at 0
e done at 14600000
integration started at 14600000
pi done at 15190000
integration started at 15190000
integration done at 30840000
Values: e*pi = 8.539734,  integral = 9.666667
Total elapsed time: 30980.000 seconds
30.963u 0.028s 0:15.64 198.0%   0+0k 0+0io 0pf+0w

The first thing to note, is that the value returned from clock() in the program is not accurate when running in parallel. The programmer also didn't do the calculation of seconds correctly (they divided by 1000 instead of CLOCKS_PER_SEC). lfm had made note of this before in a previous post. Looking at the time values returned for elapse, I am seeing about a 21.8% decrease in elapse time.

Running this on an older Sparc system running Solaris 10 and using the Sun Studio 12 compiler, I am seeing about a 29.4% decrease in elapse time.
Code: Select all
% cc -xO3 combined.c
% time a.out
e started at 0
e done at 120000
pi started at 120000
pi done at 7680000
integration started at 7680000
integration done at 17620000
Values: e*pi = 8.539734,  integral = 9.666667
Total elapsed time: 17620.000 seconds
17.0u 0.0s 0:17 96% 0+0k 0+0io 0pf+0w

% setenv OMP_DYNAMIC FALSE
% setenv OMP_NUM_THREADS 2
% cc -xO3 -xopenmp combined_mp.c
% time a.out
e started at 0
pi started at 0
e done at 230000
integration started at 230000
integration done at 9670000
pi done at 12650000
integration started at 12650000
Values: e*pi = 8.539734,  integral = 9.666667
Total elapsed time: 17570.000 seconds
17.0u 0.0s 0:12 131% 0+0k 0+0io 0pf+0w

So with this program, as written, I don't think that you are going to see large decreases in elapse time using 2 processors. Unfortunately, I don't have a Linux system available right now that is close to yours. I did try it on the same old Sparc system using 4 threads and got a 47% reduction in elapse time.
Code: Select all
% setenv OMP_NUM_THREADS 4
% time a.out
integration started at 0
integration started at 0
e started at 0
pi started at 0
e done at 390000
integration started at 390000
integration done at 9340000
pi done at 14920000
integration started at 14920000
Values: e*pi = 8.539734,  integral = 9.666667
Total elapsed time: 17310.000 seconds
17.0u 0.0s 0:09 170% 0+0k 0+0io 0pf+0w

What sort of speedup are you seeing?
ejd
 
Posts: 1025
Joined: Wed Jan 16, 2008 7:21 am

Previous

Return to Using OpenMP

Who is online

Users browsing this forum: Exabot [Bot] and 6 guests