Openmp performance on different hardware/OS

General OpenMP discussion

Openmp performance on different hardware/OS

Postby anon » Tue Nov 06, 2007 7:50 am

Hi,

I am new to OpenMP and have been trying to run an example program to get a feel for how it works. The example program performs an LU decomposition on a matrix (see http://kallipolis.com/openmp/2.html for LU_mp.cpp source). When I compile this code using Visual Studio 2005 Professional on an Intel Xeon Dual Core machine I am able to get the results quoted on the page given above (i.e 12 secs 1 thread, 6 secs 2 threads) using both the Microsoft and Intel 9.1 compilers for Windows. However, if I try and run the same code on a linux machine (quad core opteron) using either the Intel 9.1 or gcc 4.2.2 compilers I don't see any speed up, the time taken to run with 1 and 2 threads is approximately equal. Could the implementations of OpenMP on the different systems account for this or am I missing something fundamental?

TIA

anon
anon
 

Re: Openmp performance on different hardware/OS

Postby Pedro » Tue Nov 06, 2007 9:16 am

Hi,
i am new on openMP just as you, i test all the codes of the same site in a pentium 4 HT, and i didn´t have any improve in the LU decomposition just like you. And when i compiled and ran the complete_mp.c the reduction (variable sum) didn´t work besides i had used the same code. I had just asked for help in other forum. I am not sure if the problem is that i have a pentium 4 ht and in the site they use a pentium D. Did you have the right result with this program?

regards

Pedro
Pedro
 
Posts: 5
Joined: Sun Nov 04, 2007 4:55 am

Re: Openmp performance on different hardware/OS

Postby anon » Tue Nov 06, 2007 9:41 am

Hi Pedro,

I made one slight adjustment to the code to allow the number of threads to be passed in via command line and also added a omp parallel for pragma on the outer loop, see code below. This program reports the following timings when running with 1 and 2 threads set on my Dual Proc 3.6GHz Intel Xeon (Windows XP):

Completed decomposition in 12.688 seconds using 1 thread(s).
Completed decomposition in 6.516 seconds using 2 thread(s).

The strange thing is I can't get the same results using the same code on other hardware. I read somewhere about Windows threads and pthreads being different and I wonder if this has something to do with it but I don't know much about either so I am definetly clutching at straws.

Anon

#include <omp.h>
#include <time.h>
#include <stdio.h>
#include <stdlib.h>

#define SIZE 500

int main(int argc, char *argv[])
{
double start, stop; // for keeping track of running time
double A[SIZE][SIZE];
double col[SIZE], row[SIZE];
int i, j, k, n;

int num_threads = 1;
if (argc > 1)
{
num_threads = atoi(argv[1]);
}
omp_set_num_threads(num_threads);


// preload A with random values
for (i = 0; i<SIZE; i++)
{
for (j = 0; j<SIZE; j++)
{
A[i][j] = rand();
}
}

// time start now
start = clock();

// The core algorithm
for (k = 0; k<SIZE-1; k++)
{
// set col values to column k of A
for (n = k; n<SIZE; n++)
{
col[n] = A[n][k];
}

// scale values of A by multiplier
for (n = k+1; n<SIZE; n++)
{
A[k][n] /= col[k];
}

// set row values to row k of A
for (n = k+1; n<SIZE; n++)
{
row[n] = A[k][n];
}

// Here we update A by subtracting the appropriate values from row
// and column. Note that these adjustments to A can be done in
// any order
#pragma omp parallel for
for (i = k+1; i<SIZE; i++)
{
#pragma omp parallel for shared(A, row, col)
for (j = k+1; j<SIZE; j++)
{
A[i][j] = A[i][j] - row[i] * col[j];
}
}
}

// we're done so stop the timer
stop = clock();

printf("Completed decomposition in %.3f seconds using %d thread(s).\n", (stop-start) / CLOCKS_PER_SEC , num_threads);

return 0;
}
anon
 

Re: Openmp performance on different hardware/OS

Postby lfm » Tue Nov 06, 2007 10:07 am

There may be some issues with the clock() function on different OSs. You may want to use omp_get_wtime() which is guaranteed to return wall-clock time.

I modified the code to make the arrays static (I just pulled the declarations out of main) so I could make them bigger without worrying about stacksize. On size 2000, on a random 4 cpu older linux itanium box, compiled with Intel's compiler, -openmp, using omp_get_wtime, I get:

[lfmeadow@rufus lfmeadow]$ !ic
icc -openmp test.c
test.c(63) : (col. 1) remark: OpenMP DEFINED LOOP WAS PARALLELIZED.
test.c(60) : (col. 1) remark: OpenMP DEFINED LOOP WAS PARALLELIZED.
[lfmeadow@rufus lfmeadow]$ a.out 4
Completed decomposition in 5.444 seconds using 4 thread(s).
[lfmeadow@rufus lfmeadow]$ a.out 3
Completed decomposition in 7.314 seconds using 3 thread(s).
[lfmeadow@rufus lfmeadow]$ a.out 2
Completed decomposition in 9.393 seconds using 2 thread(s).
[lfmeadow@rufus lfmeadow]$ a.out 1
Completed decomposition in 13.670 seconds using 1 thread(s).

And on a random 2 cpu 2 core x86 box, linux, intel compiler, -openmp, same code, I get:
[lfmeadow@fxeqlin03 ~]$ icc -openmp test.c
test.c(63) : (col. 1) remark: OpenMP DEFINED LOOP WAS PARALLELIZED.
test.c(60) : (col. 1) remark: OpenMP DEFINED LOOP WAS PARALLELIZED.
[lfmeadow@fxeqlin03 ~]$ a.out 4
Completed decomposition in 12.093 seconds using 4 thread(s).
[lfmeadow@fxeqlin03 ~]$ a.out 3
Completed decomposition in 13.124 seconds using 3 thread(s).
[lfmeadow@fxeqlin03 ~]$ a.out 2
Completed decomposition in 16.437 seconds using 2 thread(s).
[lfmeadow@fxeqlin03 ~]$ a.out 1
Completed decomposition in 29.054 seconds using 1 thread(s).

It isn't currently convenient for me to run on a windows box (read, I can't remember how right offhand and I'm lazy).

The absolute times aren't important; I didn't mess with optimization flags, and tiling for cache would help, etc; rather, I wanted to demonstrate that this code does indeed show speedup with OpenMP on a couple of different hardware platforms.

Cheers.

-- Larry
lfm
 
Posts: 135
Joined: Sun Oct 21, 2007 4:58 pm
Location: OpenMP ARB

Re: Openmp performance on different hardware/OS

Postby Pedro » Tue Nov 06, 2007 3:27 pm

Hi Anon,
i got your code and compiled it using intel´s compiler, under windows xp, on a pentium 4 HT. It seems that hyperthreading is not enough to improve performance as it just simulates two logical processors. When i used SIZE = 2000 it took longer to complete decomposition with 2 threads than with only one. The only explanation that i have is that i really need more than one real processor to really run this code with openmp.

regards,
Pedro
Pedro
 
Posts: 5
Joined: Sun Nov 04, 2007 4:55 am

Re: Openmp performance on different hardware/OS

Postby lfm » Tue Nov 06, 2007 5:33 pm

Yes, you definitely need two cores. The code is memory bandwidth and floating point intensive; the HT might help hide latency a little but it doesn't give you any more floating point performance.
lfm
 
Posts: 135
Joined: Sun Oct 21, 2007 4:58 pm
Location: OpenMP ARB

Re: Openmp performance on different hardware/OS

Postby Anon » Tue Nov 27, 2007 2:28 am

Hi Larry,

Thanks for the suggestions regarding omp_get_wtime() and pulling the array declerations outside the main function. I am now getting similar timings to you on a 2 processor dual core linux box using gcc 4.2.2, so I can see the speed up using OpenMP.

Regards,

Anon
Anon
 

Re: Openmp performance on different hardware/OS

Postby ilmarw » Thu Jan 10, 2008 7:16 am

Hi all,

Using static arrays and omp_get_wtime, I get the following results using the program above on a Intel(R) Xeon(R) 3.00GHz (2x2 CPUs) running gcc version 4.2.1:
ilmarw@****:/scratch/ilmarw/openmp$ make test
gcc-4.2 -fopenmp -O3 -c test.c
gcc-4.2 -fopenmp -o test test.o
ilmarw@****:/scratch/ilmarw/openmp$ ./test 4
Completed decomposition in 24.938 seconds using 4 thread(s).
ilmarw@****:/scratch/ilmarw/openmp$ ./test 3
Completed decomposition in 24.185 seconds using 3 thread(s).
ilmarw@****:/scratch/ilmarw/openmp$ ./test 2
Completed decomposition in 30.246 seconds using 2 thread(s).
ilmarw@****:/scratch/ilmarw/openmp$ ./test 1
Completed decomposition in 36.641 seconds using 1 thread(s).

The speedup is minimal (I use a array size of 3000). We do I not get similar results to those lfm got? Is it a compiler problem?

Sincerly, Ilmar
Last edited by ilmarw on Thu Jan 10, 2008 9:26 am, edited 3 times in total.
ilmarw
 
Posts: 5
Joined: Tue Jan 08, 2008 3:47 am

Re: Openmp performance on different hardware/OS

Postby ilmarw » Thu Jan 10, 2008 9:45 am

When running the LU decomposition code from http://kallipolis.com/openmp/2.html I get these results:

On my dual core Macbook (size: 2000):
ilmarw@******:~/openmp$ ./LU
Completed decomposition in 86.886 seconds
ilmarw@******:~/openmp$ ./LU_mp
Completed decomposition in 49.564 seconds

On the machine mentioned in previous post (size: 3000):
ilmarw@******:/scratch/ilmarw/openmp$ ./LU
Completed decomposition in 35.003 seconds
ilmarw@******:/scratch/ilmarw/openmp$ ./LU_mp
Completed decomposition in 21.836 seconds

I get roughly the same speedup with to processors and with four. Shouldn't it be more then that?

Sincery, Ilmar
ilmarw
 
Posts: 5
Joined: Tue Jan 08, 2008 3:47 am

Re: Openmp performance on different hardware/OS

Postby ilmarw » Thu Jan 10, 2008 9:57 am

Maybe I'm getting slightly off topic, I am sorry about that.

I am just wondering; if the first pragma in http://kallipolis.com/openmp/LU_mp.c is not commented out, the program is many times slower then the serial version. What is the reason for this? It seems that the workload is not distributed among the processors.

Sincerly, Ilmar
ilmarw
 
Posts: 5
Joined: Tue Jan 08, 2008 3:47 am

Next

Return to Using OpenMP

Who is online

Users browsing this forum: MarkB, Yahoo [Bot] and 10 guests