OpenMP with Sun performance analyzer

General OpenMP discussion

OpenMP with Sun performance analyzer

Postby sagovinson » Mon Jun 30, 2008 5:59 am

Recently i 'm trying to parallel c codes with OpenMP support. And i also used Sun performance analyzer as the profiling tools. However, from the result i have seen no parallelization at all. The analyzer result in the Timeline tab shows that i have thread 1 and 2 but only cpu 1 and of course the two threads execute serially . I also tried other simple codes, but it seems the one using openMp API is always slower than the serial version. I wonder whether it is possible to do this in a virtual machine. In summary, what could be the problem for that? what should I do to make it work parallelly?
btw: At the moment i didn't know a free image upload site.Therefore I could't show the result of the sun analyzer.

The OS is on a virtual machine:Linux-32-bit, 2 CPUs, Debian_4.0 system. The sun studio version is Sun Studio 12.

Using cat /proc/cpuinfo i got the following result:
Code: Select all
processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 15
model name      : Intel(R) Xeon(R) CPU           E5335  @ 2.00GHz
stepping        : 8
cpu MHz         : 1999.743
cache size      : 4096 KB
fdiv_bug        : no
hlt_bug         : no
f00f_bug        : no
coma_bug        : no
fpu             : yes
fpu_exception   : yes
cpuid level     : 10
wp              : yes

processor       : 1
vendor_id       : GenuineIntel
cpu family      : 6
model           : 15
model name      : Intel(R) Xeon(R) CPU           E5335  @ 2.00GHz
stepping        : 8
cpu MHz         : 1999.743
cache size      : 4096 KB
fdiv_bug        : no
hlt_bug         : no
f00f_bug        : no
coma_bug        : no
fpu             : yes
fpu_exception   : yes
cpuid level     : 10
wp              : yes


The c codes i used is:
Code: Select all
#include <stdio.h>
#include <stdlib.h>
#include <omp.h>
#include <time.h>

#define THREADS 2

void mxv(int m, int n, double * a,double * b,double * c)
{
   int i,j;

#pragma omp parallel for default(none) \
   shared(m,n,a,b,c) private(i,j)
   for (i=0;i<m;i++)
   {
      a[i]=0.0;
      for (j=0;j<n;j++)
          a[i]+=b[i*n+j]*c[j];
   }
}

int main()
{
   double *a,*b,*c;
   int i,j,m,n;
   time_t start,end;
   m=5000;
   n=10000;
   
   if ((a=(double *)malloc(m*sizeof(double)))==NULL)
      perror("memory allocation for a");
   if ((b=(double *)malloc(m*n*sizeof(double)))==NULL)
      perror("memory allocation for a");
   if ((c=(double *)malloc(n*sizeof(double)))==NULL)
      perror("memory allocation for a");
   
   for (j=0;j<n;j++)
      c[j]=2.0;
   for (i=0;i<m;i++)
       for (j=0;j<n;j++)
          b[i*n+j]=i;
   #ifdef _OPENMP
   omp_set_num_threads(THREADS);
   omp_set_dynamic;
   #endif
   
   start=clock();
   mxv(m,n,a,b,c);
   end=clock();
   
   printf("Elapsed time for Multi is %dsecs\n",(end-start));
   
   free(a);free(b);free(c);
   return 0;
}      

command i used to collect results are:
$cc -xopenmp -xO3 -g matri_multi.c
$collect ./a.out
$analyzer &
sagovinson
 
Posts: 2
Joined: Fri Jun 27, 2008 7:34 am

Re: OpenMP with Sun performance analyzer

Postby ejd » Mon Jun 30, 2008 6:49 am

There are several things to note here:
(1) clock doesn't return seconds
(2) Sun Studio Performance Analyzer currently doesn't support running in a virtual machine (this is being looked at currently)
(3) Serial and parallel run in about the same amount of time (on the small system I tried). Part of this could be due to memory allocation. Since all the memory is allocated by the serial process, when running in parallel, depending on the hardware, it might take longer to access the memory from the other processor.

Code: Select all
From the time man page (format of time info):
  %Uuser %Ssystem %Eelapsed %PCPU ...
where:
  %U     Total number of CPU-seconds that the process spent in user mode.
  %S     Total number of CPU-seconds that the process spent in kernel.
  %E     Elapsed real time (in [hours:]minutes:seconds).
  %P     Percentage of the CPU that this job got, computed as (%U + %S) / %E.

% cc a.c
% time a.out
Elapsed time for Multi is 400000secs
0.760u 0.332s 0:01.09 100.0%    0+0k 0+0io 0pf+0w

% cc -xO3 a.c
% time a.out
Elapsed time for Multi is 390000secs
0.433u 0.371s 0:00.81 98.7%     0+0k 0+0io 0pf+0w
% time a.out
Elapsed time for Multi is 390000secs
0.460u 0.357s 0:00.82 98.7%     0+0k 0+0io 0pf+0w

% cc -xopenmp -xO3 a.c
% time a.out
Elapsed time for Multi is 400000secs
0.464u 0.362s 0:00.65 126.1%    0+0k 0+0io 0pf+0w
ejd
 
Posts: 1025
Joined: Wed Jan 16, 2008 7:21 am

Re: OpenMP with Sun performance analyzer

Postby sagovinson » Mon Jun 30, 2008 9:19 am

Thanks for your impressive notation.

The clock() is actually for the process's cpu time. I found one GNU c supported way to calculate the elapsed time between codes. I put it between my parallel region and also changed my original code a little bit. I also tried not to use the sun performance analyser but the time command.However, as you can see from the running results, the parallized version is even worse than the serial one. Do you think this is still a problem of memory access? What would be a solution for it?
Code: Select all
#include <stdio.h>
#include <stdlib.h>
#include <omp.h>
#include <time.h>

#define THREADS 2

void mxv(int m, int n, double * a,double * b,double * c)
{
   int i,j;

#pragma omp parallel for default(none) \
   shared(m,n,a,b,c) private(i,j)
   for (i=0;i<m;i++)
   {
      a[i]=0.0;
      for (j=0;j<n;j++)
          a[i]+=b[i*n+j]*c[j];
   }
}

int main()
{
   double *a,*b,*c;
   int i,j,m,n;
   time_t start,end,*now;
   
   m=5000;
   n=10000;
   
   if ((a=(double *)malloc(m*sizeof(double)))==NULL)
      perror("memory allocation for a");
   if ((b=(double *)malloc(m*n*sizeof(double)))==NULL)
      perror("memory allocation for a");
   if ((c=(double *)malloc(n*sizeof(double)))==NULL)
      perror("memory allocation for a");
   
   #ifdef _OPENMP
   omp_set_num_threads(THREADS);
   omp_set_dynamic;
   #endif   
      
   start = time(NULL);
#pragma omp parallel default(none) \
   shared(m,n,a,b,c) private(i,j)
{
#pragma omp for nowait
   for (j=0;j<n;j++)
      c[j]=2.0;
#pragma omp for nowait
   for (i=0;i<m;i++)
   {    a[i]=-1957.0;
       for (j=0;j<n;j++)
          b[i*n+j]=i;
   }
}         

   mxv(m,n,a,b,c);
   
   end=time(NULL);
   
   printf("Elapsed time for parallel is %.2fsecs\n",difftime(end,start));
   
   free(a);free(b);free(c);
   return 0;
}      


Checking Sun supported openMP info:
Code: Select all
$ cc -xopenmp -xO3 -xloopinfo matr_multi.c
"matr_multi.c", line 14: PARALLELIZED, user pragma used
"matr_multi.c", line 17: not parallelized, loop inside OpenMP region
"matr_multi.c", line 48: PARALLELIZED, user pragma used
"matr_multi.c", line 51: PARALLELIZED, user pragma used
"matr_multi.c", line 53: not parallelized, loop inside OpenMP region


Running results:
Code: Select all
$ cc -g -xO3 matr_multi.c
$ time ./a.out
Elapsed time for parallel is 11.00secs

real    0m11.064s
user    0m0.164s
sys     0m4.504s
$ time ./a.out
Elapsed time for parallel is 8.00secs

real    0m8.554s
user    0m0.268s
sys     0m3.068s

$ cc -xopenmp -g -xO3 matr_multi.c
$ time ./a.out
Elapsed time for parallel is 13.00secs

real    0m13.609s
user    0m0.148s
sys     0m3.864s
$ time ./a.out
Elapsed time for parallel is 13.00secs

real    0m14.371s
user    0m0.172s
sys     0m3.872s


from $man time of debian
Code: Select all
NAME
       time - overview of time

DESCRIPTION
   Real time and process time
       Real  time  is  defined  as time measured from some fixed point, either
       from a standard point in the past (see the description of the Epoch and
       calendar  time below), or from some point (e.g., the start) in the life
       of a process (elapsed time).

       Process time is defined as the amount of CPU time used  by  a  process.
       This  is  sometimes  divided into user and system components.  User CPU
       time is the time spent executing code in user mode.  System CPU time is
       the  time spent by the kernel executing in system mode on behalf of the
       process (e.g., executing system calls).  The  time(1)  command  can  be
       used  to determine the amount of CPU time consumed during the execution
       of a program.  A program can determine the amount of CPU  time  it  has
       consumed using times(2), getrusage(2), or clock(3).
sagovinson
 
Posts: 2
Joined: Fri Jun 27, 2008 7:34 am

Re: OpenMP with Sun performance analyzer

Postby ejd » Sun Jul 06, 2008 9:58 pm

I don't have your setup, but I did try your program on what I had (a Red Hat Linux box with 2 Pentium 4 processors running at 3.6GHz and with 2KB cache). I compiled using Sun Studio 12 and I got the following:

Code: Select all
% cc -xO3 a.c -g
% time a.out
Elapsed time for parallel is 1.00secs
0.448u 0.373s 0:00.82 98.7%     0+0k 0+0io 0pf+0w

% cc -xopenmp a.c -g
% time a.out
Elapsed time for parallel is 0.00secs
0.505u 0.623s 0:00.62 180.6%    0+0k 0+0io 0pf+0w

From this, I am seeing a speedup. It is not a big speedup, but then again the program has a pretty short run time. I could increase the array size, but then I most likely would start seeing more affects from memory. I did try the Performance analyzer and it shows the biggest amount of time is spent on line 54 (b[i*n+j]=i) followed by line 18 (a[i]+=b[i*n+j]*c[j]). The processor is faster, but less than twice as fast. At this point I wonder how much overhead the virtualization is costing you.
ejd
 
Posts: 1025
Joined: Wed Jan 16, 2008 7:21 am


Return to Using OpenMP

Who is online

Users browsing this forum: Google [Bot] and 7 guests