Poor scaling on Core2Duo better on OpteronX2 a code problem?

General OpenMP discussion

Poor scaling on Core2Duo better on OpteronX2 a code problem?

Postby drososkourounis » Sat Mar 22, 2008 1:37 pm

Dear OpenMP experts,
I am trying to parallelize using OpenMP directives a very important kernel used in Domain Decomposition methods. I am attaching the code and the benchmark results on an Intel Core 2 Duo T5500 CPU and an Dual-Core AMD Opteron(tm) Processor 2220 SE 2.8GHz 1MB cache per core. The code compiles simply by typing:

$ make clean && make CXX=g++-4.2.2 main
$ make clean && make CXX=CC main
$ make clean && make CXX=icpc main

for running:

$ bin/main 100000 10 100 8

the third argument is a dummy(obsolete). The outcome of the benchmark is that the same code is scaling much better in the Opteron architecture than in the Intel Core 2 Duo one. Although the Intel C++ compiler is supposed to work better on the Intel machines. I am not satisfied with the speed up though in either of them. Is it a problem of the underlying architectures or a problem of the code itself? Is there any other way of rewriting the code achieving better speed-up on the Intel Core 2 Duo platform? Two parallel sections did not do any better on the Intel architecture.

I am aware of the first touch issues on SMP architectures and that is why I have coded the memory allocation of the matrix in parallel rather than sequentially and then trying to resolve first touch issues. The assumption here is that since each thread allocates its own memory then each core will own it own memory.

I recompiled the Linux kernel trying different options like Intel-Core2/Xeon architecture and 2 cores and generally everything related to Intel Core 2 SMP without Big Kernel Lock and without Preemption. Both architectures run the same kernel. The results are though indicate that there is a problem with the Intel platform. I also tried FreeBSD 7.0 and recompiled its kernel with ULE scheduler. The code on the Intel Core 2 Duo scales more or less like in Linux. Changing the operating system did not pay off. So I am suspecting there is something wrong with the code since anyway the speed-ups on the Opteron server were not satisfactory.

The benchmark results Opteron X2 2.8 GHz:
Code: Select all
Allocate: Works well
# g++-4.2.2        Speed-Up
# ============================
1  4226              1.0 
2  2178              1.94 
3  1648              2.564
4  1121              3.7698
&
# icpc
1  3371              1.0 
2  1740              1.937 
3  1318              2.557
4  907               3.716
&
# Sun CC
1  3938              1.0
2  2006              1.963
3  1719              2.29
4  1111              3.544

Multiply: Not that good
# g++-4.2.2         Speed-Up
# ==============================
1  451                 1.0
2  268                 1.682
3  210                 2.147
4  169                 2.66
&
# intel icpc
1  455                 1.0
2  268                 1.682
3  208                 2.168
4  169                 2.668
&
# Sun CC
1  423                 1.0
2  233                 1.815
3  196                 2.158
4  166                 2.548


Intel Core 2 Duo 1.66GHz:

Allocate memory: Works well as well
Code: Select all
# g++-4.2.2         Speed-Up
================================
1  5500                1.0
2  2903                1.894

# intel icpc
1  4383                1.0
2  2304                1.902

# Sun CC
1  5065                1.0
2  2611                1.939

Multiply: Awfully bad scaling
# g++-4.2.2        Speed-Up
============================
1  420               1.0
2  303               1.386

# Intel icpc
1  320               1.0
2  280               1.142

# Sun CC
1  329               1.0
2  274               1.2


Thanks in advance,
Drosos.

Code: Select all
#include <iostream>
#include "timing.h"
#include <cmath>
using std::cout;
#include <omp.h>

const double pi = 4.0*atan(1.0);

class SystemMatrix
{
public:
   int nrows;
   int ncols;
   int nonzeros;
   int* pRows;
   int* pCols;
   double* pData;
   double* x;
   double* y;

   int numberOfDiagonalBlocks;
   int numberOfRowsPerBlock;

public:
   SystemMatrix()
   {
   nrows = 0;
   ncols = 0;
   nonzeros = 0;
   pRows = 0;
   pCols = 0;
   pData = 0;
   }

   void setStructure(int diagonalblocks, int rowsperblock)
   {
        numberOfDiagonalBlocks = diagonalblocks;
   numberOfRowsPerBlock = rowsperblock;
   }


   ~SystemMatrix()
   {
      delete[] pRows;
   delete[] pCols;
   delete[] pData;
   delete[] x;
   delete[] y;

   }

   void make()
   {
       int diagonal_block;
       int block_row;
       int i, j;

       nrows = numberOfDiagonalBlocks * numberOfRowsPerBlock; 
       ncols = nrows;
       nonzeros = numberOfRowsPerBlock*numberOfRowsPerBlock*numberOfDiagonalBlocks;

       pRows = new int[nrows + 1];
       pCols = new int[nonzeros];
       pData = new double[nonzeros];
       x = new double[nrows];
       y = new double[nrows];

       initX();

       i = 0;
       for (diagonal_block = 0; diagonal_block < numberOfDiagonalBlocks; diagonal_block++)
       {
           for (block_row = 0; block_row < numberOfRowsPerBlock; block_row++)
           {
               i++;
               pRows[i] = pRows[i-1] + numberOfRowsPerBlock;

               int from = pRows[i-1];
               int to = pRows[i];
           
               for (j = from; j < to; j++)
               {
                   pCols[j] = block_row*numberOfRowsPerBlock + j-from;
                   pData[j] = cos((j-from)*pi/(to-from-1));
               }
           }
       }
   }

   void initX()
   {
      int i;
      for (i = 0; i < nrows; i++)
      {
          x[i] = 1.0;
      }
   }

   void multiply(int matrix_id, int thread_id) const
   {
       int i;
       int j;
       int index;
       double sum;
#pragma omp critical
{
       cout << "thread_id = " << thread_id
            << " performs multiplication of matrix with id = " << matrix_id << "\n";
}      

       for (i = 0; i < nrows; i++)
       {
           sum = 0.0;
           for (index = pRows[i]; index < pRows[i+1]; index++)
      {
          j = pCols[index];

               sum += pData[index]*x[j];
      }
      y[i] = sum;
       }
   }

   void memory()
   {
       double MB = 1024*1024.0;
       cout << "rows = " << nrows << "\n";
       cout << "cols = " << ncols << "\n";
       cout << "nonzeros = " << nonzeros << "\n";
       cout << "memory = " << (4*(nrows+1) + 4*nonzeros + 8*nonzeros)/MB << " MB\n";
   }

};

int main(int argc, char* argv[])
{
    if (argc != 5)
    {
        cout << "usage: " << argv[0] << " diagonalBlocks rowsPerBlock nloops nmatrices\n";
        cout << "examp: " << argv[0] << " 761800 8 100 4\n";
        exit(1);
    }
   
    int diagonal_blocks = (int) strtol(argv[1], NULL, 0);
    int rowsperblock = (int) strtol(argv[2], NULL, 0);
    int nloops = (int) strtol(argv[3], NULL, 0);
    int nmatrices = (int) strtol(argv[4], NULL, 0);
   
    int i;
    int id;
    int thread_id;
    int nthreads;
    SystemMatrix* pMatrices = new SystemMatrix[nmatrices];
    int* tasks = new int[nmatrices];

#pragma omp parallel
{
    nthreads = omp_get_num_threads();
}   
    cout << "number of threads = " << nthreads << "\n";

    timing secs;
    secs.tick();

#pragma omp parallel for private(i, thread_id) shared(diagonal_blocks, rowsperblock)
    for (i = 0; i < nmatrices; i++)
    {
        thread_id = omp_get_thread_num();
   cout << "i = " << i << " thread = " << thread_id << "\n";
        pMatrices[i].setStructure(diagonal_blocks, rowsperblock);
        pMatrices[i].make();
    }

    secs.tack();
    secs.timeNeeded("initialization took");
    secs.tick();

#pragma omp parallel for private(i, thread_id)
    for (i = 0; i < nmatrices; i++)
    {
        thread_id = omp_get_thread_num();
        pMatrices[i].multiply(i, thread_id);
    }

    secs.tack();
    secs.timeNeeded("multiplication took");
    return 0;
}
Attachments
ddsimulation.tar.gz
The code;
(6.25 KiB) Downloaded 329 times
drososkourounis
 
Posts: 3
Joined: Sat Mar 22, 2008 1:24 pm

Re: Poor scaling on Core2Duo better on OpteronX2 a code problem?

Postby ejd » Mon Mar 24, 2008 11:48 am

I am not familiar with "timing.h" and a quick search of the web didn't give me any clue as to what package this is from. So the first question I have is, are you sure that the information it is returning is accurate?? Is there a reason you decided to use it rather than omp_get_wtime routine?
ejd
 
Posts: 1025
Joined: Wed Jan 16, 2008 7:21 am

Re: Poor scaling on Core2Duo better on OpteronX2 a code problem?

Postby drososkourounis » Mon Mar 24, 2008 9:52 pm

Dear edj,
thank you for the reply. The code is attached. If you had downloaded the code and run it you would see for yourself. The file timing.h is included. Yesterday I found out about the omp_get_wtime browsing other threads in Forum. I implemented it instead and the results where not different. You see, if you browse the code you will see that you do not need omp_get_wtime because time is measured after the end of the OpenMP parallelized loops. After all the parallel stuff have finished and the threads have joined again into one there is nothing to worry about. That is why omp_get_wtime didn't change anything. I have also implemented a pthreads version of the code above and the results where also the same. The funny thing is that the same code with the same compiler works much better on the Opteron. On the other hand a full matrix-matrix multiplication routine, you know the simple i,j,k loop is easily parallelized on my Core2Duo. That is why I am worrying about the code and I cannot say that it is a hardware problem. Since other codes can be made parallel easily and cleanly something must be wrong with this one.

If you want download the code attached run it in your architecture and report back the timings you get.

Thanks again.
drososkourounis
 
Posts: 3
Joined: Sat Mar 22, 2008 1:24 pm


Return to Using OpenMP

Who is online

Users browsing this forum: Google [Bot] and 12 guests