Parallelization of matrix inversion Cholesky-Crout algorithm

General OpenMP discussion

Parallelization of matrix inversion Cholesky-Crout algorithm

Postby szmitek » Sun Dec 09, 2012 6:55 am

I am trying to parallelize two following functions to reduce the execution time of the matrix inversion algorithm based on Cholesky-Crout decomposition. The functions executed with OpenMP longer than sequentially on machines with 2, 3 and 16 cores. Increasing number in "num_threads" increases execution time. How to parallelize this functions correctly (mainly for the processor with two cores)? How to reduce the execution time? Could you help me?

Code: Select all
Matrix MultiplayMatrixByItsTransposition(Matrix &org)
{
    unsigned int i, j, k;
    Matrix newMatrix = CreateMatrix(org.size);
    //Main matrix rows
    #pragma omp parallel for private(i, j, k) num_threads(2)
    for(i = 0; i < newMatrix.size; ++i)
        //Main matrix columns
        for(j = 0; j < newMatrix.size; ++j)
            //Left matrix columns and top matrix rows.
            for(k = 0; k < newMatrix.size; ++k)
                newMatrix.values[i][j] += org.values[k][i] * org.values[k][j];
    return newMatrix;
}

Code: Select all
Matrix InveseDownTriangleMatrix(Matrix &org)
{
    Matrix m = CreateDiagonalIMatrix(org.size);
    for(unsigned int i = 0; i < m.size; ++i)
    {
        double temp = org.values[i][i];
        for(unsigned int j = 0; j <= i; ++j)
        {
            org.values[i][j] /= temp;
            m.values[i][j] /= temp;
        }
        unsigned int k, j;
        #pragma omp parallel for private(j, temp, k) num_threads(2)
        for(j = i+1; j < m.size; ++j)
        {
            temp = org.values[j][i];
            for(k = 0; k <= i; ++k)
            {
                if(i == j)
                    break;
                org.values[j][k] -= org.values[i][k] * temp;
                m.values[j][k] -= m.values[i][k] * temp;
            }
        }
    }
    return m;
}


Matrix is structure:
Code: Select all
struct Matrix{
    double **values;
    unsigned int size;
};


I am using
Code: Select all
omp_set_dynamic(0);
szmitek
 
Posts: 2
Joined: Sat Dec 01, 2012 4:50 am
Location: Chotomów, Poland

Re: Parallelization of matrix inversion Cholesky-Crout algor

Postby ftinetti » Mon Dec 10, 2012 5:26 am

Hi,

I only have questions, right now:
1) How are you measuring time?
2) Actual execution times?
3) For
... with 2, 3 and 16 cores

you change the value in
Code: Select all
... num_threads(2)

right?
4) Can you send the specific execution time per function (i.e. MultiplayMatrixByItsTransposition and InveseDownTriangleMatrix)?
5) Just curious: computer (processor/s, RAM)? OS? Compiler?

Fernando.
ftinetti
 
Posts: 582
Joined: Wed Feb 10, 2010 2:44 pm

Re: Parallelization of matrix inversion Cholesky-Crout algor

Postby Oldboy » Tue Dec 11, 2012 12:00 pm

Looking at the resources here at OpenMP I found som matrix muliply code written by Tim Mattson.
Both serial and parallel codes can be compared. FX-8120 (8 threads) was around 5½ times faster with Open MP directives.

Code is also available for a linked list with a fib recursion on every step.
However the gain with parallel was only around 2½ times.
Oldboy
 
Posts: 18
Joined: Wed Oct 31, 2012 2:39 am

Re: Parallelization of matrix inversion Cholesky-Crout algor

Postby szmitek » Wed Dec 12, 2012 8:02 am

I am measuring time in this way (function choleskyDecomposition takes less time and I do not use openMP for it as yet):
Code: Select all
clock_t start, stop, time, wholeTime;
    time = 0;
    wholeTime = 0;
    start = clock();
    Matrix l = choleskyDecomposition(org);
    stop = clock();
    time = (stop-start)/CLOCKS_PER_SEC - wholeTime;
    wholeTime += time;
    std::cout << "Dekompozycja do L: " << time << "s." << std::endl;
    Matrix invL = InveseDownTriangleMatrix(l);
    stop = clock();
    time = (stop-start)/CLOCKS_PER_SEC - wholeTime;
    wholeTime += time;
    std::cout << "Inwersja L: " << time << "s." << std::endl;
    Matrix inv = MultiplayMatrixByItsTransposition(invL);
    stop = clock();
    time = (stop-start)/CLOCKS_PER_SEC - wholeTime;
    wholeTime += time;
    std::cout << "Mnożenie L^-T * L^-1: " << time << "s." << std::endl;


2 GB RAM, 2 cores Intel Core 2 Duo P7350 @ 2.00GHz, gcc version 4.4.6 20120305 (Red Hat 4.4.6-4):
Matrix 1000x1000 sequentially: 1s (MultiplayMatrixByItsTransposition) / 25 s (InveseDownTriangleMatrix)
Matrix 1000x1000 with openMP: 2 s (MultiplayMatrixByItsTransposition) / 28 s (InveseDownTriangleMatrix)
Matrix 2000x2000 sequentially: 30 s (MultiplayMatrixByItsTransposition) / 221 s (InveseDownTriangleMatrix)
Matrix 2000x2000 with openMP: 43 s (MultiplayMatrixByItsTransposition) / 250 s (InveseDownTriangleMatrix)

2 GB RAM, Dual Core AMD Opteron Processor 265 (according to cat /proc/cpuinfo 4 processors), gcc version 4.1.2 20070626 (Red Hat 4.1.2-13):
Matrix 1000x1000 sequentially: 7s (MultiplayMatrixByItsTransposition) / 46 s (InveseDownTriangleMatrix)
Matrix 1000x1000 with openMP: 10 s (MultiplayMatrixByItsTransposition) / 57 s (InveseDownTriangleMatrix)
Matrix 2000x2000 sequentially: 56 s (MultiplayMatrixByItsTransposition) / 393 s (InveseDownTriangleMatrix)
Matrix 2000x2000 with openMP: 77 s (MultiplayMatrixByItsTransposition) / 466 s (InveseDownTriangleMatrix)

20 GB RAM, 16 x QEMU Virtual CPU version 0.12.5 2.4 GHz, gcc version 4.3.2 (Debian 4.3.2-1.1):
Matrix 1000x1000 sequentially: 2s (MultiplayMatrixByItsTransposition) / 62 s (InveseDownTriangleMatrix)
Matrix 1000x1000 with openMP: 3 s (MultiplayMatrixByItsTransposition) / 65 s (InveseDownTriangleMatrix)
Matrix 2000x2000 sequentially: 23 s (MultiplayMatrixByItsTransposition) / 591 s (InveseDownTriangleMatrix)
Matrix 2000x2000 with openMP: 28 s (MultiplayMatrixByItsTransposition) / 632 s (InveseDownTriangleMatrix)
Matrix 1000x1000 with openMP num_threads(5): 4 s (MultiplayMatrixByItsTransposition) / 64 s (InveseDownTriangleMatrix)
Matrix 1000x1000 with openMP num_threads(7): 6 s (MultiplayMatrixByItsTransposition) / 60 s (InveseDownTriangleMatrix)
Matrix 1000x1000 with openMP num_threads(15): 9 s (MultiplayMatrixByItsTransposition) / 57 s (InveseDownTriangleMatrix)
Matrix 2000x2000 with openMP num_threads(5): 56 s (MultiplayMatrixByItsTransposition) / 630 s (InveseDownTriangleMatrix)
Matrix 2000x2000 with openMP num_threads(7): 52 s (MultiplayMatrixByItsTransposition) / 638 s (InveseDownTriangleMatrix)
Matrix 2000x2000 with openMP num_threads(16): 69 s (MultiplayMatrixByItsTransposition) / 635 s (InveseDownTriangleMatrix)

How to reduce the execution time with openMP? Do I use openMP directives incorrectly in my functions? How should they be used?
szmitek
 
Posts: 2
Joined: Sat Dec 01, 2012 4:50 am
Location: Chotomów, Poland

Re: Parallelization of matrix inversion Cholesky-Crout algor

Postby ftinetti » Wed Dec 12, 2012 11:14 am

Hi again,

Just in case: please use omp_get_wtime() and use it on on InveseDownTriangleMatrix so that we will be able to focus the most time consuming function.

Fernando.
ftinetti
 
Posts: 582
Joined: Wed Feb 10, 2010 2:44 pm

Re: Parallelization of matrix inversion Cholesky-Crout algor

Postby MarkB » Thu Dec 13, 2012 7:25 am

clock() returns CPU time (which will be accumulated across all the threads), not wall clock time. If your performance scales perfectly with the number of threads, the CPU time will be constant. As Fernando suggests, try using omp_get_wtime() instead.
MarkB
 
Posts: 487
Joined: Thu Jan 08, 2009 10:12 am
Location: EPCC, University of Edinburgh


Return to Using OpenMP

Who is online

Users browsing this forum: No registered users and 10 guests

cron