## Parallelization of matrix inversion Cholesky-Crout algorithm

General OpenMP discussion

### Parallelization of matrix inversion Cholesky-Crout algorithm

I am trying to parallelize two following functions to reduce the execution time of the matrix inversion algorithm based on Cholesky-Crout decomposition. The functions executed with OpenMP longer than sequentially on machines with 2, 3 and 16 cores. Increasing number in "num_threads" increases execution time. How to parallelize this functions correctly (mainly for the processor with two cores)? How to reduce the execution time? Could you help me?

Code: Select all
`Matrix MultiplayMatrixByItsTransposition(Matrix &org){    unsigned int i, j, k;    Matrix newMatrix = CreateMatrix(org.size);    //Main matrix rows    #pragma omp parallel for private(i, j, k) num_threads(2)    for(i = 0; i < newMatrix.size; ++i)        //Main matrix columns        for(j = 0; j < newMatrix.size; ++j)            //Left matrix columns and top matrix rows.            for(k = 0; k < newMatrix.size; ++k)                newMatrix.values[i][j] += org.values[k][i] * org.values[k][j];    return newMatrix;}`

Code: Select all
`Matrix InveseDownTriangleMatrix(Matrix &org){    Matrix m = CreateDiagonalIMatrix(org.size);    for(unsigned int i = 0; i < m.size; ++i)    {        double temp = org.values[i][i];        for(unsigned int j = 0; j <= i; ++j)        {            org.values[i][j] /= temp;            m.values[i][j] /= temp;        }        unsigned int k, j;        #pragma omp parallel for private(j, temp, k) num_threads(2)        for(j = i+1; j < m.size; ++j)        {            temp = org.values[j][i];            for(k = 0; k <= i; ++k)            {                if(i == j)                    break;                org.values[j][k] -= org.values[i][k] * temp;                m.values[j][k] -= m.values[i][k] * temp;            }        }    }    return m;}`

Matrix is structure:
Code: Select all
`struct Matrix{    double **values;    unsigned int size;};`

I am using
Code: Select all
`omp_set_dynamic(0);`
szmitek

Posts: 2
Joined: Sat Dec 01, 2012 4:50 am
Location: Chotomów, Poland

### Re: Parallelization of matrix inversion Cholesky-Crout algor

Hi,

I only have questions, right now:
1) How are you measuring time?
2) Actual execution times?
3) For
... with 2, 3 and 16 cores

you change the value in
Code: Select all
`... num_threads(2)`

right?
4) Can you send the specific execution time per function (i.e. MultiplayMatrixByItsTransposition and InveseDownTriangleMatrix)?
5) Just curious: computer (processor/s, RAM)? OS? Compiler?

Fernando.
ftinetti

Posts: 603
Joined: Wed Feb 10, 2010 2:44 pm

### Re: Parallelization of matrix inversion Cholesky-Crout algor

Looking at the resources here at OpenMP I found som matrix muliply code written by Tim Mattson.
Both serial and parallel codes can be compared. FX-8120 (8 threads) was around 5½ times faster with Open MP directives.

Code is also available for a linked list with a fib recursion on every step.
However the gain with parallel was only around 2½ times.
Oldboy

Posts: 18
Joined: Wed Oct 31, 2012 2:39 am

### Re: Parallelization of matrix inversion Cholesky-Crout algor

I am measuring time in this way (function choleskyDecomposition takes less time and I do not use openMP for it as yet):
Code: Select all
`clock_t start, stop, time, wholeTime;    time = 0;    wholeTime = 0;    start = clock();    Matrix l = choleskyDecomposition(org);    stop = clock();    time = (stop-start)/CLOCKS_PER_SEC - wholeTime;    wholeTime += time;    std::cout << "Dekompozycja do L: " << time << "s." << std::endl;    Matrix invL = InveseDownTriangleMatrix(l);    stop = clock();    time = (stop-start)/CLOCKS_PER_SEC - wholeTime;    wholeTime += time;    std::cout << "Inwersja L: " << time << "s." << std::endl;    Matrix inv = MultiplayMatrixByItsTransposition(invL);    stop = clock();    time = (stop-start)/CLOCKS_PER_SEC - wholeTime;    wholeTime += time;    std::cout << "Mnożenie L^-T * L^-1: " << time << "s." << std::endl;`

2 GB RAM, 2 cores Intel Core 2 Duo P7350 @ 2.00GHz, gcc version 4.4.6 20120305 (Red Hat 4.4.6-4):
Matrix 1000x1000 sequentially: 1s (MultiplayMatrixByItsTransposition) / 25 s (InveseDownTriangleMatrix)
Matrix 1000x1000 with openMP: 2 s (MultiplayMatrixByItsTransposition) / 28 s (InveseDownTriangleMatrix)
Matrix 2000x2000 sequentially: 30 s (MultiplayMatrixByItsTransposition) / 221 s (InveseDownTriangleMatrix)
Matrix 2000x2000 with openMP: 43 s (MultiplayMatrixByItsTransposition) / 250 s (InveseDownTriangleMatrix)

2 GB RAM, Dual Core AMD Opteron Processor 265 (according to cat /proc/cpuinfo 4 processors), gcc version 4.1.2 20070626 (Red Hat 4.1.2-13):
Matrix 1000x1000 sequentially: 7s (MultiplayMatrixByItsTransposition) / 46 s (InveseDownTriangleMatrix)
Matrix 1000x1000 with openMP: 10 s (MultiplayMatrixByItsTransposition) / 57 s (InveseDownTriangleMatrix)
Matrix 2000x2000 sequentially: 56 s (MultiplayMatrixByItsTransposition) / 393 s (InveseDownTriangleMatrix)
Matrix 2000x2000 with openMP: 77 s (MultiplayMatrixByItsTransposition) / 466 s (InveseDownTriangleMatrix)

20 GB RAM, 16 x QEMU Virtual CPU version 0.12.5 2.4 GHz, gcc version 4.3.2 (Debian 4.3.2-1.1):
Matrix 1000x1000 sequentially: 2s (MultiplayMatrixByItsTransposition) / 62 s (InveseDownTriangleMatrix)
Matrix 1000x1000 with openMP: 3 s (MultiplayMatrixByItsTransposition) / 65 s (InveseDownTriangleMatrix)
Matrix 2000x2000 sequentially: 23 s (MultiplayMatrixByItsTransposition) / 591 s (InveseDownTriangleMatrix)
Matrix 2000x2000 with openMP: 28 s (MultiplayMatrixByItsTransposition) / 632 s (InveseDownTriangleMatrix)
Matrix 1000x1000 with openMP num_threads(5): 4 s (MultiplayMatrixByItsTransposition) / 64 s (InveseDownTriangleMatrix)
Matrix 1000x1000 with openMP num_threads(7): 6 s (MultiplayMatrixByItsTransposition) / 60 s (InveseDownTriangleMatrix)
Matrix 1000x1000 with openMP num_threads(15): 9 s (MultiplayMatrixByItsTransposition) / 57 s (InveseDownTriangleMatrix)
Matrix 2000x2000 with openMP num_threads(5): 56 s (MultiplayMatrixByItsTransposition) / 630 s (InveseDownTriangleMatrix)
Matrix 2000x2000 with openMP num_threads(7): 52 s (MultiplayMatrixByItsTransposition) / 638 s (InveseDownTriangleMatrix)
Matrix 2000x2000 with openMP num_threads(16): 69 s (MultiplayMatrixByItsTransposition) / 635 s (InveseDownTriangleMatrix)

How to reduce the execution time with openMP? Do I use openMP directives incorrectly in my functions? How should they be used?
szmitek

Posts: 2
Joined: Sat Dec 01, 2012 4:50 am
Location: Chotomów, Poland

### Re: Parallelization of matrix inversion Cholesky-Crout algor

Hi again,

Just in case: please use omp_get_wtime() and use it on on InveseDownTriangleMatrix so that we will be able to focus the most time consuming function.

Fernando.
ftinetti

Posts: 603
Joined: Wed Feb 10, 2010 2:44 pm

### Re: Parallelization of matrix inversion Cholesky-Crout algor

clock() returns CPU time (which will be accumulated across all the threads), not wall clock time. If your performance scales perfectly with the number of threads, the CPU time will be constant. As Fernando suggests, try using omp_get_wtime() instead.
MarkB

Posts: 595
Joined: Thu Jan 08, 2009 10:12 am
Location: EPCC, University of Edinburgh