Adding matrix elements via different loops

General OpenMP discussion

Adding matrix elements via different loops

Postby B4nnar » Mon Oct 21, 2013 9:56 am

Hello, I'm new to the OpenMP forum as well to omp itself. I have been studing subject yesterday to get overall knowledge about the library, it's goals, tools etc.
I got 3 exercises to do, all related to OpenMP, yet instead of spamming forum index with 3 threads at once, i'll post them one by one - after the previous has been done the next will be posted. Thanks in advance for any help and for your time.

Description:
Write and run two C/OpenMP programs for adding elements of a square matrix a.
Implement two versions of loops as shown on this page.
The value of n should be 100 *(number of threads).
Time both codes. Which of the two versions runs faster. Explain why?
Loops:
Code: Select all
1)
for (int j=0; j<n; j++)
    for (int i=0; i<n; i++)
        sum += a[i][j];

2)
for (int i=0; i<n; i++)
    for (int j=0; j<n; j++)
        sum += a[i][j];


I'm using Win x86 64bit and omp library with codeblocks with mingw gcc compilator.
Code: Select all
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <omp.h>

int** getarray(int nrows)
{
    int ncolumns = nrows, i;
    int **array;
   array = malloc(nrows * sizeof(int *));

   if(array == NULL)
      return;

   for(i = 0; i < nrows; i++) {
        array[i] = malloc(ncolumns * sizeof(int));
      if(array[i] == NULL)
         return;
    }
    return array;
}

int main (int argc, char *argv[]) {

int nthreads, n,sum=0;
int i,j,**a;
srand (time(NULL));

time_t start, end;
time(&start);

#pragma omp parallel private(i,j) shared(a, sum, nthreads)
{
    #pragma omp single
    {
        nthreads = omp_get_num_threads();
        n = 100*nthreads;
        a = getarray(n);
    }

    /* Initializing matrix */
    #pragma omp for
    for(i=0; i<n; i++)
        for(j=0; j<n; j++)
            a[i][j] = rand()%100;

    printf("Thread %d starting...\n",omp_get_thread_num());

    #pragma omp single
    time(&start);

    #pragma omp for reduction(+:sum)
    for (j=0; j<n; j++)
        for (i=0; i<n; i++)
            sum += a[i][j];

}   /* end of parallel region */

time(&end);
printf("\nMethod \"a\". Elapsed time: %f; sum: %d threads %d\n", difftime(end, start), sum, nthreads);

    return 0;
}
"getarray" is standard function for memory allocation, just skip that part. Right now, I don't really know which values should be inserted into matrix but it's no matter.
This code features just one option, ofcourse there is another version of program with switched loop - it's pointless to paste it too, you get the point.

Does my approach is correct one? Is the timing done properly? Should initialization of variables and time(&start) be done within "single" clause?
Last, but not least, which and why one of those is faster?
I have been elaborating the subject, at first I thought it's similar to issue with paralleling outer and inner loops. With the first we usually can avoid the overhead, but meaby I miss the point. It seems that the 1st one is faster though.
Last edited by B4nnar on Wed Oct 23, 2013 11:35 am, edited 1 time in total.
B4nnar
 
Posts: 4
Joined: Mon Oct 21, 2013 9:09 am

Re: Adding matrix elements via different loops

Postby ftinetti » Mon Oct 21, 2013 10:29 am

Hi,

Just a few comments:
  • Use omp_get_wtime() for timing, as well as declare start and end of the appropriate type
  • I would declare "start" and "end" as shared, since I never remember what the default is. Also, I would time the specific section of code, i.e.

    Code: Select all
        #pragma omp single
        assign start

        #pragma omp for reduction(+:sum)
        for (j=0; j<n; j++)
            for (i=0; i<n; i++)
                sum += a[i][j];

        #pragma omp single
        assign end
  • About
    Last, but not least, which and why one of those is faster?
    I have been elaborating the subject, at first I thought it's similar to issue with paralleling outer and inner loops. With the first we usually can avoid the overhead, but meaby I miss the point. It seems that the 1st one is faster though.

    I suggest you look at the difference using only one thread (i.e. sequential processing). I also suggest you use n = 1000 *(number of threads) or even 10000 *(number of threads) so that timing would be a little bit less noisy (except if you are going to use, say, 1000 threads, of course).

HTH,

Fernando.
ftinetti
 
Posts: 575
Joined: Wed Feb 10, 2010 2:44 pm

Re: Adding matrix elements via different loops

Postby B4nnar » Mon Oct 21, 2013 11:02 am

Did I understand you correctly about the comments:
Changed:
Code: Select all
#pragma omp parallel private(i,j) shared(a, sum, nthreads)

->>
#pragma omp parallel private(i,j) shared(a, sum, nthreads, start, end)


Also, now start & end are doubles instead of time_t. Is the usage correct:
Code: Select all
    #pragma omp single
    start = omp_get_wtime();

    #pragma omp for reduction(+:sum)
    for (j=0; j<n; j++)
        for (i=0; i<n; i++)
            sum += a[i][j];

    #pragma omp single
    end = omp_get_wtime();

/* print outside of parallel block */
printf("\nMethod \"b\". Elapsed time: %.8f; sum: %d threads %d\n", end-start, sum, nthreads);


Results (on my i5 3570k with msi g45)
Code: Select all
/* loop type 1 */
    for (j=0; j<n; j++)
        for (i=0; i<n; i++)
            sum += a[i][j];

/* n 1000*ntheads */
~0,0085seconds
/* n 10000*nthreads */
~1,125seconds


/* loop type 2 */
    for (i=0; i<n; i++)
        for (j=0; j<n; j++)
            sum += a[i][j];

/* n 1000*ntheads */
~0,002seconds
/* n 10000*nthreads */
~0,223seconds
Seems I was right about the outcome, yet I havent discovered any real reason behind this. Is it linked to the outer and inner loop issue I had mentioned in previous post? Was looking in some efficiency-related literature, no results though.

Thanks for time as always.
B4nnar
 
Posts: 4
Joined: Mon Oct 21, 2013 9:09 am

Re: Adding matrix elements via different loops

Postby ftinetti » Mon Oct 21, 2013 1:41 pm

Hi,

Every change you have made is right, I think.

Seems I was right about the outcome, yet I havent discovered any real reason behind this. Is it linked to the outer and inner loop issue I had mentioned in previous post? Was looking in some efficiency-related literature, no results though.


Yes about the outcome. Results are with only one thread, right?

In fact, the difference is related to cache usage: the matrix is stored as an array of arrays, i.e. there are n arrays each one a row of n consecutive elements. Thus, accessing the matrix by elements in the same row, i.e.
Code: Select all
    for (i=0; i<n; i++)
        for (j=0; j<n; j++)
            sum += a[i][j];

takes advantage of consecutive elements in the same cache line as well as usual pre-fetching techniques. Otherwise, you lose both: multiple elements in the same cache line as well as prefetching advantages.

HTH,

Fernando.
ftinetti
 
Posts: 575
Joined: Wed Feb 10, 2010 2:44 pm

Re: Adding matrix elements via different loops

Postby B4nnar » Mon Oct 21, 2013 1:56 pm

Yes, results were made based on single threading. Via 4 threads aviable on my cpu those values are drastically lower.

Thank you ftinetti for clarification, time and help. Problem is solved.
B4nnar
 
Posts: 4
Joined: Mon Oct 21, 2013 9:09 am


Return to Using OpenMP

Who is online

Users browsing this forum: Yahoo [Bot] and 6 guests