tasks inside parallel loops

Discuss the OpenMP 3.0 API Specifications with the OpenMP Arch. Review Board. (Read Only)

tasks inside parallel loops

Postby karoaper » Wed Oct 07, 2009 10:53 pm

Hello. I have a question concerning the use of task constructs inside of work-sharing constructs, such as parallel for.
Consider the code below. It's completely useless but demonstrates what I'm trying to accomplish.
We want to compute the arithmetic series 1+2+.....MAXVAL, repeating it for each element of arrays sum1 and sum2. sums1 is bigger than sum2 and thus requires more time to compute the series for each of its elements.

I'm using omp to parallelize the for loop, breaking the summation of 1+2+...MAXVAL into chunks and using parallel for to assign the chunks to NUM_THREADS threads. Each thread is given a part/copy of sums1 and sum2 array, which it incrementally fills, and then afterwards we add up the results of the threads to make up the total sum. So, one level of parallelism is the chunking of the summation loop.

I also want to fill sum1 and sum2 separately/concurrently, and there are two methods I can use. One is to use parallel sections and another is use a task construct on the smaller sum2. As you can see, sum2 is much smaller and thus, the thread that fills sum2 using the parallel sections construct will finish sooner and simply wait at the barrier. This results in poor utilization. This is why I thought of using the task construct on the smaller sum2 computation, rather than explicitly create 2 threads with the parallel sections construct. My hope/desire is that the thread that's seeing the task might ignore it, and fill sum1, while some other thread, one that has finished taking care of some chunk, will take care of this task. Now, as you can see, I have to have the taskwait directive. If I don't, the task based method computes incorrect values for sum2. However, having it there seems to make the code slower rather than faster. In fact, the parallel sections method is faster than the task method.

Another problem I have is if I use the task based approach, I can't set the NUM_THREADS to be greater than 8. Otherwise, it causes Segmentation Fault. Using the parallel sections, I can create as big a team as I want. Also, when using the task-based approach, increasing the number of threads seems to decrease performance, rather than increase it. Strange goings-on.

Does anyone have any information about using tasks together with worksharing constructs, and what might be the best way to address such cases (where one section completes sooner than the other) occurring inside other parallelization constructs (worksharing in this case).

Thanks!

Code: Select all
#include <omp.h>
#include <stdio.h>
#include <unistd.h>
#include <stdlib.h>

#define Malloc(type,n) (type *)malloc((n)*sizeof(type))
#define NUM_THREADS 8  //NUMBER OF THREADS IN THE TEAM
#define CHUNK 10  //SIZE OF PARALLEL FOR CHUNKS

#define SECTIONBASED  //TURNS ON PARALLEL SECTIONS BASED METHOD FOR CONCURRENCY
//#define TASKBASED  //TURNS ON TASK BASED METHOD FOR CONCURRENCY

#define MAXVAL 1000000  //MAX VALUE FOR THE ARITHMETIC SERIES
#define SIZES 10000  //SIZE OF FIRST ARRAY
#define SIZES_SM 10  //SIZE OF SECOND SMALLER ARRAY
int main()
{
   double *sums1 = Malloc(double, NUM_THREADS*SIZES);   //allocate space for the size of array times the number of threads, since each thread will be adding part of the series
   double *sums2 = Malloc(double, NUM_THREADS*SIZES_SM);  //same as before but for the smaller array
   int i,j,k,tid;

   #pragma omp parallel firstprivate(sums2,sums1) private(i,j,k,tid) num_threads(NUM_THREADS)
   {
      tid = omp_get_thread_num();
      printf("%d\n",tid);

      sums1 += tid*SIZES;  //the part of the sums1 array that each thread can add to
      sums2 += tid*SIZES_SM;  //same as above but for the smaller array
      
      for(i=0;i<SIZES;i++)
         sums1[i] = 0;   //initialize sums1

      for(i=0;i<SIZES_SM;i++)
         sums2[i] = 0;   //initialize the sums2
      
      omp_set_nested(1);
      
      #pragma omp for schedule(dynamic,CHUNK) nowait      
      for(j=1;j<=MAXVAL;j++)
      {
#ifdef TASKBASED   
         #pragma omp task shared(j) private(k) firstprivate(sums2)
         {
            for(k=0;k<SIZES_SM;k++)
               sums2[k] += j;
         }
         
         #pragma omp taskwait
         

         for(k=0;k<SIZES;k++)
            sums1[k] += j;
#else
         #pragma omp parallel sections shared(j) private(k) firstprivate(sums1,sums2) num_threads(2)
         {
            #pragma omp section
            {
               for(k=0;k<SIZES;k++)
                  sums1[k] += j;
            }   
            #pragma omp section
            {      
               for(k=0;k<SIZES_SM;k++)
                  sums2[k] += j;
            }
         }
#endif         
      }
   }
   
//add up the results of the threads to make up the total sums, store the results in the part computed by the master thread
   for(i=1;i<NUM_THREADS;i++)
   {
      for(j=0;j<SIZES;j++)
         sums1[j] += sums1[i*SIZES+j];

      for(j=0;j<SIZES_SM;j++)
         sums2[j] += sums2[i*SIZES_SM+j];
   }

   printf("sums bigger\n----------------------------------------");
   for(j=0;j<SIZES;j++)   
      printf("%lf\n",sums1[j]);

   printf("sums smaller\n----------------------------------");
   for(j=0;j<SIZES_SM;j++)   
      printf("%lf\n",sums2[j]);

   return 0;
}
karoaper
 
Posts: 2
Joined: Wed Oct 07, 2009 9:01 pm

Re: tasks inside parallel loops

Postby karoaper » Thu Oct 08, 2009 8:48 am

I guess I've been thinking about my problem wrong. My actual problem and the illustrative code I presented are trying to do the wrong thing. Namely, it is guaranteed to be sub-obtimal in the sense of achieving highest utilization of CPUs, to include nested concurrency in a work-sharing construct like a parallel loop. It's a classical case of thinking of concurrency as utilization of threads, when in reality it is utilization of CPUs. Thus, in my previous case, whether I use parallel sections or the task method, I am forcing some thread, which is already/could be doing something, to handle the summation of sum1 or sum2. In other words, I'm not getting any increased efficiency, because the CPUs are already being used on part of the problem (chunks), and all I'm doing is adding unnecessary task/thread creation, aka overhead. Looking at the activity monitor, highest user (and lowest system) utilization of CPUs happens when I simply compute sum1 and sum2 arrays serially, rather than concurrently.

So, in my previous code, maximal efficiency is achieved with simply:
Code: Select all
      #pragma omp for schedule(dynamic,CHUNK) nowait     
      for(j=1;j<=MAXVAL;j++)
      {
            for(k=0;k<SIZES_SM;k++)
               sums2[k] += j;

            for(k=0;k<SIZES;k++)
               sums1[k] += j;
       
      }

karoaper
 
Posts: 2
Joined: Wed Oct 07, 2009 9:01 pm


Return to OpenMP 3.0 API Specifications

Who is online

Users browsing this forum: No registered users and 2 guests

cron