I've seen other posts regarding variability in execution times but it wasn't clear to me what the solution was.

1) using gcc version 4.1.2 on a RedHat Linux distribution 5.3

2) my code is making use of omp_get_wtime()

3) my code is below

#include <stdlib.h>

#include <stdio.h>

#include <omp.h>

#include <unistd.h>

#define N 100000000

int

main(int argc, char **argv)

{

int i, *a;

long long sum = 0;

double t;

a = (int*)malloc(N*sizeof(int));

t = omp_get_wtime();

#pragma omp parallel for

for (i=0; i<N; i++) {

a[i] = i;

}

#pragma omp parallel for reduction(+:sum)

for (i=0; i<N; i++) {

sum += a[i];

}

printf ("sum = %lld, t = %f\n", sum, omp_get_wtime() - t);

return 0;

}

The variability in execution times are shown below for a quad core system

sum = 4999999950000000, t = 1.297301

sum = 4999999950000000, t = 0.701856

sum = 4999999950000000, t = 1.451137

sum = 4999999950000000, t = 0.697694

sum = 4999999950000000, t = 0.704821

sum = 4999999950000000, t = 1.502765

sum = 4999999950000000, t = 1.269791

sum = 4999999950000000, t = 1.226138

If I remove both pragma directives around the for loops, I get the following more stable execution times

sum = 4999999950000000, t = 1.734803

sum = 4999999950000000, t = 1.731349

sum = 4999999950000000, t = 1.727734

sum = 4999999950000000, t = 1.728106

sum = 4999999950000000, t = 1.720753

sum = 4999999950000000, t = 1.734948

sum = 4999999950000000, t = 1.734952

sum = 4999999950000000, t = 1.732308

sum = 4999999950000000, t = 1.723079

sum = 4999999950000000, t = 1.731947

sum = 4999999950000000, t = 1.726258

What am I missing?

Also, what is the underlying mechanism that determines how many cores are available and thereby create the appropriate number of threads to handle the "embarassungly parallel" for loop?

Thanks

-kook