I have encountered a strange behaviour in the program I parallelize. To benchmark the performance of the parallelized function I have duplicated the function 4 times (one for each core in my PC). I then, where the function call is being executed, call the different functions in consecutive order and changing the omp_set_num_threads() from 1 to 4, and test that it has been changed in between. I then use gprof to profile the run. The strange results I acquire is that the most time consuming function call is the 2 procs call followed by 1 procs, 4 procs and the most effective is the 3 procs. This is a very strange result, atleast in my opinion. So I wonder if using gprof is unreliable (and if there is a better way?), and if there could be any reasonably explenation for this result?!
I have tried with both omp_set_dynamic enabled and disabled. I have even tried changing the order of the function calls if there might be some kind of cpu or motherboard algorithm making the following calls more and more effective. None of the inputs to the function is identical. They are all copies made before calling any of the functions so there should be no memory allocation difference.
If anybody has a clue or just want to speculate a little it would be welcome.
Thanks in advance