forum post yet:
What determines the CPU usage of OpenMP threads?
Background and details:
I have a C application that uses a #pragma parallel for loop to do some
pretty heavy processing that's typically taking several tens of seconds.
I am running it on 64bit Linux (using gcc 4.5.1 or 4.4.6 depending on
the machine) on machines with 8 to 32 cores. While developing it and "in
production" (for the past half year or so) it used to basically saturate
the machine, so that when I checked the CPU usage, I saw as many threads
as there were cores all use 100% CPU, i.e. a total of 1600% CPU usage on
the 16 core machine. I check this using top.
Now, in the last few weeks, I see something different: there are as many
threads as before, but I see a total CPU usage of 100% or 500% or some
n*100% in between, where a few threads are using 100% and the rest of
the threads is getting less CPU. Here is an example top output with 500%
usage (ran on a 16 core machine with OMP_NUM_THREADS=12 set in the
environment):
- Code: Select all
top - 16:00:03 up 384 days, 44 min, 23 users, load average: 6.69, 3.40, 1.43
Tasks: 420 total, 13 running, 396 sleeping, 1 stopped, 10 zombie
Cpu(s): 36.0%us, 0.5%sy, 0.0%ni, 63.4%id, 0.0%wa, 0.0%hi, 0.1%si, 0.0%st
Mem: 72633M total, 69799M used, 2833M free, 226M buffers
Swap: 4102M total, 479M used, 3623M free, 64874M cached
PID USER PR NI VIRT SWAP CODE DATA RES SHR S %CPU %MEM TIME+ COMMAND
8944 username 20 0 2897m 370m 124 2.8g 2.5g 2868 R 100 3.5 1:18.77 myprogram
8952 username 20 0 2897m 370m 124 2.8g 2.5g 2868 R 100 3.5 0:29.72 myprogram
8955 username 20 0 2897m 370m 124 2.8g 2.5g 2868 R 99 3.5 0:30.18 myprogram
8950 username 20 0 2897m 370m 124 2.8g 2.5g 2868 R 26 3.5 0:22.84 myprogram
8954 username 20 0 2897m 370m 124 2.8g 2.5g 2868 R 26 3.5 0:23.39 myprogram
8949 username 20 0 2897m 370m 124 2.8g 2.5g 2868 R 25 3.5 0:25.18 myprogram
8953 username 20 0 2897m 370m 124 2.8g 2.5g 2868 R 25 3.5 0:23.33 myprogram
8947 username 20 0 2897m 370m 124 2.8g 2.5g 2868 R 21 3.5 0:22.88 myprogram
8948 username 20 0 2897m 370m 124 2.8g 2.5g 2868 R 21 3.5 0:22.84 myprogram
8957 username 20 0 2897m 370m 124 2.8g 2.5g 2868 R 21 3.5 0:27.32 myprogram
8956 username 20 0 2897m 370m 124 2.8g 2.5g 2868 R 20 3.5 0:22.77 myprogram
8951 username 20 0 2897m 370m 124 2.8g 2.5g 2868 R 19 3.5 0:22.84 myprogram
My program even prints out run time of the relevant part:
- Code: Select all
51.455s (wall-clock) 131.910s (CPU, 12 CPUs)
showing only a very moderate gain of the parallelization, despite having
12 CPUs available to OpenMP.
The problem is that I don't see anything that has changed, neither in
the code nor in the installation of the machine. The loop in question is
a simple parallel OpenMP for loop with some explicitly shared variables:
- Code: Select all
#pragma omp parallel for default(none) \
shared(aCube, aPixgrid, cd33, crpix3, crval3, ..., znorm)
I don't restrict the scheduling in any way.
I would be grateful for some hints on what could cause this or how to
debug what's going on.
Peter.
