strange lack of speedup over seemingly-independent for loop

General OpenMP discussion

strange lack of speedup over seemingly-independent for loop

Postby tlippy01 » Sun Jul 28, 2013 5:10 pm

Hello all,

I'm having some trouble understanding why openmp isn't coming even close to the speedup I'd expect. In the serial version, I read in a graph, create a vector of nodes, and sample them in a for loop. Everything except the for loop is trivial (maybe 1/3 of a second), and the total runtime is about 35 seconds. So, I put an openmp pragma on the loop, it splits it over 4 cores and runs in about 30 seconds: one core churns at 100%, the other three at 70%.

I was expecting almost linear speedup, so this was disappointing. There were some objects that I thought might be causing some write-collisions between threads, so I tried fiddling with some openmp features but to no avail. So, I decided to write the entire read/create/sample sequence inside a loop, so each thread has its own private copy of everything (which I think is normally/more elegantly accomplished with "firstprivate", but I wanted to be sure). This is the relevant code:

Code: Select all
#pragma omp parallel for schedule(static)
    for(int thread=0; thread<num_threads; thread++){
      FactorGraph<DenseCounts> fg = FactorGraph<DenseCounts>(file);     
      auto vars = fg.get_variables();
      DenseCounts counts = fg.get_counts();
      decltype(vars) to_sample;
      typedef decltype(vars)::value_type VarType;
      copy_if(vars.begin(), vars.end(), back_inserter(to_sample), [](VarType v){ return not v->get_observed(); }); 
      int per_thread = to_sample.size() / num_threads;
      int start = thread * per_thread;
      int end = (thread + 1) * per_thread;
      if(thread == num_threads - 1){
   end = to_sample.size();
      }
      cout << "thread #" << omp_get_thread_num() << ": " << start << " " << end << endl;
      for(int k=start; k<end; k++){
   to_sample[k]->sample(counts);
      }
    }

With one thread (export OMP_NUM_THREADS=1) I get:
thread #0: 0 25474

real 0m36.578s
user 0m36.576s
sys 0m0.024s

With four threads (export OMP_NUM_THREADS=4):
sample #0
thread #0: 0 6368
thread #2: 12736 19104
thread #1: 6368 12736
thread #3: 19104 25474

real 0m29.892s
user 1m22.472s
sys 0m23.480s

Note that the "thread #X" lines print almost immediately, so setting up isn't the problem, it's spending all the time in the nested for loop. Each thread is getting the same amount of work, 1/4 of the original (the computations are all essentially the same), so why is the speedup <20%? The cores behave the same as described in the first paragraph, one at 100%, the rest at 70%, so I'm guessing whatever the problem was, remains.

Any help/advice on where to look is much appreciated.

-Tom
tlippy01
 
Posts: 5
Joined: Sun Jul 28, 2013 12:18 pm

Re: strange lack of speedup over seemingly-independent for l

Postby ftinetti » Mon Jul 29, 2013 8:40 am

Hi Tom,

I think I'm losing something, but anyway:
1) Please include computers details, specifically about processor/s (vendor, model, etc., /proc/cpuinfo) being used,
2) What does sample(counts) do? If possible, include the corresponding code
3) Just in case: please time
Code: Select all
for(int k=start; k<end; k++){
   to_sample[k]->sample(counts);
}

just to confirm almost all the run time is spent in that code regiion.

Fernando.
ftinetti
 
Posts: 567
Joined: Wed Feb 10, 2010 2:44 pm

Re: strange lack of speedup over seemingly-independent for l

Postby MarkB » Tue Jul 30, 2013 6:24 am

Hi Tom,

The high system time in the 4 threads run ought to be a clue. Some possible causes are:

Context switching: are there other processes consuming CPU?
Locking: are you using some shared data structure inside to_sample[k]->sample(counts) which is enforcing mutual exclusion?
Memory allocation/deallocation: does to_sample[k]->sample(counts) do a lot of this?
Page swapping: are you running out of memory?

What does a run with 2 threads look like?
Can you get some profiling information to find out where the time is being spent?

Mark.
MarkB
 
Posts: 422
Joined: Thu Jan 08, 2009 10:12 am

Re: strange lack of speedup over seemingly-independent for l

Postby tlippy01 » Fri Aug 02, 2013 4:43 am

Thanks for the fast replies Fernando and Mark (and apologies for the delayed response). Since writing what follows, I have found at least part of the problem, in the form of a global RNG: fixing that got some improvement. It seems like the performance boost is only approaching 2x, so maybe something about my code isn't "hyperthreadable", I'm not familiar with the details of how HT works but will be looking into it. I need to set this aside for a week or two, but when I get back to it I'll keep posting on this thread because I think it might be a useful case study for fellow newcomers. What follows are answers to a few of the questions you guys had. Thanks again, and more soon.
-Tom

I'm running this on a i7 ULV with 2 cores and HT, so 4 threads. /proc/cpuinfo repeats the following 4 times:
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 58
model name : Intel(R) Core(TM) i7-3517U CPU @ 1.90GHz
stepping : 9
microcode : 0x17
cpu MHz : 1776.000
cache size : 4096 KB
physical id : 0
siblings : 4
core id : 0
cpu cores : 2
apicid : 0
initial apicid : 0
fpu : yes
fpu_exception : yes
cpuid level : 13
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer xsave avx f16c rdrand lahf_lm ida arat epb xsaveopt pln pts dtherm tpr_shadow vnmi flexpriority ept vpid fsgsbase smep erms
bogomips : 4788.75
clflush size : 64
cache_alignment : 64
address sizes : 36 bits physical, 48 bits virtual
power management:

When I wrap the startup code (from "FactorGraph" to the nested for loop) and the loop itself, and print the results in microseconds, I get:
startup: 838350 loop: 20565871
startup: 893896 loop: 26708657
startup: 858535 loop: 28466323
startup: 912165 loop: 28573302

real 0m29.503s
user 1m21.412s
sys 0m23.268s

So it looks like the startup code takes <4% of the time.

The code behind sample(counts) is complex: there is inheritance, CRTP, and heavy use of C++11 features. But since each thread is creating its own FactorGraph (and DenseCount) objects, they shouldn't have any shared structure. I do have several class member functions declared static, might this be a problem?
tlippy01
 
Posts: 5
Joined: Sun Jul 28, 2013 12:18 pm

Re: strange lack of speedup over seemingly-independent for l

Postby ftinetti » Fri Aug 02, 2013 5:14 am

Hi Tom,

Just a few comments:
1) The processor has actually only two cores (http://ark.intel.com/products/65714), and HT doesn't add much performance to HPC so the speedup limit is around 2
2) There are some issues related to performance measurements and the so called "Turbo Frequency" or "Turbo boost", you could take a look at previous posts about this, such as viewtopic.php?f=3&t=1481 and viewtopic.php?f=3&t=1289
3) RNG has always been an issue on parallel code (independently of HT), you may find many posts about it, take some time to search.

HTH,

Fernando.
ftinetti
 
Posts: 567
Joined: Wed Feb 10, 2010 2:44 pm

Re: strange lack of speedup over seemingly-independent for l

Postby tlippy01 » Thu Aug 08, 2013 6:07 pm

OK, I've spent some more time on this, but it's still not showing improvement. I've tried to eliminate possible sources of the problem (see below): every variable is now declared private, and it explicitly divides the tasks evenly between threads ("get_range"). So, my understanding is, if all the firstprivate variables have correct copy semantics, nothing should collide between the threads. I did a test (using the commented-out lines) which does speed up correctly, so I think this eliminates the RNG and "counts" object as the issue. It makes me think something in my model class is the problem.

The model class is a container: I had each thread print out the address of all the objects in its model, and confirmed that no pointers are shared between the private copies. So I don't know what could be causing this: is there a tool/method to determine what object is locking between the threads, or otherwise slowing things down? I've tried strace, and found many, many "madvise MADV_DONTNEED", "mprotect PROT_READ|PROT_WRITE" and "mprotect resumed" entries, which seem to be related to the problem, but I can't tell why they're occurring.

One final observation: when I compare the actual runtimes using two threads, they both start at the same time, the first completes as expected (about half the time in serial), while the other continues running, as though it didn't actually start until the first ended...

Does any of this help narrow down the issue? Right now I'm a bit jealous of this: http://www.haskell.org/haskellwiki/ThreadScope

Thanks,
-Tom

Code: Select all
vector<double> vals(num_threads, 0.0);
#pragma omp parallel for schedule(static) default(none) firstprivate(model,abs_start_time,num_threads)
    for(int j=0; j<num_threads; j++){
      auto vars = model.get_variables();
      decltype(vars) unobserved, to_sample;
      typedef decltype(vars)::value_type VarType;
      copy_if(vars.begin(), vars.end(), back_inserter(to_sample), [](VarType v){ return not v->get_observed(); }); 
      int start, end;
      std::tie(start, end) = get_range(j, num_threads, to_sample.size());
      auto& counts = model.get_counts();
      auto& alphabets = model.get_alphabets();

      mt19937_64 rng;
      rng.seed(static_cast<unsigned int>(time(0))); 

      for(int i=start; i<end; i++){
   //for(int k=0; k<1000; k++){
   //  vals[j] += counts({"classA", "classB"}, vector<unsigned int>({0, 0})) + rng();
   //  counts.increment({"classA", "classB"}, vector<int>({0, 0}), rng());
   //}
   to_sample[i]->sample(counts, rng);
      }
   }
tlippy01
 
Posts: 5
Joined: Sun Jul 28, 2013 12:18 pm

Re: strange lack of speedup over seemingly-independent for l

Postby tlippy01 » Sat Aug 10, 2013 9:47 am

Quick update: when I compile without optimizations (-O0) I get the expected speedup, though of course everything runs much, much slower: everything took about 35s before, regardless of number of threads, but now 1 thread takes 10 minutes, 2 threads take 5 minutes, etc, and the CPUs are at 100% user. So, should I take this to mean that my code is semantically correct, but some optimization flag is merging memory locations or the like and causing thrashing?

Since even -O1 causes the problem/high CPU sys times, I guess I'll try enabling each optimization flag in -O1 to see which triggers it.
tlippy01
 
Posts: 5
Joined: Sun Jul 28, 2013 12:18 pm

Re: strange lack of speedup over seemingly-independent for l

Postby MarkB » Mon Aug 12, 2013 4:24 am

Hi Tom,

It is possible that the mprotect calls are coming from the memory allocator: see the comments on this thread: http://stackoverflow.com/questions/8474 ... o-mprotect The problem may be that your code is doing a lot of news and deletes.

However, you should not expect too much performance gain from hyperthreading. Codes which benefit from this are typically those which are memory latency bound, i.e. if the CPU is frequently stalled waiting on memory accesses, but even then the best you can really hope for is a 1.2x to 1.3x boost. Codes which are memory bandwidth or arithmetic bound won't benefit so much, and may even slow down.

Turning down/off optimisation will likely cause a lot more memory accesses, and so make hyperthreading (relatively) more beneficial. It will also reduce the importance of memory allocation overheads.

Hope that helps,
Mark.
MarkB
 
Posts: 422
Joined: Thu Jan 08, 2009 10:12 am

Re: strange lack of speedup over seemingly-independent for l

Postby tlippy01 » Mon Aug 12, 2013 4:36 am

Thanks Mark: I should point out that after the first post I'm no longer using hyper-threading, i.e. I'm only asking for two cores rather than four. I noticed in my sanity check (using the commented section of the code in my previous post rather than the "sample" calls) I got the results you describe: 2x speedup with 2 cores, then another small speedup with 4 cores.
-Tom
tlippy01
 
Posts: 5
Joined: Sun Jul 28, 2013 12:18 pm


Return to Using OpenMP

Who is online

Users browsing this forum: Google [Bot] and 7 guests