Why AMD+parallel got the slowest?

General OpenMP discussion

Re: Why AMD+parallel got the slowest?

Postby ftinetti » Fri Apr 08, 2011 11:49 am

It's good to see it working

Remember that production code usually runs with -O2 or -O3.

Remember That production code usually takes advantage of highly optimized libraries such as MKL and ACML.

Even though, still feel AMD may not be as good as Intel if specifications are the same???


That's a too "heavy" question... not for me...
ftinetti
 
Posts: 581
Joined: Wed Feb 10, 2010 2:44 pm

Re: Why AMD+parallel got the slowest?

Postby ejd » Mon Apr 11, 2011 11:32 am

I am not sure what exactly to make of your results. You show one run using gfortran version 4.1.2, which I didn't think even supported OpenMP (documentation seems to indicate OMP support didn't go in till version 4.4). At -O0 you should get minimal optimization, but I would still expect to see better times using 2 threads than running serially. When you run 1 thread and compare it to a serial run, you will generally see the parallel version runs slower than the serial version because there is some overhead (even if the compiler dual paths the code it is at best the same).

The codes you run do make a difference in the performance you will see. Some codes will run better on an Intel chip and some on an AMD. This makes it extremely hard to say one is better than another.

Here are some timings that I see:
Code: Select all
These runs were made on a RH Linux EL6 system running an Intel 975 3.33GHz chip (4 cores, 8 threads) with 6GB memory running your program:

gfortran 4.4.4 using -O3 -fopenmp:
serial:     93.90 sec
1 thread:  129.06 sec
2 threads:  64.73
4 threads:  96.60
8 threads:  90.87

ifort V12.0 using -O3 -openmp:
serial:     22.78 sec
1 thread:   22.78 sec
2 threads:  11.44
4 threads:  5.81
8 threads:  5.78
ejd
 
Posts: 1025
Joined: Wed Jan 16, 2008 7:21 am

Re: Why AMD+parallel got the slowest?

Postby baddylover » Mon Apr 11, 2011 6:57 pm

Very intresting! I'm sure I didn't make mistakes in producing the results from gfortran came from Scientific Linux 5.3 64-bit.
Anyway, i'll check it again later today.
Meanwhile, I have just got Scientific Linux 6.0 64-bit installation DVDs. I'll install SL6.0 soon and see what results I will get.
baddylover
 
Posts: 25
Joined: Thu Apr 07, 2011 6:17 am

Re: Why AMD+parallel got the slowest?

Postby baddylover » Tue Apr 12, 2011 1:24 am

the following are outputs (again) from my AMD running SL5.3 64 bit.
gfortran 4.1.2 does support Openmp! (Maybe it's just not documented?).
One thing I don't understand is this line

cpu MHz : 800.000

from cat /proc/cpuinfo. Why not 3.4GHz? (Again I'm really new on all these.)

Anyway, I'm going to install SL6.0 64 bit now.

>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
[fuchun@localhost openmp]$ cat ./ttt
clear
cat /proc/cpuinfo
gfortran -v
gfortran -fopenmp -O2 mc2.f90
time ./a.out
gfortran -O2 mc2.f90
time ./a.out
[fuchun@localhost openmp]$
[fuchun@localhost openmp]$ bash ./ttt

processor : 0
vendor_id : AuthenticAMD
cpu family : 16
model : 4
model name : AMD Phenom(tm) II X4 965 Processor
stepping : 2
cpu MHz : 800.000
cache size : 512 KB
physical id : 0
siblings : 4
core id : 0
cpu cores : 4
apicid : 0
fpu : yes
fpu_exception : yes
cpuid level : 5
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc pni cx16 popcnt lahf_lm cmp_legacy svm extapic cr8_legacy altmovcr8 abm sse4a misalignsse 3dnowprefetch osvw
bogomips : 6836.33
TLB size : 1024 4K pages
clflush size : 64
cache_alignment : 64
address sizes : 48 bits physical, 48 bits virtual
power management: ts ttp tm stc 100mhzsteps hwpstate [8]

processor : 1
vendor_id : AuthenticAMD
cpu family : 16
model : 4
model name : AMD Phenom(tm) II X4 965 Processor
stepping : 2
cpu MHz : 800.000
cache size : 512 KB
physical id : 0
siblings : 4
core id : 1
cpu cores : 4
apicid : 1
fpu : yes
fpu_exception : yes
cpuid level : 5
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc pni cx16 popcnt lahf_lm cmp_legacy svm extapic cr8_legacy altmovcr8 abm sse4a misalignsse 3dnowprefetch osvw
bogomips : 6830.80
TLB size : 1024 4K pages
clflush size : 64
cache_alignment : 64
address sizes : 48 bits physical, 48 bits virtual
power management: ts ttp tm stc 100mhzsteps hwpstate [8]

processor : 2
vendor_id : AuthenticAMD
cpu family : 16
model : 4
model name : AMD Phenom(tm) II X4 965 Processor
stepping : 2
cpu MHz : 800.000
cache size : 512 KB
physical id : 0
siblings : 4
core id : 2
cpu cores : 4
apicid : 2
fpu : yes
fpu_exception : yes
cpuid level : 5
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc pni cx16 popcnt lahf_lm cmp_legacy svm extapic cr8_legacy altmovcr8 abm sse4a misalignsse 3dnowprefetch osvw
bogomips : 6831.30
TLB size : 1024 4K pages
clflush size : 64
cache_alignment : 64
address sizes : 48 bits physical, 48 bits virtual
power management: ts ttp tm stc 100mhzsteps hwpstate [8]

processor : 3
vendor_id : AuthenticAMD
cpu family : 16
model : 4
model name : AMD Phenom(tm) II X4 965 Processor
stepping : 2
cpu MHz : 800.000
cache size : 512 KB
physical id : 0
siblings : 4
core id : 3
cpu cores : 4
apicid : 3
fpu : yes
fpu_exception : yes
cpuid level : 5
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc pni cx16 popcnt lahf_lm cmp_legacy svm extapic cr8_legacy altmovcr8 abm sse4a misalignsse 3dnowprefetch osvw
bogomips : 6830.88
TLB size : 1024 4K pages
clflush size : 64
cache_alignment : 64
address sizes : 48 bits physical, 48 bits virtual
power management: ts ttp tm stc 100mhzsteps hwpstate [8]

Using built-in specs.
Target: x86_64-redhat-linux
Configured with: ../configure --prefix=/usr --mandir=/usr/share/man --infodir=/usr/share/info --enable-shared --enable-threads=posix --enable-checking=release --with-system-zlib --enable-__cxa_atexit --disable-libunwind-exceptions --enable-libgcj-multifile --enable-languages=c,c++,objc,obj-c++,java,fortran,ada --enable-java-awt=gtk --disable-dssi --enable-plugin --with-java-home=/usr/lib/jvm/java-1.4.2-gcj-1.4.2.0/jre --with-cpu=generic --host=x86_64-redhat-linux
Thread model: posix
gcc version 4.1.2 20080704 (Red Hat 4.1.2-44)
j= 3 started:
j= 1 started:
j= 5 started:
j= 7 started:
j= 1 finished.
j= 2 started:
j= 3 finished.
j= 4 started:
j= 5 finished.
j= 6 started:
j= 7 finished.
j= 8 started:
j= 2 finished.
j= 4 finished.
j= 6 finished.
j= 8 finished.
1 1234691.36248305
2 2469382.72496610
3 3704074.08744916
4 4938765.44993221
5 6173456.81241526
6 7408148.17489831
7 8642839.53738137
8 9877530.89986442

real 0m16.159s
user 1m4.534s
sys 0m0.001s
j= 1 started:
j= 1 finished.
j= 2 started:
j= 2 finished.
j= 3 started:
j= 3 finished.
j= 4 started:
j= 4 finished.
j= 5 started:
j= 5 finished.
j= 6 started:
j= 6 finished.
j= 7 started:
j= 7 finished.
j= 8 started:
j= 8 finished.
1 1234691.36248305
2 2469382.72496610
3 3704074.08744916
4 4938765.44993221
5 6173456.81241526
6 7408148.17489831
7 8642839.53738137
8 9877530.89986442

real 1m4.719s
user 1m4.701s
sys 0m0.000s
[fuchun@localhost openmp]$
baddylover
 
Posts: 25
Joined: Thu Apr 07, 2011 6:17 am

Re: Why AMD+parallel got the slowest?

Postby baddylover » Tue Apr 12, 2011 4:07 am

here are results from my AMD running SL6.0 64 bit.
Very similar results as from SL5.3 64 bit.

BTW I did a google search and somehow understand the 800 MHz cpu MHz.

Basically for the time being I'm happy with gfortran -fopenmp -O2 by AMD running SL6.0 64 bit,
and will start doing my "serious work" on these.

Thanks everybody for helps.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
[f1@FCHUANGSL60BIT64 openmp]$ gfortran -v
Using built-in specs.
Target: x86_64-redhat-linux
Configured with: ../configure --prefix=/usr --mandir=/usr/share/man --infodir=/usr/share/info --with-bugurl=http://bugzilla.redhat.com/bugzilla --enable-bootstrap --enable-shared --enable-threads=posix --enable-checking=release --with-system-zlib --enable-__cxa_atexit --disable-libunwind-exceptions --enable-gnu-unique-object --enable-languages=c,c++,objc,obj-c++,java,fortran,ada --enable-java-awt=gtk --disable-dssi --with-java-home=/usr/lib/jvm/java-1.5.0-gcj-1.5.0.0/jre --enable-libgcj-multifile --enable-java-maintainer-mode --with-ecj-jar=/usr/share/java/eclipse-ecj.jar --disable-libjava-multilib --with-ppl --with-cloog --with-tune=generic --with-arch_32=i686 --build=x86_64-redhat-linux
Thread model: posix
gcc version 4.4.4 20100726 (Red Hat 4.4.4-13) (GCC)
[f1@FCHUANGSL60BIT64 openmp]$


[f1@FCHUANGSL60BIT64 openmp]$ gfortran -fopenmp -O2 mc2.f90
[f1@FCHUANGSL60BIT64 openmp]$ time ./a.out
j= 1 started:
j= 3 started:
j= 7 started:
j= 5 started:
j= 1 finished.
j= 2 started:
j= 3 finished.
j= 4 started:
j= 5 finished.
j= 6 started:
j= 7 finished.
j= 8 started:
j= 2 finished.
j= 4 finished.
j= 6 finished.
j= 8 finished.
1 1234691.3624830523
2 2469382.7249661046
3 3704074.0874491567
4 4938765.4499322092
5 6173456.8124152618
6 7408148.1748983134
7 8642839.5373813659
8 9877530.8998644184

real 0m16.178s
user 1m4.490s
sys 0m0.032s
[f1@FCHUANGSL60BIT64 openmp]$ gfortran -O2 mc2.f90
[f1@FCHUANGSL60BIT64 openmp]$ time ./a.out
j= 1 started:
j= 1 finished.
j= 2 started:
j= 2 finished.
j= 3 started:
j= 3 finished.
j= 4 started:
j= 4 finished.
j= 5 started:
j= 5 finished.
j= 6 started:
j= 6 finished.
j= 7 started:
j= 7 finished.
j= 8 started:
j= 8 finished.
1 1234691.3624830523
2 2469382.7249661046
3 3704074.0874491567
4 4938765.4499322092
5 6173456.8124152618
6 7408148.1748983134
7 8642839.5373813659
8 9877530.8998644184

real 1m4.426s
user 1m4.413s
sys 0m0.002s
[f1@FCHUANGSL60BIT64 openmp]$

[f1@FCHUANGSL60BIT64 openmp]$ gfortran -fopenmp mc2.f90
[f1@FCHUANGSL60BIT64 openmp]$ time ./a.out
j= 5 started:
j= 1 started:
j= 7 started:
j= 3 started:
j= 1 finished.
j= 2 started:
j= 5 finished.
j= 6 started:
j= 3 finished.
j= 4 started:
j= 7 finished.
j= 8 started:
j= 2 finished.
j= 6 finished.
j= 4 finished.
j= 8 finished.
1 1234691.3624830523
2 2469382.7249661046
3 3704074.0874491567
4 4938765.4499322092
5 6173456.8124152618
6 7408148.1748983134
7 8642839.5373813659
8 9877530.8998644184

real 3m56.808s
user 15m43.934s
sys 0m0.138s
[f1@FCHUANGSL60BIT64 openmp]$

[f1@FCHUANGSL60BIT64 openmp]$ gfortran mc2.f90
[f1@FCHUANGSL60BIT64 openmp]$ time ./a.out
j= 1 started:
j= 1 finished.
j= 2 started:
j= 2 finished.
j= 3 started:
j= 3 finished.
j= 4 started:
j= 4 finished.
j= 5 started:
j= 5 finished.
j= 6 started:
j= 6 finished.
j= 7 started:
j= 7 finished.
j= 8 started:
j= 8 finished.
1 1234691.3624830523
2 2469382.7249661046
3 3704074.0874491567
4 4938765.4499322092
5 6173456.8124152618
6 7408148.1748983134
7 8642839.5373813659
8 9877530.8998644184

real 1m58.256s
user 1m58.234s
sys 0m0.001s
[f1@FCHUANGSL60BIT64 openmp]$
baddylover
 
Posts: 25
Joined: Thu Apr 07, 2011 6:17 am

Re: Why AMD+parallel got the slowest?

Postby ftinetti » Tue Apr 12, 2011 10:41 am

I'm curious about

BTW I did a google search and somehow understand the 800 MHz cpu MHz.

1) URLs?
2) What did you understand?

Thanks in advance.
ftinetti
 
Posts: 581
Joined: Wed Feb 10, 2010 2:44 pm

Re: Why AMD+parallel got the slowest?

Postby baddylover » Tue Apr 12, 2011 6:27 pm

here is the URL:
http://fixunix.com/mandriva/331310-proc ... w-mhz.html

one post there says,

"The OS is using dynamic cpu frequency control. With four cores,
your machine is probably rarely working at more than a percent
or two of capacity, so the speed is throttled back to reduce
power consumption and heat.

You can install (or maybe enable) software that allows setting
the machine for maximum performance, and it will then use a
different scheduler and run all cores at full speed at all times."

For your information, I now copy the output of "$cat /proc/cpuinfo" from my AMD SL6.0 64bit:

[f1@FCHUANGSL60BIT64 ~]$ cat /proc/cpuinfo
processor : 0
vendor_id : AuthenticAMD
cpu family : 16
model : 4
model name : AMD Phenom(tm) II X4 965 Processor
stepping : 2
cpu MHz : 800.000
cache size : 512 KB
physical id : 0
siblings : 4
core id : 0
cpu cores : 4
apicid : 0
initial apicid : 0
fpu : yes
fpu_exception : yes
cpuid level : 5
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc rep_good nonstop_tsc extd_apicid pni monitor cx16 popcnt lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt npt lbrv svm_lock nrip_save
bogomips : 6830.32
TLB size : 1024 4K pages
clflush size : 64
cache_alignment : 64
address sizes : 48 bits physical, 48 bits virtual
power management: ts ttp tm stc 100mhzsteps hwpstate

processor : 1
vendor_id : AuthenticAMD
cpu family : 16
model : 4
model name : AMD Phenom(tm) II X4 965 Processor
stepping : 2
cpu MHz : 800.000
cache size : 512 KB
physical id : 0
siblings : 4
core id : 1
cpu cores : 4
apicid : 1
initial apicid : 1
fpu : yes
fpu_exception : yes
cpuid level : 5
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc rep_good nonstop_tsc extd_apicid pni monitor cx16 popcnt lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt npt lbrv svm_lock nrip_save
bogomips : 6830.90
TLB size : 1024 4K pages
clflush size : 64
cache_alignment : 64
address sizes : 48 bits physical, 48 bits virtual
power management: ts ttp tm stc 100mhzsteps hwpstate

processor : 2
vendor_id : AuthenticAMD
cpu family : 16
model : 4
model name : AMD Phenom(tm) II X4 965 Processor
stepping : 2
cpu MHz : 800.000
cache size : 512 KB
physical id : 0
siblings : 4
core id : 2
cpu cores : 4
apicid : 2
initial apicid : 2
fpu : yes
fpu_exception : yes
cpuid level : 5
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc rep_good nonstop_tsc extd_apicid pni monitor cx16 popcnt lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt npt lbrv svm_lock nrip_save
bogomips : 6830.90
TLB size : 1024 4K pages
clflush size : 64
cache_alignment : 64
address sizes : 48 bits physical, 48 bits virtual
power management: ts ttp tm stc 100mhzsteps hwpstate

processor : 3
vendor_id : AuthenticAMD
cpu family : 16
model : 4
model name : AMD Phenom(tm) II X4 965 Processor
stepping : 2
cpu MHz : 800.000
cache size : 512 KB
physical id : 0
siblings : 4
core id : 3
cpu cores : 4
apicid : 3
initial apicid : 3
fpu : yes
fpu_exception : yes
cpuid level : 5
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc rep_good nonstop_tsc extd_apicid pni monitor cx16 popcnt lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt npt lbrv svm_lock nrip_save
bogomips : 6830.90
TLB size : 1024 4K pages
clflush size : 64
cache_alignment : 64
address sizes : 48 bits physical, 48 bits virtual
power management: ts ttp tm stc 100mhzsteps hwpstate

[f1@FCHUANGSL60BIT64 ~]$
baddylover
 
Posts: 25
Joined: Thu Apr 07, 2011 6:17 am

Re: Why AMD+parallel got the slowest?

Postby baddylover » Tue Apr 12, 2011 6:49 pm

when ./a.out is running, one or four of the four CPUs does/do show CPU MHz 3400, depending on whether it is serial do or parallel do.
here is one output when "serial do" is working:

>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
[f1@FCHUANGSL60BIT64 openmp]$ cat /proc/cpuinfo
processor : 0
vendor_id : AuthenticAMD
cpu family : 16
model : 4
model name : AMD Phenom(tm) II X4 965 Processor
stepping : 2
cpu MHz : 800.000
cache size : 512 KB
physical id : 0
siblings : 4
core id : 0
cpu cores : 4
apicid : 0
initial apicid : 0
fpu : yes
fpu_exception : yes
cpuid level : 5
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc rep_good nonstop_tsc extd_apicid pni monitor cx16 popcnt lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt npt lbrv svm_lock nrip_save
bogomips : 6830.32
TLB size : 1024 4K pages
clflush size : 64
cache_alignment : 64
address sizes : 48 bits physical, 48 bits virtual
power management: ts ttp tm stc 100mhzsteps hwpstate

processor : 1
vendor_id : AuthenticAMD
cpu family : 16
model : 4
model name : AMD Phenom(tm) II X4 965 Processor
stepping : 2
cpu MHz : 3400.000
cache size : 512 KB
physical id : 0
siblings : 4
core id : 1
cpu cores : 4
apicid : 1
initial apicid : 1
fpu : yes
fpu_exception : yes
cpuid level : 5
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc rep_good nonstop_tsc extd_apicid pni monitor cx16 popcnt lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt npt lbrv svm_lock nrip_save
bogomips : 6830.90
TLB size : 1024 4K pages
clflush size : 64
cache_alignment : 64
address sizes : 48 bits physical, 48 bits virtual
power management: ts ttp tm stc 100mhzsteps hwpstate

processor : 2
vendor_id : AuthenticAMD
cpu family : 16
model : 4
model name : AMD Phenom(tm) II X4 965 Processor
stepping : 2
cpu MHz : 800.000
cache size : 512 KB
physical id : 0
siblings : 4
core id : 2
cpu cores : 4
apicid : 2
initial apicid : 2
fpu : yes
fpu_exception : yes
cpuid level : 5
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc rep_good nonstop_tsc extd_apicid pni monitor cx16 popcnt lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt npt lbrv svm_lock nrip_save
bogomips : 6830.90
TLB size : 1024 4K pages
clflush size : 64
cache_alignment : 64
address sizes : 48 bits physical, 48 bits virtual
power management: ts ttp tm stc 100mhzsteps hwpstate

processor : 3
vendor_id : AuthenticAMD
cpu family : 16
model : 4
model name : AMD Phenom(tm) II X4 965 Processor
stepping : 2
cpu MHz : 800.000
cache size : 512 KB
physical id : 0
siblings : 4
core id : 3
cpu cores : 4
apicid : 3
initial apicid : 3
fpu : yes
fpu_exception : yes
cpuid level : 5
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc rep_good nonstop_tsc extd_apicid pni monitor cx16 popcnt lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt npt lbrv svm_lock nrip_save
bogomips : 6830.90
TLB size : 1024 4K pages
clflush size : 64
cache_alignment : 64
address sizes : 48 bits physical, 48 bits virtual
power management: ts ttp tm stc 100mhzsteps hwpstate

[f1@FCHUANGSL60BIT64 openmp]$
baddylover
 
Posts: 25
Joined: Thu Apr 07, 2011 6:17 am

Re: Why AMD+parallel got the slowest?

Postby ftinetti » Wed Apr 13, 2011 1:16 pm

Hmmm...

one post there says,

"The OS is using dynamic cpu frequency control. With four cores,
your machine is probably rarely working at more than a percent
or two of capacity, so the speed is throttled back to reduce
power consumption and heat.

...

when ./a.out is running, one or four of the four CPUs does/do show CPU MHz 3400, depending on whether it is serial do or parallel do.
here is one output when "serial do" is working:


I see. Please remember it's almost impossible to understand performance results/samples with this setting. Please disable this setting for further performance study/analysis.

Also, in a HPC production environment this setting is hardly found, I think. HPC implies almost always full speed (max MHz) frequency control is usual and suggested in desktops/laptops.
ftinetti
 
Posts: 581
Joined: Wed Feb 10, 2010 2:44 pm

Re: Why AMD+parallel got the slowest?

Postby tob » Wed May 04, 2011 12:44 am

ejd wrote:I am not sure what exactly to make of your results. You show one run using gfortran version 4.1.2, which I didn't think even supported OpenMP (documentation seems to indicate OMP support didn't go in till version 4.4).


That's not quite right. GCC supports OpenMP v2.5 since 4.2 and only OpenMP v3.0 supported was added with GCC 4.4 (while OpenMP v3.1 might get added for 4.7). Red Hat backported the OpenMP support to its GCC 4.1.x versions; thus, there are gfortran 4.1.2 versions which support OpenMP.

ejd wrote:
Code: Select all
gfortran 4.4.4 using -O3 -fopenmp:
serial:     93.90 sec
ifort V12.0 using -O3 -openmp:
serial:     22.78 sec



The comparison is not completely fair as ifort by default ignores the Fortran standard with regards to parentheses. You need to specify -assume protect-parens to make sure that the parentheses aren't optimized away in the following expression:
Code: Select all
x(j) = x(j)+((dble(k1)/90000.0)*(dble(k2)/90000.0))**2


Or if you want to optimize them away, use gfortran's -fno-protect-parens. Additionally, ifort's -O3 rather matches GCC's -O3 -ffast-math. Taking all the options together, I get the following serial timings with Intel's ifort 11.1 and gfortran 4.7 on an Intel Core2 (E8400, 3GHz, CentOS 5.5).

Code: Select all
0m11.916s   for gfortran -O3 -ffast-math -march=native -funroll-loops -fno-protect-parens bench.f90
0m24.556s   for ifort -O3 -xHost bench.f90


And on an Intel Xeon X5570 (2.93GHz, SUSE Linux Enterprise 11sp1):
Code: Select all
For gfortran -O3 -ffast-math -march=native -funroll-loops -fno-protect-parens bench.f90
- Serial: 0m9.525s
- 1thread: 0m9.572s
- 2 threads: 0m4.762s
- 3 threads: 0m3.597s
- 4 threads: 0m2.453s
For ifort -O3 -xHost bench.f90
- Serial: 0m25.611s
- 1 thread: 0m25.615s
- 2 threads: 0m13.010s
- 3 threads: 0m9.809s
- 4 threads: 0m6.608s


And on an AMD Athlon64 X2 (4800+, openSUSE Factory), I get:
Code: Select all
For gfortran -O3 -ffast-math -march=native -funroll-loops -fno-protect-parens bench.f90
- Serial: 0m12.259s
- 1 Thread: 0m12.034s
- 2 Threads: 0m6.194s

For ifort -O3 -xHost bench.f90
- Serial: 0m52.454s
- 1 Thread: 0m52.829s
- 2 Threads: 0m28.017s


This shows how much the timing can depend on the compiler options (and the compiler defaults); other examples show how much the relative performance between compilers depends on the program in question. For GCC vs. Intel I saw also for bigger programms factor 2 differences, depending on the program in favour of Intel or in favour of GCC/gfortran.

Admittedly, also my comparison is not completely fair as I did not compare with ifort 12 (the latest version) and I made also no comprehensive study with which equivalent option which compiler is fastest.

Tobias
tob
 
Posts: 10
Joined: Thu Sep 23, 2010 12:53 am

PreviousNext

Return to Using OpenMP

Who is online

Users browsing this forum: No registered users and 9 guests

cron