Why AMD+parallel got the slowest?

General OpenMP discussion

Why AMD+parallel got the slowest?

Postby baddylover » Thu Apr 07, 2011 6:21 am

Hi,

I'm really new to openmp.
Recently I run a simple gfortran program and got the following results:

by (intel i5 650 2 cores, 2GB memory) and Windows 7:
serial do: about 2 minutes
parallel do: about 1 minute

by (intel i5 520 2 cores, 2GB memory) and Windows 7:
serial do: about 2 minutes
parallel do: about 1 minute

by (AMD Phenom II X4 965 Black 3.4GHz 4 cores, 4GB memory) and Windows XP:
serial do: about 2 minutes
parallel do: about 4 minutes

by (AMD Phenom II X4 965 Black 3.4GHz 4 cores, 6GB memory) and Scientific Linux 5.3 (64-bit):
serial do: about 2 minutes
parallel do: about 4 minutes

Why AMD+parallel got the slowest? I expected it would take about 0.5 minutes.
I want to know this to decide what kind of CPUs are good (to buy) for parallel computing using gfortran+openmp.

baddylover
baddylover
 
Posts: 25
Joined: Thu Apr 07, 2011 6:17 am

Re: Why AMD+parallel got the slowest?

Postby ejd » Thu Apr 07, 2011 9:25 am

Unfortunately without more to go on, I have no idea why this is happening. The problem is that one program may run better on one type of processor than another and another program might run on that same processor very badly. It could be memory layout, processor design, compiler implementation, etc. that is causing the problem. You have to look at each case and dig into it to figure out why.
ejd
 
Posts: 1025
Joined: Wed Jan 16, 2008 7:21 am

Re: Why AMD+parallel got the slowest?

Postby ftinetti » Thu Apr 07, 2011 10:57 am

Hi, I just have questions:
1) How are you measuring time?
2) Are you using the same compiler version and compiler options? Which ones?
3) Did you check how many threads are created? Result?
4) Do you have a some (representative) code to show?

Maybe (hopefully) we can figure something out using the answers...
ftinetti
 
Posts: 582
Joined: Wed Feb 10, 2010 2:44 pm

Re: Why AMD+parallel got the slowest?

Postby baddylover » Thu Apr 07, 2011 10:19 pm

The following are from my office computer (intel i5 650 running Windows XP professional).
We see the parallel do took about 1 minute while serial do took about 2 minutes.
I will post results from my AMD computer at my home later today.

E:\openmp>gfortran -v
Using built-in specs.
Target: i586-pc-mingw32
Configured with: ../gcc-trunk/configure --prefix=/mingw --enable-languages=c,for
tran --with-gmp=/home/FX/gfortran/dependencies --disable-werror --enable-threads
--disable-nls --build=i586-pc-mingw32 --enable-libgomp --disable-shared --disab
le-win32-registry --with-dwarf2 --disable-sjlj-exceptions
Thread model: win32
gcc version 4.5.0 20090421 (experimental) [trunk revision 146519] (GCC)

E:\openmp>type mc2.f90

program mc2
implicit none
integer*4, parameter :: nmc = 8
real*8, dimension(nmc) :: x = 0.0d0
integer*4 :: j,k1,k2

!$OMP PARALLEL private(j,k1,k2)
!$OMP DO
do j = 1, nmc
x(j)=0.0d0;
print *, 'j=', j, ' started:'
do k1=1,30000
do k2=1,30000
x(j) = x(j)+((dble(k1)/90000.0)*(dble(k2)/90000.0))**2
end do
end do
x(j)=x(j)*j;
print *, 'j=', j, ' finished.'
end do
!$OMP END DO NOWAIT
!$OMP END PARALLEL

do j=1,nmc
print*,j,x(j);
end do;
end program mc2

E:\openmp>gfortran -fopenmp mc2.f90

E:\openmp>type test.bat
echo %TIME%
a.exe
echo %TIME%

E:\openmp>test

E:\openmp>echo 11:35:12.44
11:35:12.44

E:\openmp>a.exe
j= 3 started:
j= 5 started:
j= 1 started:
j= 7 started:
j= 7 finished.
j= 8 started:
j= 3 finished.
j= 4 started:
j= 5 finished.
j= 6 started:
j= 1 finished.
j= 2 started:
j= 2 finished.
j= 8 finished.
j= 4 finished.
j= 6 finished.
1 1234691.3624830756
2 2469382.7249661512
3 3704074.0874491567
4 4938765.4499322092
5 6173456.8124152618
6 7408148.1748983134
7 8642839.5373813659
8 9877530.8998644184

E:\openmp>echo 11:36:13.42
11:36:13.42

E:\openmp>gfortran mc2.f90

E:\openmp>test

E:\openmp>echo 11:38:14.84
11:38:14.84

E:\openmp>a.exe
j= 1 started:
j= 1 finished.
j= 2 started:
j= 2 finished.
j= 3 started:
j= 3 finished.
j= 4 started:
j= 4 finished.
j= 5 started:
j= 5 finished.
j= 6 started:
j= 6 finished.
j= 7 started:
j= 7 finished.
j= 8 started:
j= 8 finished.
1 1234691.3624830756
2 2469382.7249661512
3 3704074.0874492265
4 4938765.4499323023
5 6173456.8124153782
6 7408148.1748984531
7 8642839.5373815298
8 9877530.8998646047

E:\openmp>echo 11:40:07.41
11:40:07.41

E:\openmp>
baddylover
 
Posts: 25
Joined: Thu Apr 07, 2011 6:17 am

Re: Why AMD+parallel got the slowest?

Postby baddylover » Fri Apr 08, 2011 6:30 am

Here are results from my home computer (AMD Phenom II X4 965 Black 3.4GHz 4 cores, 6GB memory) and Scientific Linux 5.3 (64-bit):
we see serial do took about 2 minutes while parallel do took about 5 minutes
I noticed the gfortran version is different. I now reboot my PC into windows XP and test my program.
I'll post the results shortly.

fuchun@fuchun-huang-6 openmp]$ gfortran -v
Using built-in specs.
Target: x86_64-redhat-linux
Configured with: ../configure --prefix=/usr --mandir=/usr/share/man --infodir=/usr/share/info --enable-shared --enable-threads=posix --enable-checking=release --with-system-zlib --enable-__cxa_atexit --disable-libunwind-exceptions --enable-libgcj-multifile --enable-languages=c,c++,objc,obj-c++,java,fortran,ada --enable-java-awt=gtk --disable-dssi --enable-plugin --with-java-home=/usr/lib/jvm/java-1.4.2-gcj-1.4.2.0/jre --with-cpu=generic --host=x86_64-redhat-linux
Thread model: posix
gcc version 4.1.2 20080704 (Red Hat 4.1.2-44)
[fuchun@fuchun-huang-6 openmp]$
[fuchun@fuchun-huang-6 openmp]$ cat test.sh
date --rfc-2822
./a.out
date --rfc-2822
[fuchun@fuchun-huang-6 openmp]$

[fuchun@fuchun-huang-6 openmp]$ cat mc2.f90

program mc2
implicit none
integer*4, parameter :: nmc = 8
real*8, dimension(nmc) :: x = 0.0d0
integer*4 :: j,k1,k2

!$OMP PARALLEL private(j,k1,k2)
!$OMP DO
do j = 1, nmc
x(j)=0.0d0;
print *, 'j=', j, ' started:'
do k1=1,30000
do k2=1,30000
x(j) = x(j)+((dble(k1)/90000.0)*(dble(k2)/90000.0))**2
end do
end do
x(j)=x(j)*j;
print *, 'j=', j, ' finished.'
end do
!$OMP END DO NOWAIT
!$OMP END PARALLEL

do j=1,nmc
print*,j,x(j);
end do;
end program mc2

[fuchun@fuchun-huang-6 openmp]$

[fuchun@fuchun-huang-6 openmp]$ ./test.sh
Fri, 08 Apr 2011 23:06:24 +1000
j= 1 started:
j= 1 finished.
j= 2 started:
j= 2 finished.
j= 3 started:
j= 3 finished.
j= 4 started:
j= 4 finished.
j= 5 started:
j= 5 finished.
j= 6 started:
j= 6 finished.
j= 7 started:
j= 7 finished.
j= 8 started:
j= 8 finished.
1 1234691.36248305
2 2469382.72496610
3 3704074.08744916
4 4938765.44993221
5 6173456.81241526
6 7408148.17489831
7 8642839.53738137
8 9877530.89986442
Fri, 08 Apr 2011 23:08:25 +1000
[fuchun@fuchun-huang-6 openmp]$ gfortran -fopenmp mc2.f90
[fuchun@fuchun-huang-6 openmp]$ ./test.sh
Fri, 08 Apr 2011 23:09:29 +1000
j= 1 started:
j= 3 started:
j= 5 started:
j= 7 started:
j= 5 finished.
j= 6 started:
j= 3 finished.
j= 4 started:
j= 1 finished.
j= 2 started:
j= 7 finished.
j= 8 started:
j= 6 finished.
j= 2 finished.
j= 4 finished.
j= 8 finished.
1 1234691.36248305
2 2469382.72496610
3 3704074.08744916
4 4938765.44993221
5 6173456.81241526
6 7408148.17489831
7 8642839.53738137
8 9877530.89986442
Fri, 08 Apr 2011 23:14:41 +1000

[fuchun@fuchun-huang-6 openmp]$
baddylover
 
Posts: 25
Joined: Thu Apr 07, 2011 6:17 am

Re: Why AMD+parallel got the slowest?

Postby baddylover » Fri Apr 08, 2011 6:36 am

additional note: I didn't install and run any other programs rather than those come with a standard installation of Scientific Linux 5.3.
baddylover
 
Posts: 25
Joined: Thu Apr 07, 2011 6:17 am

Re: Why AMD+parallel got the slowest?

Postby baddylover » Fri Apr 08, 2011 6:58 am

here are results from my home AMD computer running Windows XP:
we see serial do took about 1.5 minutes while parallel do took about 2 minutes.
The gfortran version is the same as that in my office PC (intel i5 650).
Why?

Microsoft Windows XP [Version 5.1.2600]
(C) Copyright 1985-2001 Microsoft Corp.

H:\openmp>gfortran -v
Using built-in specs.
Target: i586-pc-mingw32
Configured with: ../gcc-trunk/configure --prefix=/mingw --enable-languages=c,for
tran --with-gmp=/home/FX/gfortran/dependencies --disable-werror --enable-threads
--disable-nls --build=i586-pc-mingw32 --enable-libgomp --disable-shared --disab
le-win32-registry --with-dwarf2 --disable-sjlj-exceptions
Thread model: win32
gcc version 4.5.0 20090421 (experimental) [trunk revision 146519] (GCC)

H:\openmp>

H:\openmp>type test.bat
echo %TIME%
a.exe
echo %TIME%

H:\openmp>

H:\openmp>type mc2.f90
program mc2
implicit none
integer*4, parameter :: nmc = 8
real*8, dimension(nmc) :: x = 0.0d0
integer*4 :: j,k1,k2

!$OMP PARALLEL private(j,k1,k2)
!$OMP DO
do j = 1, nmc
x(j)=0.0d0;
print *, 'j=', j, ' started:'
do k1=1,30000
do k2=1,30000
x(j) = x(j)+((dble(k1)/90000.0)*(dble(k2)/90000.0))**2
end do
end do
x(j)=x(j)*j;
print *, 'j=', j, ' finished.'
end do
!$OMP END DO NOWAIT
!$OMP END PARALLEL

do j=1,nmc
print*,j,x(j);
end do;
end program mc2

H:\openmp>

H:\openmp>gfortran mc2.f90

H:\openmp>test

H:\openmp>echo 23:48:44.45
23:48:44.45

H:\openmp>a.exe
j= 1 started:
j= 1 finished.
j= 2 started:
j= 2 finished.
j= 3 started:
j= 3 finished.
j= 4 started:
j= 4 finished.
j= 5 started:
j= 5 finished.
j= 6 started:
j= 6 finished.
j= 7 started:
j= 7 finished.
j= 8 started:
j= 8 finished.
1 1234691.3624830756
2 2469382.7249661512
3 3704074.0874492265
4 4938765.4499323023
5 6173456.8124153782
6 7408148.1748984531
7 8642839.5373815298
8 9877530.8998646047

H:\openmp>echo 23:50:15.34
23:50:15.34

H:\openmp>


H:\openmp>gfortran -fopenmp mc2.f90

H:\openmp>test

H:\openmp>echo 23:46:06.51
23:46:06.51

H:\openmp>a.exe
j= 3 started:
j= 1 started:
j= 5 started:
j= 7 started:
j= 1 finished.
j= 2 started:
j= 3 finished.
j= 4 started:
j= 2 finished.
j= 5 finished.
j= 6 started:
j= 7 finished.
j= 8 started:
j= 4 finished.
j= 6 finished.
j= 8 finished.
1 1234691.3624830756
2 2469382.7249661512
3 3704074.0874491567
4 4938765.4499322092
5 6173456.8124152618
6 7408148.1748983134
7 8642839.5373813659
8 9877530.8998644184

H:\openmp>echo 23:48:02.79
23:48:02.79

H:\openmp>
baddylover
 
Posts: 25
Joined: Thu Apr 07, 2011 6:17 am

Re: Why AMD+parallel got the slowest?

Postby ftinetti » Fri Apr 08, 2011 10:29 am

Hi,

I have similar (bad) results without optimizations using two Opterons (Linux 64 bits):

$ gfortran -fopenmp mc2.f90

$ export OMP_NUM_THREADS=1
$ time a.out
j= 1 started:
j= 1 finished.
j= 2 started:
j= 2 finished.
j= 3 started:
j= 3 finished.
j= 4 started:
j= 4 finished.
j= 5 started:
j= 5 finished.
j= 6 started:
j= 6 finished.
j= 7 started:
j= 7 finished.
j= 8 started:
j= 8 finished.
1 1234691.36248305
2 2469382.72496610
3 3704074.08744916
4 4938765.44993221
5 6173456.81241526
6 7408148.17489831
7 8642839.53738137
8 9877530.89986442

real 2m37.577s
user 2m37.549s
sys 0m0.003s

$ export OMP_NUM_THREADS=2
[fernando@smp0 testAMD]$ time a.out
j= 1 started:
j= 5 started:
j= 5 finished.
j= 6 started:
j= 1 finished.
j= 2 started:
j= 6 finished.
j= 7 started:
j= 2 finished.
j= 3 started:
j= 7 finished.
j= 8 started:
j= 3 finished.
j= 4 started:
j= 8 finished.
j= 4 finished.
1 1234691.36248305
2 2469382.72496610
3 3704074.08744916
4 4938765.44993221
5 6173456.81241526
6 7408148.17489831
7 8642839.53738137
8 9877530.89986442

real 4m28.276s
user 8m42.891s
sys 0m0.102s

However, results change with optimization:

$ gfortran -fopenmp -O2 mc2.f90

$ export OMP_NUM_THREADS=2
$ time a.out
j= 1 started:
j= 5 started:
j= 5 finished.
j= 6 started:
j= 1 finished.
j= 2 started:
j= 6 finished.
j= 7 started:
j= 2 finished.
j= 3 started:
j= 7 finished.
j= 8 started:
j= 3 finished.
j= 4 started:
j= 8 finished.
j= 4 finished.
1 1234691.36248305
2 2469382.72496610
3 3704074.08744916
4 4938765.44993221
5 6173456.81241526
6 7408148.17489831
7 8642839.53738137
8 9877530.89986442

real 0m30.762s
user 1m1.480s
sys 0m0.023s

$ export OMP_NUM_THREADS=1
$ time a.out
j= 1 started:
j= 1 finished.
j= 2 started:
j= 2 finished.
j= 3 started:
j= 3 finished.
j= 4 started:
j= 4 finished.
j= 5 started:
j= 5 finished.
j= 6 started:
j= 6 finished.
j= 7 started:
j= 7 finished.
j= 8 started:
j= 8 finished.
1 1234691.36248305
2 2469382.72496610
3 3704074.08744916
4 4938765.44993221
5 6173456.81241526
6 7408148.17489831
7 8642839.53738137
8 9877530.89986442

real 1m1.594s
user 1m1.581s
sys 0m0.002s


Summary: with -O2, 1 thread (i.e. serial, no parallel) takes about 1 min., 2 threads take about 30 sec.

Please try on your computers and let us know the results.

Just guessing about the reason/s: debug code serializes and/or has too many performance penalizations on AMD processors.
ftinetti
 
Posts: 582
Joined: Wed Feb 10, 2010 2:44 pm

Re: Why AMD+parallel got the slowest?

Postby baddylover » Fri Apr 08, 2011 11:03 am

here are results with -O2 from AMD runing Windows XP
serial do took about 43 seconds while parallel do took about 23 seconds.
Encouraging. Thanks a lot for helps.
I'll post results with -O2 from AMD running SL5.3 shortly.

H:\openmp>gfortran -O2 mc2.f90

H:\openmp>test

H:\openmp>echo 3:51:58.15
3:51:58.15

H:\openmp>a.exe
j= 1 started:
j= 1 finished.
j= 2 started:
j= 2 finished.
j= 3 started:
j= 3 finished.
j= 4 started:
j= 4 finished.
j= 5 started:
j= 5 finished.
j= 6 started:
j= 6 finished.
j= 7 started:
j= 7 finished.
j= 8 started:
j= 8 finished.
1 1234691.3624829219
2 2469382.7249658438
3 3704074.0874487660
4 4938765.4499316877
5 6173456.8124146098
6 7408148.1748975320
7 8642839.5373804532
8 9877530.8998633754

H:\openmp>echo 3:52:42.59
3:52:42.59

H:\openmp>

H:\openmp>gfortran -fopenmp -O2 mc2.f90

H:\openmp>test

H:\openmp>echo 3:53:28.73
3:53:28.73

H:\openmp>a.exe
j= 1 started:
j= 3 started:
j= 7 started:
j= 5 started:
j= 1 finished.
j= 2 started:
j= 5 finished.
j= 6 started:
j= 3 finished.
j= 4 started:
j= 7 finished.
j= 8 started:
j= 2 finished.
j= 4 finished.
j= 6 finished.
j= 8 finished.
1 1234691.3624829219
2 2469382.7249658438
3 3704074.0874491567
4 4938765.4499322092
5 6173456.8124152618
6 7408148.1748983134
7 8642839.5373813659
8 9877530.8998644184

H:\openmp>echo 3:53:51.57
3:53:51.57

H:\openmp>
baddylover
 
Posts: 25
Joined: Thu Apr 07, 2011 6:17 am

Re: Why AMD+parallel got the slowest?

Postby baddylover » Fri Apr 08, 2011 11:25 am

here are results from AMD running SL5.3 64 bit:
serial do took 1m4s while parallel do took 0m16s.
Exactly what I expected! Great speedup by OpenMP!
Next I'll study -O2 option.
Thanks a lot for helps.

Even though, still feel AMD may not be as good as Intel if specifications are the same???


[fuchun@fuchun-huang-6 openmp]$ gfortran -O2 mc2.f90
[fuchun@fuchun-huang-6 openmp]$ time ./a.out
j= 1 started:
j= 1 finished.
j= 2 started:
j= 2 finished.
j= 3 started:
j= 3 finished.
j= 4 started:
j= 4 finished.
j= 5 started:
j= 5 finished.
j= 6 started:
j= 6 finished.
j= 7 started:
j= 7 finished.
j= 8 started:
j= 8 finished.
1 1234691.36248305
2 2469382.72496610
3 3704074.08744916
4 4938765.44993221
5 6173456.81241526
6 7408148.17489831
7 8642839.53738137
8 9877530.89986442

real 1m4.458s
user 1m4.447s
sys 0m0.000s

[fuchun@fuchun-huang-6 openmp]$ gfortran -fopenmp -O2 mc2.f90
[fuchun@fuchun-huang-6 openmp]$ time ./a.out
j= 1 started:
j= 3 started:
j= 5 started:
j= 7 started:
j= 3 finished.
j= 4 started:
j= 1 finished.
j= 2 started:
j= 5 finished.
j= 6 started:
j= 7 finished.
j= 8 started:
j= 4 finished.
j= 6 finished.
j= 2 finished.
j= 8 finished.
1 1234691.36248305
2 2469382.72496610
3 3704074.08744916
4 4938765.44993221
5 6173456.81241526
6 7408148.17489831
7 8642839.53738137
8 9877530.89986442

real 0m16.187s
user 1m4.486s
sys 0m0.001s
[fuchun@fuchun-huang-6 openmp]$
baddylover
 
Posts: 25
Joined: Thu Apr 07, 2011 6:17 am

Next

Return to Using OpenMP

Who is online

Users browsing this forum: No registered users and 4 guests