I started to use OpenMP several months ago and during my tests I found out that the OpenMP does NOT scale linearly on my computer for even a simple Matrix Vector Multiplication (MVX). I can neither find the mistake in my parallelization (i.e., code/compile/run ...) nor get efficient results (i.e., Linear Speed up - OpenMP with 2 threads must perform twice faster than the OpenMP with 1 thread, and so on...).
So I kindly ask all of you to share the followings information with me (either FORTRAN or C programs), if possible:
1) Do you get Linear Speed Ups? (or, better results...)
-- If the answer is YES, would you please share the results? What is wrong with my implementation?
-- If the answer is NO, do you have any logical explanation on that?
2) Would you please run the following code on your own platforms and send me a feedback about the results? (I want to check if it is system dependent or not?)
In order to get consistent results with each other, please run the MVX for following sizes and NOT FORGET TO REPORT YOUR SYSTEM SPECS:
i) Square Matrix of Size 1000*1000
ii) Square Matrix of Size 10,000*10,000
iii) Matrix of Size 5000*15000
iv) Matrix of Size 15000*5000
v) Matrix of Size 300*2000
I am looking forward to hear from you... I really need your kind helps and these results as my research topic is highly related with OpenMP implementations!
Following two links(images) are the output of my runs:
Image 1: MVX Run Time vs Matrix Elements
link: http://www.flickr.com/photos/93385967@N ... hotostream
Image 2: MVX Run Time vs Matrix Elements
link: http://www.flickr.com/photos/93385967@N ... hotostream
You can find my code below and the Matrix Vector Multiplication Part is important for me:
- Code: Select all
program main
INCLUDE "omp_lib.h"
integer i,j,k, h, m, n, q, proc_num, thread_num
real*8 t1, t2
real*8 time
real, allocatable :: a(:,:)
real, allocatable :: b(:,:)
real, allocatable :: x(:)
real, allocatable :: y(:)
n = 10000
m = 10000
DO WHILE (m.LT.100001)
DO WHILE (n.LT.100001)
allocate(a(m,n))
allocate(b(n,m))
allocate(x(n))
allocate(y(m))
h = 1
DO WHILE (h.LT.9)
call omp_set_num_threads(h)
print*,'The Thread numbers is set to :',h
proc_num = omp_get_num_procs ( )
thread_num = omp_get_max_threads ( )
print*, ' Compute matrix vector multiplications y = A*x.'
print*, ' The number of processors available = ', proc_num
print*, ' The number of threads available = ', thread_num
C
C Set the matrix A.
C
!$omp parallel
!$omp& shared (a,j)
!$omp& private (i)
!$omp do
do i = 1, m
do j = 1, n
a(i,j) = (10*i+j)
end do
end do
!$omp end do
!$omp end parallel
C TRANSPOSE A
!$omp parallel
!$omp& shared (b,a,j)
!$omp& private (i)
!$omp do
do i = 1, n
do j = 1, m
b(i,j) = a(j,i)
end do
end do
!$omp end do
!$omp end parallel
C
C Set the Vector x
C
!$omp parallel
!$omp& shared (x)
!$omp& private (i)
!$omp do
do i = 1, n
x(i) = i
end do
!$omp end do
!$omp end parallel
C
C Initialization
C
!$omp parallel
!$omp& shared (y)
!$omp& private (i)
!$omp do
do i = 1, m
y(i) = 0.0
end do
!$omp end do
!$omp end parallel
C #######################################
C
C Matrix Vector Multiplication Part
C
C #######################################
t1 = OMP_GET_WTIME()
!$omp parallel
!$omp& shared (b,x)
!$omp& private (i,j)
!$omp do reduction(+:y)
do i = 1, m
y(i) = 0.0
do j = 1, n
y(i) = y(i) + b(j,i) * x(j)
end do
end do
!$omp end do
!$omp end parallel
t2 = OMP_GET_WTIME()
time = (t2-t1)*1000
print*, m, n, time
h = h*2
END DO
deallocate(a)
deallocate(b)
deallocate(x)
deallocate(y)
n = n + 4000
END DO
n = 10000
m = m + 4000
END DO
stop
end
To run this FORTRAN code, I use following commands, and the output result for one of the steps can be seen below:
[mahdi@hpcn00 MVX]$ ifort -openmp -o MVX-fortran mahdi-MVX.f
[mahdi@hpcn00 MVX]$ ./MVX-fortran
The Thread numbers is set to : 1
Compute matrix vector multiplications y = A*x.
The number of processors available = 8
The number of threads available = 1
10000 10000 121.527910232544
The Thread numbers is set to : 2
Compute matrix vector multiplications y = A*x.
The number of processors available = 8
The number of threads available = 2
10000 10000 72.3700523376465
The Thread numbers is set to : 4
Compute matrix vector multiplications y = A*x.
The number of processors available = 8
The number of threads available = 4
10000 10000 64.0790462493896
The Thread numbers is set to : 8
Compute matrix vector multiplications y = A*x.
The number of processors available = 8
The number of threads available = 8
10000 10000 63.7650489807129
The Thread numbers is set to : 1
The platform I am testing my runs has the following Specs:
OS: Scientific Linux 5.7 - Kenel: 2.6.18
Compiler : Intel parallel studio XE 2013
CPU : 2 quad-core Intel Xeon 5345 (8 cores)
RAM : 32 GB
You are also welcome to contact me via the following email address:
kazempour[at]ee[dot]bilkent[dot]edu[dot]tr
Please don't hesitate to share any information with me.
Best Wishes and Regards,
Mahdi
