Program crashes on AIX but rocks on Linux

General OpenMP discussion

Program crashes on AIX but rocks on Linux

Postby drososkourounis » Wed Sep 10, 2008 11:11 am

Hi there,
I have written a piece of code which I am trying to run in parallel on AIX Power5 CPUs with the xlC_r compiler. The code links to several libraries which I compile natively on each architecture. My experience with Linux was satisfactory. Great scalability even on Intel Core 2 Duo and Opteron servers. I tried 3 different compilers on Linux (g++-4.2.2, sunCC-5.9, intel icpc-10.1) and the result was the same with all. The code compiles fine and scales with all compilers.

However, on AIX the code crashes on some specific point. The reasons are that somehow the corresponding loop does not do the work it should but it does something else instead. The loops are the following:

#pragma omp parallel for private(i)
for (i = 0; i < n; i++)
{
// this is an array of pointers to objects which are
// sparse direct linear solvers and provide the methods
// init(), factorize() and solve()
pLinearSolvers[i].factorize();
}

an identical previous loop calling pLinearSolvers[i].init() works fine, and the solver continues its work and produces the correct result if the previous loop involving the LinearSolver::factorize() becomes serial. So, the same way of parallelizing works but it doesn't work for the previous loop which makes me suspect that something is wrong with pLinearSolvers[i].factorize(). Why though it works so well with Linux for 3 different compilers? I believe it is a UNIVERSAL STANDARD for OpenMP that

Data in function calls in OpenMP parallel loops become all private


Does this exclude class members that may be referenced inside those functions?

The other loop that has problems is the following:

#pragma omp parallel for private(i)
for (i = 0; i < n; i++)
{
blockEigenValueDecomposition(i);
}

Inside blockEigenValueDecomposition(i) the function allocates a local matrix A and initializes it to zero. Then it copies from a global sparse matrix which is a class member visible for all threads (it should become shared since it is a class member shouldn't it?) the nonzero entries to A, and after of allocating the local eigenvectors matrix V and local eigenvalues vector Lambda it calls Lapack eigenvalue decomposition. However, although the matrix is SPD (symmetric positive definite) the eigenvalues when I run in 2 threads are negative while when running on one single thread are fine. Nothing of these happens on Linux.

Any ideas what may be wrong here?

I tried to link with:

-lessl -lm
lapack compiled and linked with -lessl

-lessl_r -lm_r
lapack compiled and linked with -lessl_r

I also compiled GOTO BLAS with both xlf and xlf_r (serial version of GOTO)
and then linked lapack with either of those and tried again. Nothing. The same error always.

Any ideas?
drososkourounis
 
Posts: 3
Joined: Sat Mar 22, 2008 1:24 pm

Re: Program crashes on AIX but rocks on Linux

Postby ejd » Thu Sep 11, 2008 6:46 am

drososkourounis wrote:Hi there,
I have written a piece of code which I am trying to run in parallel on AIX Power5 CPUs with the xlC_r compiler. The code links to several libraries which I compile natively on each architecture. My experience with Linux was satisfactory. Great scalability even on Intel Core 2 Duo and Opteron servers. I tried 3 different compilers on Linux (g++-4.2.2, sunCC-5.9, intel icpc-10.1) and the result was the same with all. The code compiles fine and scales with all compilers.

However, on AIX the code crashes on some specific point. The reasons are that somehow the corresponding loop does not do the work it should but it does something else instead.

Since you have tried three different compilers on Linux and the program works, there is a good chance that the AIX compiler has a problem. While it is possible that that OpenMP spec has been interpreted incorrectly by all three groups, it is less likely than the one vendor has made a mistake (though not impossible). I would tend to think that the best place to ask your question is on an IBM forum.

drososkourounis wrote:I believe it is a UNIVERSAL STANDARD for OpenMP that

Data in function calls in OpenMP parallel loops become all private


While I am not totally sure I understand your comment (do you mean the formal parameters on the call?), the OpenMP spec tries to define the data sharing attributes in all cases. For example, in the OpenMP V3.0 spec, section 2.9.1.2 Data-sharing Attribute Rules for Variables Referenced in a Region but not in a Construct, it states:

• Formal arguments of called routines in the region that are passed by reference inherit the data-sharing attributes of the associated actual argument.

which is one case that differs from what you have stated.

drososkourounis wrote:Does this exclude class members that may be referenced inside those functions?

I am not a C++ expert so I am not quite sure I understand how you are using the class member. Class members are not variables (in the OpenMP sense) and are handled differently than variables in OpenMP. Could you provide a small example of what you are doing?
ejd
 
Posts: 1025
Joined: Wed Jan 16, 2008 7:21 am


Return to Using OpenMP

Who is online

Users browsing this forum: No registered users and 12 guests