[Omp] A question about OpenMP 2.5

Haab, Grant grant.haab at intel.com
Thu Mar 22 10:53:50 PDT 2007


Dieter,

I do see your point about unaligned data now.

Both the Compaq Alpha Compilers and KAP/Pro toolset compilers supported
access to 1-byte data types with OpenMP compilation.

I believe the Alpha Unix OS (can't remember the name) would issue
unaligned access messages whenever a piece of data was accessed
unaligned to a four-byte boundary.  The code would still run correctly,
but the unaligned accesses were very slow compared to the aligned ones.

Greg's example code with the byte array and your example of an unaligned
four-byte data would have data races for either OpenMP compiler, based
on experiments I did years ago.

Finally, an OpenMP implementation that doesn't allow 2-byte datatypes
would be a nightmare to port codes to.

So I respectfully disagree with both your feeling (1st paragraph) and
guess (3rd paragraph) below.

- Grant

-----Original Message-----
From: Dieter an Mey [mailto:anmey at rz.rwth-aachen.de] 
Sent: Thursday, March 22, 2007 12:07 PM
To: Haab, Grant
Cc: Greg Bronevetsky; omp at openmp.org
Subject: Re: [Omp] A question about OpenMP 2.5

Me feeling is that the (OpenMP) compilers we are looking at don't really

have any problem if they follow the language specifications.

I never used an Alpha processor.
But imagine  you want to access 4-byte data which are not aligned to a 4

byte boundary, then you will probably run into such a problem, if the 
processor is only able to load and store 4 byte aligned data.

Thinking about Fortran, I would guess that an OpenMP compiler for such a

processor will not support any 2-byte datatypes at the same time.
Because otherwise you could by bad programming style (common, 
equivalence) force 4-byte data on 2-byte boundaries and run into that 
problem.

regards,
Dieter



Haab, Grant schrieb:
> Dieter,
> 
> The problem Greg is describing is not data alignment at all, but
instead
> what minimum data size can be used so that loads and stores are
> performed atomically by the processor and memory system hardware.
Most
> processors support byte-sized atomicity for regular loads and store,
but
> several have pointed out that the Alpha processors supported a minimum
> of 4-byte atomicity.  
> 
> I know of no general-purpose processor that supports less than
> byte-granularity loads and stores, because a byte is the minimum
> addressable unit for most processors.  (I'm sure somebody will find a
> counterexample though ;-)
> 
> I don't believe the compiler can easily fix this problem because C and
> Fortran don't allow you to pad array elements to the minimum atomic
> load/store size.  That would break unions, equvialence and the like,
not
> to mention make users very irate that their character array now takes
4
> times more space!
> 
> - Grant
> 
> 
> 
> -----Original Message-----
> From: omp-bounces at openmp.org [mailto:omp-bounces at openmp.org] On Behalf
> Of Dieter an Mey
> Sent: Thursday, March 22, 2007 4:51 AM
> To: Greg Bronevetsky
> Cc: omp at openmp.org
> Subject: Re: [Omp] A question about OpenMP 2.5
> 
> I see what you say.
> As a user I would expect that the compiler takes care of proper 
> alignment etc. to avoid these "false sharing" effects which could lead

> to a data race.
> 
> I wonder in how far this can really cause any problems on the current 
> hardware and how this has been taken care of by the current OpenMP 
> compliant compilers.
> 
> I assume that the compiler has to guarantee and in many or all cases
can
> 
> guarantee that elements can be aligned such that each two elements of
a 
> structure, class etc. can be load and stored.
> 
> In Fortran you can try to force bad alignment by common blocks or 
> equivalence, which would be be programming practice anyway.
> I tried to create such a bad case, but I was not "successful" yet.
> 
> I don't know in how far C/C++ programmers can do this (unions or so?)
> 
> The question is, can primitive datatypes be forced to be so badly 
> aligned that the compiler cannot generate single load/store
instructions
> 
> for those data elements.
> 
> regards
> Dieter
> 
> Greg Bronevetsky schrieb:
>> The difference is evoked by the following example. Suppose that all
> memory
>> operations operate at 4-byte granularity. The code in question is:
>>    char buf[BUF_SIZE];
>>    #pragma omp for
>>    for(i=0; i<BUF_SIZE; i++)
>>       buf[i] = ?; 
>> Suppose that buf[] is 4-byte-aligned, thread t gets iteration i=0 and
>> thread r gets iteration i=1. t writes to address &buf, bringing the
> memory
>> range [&buf - &buf+4] into its cache. r writes to &buf+1, also
> bringing
>> the memory range [&buf - &buf+4] into its cache. When these cache
> lines
>> are finally evicted, each contains data that the other does not. As
> such,
>> regardless of which cache line we pick, we will lose data.
>>
>> In short, when the system moves data at 4-byte granularity, writes by
>> multiple threads to the same 4-byte region are data races. It should
> be
>> noted that the above is the reverse of Dieter's example. We're
> worrying
>> about code that operates on memory locations of size x, while the
> hardware
>> supports memory transfers of size y. If x>=y (Dieter's example), we
> have
>> no problem. The problem is cases where x<y (the above example).
>>
>>                              Greg Bronevetsky
>>
>> On Wed, 21 Mar 2007, Dieter an Mey wrote:
>>
>>> Well, Bronis and Greg, I still don't see whether it should make any 
>>> difference to any potential data race, whether the "memory location"

>>> which is spoiling my fun is written in bit or page atomicity by the 
>>> memory system of the hardware I am using.
>>> The results are thus unspecified or broken and  may be correct a 
>>> thousand times but may be wrong the 1001st time.
>>>
>>> I agree completely that there may be situations where it may be
> highly 
>>> desirable to know with which atomicity I have to deal with.
>>>
>>> For example on a Sparc system 64-bit floating point numbers may be 
>>> written or loaded by two 4-byte memory operations.
>>>
>>> And I would be happy to have an atomic directive for load and store 
>>> operations and not only for updates.
>>>
>>> best regards,
>>> Dieter
>>>
>>>
>>> Bronis R. de Supinski schrieb:
>>>> Dieter and all:
>>>>
>>>> Re:
>>>>>   >    If multiple threads write to the same ** memory location **
>>>> What is a memory location? It is a central question to
>>>> the memory model and is why Greg has said this has
>>>> implications for the memory model.
>>>>
>>>>>   >    without synchronization, the resulting ** memory content **
>>>>>   >    is unspecified. If at least one thread reads from
>>>> Anything that says some memory location becomes "unspecified"
>>>> is an issue for the memory model. The memory model must define
>>>> what the state of memory is after any action (legal or not).
>>>> In the case of a location becoming unspecified, it is equivalent
>>>> to a write of that location of random value lambda. The memory
>>>> model needs to state that this occurs.
>>>>
>>>>>   >    a shared ** memory location ** and at least one thread
> writes to
>>>>>   >    it without
>>>>>   >    synchronization, the value seen by any reading thread is
>>>>>   >    unspecified.
>>>> Currently, we have no precise definition of a memory
>>>> location because stating that a memory location is more
>>>> than one bit could imply that an implementation must
>>>> write that much data atomically. In this case, we are
>>>> not talking about the OpenMP "atomic" construct but
>>>> hardware atomicity.
>>>>
>>>> Simply saying b is a pointer does not solve the problem.
>>>> Consider a simple variant of Brad's example in which bit
>>>> operations to write individual bits in a single byte. By
>>>> the suggested "variable" definitions the code would still
>>>> be correct. However, I know of no current hardware that
>>>> provides atomic writes to individual bits. The reality
>>>> is that writes to the same byte are a data race, even if
>>>> the code describes them as array operations to distinct
>>>> bits. I am certain our vendors would (rightly) oppose being
>>>> required to make that code work.
>>>>
>>>> Note that it is not clear where to define the hardware
>>>> aromicity level, which is why the specification has tried
>>>> to avoid doing so. I could easily argue that the right
>>>> level of write atomicity for a DSM implementation is at
>>>> the page granularity. While I don't think anyone would
>>>> accept that, it is very unclear where we stop. If Brad's
>>>> example used a char array, does it work? I would hope so...
>>>>
>>>>> This text just describes the circumstances of a data race.
>>>> Defining data races and what happens under them are the
>>>> primary role of the memory model. The example demonstrates
>>>> that we probably need to make some statement about the
>>>> minimum level at which the programer can assume write
>>>> atomicity (in the hardware sense). This is much bigger
>>>> issue than what I had intended to cover in the memory
>>>> model revisions, which was really just intended to be
>>>> clarifications and consolidations.
>>>>
>>>> Bronis
>>>>
>>>>
>>>>
>>>>> regards
>>>>> Dieter
>>>>>  >
>>>>>
>>>>> Brad Bell schrieb:
>>>>>> I have a question about the OpenMP 2.5 standard
>>>>>>     http://www.openmp.org/drupal/mp-documents/spec25.pdf
>>>>>>
>>>>>> In Section 1.2.3 Data Terminology of spec25.pdf,
>>>>>> the following text appears:
>>>>>>
>>>>>>    variable
>>>>>>    A named data object, whose value can be defined and
>>>>>>    redefined during the execution of a program.
>>>>>>
>>>>>>    Only an object that is not part of another object is
>>>>>>    considered a variable. For example, array elements,
>>>>>>    structure components, array sections and substrings
>>>>>>    are not considered variables.
>>>>>>
>>>>>>
>>>>>> In Section 1.4.1 Structure of the OpenMP Memory Model of
> spec25.pdf,
>>>>>> the following text appears:
>>>>>>
>>>>>>    If multiple threads write to the same shared variable
>>>>>>    without synchronization, the resulting value of the variable
>>>>>>    in memory is unspecified. If at least one thread reads from
>>>>>>    a shared variable and at least one thread writes to it without
>>>>>>    synchronization, the value seen by any reading thread is
> unspecified.
>>>>>> It appears to me that, given the text above, that Example A.1.1.c
> of
>>>>>> in the OpenMP 2.5 standard is not correct (or at least
> misleading).
>>>>>> Here is the code for that example:
>>>>>>
>>>>>>     void a1(int n, float *a, float *b)
>>>>>>     {
>>>>>>         int i;
>>>>>>     #pragma omp parallel for
>>>>>>         for (i=1; i<n; i++) /* i is private by default */
>>>>>>             b[i] = (a[i] + a[i-1]) / 2.0;
>>>>>>     }
>>>>>>
>>>>>> 1. As I understand the parallel command above, different threads
> may
>>>>>> execute
>>>>>> the loop for different values of i.
>>>>>>
>>>>>> 2. As I understand, the variable b is a shared variable because
it
> is
>>>>>> defined before the loop.
>>>>>>
>>>>>> 3. The arguments b to the routine a1 may be an array, for example
>>>>>> it may be declared in the calling program by
>>>>>>     float b[SIZE];
>>>>>> where SIZE is any positive integer constant greater than or equal
> n.
>>>>>> 4. In the case of 3 above, b is a variable, and b[i] is not a
> variable,
>>>>>> hence multiple threads may be writing to the same variable;
namely
> b.
>>>>>> 5. Thus, in the case described above, the result of the loop is
> undefined.
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> Omp mailing list
>>>>>> Omp at openmp.org
>>>>>> http://openmp.org/mailman/listinfo/omp
>>>>>>
>>>>> --
>>>>>
> --------------------------------------------------------------------
>>>>> Dieter an Mey
>>>>> High Performance Computing               Hochleistungsrechnen
>>>>> RWTH Aachen University                   Rechen- und
> Kommunikations-
>>>>> Center for Computing and Communication   zentrum der RWTH Aachen
>>>>> phone: ++49-(0)241-80-24377              Seffenter Weg 23
>>>>> fax:   ++49-(0)241-80-22134              52074 Aachen, Germany
>>>>> email: anmey at rz.rwth-aachen.de
>>>>>
> --------------------------------------------------------------------
>>>>> _______________________________________________
>>>>> Omp mailing list
>>>>> Omp at openmp.org
>>>>> http://openmp.org/mailman/listinfo/omp
>>>>>
>>> -- 
>>> --------------------------------------------------------------------
>>> Dieter an Mey
>>> High Performance Computing               Hochleistungsrechnen
>>> RWTH Aachen University                   Rechen- und Kommunikations-
>>> Center for Computing and Communication   zentrum der RWTH Aachen
>>> phone: ++49-(0)241-80-24377              Seffenter Weg 23
>>> fax:   ++49-(0)241-80-22134              52074 Aachen, Germany
>>> email: anmey at rz.rwth-aachen.de
>>> --------------------------------------------------------------------
>>>
>>> _______________________________________________
>>> Omp mailing list
>>> Omp at openmp.org
>>> http://openmp.org/mailman/listinfo/omp
>>>
>>
>>
>>
> 

-- 
--------------------------------------------------------------------
Dieter an Mey
High Performance Computing               Hochleistungsrechnen
RWTH Aachen University                   Rechen- und Kommunikations-
Center for Computing and Communication   zentrum der RWTH Aachen
phone: ++49-(0)241-80-24377              Seffenter Weg 23
fax:   ++49-(0)241-80-22134              52074 Aachen, Germany
email: anmey at rz.rwth-aachen.de
--------------------------------------------------------------------


More information about the Omp mailing list