[Omp] First Touch initialization
Shah, Sanjiv
sanjiv.shah at intel.com
Wed Mar 8 04:46:44 PST 2006
A further note about the single array case. You are correct that neighbouring elements must have their virtual addresses be linear and accessible by pointer arithmetic.
However I think you are forgetting that there is a page table underneath the virtual address system that controls the mapping of virtual addresses to physical pages of memory. On NUMA, by using policies like first touch you are also making sure that the physical pages are allocated on the "touching" processors. So linear virtual addresses may be all over the machine physically.
Depending on your array sizes in relation to the page sizes, that can have a huge impact.
These pages also introduce false sharing issues at a page size granularity.
Sanjiv
--
Sanjiv, 217-419-4390
-----Original Message-----
From: Omp-bounces at openmp.org <Omp-bounces at openmp.org>
To: Francisco Jesús Martínez Serrano <franjesus at gmail.com>
CC: omp at openmp.org <omp at openmp.org>
Sent: Wed Mar 08 04:09:06 2006
Subject: Re: [Omp] First Touch initialization
Hi Francisco,
> Initializing shared arrays in parallel at the very beginning of the program
> will distribute the contents of each array according to the access pattern
> hence, in NUMA machines access will be much faster since it's local-node.
>
> We have tried it and it works indeed (Intel Fortran compiler v9 on 4-way
> Opteron),
> but I don't understand why.
First touch (or data placement in general) is not something
typically handled by a compiler. It is controlled by the Operating
System. Solaris has cc-NUMA support for example, but I believe
Linux supports it too these days.
The general rule is that the thread first touching (a chunk of)
the data gets it in it's local memory. Typically such first touch
happens when initializing or reading the data for the first time.
If for example, you use "malloc" to allocate a chunk of memory,
nothing has happened yet. All the OS does is to reserve that
chunk for you.
The minute a thread then _accesses_ a portion (or all) of it,
first touch causes it to be owned by that thread.
This is why one can speed up an OpenMP program running on a
cc-NUMA system by parallelizing the data initialization phase.
Even adding a redudant initialization upfront could work out
well (in case the first touch is through sequential I/O for
example).
How you want to initialize the data in parallel depends on
how you access the data later on.
Kind regards,
Ruud
PS I did mention "malloc" for a good reason. With "calloc" the
data gets pre-initialized to zero and may therefore end up on
the wrong node.
----------------------------------------------------------------
Senior Staff Engineer Email: ruud.vanderpas at sun.com
Scalable Systems Group Phone: +31-33-4515000 (x15920)
Sun Microsystems Fax : +31-33-4515001
----------------------------------------------------------------
_______________________________________________
Omp mailing list
Omp at openmp.org
http://openmp.org/mailman/listinfo/omp_openmp.org
More information about the Omp
mailing list