ejd wrote:You haven't given me much information to go on.
Sorry about that, I was sort of straightening things out in my head at the time. The server is running SUSE Linux and each processor is an Itanium 2 at 1.3 GHz with 1 GB memory. I believe each node has two processors and that the rest is tied together with SGI's numalink. The shared memory pool is multiple gigabytes; as my code is written it represents the main dataset. I read from that pool into temporary buffers that are 10's of megabytes, update the fluid data, and write it back to the shared memory pool. For synchronization, the domain is decomposed along one dimension so each thread updates a block of data with each block having a lock at either end to ensure that the threads don't read and write to the same area at the same time.
The main shared memory pool is allocated and initialized in the master thread, and this is the part I'm working on now. Realizing how easy it was to use threadprivate for the temporary buffers, I'm going to try partitioning the field data into private memory for each thread and then use regions of shared memory for communication between threads - almost like a MPI implementation. The funny thing is that this won't actually mean much rewriting. Here's an example of what I wrote for the buffer memory and what I intend to do with most of the global data:
- Code: Select all
static buffer *tempBuffer=NULL;
static buffer *leftBuffer=NULL;
#pragma omp threadprivate(tempBuffer, leftBuffer)
#pragma omp for schedule(static,1)
for (int i=0;i<numThreads;i++)
tempBuffer = (buffer*) malloc(sx*sy*2*sizeof(buffer));
leftBuffer = (buffer*) malloc(sx*sy*sizeof(buffer));
int threadNum = omp_get_thread_num();
printf("threadNum %d address = %p\n", threadNum, tempBuffer);
This is all based on my impressions of things, though. I don't know how to profile the openmp code. The people that run the server make no mention of any profiling software, and I can't even find any kind of profiler at my desktop in visual studio.
I also have an unrelated question that's been bothering me. For my main data set, I currently allocate it as one large block of linear memory and then address it with *(address
+ x + (y*sizeX) + (z*sizeX*sizeY)). I did this because I didn't want to use a triple indirection, but I've been wondering when it gets down to it which is the fastest way to manage large sets of data.
Thanks again for the help.