I'm having trouble making private arrays for a simulation program. Each thread gets its own piece of buffer memory where intermediate results are stored but I'm having trouble doing this efficiently. At first I used automatic management calling the buffer as buffer[x][y][z] and hoped that it would be stored between chunks of the parallel for loop it was in so that each thread had a constantly allocated piece of memory for the buffer.
However, scaling up the program meant that using automatic management caused these buffers to overrun the stack. The only solution I could think of was to use malloc to create the buffer in each block and then free at the end. So for each chunk of the for loop a large array is allocated and destroyed. To make matters worse, I read that malloc and free are made thread safe by having their own built in locks. So each time step, a thread will go over several different blocks and have to do this each time - it's far from ideal.
I tried allocating the memory outside the parallel region and using first private, but that just copies the pointer. The parallel region is divided into two single sections and the parallel for so I can't think of a way to allocate a buffer for each thread and have it persist from chunk-to-chunk and timestep-to-timestep. Any help would be greatly appreciated.
