Could you perhaps give us some more detail?
In particular, what the code within the loop looks like. Also, since this is in C, you really want to change the order of the loops to access the array along the rows first, not the columns. This will make the code run much faster serially and also improve scalability of the parallel version. On top of that there is less overhead because it'll make the parallel for-loop the outermost loop.
Whether this can be done though depends on what is computed within the loop nest.