The key to tuning OpenMP codes is to understand both where in the code the performance bottlenecks are, and also what is causing them (e.g. sequential code, load imbalance, cache misses, synchronisation, false sharing, etc).
To begin with, you could do a set of runs on, say, 1, 2, 4, 6 and 8 threads to get a feel for the scaling behaviour.
You might then consider using omp_get_wtime() calls around the whole code, and to accumulate the time spent in each parallel construct.
This can tell you how well/badly each parallel construct is scaling, and also (by subtraction) how much time is spent in the sequential part of the program.
For a free performance analysis tool I would recommend Scalasca http://www.scalasca.org/
though as is usual with such tools there is something of a learning curve to get it working and to understand the output.
Hope that helps,