I am facing the very same problem. I have a lot of loops in my code, which means there is a lot of parallelisation. Still the efficiency is minimum. I hope kazempour may post here if he found out a solution for this.
I suggest you follow the ideas in the post of Mark. Also, having a "lot of loops" does not necessarily imply there is a lot of parallelisation or, said in other way: parallesing every loop will not necessarily provide the best or even better performance.