You are correct. The tasks are too many, I think every compiler has some limit on the number of tasks, and once that is reached, the tasks are serialized(?). Anyways, as you suggested, I modified the code by placing the tasking construct after the outermost i-loop, to limit the number of tasks generated to N, instead of N^3. With 8 threads, I see that the performance is close to the OMP DO version (which is still best).
Modified code, I have removed the outer task construct from the code block which is calling test(..), rest is intact.
- Code: Select all
!$OMP TASK FIRSTPRIVATE(...) PRIVATE(...) DEFAULT(SHARED) & !variables are either private or firstprivate here
!$OMP FIRSTPRIVATE(i) PRIVATE(j,k)
<embarrassingly parallel code>
!$OMP END TASK
END SUBROUTINE test
I am using the latest Ifort (12.1.5) and pgf90 (12.3) - unfortunately ifort compiled version is producing NaNs and taking more time than as opposed to the PGI version.
Any ideas to optimize this further is welcome.