I'm trying to implement a simple panel based cholesky factorization using a dag scheduler implemented via locks.
Main code: http://pastebin.com/m5e186800
Wrapper program: http://pastebin.com/d25c339a8
This works fine in serial, achieving a good residual of O(1e-14) .
In parallel it seems that the data in matrix A isn't getting communicated between threads. Specifically updates using a block (TASK_UB) get the wrong data which should have come from the column solve (TASK_SC); also column solves aren't getting the correct data from diagonal block factorizations (TASK_FD).
As far as I can tell this shouldn't happen - the locks on these blocks should have an implicit flush which forces the data to be correct, but with all compilers on problems of sufficient size I get poor residuals of O(1). This is on a system with two quad-core intel chips.
Any ideas why this isn't working appreciated.