Your reasoning is correct for how typical hardware works, but the OpenMP memory model is weaker than that of most typical hardware (deliberately so, in order for it to be widely implementable).
A possible sequence of events is:
thread 1: flush (line 28)
thread 0: data = 42 (line 14)
thread 0: flush (line 18)
thread 0: flag = 1 (line 20)
thread 1: while (flag < 1) test fails (line 29)
thread 1: read data (line 34)
which fails to satisfy the required write(0)-flush(0)-flush(1)-read(1) ordering (see top of page 18 of 4.0 RC2).
After the flush on line 35, the ordering is now guaranteed, and data is then valid.
Thank you for looking more into this.
You're very welcome!