Google might help with finding published material. Basically, the compiler creates a private copy of the reduction variable for each thread and initializes it to the identity. Each thread then does the accumulation into the private variable. This is pretty universal. Then, at the barrier, all the private copies are combined and the result is assigned to the original shared reduction variable.
The tricks happen in the combine; the obvious way to do it is with a critical section, but some runtimes use a tree structure or hardware assistance (like atomic operations or special combining networks). Also, there are issues with the nowait clause on barriers.