For each variable involved in a reduction, the compiler
makes a private copy of the variable for each processor.
The executable code for the loop containing the reduction manipulates
the private copy of the reduction variable in three
separate parts. First, the private copy is initialized
prior to executing the loop with the identity element for
(e.g.,
0 for
).
Second, the reduction operation is applied to the private copy
within the parallel loop. Finally, the program performs
a global accumulation following the loop execution whereby
all non-identity elements of the local copies of the variable are
accumulated into the original variable. Synchronization locks are
used to guard accesses to the original variable to guarantee that the
updates are atomic.