Reduction is a common component of many applications, but can often be
the limiting factor for parallelization. Previous reduction work has
focused on detecting reduction idioms and parallelizing the reduction
operation by minimizing data communications or exploiting more data
locality. While these techniques can be useful, they are mostly limited
to simple code structures. In this paper, we propose a method for
exploiting more parallelism by isolating the reduction from users of the
intermediate results. The other main contribution of our work is
enabling the parallelization of more complex reduction codes, including
those that involve the use of intermediate reduction results. The
proposed transformations are often implemented by programmers in an
ad-hoc manner, but to the best of our knowledge no previous work has
been proposed to automate these transformations for many-core
architectures. We show that the automatic transformations can result in
significant speedup compared to the original code using two benchmark
applications.