With heterogeneous computing on the rise, executing programs
efficiently on different devices from a single source
code has become increasingly important. OpenCL, having a
bulk-synchronous programming model, has been proposed
as a framework for writing such performance-portable programs.
Execution order of work-items in a program is unconstrained
except at barrier synchronization events, giving
some freedom to an implementation when scheduling workitems
between synchronization points.
Many OpenCL (and CUDA) compilers have been designed
for targeting multicore CPU architectures. However,
scheduling work-items in prior work has been done with
primary focus on correctness and vectorization. To the best
of our knowledge, no existing implementations consider the
impact of work-item scheduling on data locality.
We propose an OpenCL compiler that performs datalocality-centric
work-item scheduling. By analyzing the
memory addresses accessed in loops within a kernel, our
technique can make better decisions on how to schedule
work-items to construct better memory access patterns,
thereby improving performance. Our approach achieves geomean
speedups of 3.32× over AMDs and 1.71× over Intels
implementations on Parboil and Rodinia benchmarks.