Heterogeneous architectures, by definition, include multiple processing components with very different microarchitectures
and execution models. In particular, computing platforms from supercomputers to smartphones can now
incorporate both CPU and GPU processors. Disparities between CPU and GPU processor architectures have naturally
led to distinct programming models and development patterns for each component. Developers for a specific system
decompose their application, assign different parts to different heterogeneous components, and express each part in
its assigned components native model. But without additional effort, that application will not be suitable for another
architecture with a different heterogeneous component balance. Developers addressing a variety of platforms must
either write multiple implementations for every potential heterogeneous component or fall back to a safe CPU
implementation, incurring a high development cost or loss of system performance, respectively. The disadvantages
of developing for heterogeneous systems are vastly reduced if one source code implementation can be mapped to
either a CPU or GPU architecture with high performance.
A convention has emerged from the OpenCL community defining how to write kernels for performance portability
among different GPU architectures. This paper demonstrates that OpenCL programs written according to this
convention contain enough abstract performance information to enable effective translations to CPU architectures as
well. The challenge is that an OpenCL implementation must focus on those programming conventions more than the
most natural mapping of the language specification to the target architecture. In particular, prior work implementing
OpenCL on CPU platforms neglects the OpenCL kernels implicit expression of performance properties such as
spatial or temporal locality. We outline some concrete transformations that can be applied to an OpenCL kernel to
suitably map the abstract performance properties to CPU execution constructs. We show that such transformations
result in marked performance improvements over existing CPU OpenCL implementations for GPU-portable OpenCL
kernels. Ultimately, we show that the performance of GPU-portable OpenCL kernels, when using our methodology,
is comparable to the performance of native multicore CPU programming models such as OpenMP.