Heterogeneous computing platforms are becoming increasingly important in
supercomputing. Many systems now integrate CPUs and GPUs cooperating
together on a single node. Much effort is invested in tuning GPU-kernels.
However, it can be the case that some systems may not have GPUs or the
GPUs are busy. Maintaining two versions of the same code for GPUs and
CPUs is expensive. For this reason, it would be ideal if one could retarget
GPU-optimized kernels to run efficiently on a CPU.
Many efforts have been made to compile OpenCL kernels to run efficiently
on CPUs. Such approaches typically involve running work-groups in parallel
on different CPU threads, and executing work-items within a work-group in
one thread serially via loop-based serialization or in parallel via SIMD vectorization.
SIMD vectorization is particularly difficult where control divergence
is present. This thesis proposes a technique for transforming divergent loops
in OpenCL kernels such that vectorization opportunities can be extracted
when possible and memory access patterns can be improved. The transformations
presented show promising speedups for kernels that follow GPU
programming best practices, and slowdowns for kernels that do not.