HyperLink   Dynamic Loop Vectorization for Executing OpenCL Kernels on CPUs
Publication Year:
  Izzat El Hajj
  University of Illinois Masters Thesis, May 2014

Heterogeneous computing platforms are becoming increasingly important in supercomputing. Many systems now integrate CPUs and GPUs cooperating together on a single node. Much effort is invested in tuning GPU-kernels. However, it can be the case that some systems may not have GPUs or the GPUs are busy. Maintaining two versions of the same code for GPUs and CPUs is expensive. For this reason, it would be ideal if one could retarget GPU-optimized kernels to run efficiently on a CPU. Many efforts have been made to compile OpenCL kernels to run efficiently on CPUs. Such approaches typically involve running work-groups in parallel on different CPU threads, and executing work-items within a work-group in one thread serially via loop-based serialization or in parallel via SIMD vectorization. SIMD vectorization is particularly difficult where control divergence is present. This thesis proposes a technique for transforming divergent loops in OpenCL kernels such that vectorization opportunities can be extracted when possible and memory access patterns can be improved. The transformations presented show promising speedups for kernels that follow GPU programming best practices, and slowdowns for kernels that do not.