HyperLink   Locality-Centric Thread Scheduling for Bulk-synchronous Programming Models on CPU Architectures
Best Paper Award Nominee
Publication Year:
  Hee-Seok Kim, Izzat El Hajj, John A. Stratton, Steve S. Lumetta, Wen-mei Hwu
  International Symposium on Code Generation and Optimization (CGO)

With heterogeneous computing on the rise, executing programs efficiently on different devices from a single source code has become increasingly important. OpenCL, having a bulk-synchronous programming model, has been proposed as a framework for writing such performance-portable programs. Execution order of work-items in a program is unconstrained except at barrier synchronization events, giving some freedom to an implementation when scheduling workitems between synchronization points. Many OpenCL (and CUDA) compilers have been designed for targeting multicore CPU architectures. However, scheduling work-items in prior work has been done with primary focus on correctness and vectorization. To the best of our knowledge, no existing implementations consider the impact of work-item scheduling on data locality. We propose an OpenCL compiler that performs datalocality-centric work-item scheduling. By analyzing the memory addresses accessed in loops within a kernel, our technique can make better decisions on how to schedule work-items to construct better memory access patterns, thereby improving performance. Our approach achieves geomean speedups of 3.32× over AMDs and 1.71× over Intels implementations on Parboil and Rodinia benchmarks.