HyperLink   DL: Data Layout Transformation System for Heterogeneous Computing
   
Publication Year:
  2012
Authors
  I-Jui Sung, Daniel Liu, Wen-mei Hwu
   
Published:
  IEEE Innovative Parallel Computing (InPar 2012), San Jose, CA, May 13--14, 2012
   
Abstract:

For many-core architectures like the GPUs, efficient off-chip memory access is crucial to high performance; the appli- cations are often limited by off-chip memory bandwidth. Transforming data layout is an effective way to reshape the access patterns to improve off-chip memory access behavior, but several challenges had limited the use of automated data layout transformation systems on GPUs, namely how to effi- ciently handle arrays of aggregates, and transparently mar- shal data between layouts required by different performance sensitive kernels and legacy host code. While GPUs have higher memory bandwidth and are natural candidates for marshaling data between layouts, the relatively constrained GPU memory capacity, compared to that of the CPU, im- plies that not only the temporal cost of marshaling but also the spatial overhead must be considered for any practical layout transformation systems.

This paper presents DL, a practical GPU data layout transformation system that addresses these problems: first, a novel approach to laying out array of aggregate types across GPU and CPU architectures is proposed to further improve memory parallelism and kernel performance beyond what is achieved by human programmers using discrete ar- rays today. Our proposed new layout can be derived in situ from the traditional Array of Structure, Structure of Arrays, and adjacent Discrete Arrays layouts used by programmers. Second, DL has a run-time library implemented in OpenCL that transparently and efficiently converts, or marshals, data to accommodate application components that have differ- ent data layout requirements. We present insights that lead to the design of this highly efficient run-time marshaling li- brary. In particular, the in situ transformation implemented in the library is comparable or faster than optimized tradi- tional out-of-place transformations while avoiding doubling the GPU DRAM usage. Third, we show experimental re- sults that the new layout approach leads to substantial per- formance improvement at the applications level even when all marshaling cost is taken into account.