HyperLink   Data Layout Transformation Through In-Place Transposition
Publication Year:
  I-Jui Sung
  Diss. University of Illinois at Urbana-Champaign, 2013.

Matrix transposition is an important algorithmic building block for many numeric algorithms like multidimensional FFT. It has also been used to convert the storage layout of arrays. Intuitively, in-place transposition should be a good fit for GPU architectures due to limited available on-board memory capacity and high throughput. However, direct application of in-place transposition algorithms from CPU lacks the amount of parallelism and locality required by GPU to achieve good performance. In this thesis we present the first known in-place matrix transposition approach for the GPUs. Our implementation is based on a staged transposition algorithm where each stage is performed using an elementary tiled-wise transposition. With both low-level optimizations to the elementary tiled-wise transpositions as well as high-level improvements to existing staged transposition algorithm, our design is able to reach more than 20GB/s sustained throughput on modern GPUs, and a 3X speedup. Furthermore, for many-core architectures like the GPUs, efficient off-chip memory access is crucial to high performance; the applications are often limited by off-chip memory bandwidth. Transforming data layout is an effective way to reshape the access patterns to improve off-chip memory access behavior, but several challenges had limited the use of automated data layout transformation systems on GPUs, namely how to efficiently handle arrays of aggregates, and transparently marshal data between layouts required by different performance sensitive kernels and legacy host code. While GPUs have higher memory bandwidth and are natural candidates for marshaling data between layouts, the relatively constrained GPU memory capacity, compared to that of the CPU, implies that not only the temporal cost of marshaling but also the spatial overhead must be considered for any practical layout transformation systems. As an application of the in-place transposition methodology, a novel apii proach to laying out arrays of aggregate types across GPU and CPU architectures is proposed to further improve memory parallelism and kernel performance beyond what is achieved by human programmers using discrete arrays today. Second, the system, DL, has a run-time library implemented in OpenCL that transparently and efficiently converts, or marshals, data to accommodate application components that have different data layout requirements. We present insights that lead to the design of this highly efficient run-time marshaling library. Third, we show experimental results that the new layout approach leads to substantial performance improvement at the applications level even when all marshaling cost is taken into account.