Matrix transposition is an important algorithmic building block for many
numeric algorithms like multidimensional FFT. It has also been used to convert
the storage layout of arrays. Intuitively, in-place transposition should be
a good fit for GPU architectures due to limited available on-board memory
capacity and high throughput. However, direct application of in-place transposition
algorithms from CPU lacks the amount of parallelism and locality
required by GPU to achieve good performance.
In this thesis we present the first known in-place matrix transposition
approach for the GPUs. Our implementation is based on a staged transposition
algorithm where each stage is performed using an elementary tiled-wise
transposition. With both low-level optimizations to the elementary tiled-wise
transpositions as well as high-level improvements to existing staged transposition
algorithm, our design is able to reach more than 20GB/s sustained
throughput on modern GPUs, and a 3X speedup.
Furthermore, for many-core architectures like the GPUs, efficient off-chip
memory access is crucial to high performance; the applications are often limited
by off-chip memory bandwidth. Transforming data layout is an effective
way to reshape the access patterns to improve off-chip memory access behavior,
but several challenges had limited the use of automated data layout
transformation systems on GPUs, namely how to efficiently handle arrays of
aggregates, and transparently marshal data between layouts required by different
performance sensitive kernels and legacy host code. While GPUs have
higher memory bandwidth and are natural candidates for marshaling data
between layouts, the relatively constrained GPU memory capacity, compared
to that of the CPU, implies that not only the temporal cost of marshaling
but also the spatial overhead must be considered for any practical layout
transformation systems.
As an application of the in-place transposition methodology, a novel apii
proach to laying out arrays of aggregate types across GPU and CPU architectures
is proposed to further improve memory parallelism and kernel
performance beyond what is achieved by human programmers using discrete
arrays today.
Second, the system, DL, has a run-time library implemented in OpenCL
that transparently and efficiently converts, or marshals, data to accommodate
application components that have different data layout requirements.
We present insights that lead to the design of this highly efficient run-time
marshaling library. Third, we show experimental results that the new layout
approach leads to substantial performance improvement at the applications
level even when all marshaling cost is taken into account.