Matrix transposition is an important algorithmic building block for many numeric algorithms such as FFT. With more and
more algebra libraries offloading to GPUs, a high performance in-place transposition becomes necessary. Intuitively, in-place
transposition should be a good fit for GPU architectures due to limited available on-board memory capacity and high throughput.
However, direct application of CPU in-place transposition algorithms lacks the amount of parallelism and locality required by GPU to
achieve good performance. In this paper we present our in-place matrix transposition approach for GPUs that is performed using
elementary tile-wise transpositions. We propose low-level optimizations for the elementary transpositions, and find the best performing
configurations for them. Then, we compare all sequences of transpositions that achieve full transposition, and detect which is the most
favorable for each matrix. We present an heuristic to guide the selection of tile sizes, and compare them to brute-force search. We
diagnose the drawback of our approach, and propose a solution using minimal padding. With fast padding and unp