HyperLink   Automatic Parallelization of Kernels in Shared-Memory Multi-GPU Nodes
Publication Year:
  Javier Cabezas, Lluis Vilanova, Isaac Gelado, Thomas B. Jablin, Nacho Navarro, Wen-mei Hwu
  Proceedings of the 29th ACM on International Conference on Supercomputing (ICS '15)

In this paper we present AMGE, a programming framework and runtime system that transparently decomposes GPU kernels and executes them on multiple GPUs in parallel. AMGE exploits the remote memory access capability in modern GPUs to ensure that data can be accessed regardless of its physical location, allowing our runtime to safely decompose and distribute arrays across GPU memories. It optionally performs a compiler analysis that detects array access patterns in GPU kernels. Using this information, the runtime can perform more efficient computation and data distribution configurations than previous works. The GPU execution model allows AMGE to hide the cost of remote accesses if they are kept below 5%. We demonstrate that a thread block scheduling policy that distributes remote accesses through the whole kernel execution further reduces their overhead. Results show 1.98[1] and 3.89[1] execution speedups for 2 and 4 GPUs for a wide range of dense computations compared to the original versions on a single GPU.