In this paper we present AMGE, a programming framework and
runtime system that transparently decomposes GPU kernels and executes them on
multiple GPUs in parallel. AMGE exploits the remote memory access capability in
modern GPUs to ensure that data can be accessed regardless of its physical
location, allowing our runtime to safely decompose and distribute arrays across
GPU memories. It optionally performs a compiler analysis that detects array access
patterns in GPU kernels. Using this information, the runtime can perform more efficient
computation and data distribution configurations than previous works. The GPU execution
model allows AMGE to hide the cost of remote accesses if they are kept below
5%. We demonstrate that a thread block scheduling policy that distributes
remote accesses through the whole kernel execution further reduces their
overhead. Results show 1.98[1]
and 3.89[1] execution speedups
for 2 and 4 GPUs for a wide range of dense computations compared to the
original versions on a single GPU.