High-performance computing increasingly relies on heterogeneous systems
with specialized hardware accelerators to improve application
performance. For example, NVIDIAs CUDA programming system and
general-purpose GPUs have emerged as a widespread accelerator in HPC
systems. This trend has exacerbated challenges of data placement as
accelerators often have fast local memories to fuel their computational
demands, but slower interconnects to feed those memories. Crucially,
real-world data-transfer performance is strongly influenced not just by
the underlying hardware, but by the capabilities of the programming
systems. Understanding how application performance is affected by the
logical communication exposed through abstractions, as well as the
underlying system topology, is crucial for developing high-performance
applications and architectures. This report presents initial
data-transfer microbenchmark results from two POWER-based systems
obtained during work towards developing an automated system performance
characterization tool.