Evaluating Characteristics of CUDA Communication Primitives on High-Bandwidth Interconnects
Publication Year:
  Carl Pearson, Abdul Dakkak, Cheng Li, Jinjun Xiong, Wen-mei Hwu
  Pearson, Carl, et al. "Evaluating Characteristics of CUDA Communication Primitives on High-Bandwidth Interconnects." Proceedings of the 10th ACM/SPEC International Conference on Performance Engineering. ACM, 2019.

Data-intensive applications such as machine learning and analytics have created a demand for faster interconnects to avert the memory bandwidth wall and enable GPUs to be effectively leveraged for lower compute intensity tasks.
This has resulted in wide adoption of heterogeneous systems with varying underlying interconnects, and has delegated the task of understanding and copying data to the system or application developer.
No longer is a malloc followed by memcpy necessarily sufficient;
data transfer performance on these systems is now impacted by application configuration, system specification, data locality, and interconnect bandwidth.

This paper presents Comm|Scope, a set of microbenchmarks designed for system and application developers to understand memory transfer behavior across different data placement and exchange scenarios.
Comm|Scope comprehensively measures the latency and bandwidth of CUDA data transfer primitives, and avoids common pitfalls in ad-hoc measurements by controlling CPU caches, clock frequencies, and avoids measuring synchronization costs where possible.
This paper also presents an evaluation of Comm|Scope on systems featuring the POWER and x86 CPU architectures and PCIe 3, NVLink 1, and NVLink 2 interconnects.
These systems are chosen as representative configurations of current high-performance GPU platforms.
Comm|Scope measurements can serve to update insights about the relative performance of data transfer methods on current systems.
This work also reports insights for how high-level system design choices affect the performance of these data transfers, and how application developers can optimize applications on these systems.