Data-intensive applications such as machine learning and analytics have created a demand for faster interconnects to avert the memory bandwidth wall and enable GPUs to be effectively leveraged for lower compute intensity tasks.
This has resulted in wide adoption of heterogeneous systems with varying underlying interconnects, and has delegated the task of understanding and copying data to the system or application developer.
No longer is a malloc followed by memcpy necessarily sufficient;
data transfer performance on these systems is now impacted by application configuration, system specification, data locality, and interconnect bandwidth.
This paper presents Comm|Scope, a set of microbenchmarks designed for system and application developers to understand memory transfer behavior across different data placement and exchange scenarios.
Comm|Scope
comprehensively measures the latency and bandwidth of CUDA data transfer primitives, and avoids common pitfalls in ad-hoc measurements by controlling CPU caches, clock frequencies, and avoids measuring synchronization costs where possible.