With the emergence of highly multithreaded architectures, performance
monitoring techniques face new challenges in efficiently
locating sources of performance discrepancies in the program
source code. For example, the state-of-the-art performance counters
in highly multithreaded graphics processing units (GPUs) report
only the overall occurrences of microarchitecture events at
the end of program execution. Furthermore, even if supported, any
fine-grained sampling of performance counters will distort the actual
program behavior and will make the sampled values inaccurate.
On the other hand, it is difficult to achieve high resolution
performance information at low sampling rates in the presence
of thousands of concurrently running threads. In this paper, we
present a novel software-based approach for monitoring the memory
hierarchy performance in highly multithreaded general-purpose
graphics processors. The proposed analysis is based on memory
traces collected for snapshots of an application execution. A tracebased
memory hierarchy model with a Monte Carlo experimental
methodology generates statistical bounds of performance measures
without being concerned about the exact inter-thread ordering of
individual events but rather studying the behavior of the overall
system. The statistical approach overcomes the classical problem
of disturbed execution timing due to fine-grained instrumentation.
The approach scales well as we deploy an efficient parallel trace
collection technique to reduce the trace generation overhead and a
simple memory hierarchy model to reduce the simulation time. The
proposed scheme also keeps track of individual memory operations
in the source code and can quantify their efficiency with respect to
the memory system. A cross-validation of our results shows close
agreement with the values read from the hardware performance
counters on an NVIDIA Tesla C2050 GPU. Based on the high resolution
profile data produced by our model we optimized memory
accesses in the sparse matrix vector multiply kernel and achieved
speedups ranging from 2.4 to 14.8 depending on the characteristics
of the input matrices.