HyperLink   Tolerating First Level Memory Access Latency In High-Performance Systems.
Publication Year:
  William Y. Chen, Scott A. Mahlke, Wen-mei Hwu
  Proceedings of the 21st Annual Int'l Conference on Parallel Processing, pp.(I) 36-43, St Charles, IL, Aug. 1992

In order to improve performance, future parallel systems will continue to increase the processing power of each node in a system. As node processors, though, can Execute more instructions concurrently, they become more sensitive to the first level memory access latency. This paper presents a set of hardware and software techniques, collectively referred to as register preloading, to effectively tolerate long first level memory access latency. The techniques include speculative execution, loop unrolling, dynamic memory disambiguation, and strip-mining. Results show that register preloading provides excellent tolerance to first level memory access latency up to 15 cycles for an issue 4 node processor.