In order to improve performance, future parallel systems
will continue to increase the processing power of each node
in a system. As node processors, though, can Execute more
instructions concurrently, they become more sensitive to the
first level memory access latency. This paper presents a set
of hardware and software techniques, collectively referred
to as register preloading, to effectively tolerate long first
level memory access latency. The techniques include
speculative execution, loop unrolling, dynamic memory
disambiguation, and strip-mining. Results show that register
preloading provides excellent tolerance to first level memory
access latency up to 15 cycles for an issue 4 node processor.