Processor design techniques, such as pipelining, 
superscalar, and VLIW, have dramatically decreased the 
average number of clock cycles per instruction.  As a 
result, each execution cycle has become more significant 
to overall system performance.  To maximize the effectiveness 
of each cycle, one must expose instruction-level parallelism 
and employ memory latency tolerant techniques.  However, 
without special architecture support, a superscalar compiler 
cannot effectively accomplish these two tasks in the presence 
of control and memory access dependences.
Preloading is a class of architectural support which allows 
memory reads to be performed early in spite of potential 
violation of control and memory access dependences.  With 
preload support, a superscalar compiler can perform more 
aggressive code reordering to provide increased tolerance of 
cache and memory access latencies and increasing instruction-
level parallelism.  This thesis discusses the architectural 
features and compiler support required to effectively utilize 
preload instructions to increase the overall system performance.
The first hardware support is preload register update, a 
data preload support for load scheduling to reduce first-level 
cache hit latency.  Preload register update keeps the load 
destination registers coherent when load instructions are 
moved past store instructions that reference the same location.  
With this addition, superscalar processors can more effectively 
tolerate longer data access latencies.
The second hardware support is memory conflict buffer.  
Memory conflict buffer extends preload register update support 
by allowing uses of the load to move above ambiguous stores.  
Correct program execution is maintained using the memory 
conflict buffer and repair code provided by the compiler.  
With this addition, substantial speedup over an aggressive 
code scheduling model is achieved for a set of control 
intensive nonnumerical programs.  The last hardware support 
is preload buffer.  Large data sets and slow memory sub-
systems result in unacceptable performance for numerical 
programs.  Preload buffer allows performing loads early 
while eliminating problems with cache pollution and 
extended register live ranges.  Adding the prestore buffer 
allows loads to be scheduled in the presence of ambiguous 
stores.  Preload buffer support in addition to cache 
prefetching support is shown to achieve better performance 
than cache prefetching alone for a set of benchmarks.
In all cases, preloading decreases the bus traffic and 
reduces the miss rate when compared with no prefetching or 
cache prefetching.