Two orthogonal hardware techniques, table-based address prediction and
early address calculation, for reducing the latency of load
instructions have been recently proposed. The key idea behind both of
these techniques is to speculatively perform loads early in the
processor pipeline using predicted values for the loads'
addresses. These techniques have required either a large hardware
table or complex register bypass logic to be implemented in order to
accurately predict the important loads in the presence of a large
number of less-important loads. This paper proposes a
compiler-directed approach that allows a streamlined version of both
of these techniques to be effectively used together. The compiler
provides directives to indicate which prediction mechanism to use or,
when appropriate, that a prediction should not be made. The hardware
therefore can be focused on their target cases so that a smaller
prediction table and simpler bypass logic suffice. Our results show
that through straightforward compiler heuristics, we obtain an average
speedup of 34% with a 256-entry direct-mapped address table and only
one cached register. And with the help of address profiling, an extra
4% of speedup can be obtained.