As semiconductor technologies have continued to scale according to Moores Law, complexity,
power consumption, and energy dissipation have become first-order considerations in
microprocessor design. In processors that issue instructions out-of-order, store-load forwarding
is a source of significant complexity and energy dissipation. To decrease the complexity
and improve the energy efficiency of store-load forwarding, this thesis proposes the forwarding
cache (FC), an address-indexed, set-associative alternative to the age-indexed, fully associative
store queue (SQ).
The SQ is a content-addressable memory (CAM) that holds in-flight stores in program
order. Because the SQ is age-indexed, a loads address may match one or more stores located
anywhere in the SQ. Thus, the SQ search is fully associative and priority-encoded. In todays
wide-issue processors, the SQ is large (24 to 32 entries), multiported (to accommodate the
issue of multiple loads in a single cycle), and fast (no slower than an L1 data cache hit). The
energy and complexity required to perform a fast search in a highly associative, multiported
CAM are substantial.
The contributions of this work are as follows. First, this thesis shows empirically that address
multiversioning and the accompanying age-ordered, priority-encoded search are rarely
necessary to perform store-load forwarding correctly. While others have observed the same
empirical result on a particular processor configuration using a particular load speculation policy,
this thesis extends the analysis to a broad variety of processors using several load speculation
policies. Second, this thesis proposes the forwarding cache (FC), an address-indexed,
set-associative cache that performs store-load forwarding. Third, this thesis investigates the
sensitivity of the FCs performance and energy dissipation to several design parameters, including
size, associativity, number of banks, and number of ports. The results show that a
small, simple, set-associative FC performs comparably to the complex, fully associative SQ on
both control-intensive and scientific workloads, while dissipating nearly ten times less energy
than the SQ.