HyperLink   Multiversioning in the Store Queue is the Root of All Store-Forwarding Evil
Publication Year:
  Sam S. Stone
  Diss. University of Illinois at Urbana-Champaign, 2007.

As semiconductor technologies have continued to scale according to Moores Law, complexity, power consumption, and energy dissipation have become first-order considerations in microprocessor design. In processors that issue instructions out-of-order, store-load forwarding is a source of significant complexity and energy dissipation. To decrease the complexity and improve the energy efficiency of store-load forwarding, this thesis proposes the forwarding cache (FC), an address-indexed, set-associative alternative to the age-indexed, fully associative store queue (SQ). The SQ is a content-addressable memory (CAM) that holds in-flight stores in program order. Because the SQ is age-indexed, a loads address may match one or more stores located anywhere in the SQ. Thus, the SQ search is fully associative and priority-encoded. In todays wide-issue processors, the SQ is large (24 to 32 entries), multiported (to accommodate the issue of multiple loads in a single cycle), and fast (no slower than an L1 data cache hit). The energy and complexity required to perform a fast search in a highly associative, multiported CAM are substantial. The contributions of this work are as follows. First, this thesis shows empirically that address multiversioning and the accompanying age-ordered, priority-encoded search are rarely necessary to perform store-load forwarding correctly. While others have observed the same empirical result on a particular processor configuration using a particular load speculation policy, this thesis extends the analysis to a broad variety of processors using several load speculation policies. Second, this thesis proposes the forwarding cache (FC), an address-indexed, set-associative cache that performs store-load forwarding. Third, this thesis investigates the sensitivity of the FCs performance and energy dissipation to several design parameters, including size, associativity, number of banks, and number of ports. The results show that a small, simple, set-associative FC performs comparably to the complex, fully associative SQ on both control-intensive and scientific workloads, while dissipating nearly ten times less energy than the SQ.