Wide-issue processors continue to achieve higher performance by
exploiting greater instruction-level parallelism. Dynamic techniques
such as out-of-order execution and hardware speculation have proven
effective at increasing instruction throughput. Run-time optimization
promises to provide an even higher level of performance by adaptively
applying aggressive code transformations on a larger scope. This
paper presents a new hardware mechanism for generating and deploying
run-time optimized code. The mechanism can be viewed as a filtering
system, that resides in the retirement stage of the processor
pipeline, accepts an instruction execution stream as input, and
produces instruction profiles and sets of linked, optimized traces as
output. The code deployment mechanism uses an extension to the branch
prediction mechanism to migrate execution into the new code without
modifying the original code. These new components do not add
delay to the execution of the program except during short
bursts of reoptimization. This technique provides a strong platform
for run-time optimization because the hot execution regions are
extracted, optimized, and written to main memory for execution and
because these regions persist across context switches. The current
design of the framework supports a suite of optimizations including
partial function inlining (even into shared libraries), code
straightening optimizations, loop unrolling, and peephole
optimizations.