As VLIW/EPIC processors are increasingly used in real-time,
signal-processing, and embedded applications, the importance of minimizing
code size and reducing power is growing. This paper describes a new
architectural mechanism, called the Modulo Schedule Buffers, that
provides an elegant interface for the execution of modulo scheduled
loops. While the performance is similar to that of kernel-only modulo
scheduling, this mechanism has a number of advantages, including
minimal code expansion. Rather than generating fully-scheduled
kernels, the compiler generates a sequential form of the modulo
scheduled loop body. Using the sequential form, the hardware
internally synthesizes the prologue, kernel, and epilogue. In
addition, while loops can be scheduled with fewer constraints
and fewer explicit prologues/epilogues than with existing mechanisms.
Because the hardware controls loop execution, the burden of modulo
schedule loop control is lifted from the predicate register file,
allowing for a less rigorous predication implementation. Finally,
hardware control limits the interrupt latency when using the EQ
explicit latency model to the execution latency of one iteration,
rather than the whole loop invocation.