The performance of modern superscalar and very long instruction word (VLIW) processors
depends on their ability to execute multiple instructions per cycle. These processors contain
multiple data paths and multiple functional units to concurrently execute independent
instructions from the instruction stream. In order to realize their performance potential, these
processors demand that increasing levels of instruction-level parallelism (ILP) be exposed by
the compiler. Unfortunately, recent studies have shown that conventional optimization and
scheduling methods cannot expose enough parallelism for full utilization of these processors [1].