Clustering is one solution to the demand for wide-issue machines and
fast clock cycles because it allows for smaller, less ported register
files and simpler bypass logic while remaining scaleable. Much of the
previous work on scheduling for clustered architectures has focused on
acyclic code. While minimizing schedule length of acyclic code is
paramount, the primary objective when scheduling cyclic code is to
maximize the throughput or steady state performance. This paper
investigates a pre-modulo scheduling pass that performs cluster
assignment in a way that minimizes performance degradation do to
explicit communication required as the loops are split over clusters.
The cluster assignment algorithm annotates and adjusts the graph for use
by the scheduler so that any traditional modulo scheduling algorithm,
having no knowledge of clustering, can produce a valid and efficient
schedule for a clustered machine.
The algorithm was tested on various machine configurations
including ones with general function units and fully specialized
function units. The results show that, given a reasonable number of
buses, the assignment algorithm allows for a modulo schedule on a
clustered machine equal to that of a unified machine for 98% and 95% of
loops on general and fully specialized machines, respectively, for both
two and four cluster configurations.