Program optimization for
highly-parallel systems has historically been considered an art, with
experts doing much of the performance tuning by hand. With the
introduction of inexpensive, single-chip, massively parallel platforms,
more developers will be creating highly-parallel applications for these
platforms, who lack the substantial experience and knowledge needed to
maximize their performance. This creates a need for more structured
optimization methods with means to estimate their performance eects.
Furthermore these methods need to be understandable by most programmers.
This paper shows the complexity involved in optimizing applications for
one such system and one relatively simple methodology for reducing the
workload involved in the optimization process.
This work is based on one such highly-parallel system, the GeForce 8800
GTX using CUDA. Its exible allocation of resources to threads allows it
to extract performance from a range of applications with varying
resource requirements, but places new demands on developers who seek to
maximize an applications performance. We show how optimizations
interact with the architecture in complex ways, initially prompting an
inspection of the entire conguration space to nd the optimal
conguration. Even for a seemingly simple application such as matrix
multiplication, the optimal conguration can be unexpected. We then
present metrics derived from static code that capture the rst-order
factors of performance. We demonstrate how these metrics can be used to
prune many optimization congurations, down to those that lie on a
Pareto-optimal curve. This reduces the optimization space by as much as
98% and still nds the optimal conguration for each of the studied
applications.