Contemporary many-core processors such as the GeForce 8800 GTX enable
application developers to utilize various levels of parallelism to
enhance the performance of their applications. However, iterative
optimization for such a system may lead to a local performance maximum,
due to the complexity of the system. We propose program optimization
carving, a technique that begins with a complete optimization space and
prunes it down to a set of configurations that is likely to contain the
global maximum. The remaining configurations can then be evaluated to
determine the one with the best performance. The technique can reduce
the number of configurations to be evaluated by as much as 98% and is
successful at finding a near-best configuration. For some applications,
we show that this approach is significantly superior to random sampling
of the search space.