The newest generations of graphics processing unit (GPU) architecture, such as the NVIDIA GeForce 8-series, feature new interfaces that improve programmability and generality over previous GPU generations. Using NVIDIAs Compute Unied Device Architecture (CUDA), the GPU is presented to developers as a exible parallel architecture. This exibility introduces the opportunity to perform a wide variety of parallelization optimizations on applications, but it can be difcult to choose and control optimizations to give reliable performance benet. This work presents a study that examines a broad space of optimization combinations performed on several applications ported to the GeForce 8800 GTX. By doing an exhaustive search of the optimization space, we nd congurations that are up to 74% faster than those previously thought optimal. We explain the effects that optimizations can have on this architecture and how they differ from those on more traditional processors. For some optimizations, small changes in resource usage per thread can have very signicant performance ramications due to the thread assignment granularity of the platform and the lack of control over scheduling and allocation behavior of the runtime. We conclude with suggestions for better controlling resource usage and performance on this platform.