The computer industry has transitioned into multi-core and
many-core parallel systems. The CUDA programming environment from
NVIDIA is an attempt to make programming many-core GPUs more
accessible to programmers. However, there are still many burdens placed
upon the programmer to maximize performance when using CUDA. One
such burden is dealing with the complex memory hierarchy. Efficient and
correct usage of the various memories is essential, making a difference of
2-17x in performance. Currently, the task of determining the appropriate
memory to use and the coding of data transfer between memories is still
left to the programmer. We believe that this task can be better performed
by automated tools. We present CUDA-lite, an enhancement to CUDA,
as one such tool. We leverage programmer knowledge via annotations
to perform transformations and show preliminary results that indicate
auto-generated code can have performance comparable to hand coding.