As an open, royalty-free framework for writing programs that execute across
heterogeneous platforms, OpenCL gives programmers access to a variety
of data parallel processors including CPUs, GPUs, the Cell and DSPs.
All OpenCL-compliant implementations support a core specification, thus
ensuring robust functional portabiity of any OpenCL program. This thesis
presents the CUDAtoOpenCL source-to-source tool that translates code from
CUDA to OpenCL, thus ensuring portability of applications on a variety
of devices. However, current compiler optimizations are not sufficient to
translate performance from a single expression of the program onto a wide
variety of different architectures. To achieve true performance portability, an
open standard like OpenCL needs to be augmented with automatic high-level
optimization and transformation tools, which can generate optimized code
and configurations for any target device.
This thesis presents details of the working and implementation of the
CUDAtoOpenCL translator, based on the Cetus compiler framework. This
thesis also describes key insights from our studies optimizing selected
benchmarks for two distinct GPU architectures: the NVIDIA GTX280
and the ATI Radeon HD 5870. It can be concluded from the generated
results that the type and degree of optimization applied to each benchmark
need to be adapted to the target architecture specifications. In particular,
the different hardware architectures of the basic compute unit, register
file organization, on-chip memory limitations, DRAM coalescing patterns
and floating point unit throughput of the two devices interact with each
optimization differently.