HyperLink   Implementing a GPU Programming Model on a non-GPU Accelerator Architecture
Publication Year:
  Stephen M. Kofsky, Daniel R. Johnson, John A. Stratton, Wen-mei Hwu, Sanjay J. Patel, Steve S. Lumetta
  Proceedings of the Workshop on Applications for Multi- and Many-cores, June 2010

Parallel codes are written primarily for the purpose of performance.  It is highly desirable that parallel codes be portable between parallel architectures without significant performance degradation or code rewrites.  While performance portability and its limits have been studied thoroughly on single-processor system, this goal has been less extensively studied and is more difficult to achieve for parallel systems.  Emerging single-chip parallel platforms are no exception; writing code that obtains good performance across CPUs and other many-core CMPs can be challenging.  In this paper, we focus on CUDA codes, noting that programs must obey a number of constraints to achieve high performance on an NVIDIA GPU.  Under such constrains, we develop optimizations that improve the performance of CUDA code on a MIMD accelerator architecture that we are developing callied Rigel.  We demonstrate performance improvements with these optimizations over naive translations, and final performance results comparable to those of codes that were hand-optimized for Rigel.

author = {Kofsky, Stephen M. and Johnson, Daniel R. and Stratton, John A. and Hwu, Wen-Mei W. and Patel, Sanjay J. and Lumetta, Steven S.},
title = {Implementing a {GPU} programming model on a Non-{GPU} accelerator architecture},
booktitle = {Proceedings of the 2010 Workshop on Applications and Multi- and Many-Core Processors},
year = {2010},
location = {Saint-Malo, France},
pages = {40--51},
numpages = {12},
url = {http://dx.doi.org/10.1007/978-3-642-24322-6_5},
doi = {10.1007/978-3-642-24322-6_5},
acmid = {2185876},
publisher = {Springer-Verlag},
address = {Berlin, Heidelberg},