OpenCL is undoubtedly becoming one of the most popular parallel
programming languages as it provides a standardized and portable
programming model. However, adopting OpenCL for Coarse-Grained
Reconfigurable Arrays (CGRA) is challenging due to divergent
architecture capability compared to GPUs. In particular, CGRAs are
designed to accelerate loop execution by software pipelining on a grid
of functional units exploiting instruction-level parallelism. This is
vastly different from a GPU
in that it
executes data parallel kernels using a large number of parallel threads.
Therefore, an OpenCL compiler and runtime for CGRAs must map the
threaded parallel programming model to a loop-parallel execution model
so that the architecture can best utilize its resources. In this paper,
we propose and evaluate a design for an OpenCL compiler framework for
CGRAs. The proposed design is composed of a serializer and post
optimizer. The serializer transforms parallel execution of work-items to
an equivalent loop-based iterative execution in order to avoid
expensive multithreading on CGRAs. The resulting code is further
optimized by the post optimizer to maximize the coverage of
software-pipelinable innermost loops. In order to achieve the goal,
various loop-level optimizations can take place in the post optimizer
using the loops introduced by the serializer for iterative execution of
OpenCL kernels. We provide an analysis of the propose framework from a
set of well-studied standard OpenCL kernels by comparing performance of
various implementations of benchmarks.