IMPACT



Publication Year:
	2009

Authors
	Dennis Lin, Victor Huang, Quang Nguyen, Joshua Blackburn, Christopher I. Rodrigues, Thomas Huang, Minh N. Do, Sanjay J. Patel, Wen-mei Hwu

Published:
	IEEE Signal Processing Magazine 26(6), 103--112, 2009

Abstract:
	In this article, we focus on the applicability of parallel computing architectures to video processing applications. We demonstrate different optimization strategies in detail using the 3-D convolution problem as an example, and show how they affect performance on both many-core CPUs and symmetric multiprocessor CPUs. Applying these strategies to case studies from three video processing domains brings out some trends. The highly uniform, abundant parallelism in many video processing kernels means that they are well suited to a simple, massively parallel task-based model such as CUDA. As a result, we often see ten times or greater performances increases running on many-core hardware. Some kernels, however, push the limits of CUDA, because their memory accesses cannot be shaped into regular, vectorizable patterns or because they cannot be efficiently decomposed into small independent tasks. Such kernels, like the depth propagation kernel in the section "Synthesis Example: Depth Image-Based Rendering" may achieve a modest speedup, but they are probably better suited to a more flexible parallel programming model. We look forward to additional advances, as more researchers learn to harness the processing capabilities of the latest generation of computation hardware.