HyperLink   High Performance Histogramming on Massively Parallel Processors
Publication Year:
  Greg Ross
  University of Illinois Masters Disertation, August 2014

Histogramming is a technique by which input datasets are mined to extract features and patterns. Histograms have wide range of uses in computer vision, machine learning, database processing, quality control for manufacturing, and many applications bene t from advance knowledge about the distribution of data. Computing a histogram is, essentially, the antithesis of parallel processing. Without the use of slow atomic operations or serial execution when contributing data to a histogram bin in an input-driven method, there would likely be inaccuracies in the resulting output. An output-driven method would eliminate the need for atomic operations but would amplify read bandwidth requirements, reduce overall throughput, and result in a zero or negative gain in performance. We introduce a method to pack multiple bins into a memory word with the goal of better utilizing GPU resources. This method improves GPU cupancy relative to earlier histogram kernel implementations, increases the number of working threads to better hide the latency of atomic operations and collisions while maintaining reasonable throughput. This technique will be demonstrated to improve performance of histogram functions of various sizes with a variety of inputs, including a study on a particular application. While the results are heavily driven by dependancies on input data patterns, the conclusions gathered in this thesis will outline that the packed atomics histogramming kernel can and usually does outperform other implementations in all but a select number of exceptions.