In-place data manipulation is very desirable in
many-core architectures with limited on-board memory. This
paper deals with the in-place implementation of a class of primitives
that perform data movements in one direction. We call
these primitives Data Sliding (DS) algorithms. Notable among
them are relational algebra primitives (such as select and
unique), padding to insert empty elements in a data structure,
and stream compaction to reduce memory requirements. Their
in-place implementation in a bulk synchronous parallel model,
such as GPUs, is specially challenging due to the difficulties
in synchronizing threads executing on different compute units.
Using a novel adjacent work-group synchronization technique,
we propose two algorithmic schemes for regular and irregular
DS algorithms. With a set of 5 benchmarks, we validate our
approaches and compare them to the state-of-the-art implementations
of these benchmarks. Our regular DS algorithms
demonstrate up to 9.11× and 73.25× on NVIDIA and AMD
GPUs, respectively, the throughput of their competitors. Our
irregular DS algorithms outperform NVIDIA Thrust library by
up to 3.24× on the three most recent generations of NVIDIA
GPUs.