Numerical applications frequently contain nested loop
structures that process large arrays of data. The
execution of these loop structures often produces memory
preference patterns that poorly utilize data caches.
Limited associativity and cache capacity result in
cache conflict misses. Also, non-unit stride access
patterns can cause low utilization of cache lines.
Data copying has been proposed and investigated in
order to reduce the cache conflict misses but this
technique has a high execution overhead since it does
the copy operations entirely in software.
We propose a combined hardware and software technique
called data relocation and prefetching which eliminates
much of the overhead of data copying through the use of
special hardware. Furthermore, by relocating the data
while performing software prefetching, the overhead of
copying the data can be reduced further. Experimental
results for data relocation and prefetching are encouraging
and show a large improvement in cache performance.