Numerical applications frequently contain nested loop 
structures that process large arrays of data.  The 
execution of these loop structures often produces memory 
preference patterns that poorly utilize data caches.  
Limited associativity and cache capacity result in 
cache conflict misses.  Also, non-unit stride access 
patterns can cause low utilization of cache lines.
Data copying has been proposed and investigated in 
order to reduce the cache conflict misses but this 
technique has a high execution overhead since it does 
the copy operations entirely in software.
  
   We propose a combined hardware and software technique 
called data relocation and prefetching which eliminates 
much of the overhead of data copying through the use of 
special hardware.  Furthermore, by relocating the data 
while performing software prefetching, the overhead of 
copying the data can be reduced further.  Experimental 
results for data relocation and prefetching are encouraging 
and show a large improvement in cache performance.