Optimizing Cache Usage
6
6-49
In Example 6-11, eight
_mm_load_ps
and
_mm_stream_ps
intrinsics are
used so that all of the data prefetched (a 128-byte cache line) is written
back. The prefetch and streaming-stores are executed in separate loops
to minimize the number of transitions between reading and writing data.
This significantly improves the bandwidth of the memory accesses.
// copy 128 byte per loop
for (j=kk; j<kk+NUMPERPAGE; j+=16) {
_mm_stream_ps((float*)&b[j],
_mm_load_ps((float*)&a[j]));
_mm_stream_ps((float*)&b[j+2],
_mm_load_ps((float*)&a[j+2]));
_mm_stream_ps((float*)&b[j+4],
_mm_load_ps((float*)&a[j+4]));
_mm_stream_ps((float*)&b[j+6],
_mm_load_ps((float*)&a[j+6]));
_mm_stream_ps((float*)&b[j+8],
_mm_load_ps((float*)&a[j+8]));
_mm_stream_ps((float*)&b[j+10],
_mm_load_ps((float*)&a[j+10]));
_mm_stream_ps((float*)&b[j+12],
_mm_load_ps((float*)&a[j+12]));
_mm_stream_ps((float*)&b[j+14],
_mm_load_ps((float*)&a[j+14]));
} // finished copying one block
}
// finished copying N elements
_mm_sfence();
Example 6-11
A Memory Copy Routine Using Software Prefetch
Summary of Contents for ARCHITECTURE IA-32
Page 1: ...IA 32 Intel Architecture Optimization Reference Manual Order Number 248966 013US April 2006...
Page 220: ...IA 32 Intel Architecture Optimization 3 40...
Page 434: ...IA 32 Intel Architecture Optimization 9 20...
Page 514: ...IA 32 Intel Architecture Optimization B 60...
Page 536: ...IA 32 Intel Architecture Optimization C 22...