IA-32 Intel® Architecture Optimization
6-48
Using the 8-byte Streaming Stores and Software Prefetch
Example 6-11 presents the copy algorithm that uses second level cache.
The algorithm performs the following steps:
1.
Uses blocking technique to transfer 8-byte data from memory into
second-level cache using the
_mm_prefetch
intrinsic, 128 bytes at
a time to fill a block. The size of a block should be less than one
half of the size of the second-level cache, but large enough to
amortize the cost of the loop.
2.
Loads the data into an
xmm
register using the
_mm_load_ps
intrinsic.
3.
Transfers the 8-byte data to a different memory location via the
_mm_stream
intrinsics, bypassing the cache. For this operation, it is
important to ensure that the page table entry prefetched for the
memory is preloaded in the TLB.
Example 6-11
A Memory Copy Routine Using Software Prefetch
#define PAGESIZE 4096;
#define NUMPERPAGE 512
// # of elements to fit a page
double a[N], b[N], temp;
for (kk=0; kk<N; kk+=NUMPERPAGE) {
temp = a[kk+NUMPERPAGE];
// TLB priming
// use block size = page size,
// prefetch entire block, one cache line per loop
for (j=kk+16; j<kk+NUMPERPAGE; j+=16) {
_mm_prefetch((char*)&a[j], _MM_HINT_NTA);
}
continued
Summary of Contents for ARCHITECTURE IA-32
Page 1: ...IA 32 Intel Architecture Optimization Reference Manual Order Number 248966 013US April 2006...
Page 220: ...IA 32 Intel Architecture Optimization 3 40...
Page 434: ...IA 32 Intel Architecture Optimization 9 20...
Page 514: ...IA 32 Intel Architecture Optimization B 60...
Page 536: ...IA 32 Intel Architecture Optimization C 22...