Mathematics of Prefetch Scheduling Distance
E
E-3
T
b
data transfer latency which is equal to number of lines
per iteration * line burst latency
Note that the potential effects of µop reordering are not factored into the
estimations discussed.
Examine Example E-1 that uses the
prefetchnta
instruction with a
prefetch scheduling distance of 3, that is, psd = 3. The data prefetched in
iteration
i
, will actually be used in iteration
i+3
.
T
c
represents the cycles
needed to execute
top_loop
- assuming all the memory accesses hit L1
while il (iteration latency) represents the cycles needed to execute this
loop with actually run-time memory footprint.
T
c
can be determined by
computing the critical path latency of the code dependency graph. This
work is quite arduous without help from special performance
characterization tools or compilers. A simple heuristic for estimating the
T
c
value is to count the number of instructions in the critical path and
multiply the number with an artificial CPI. A reasonable CPI value
would be somewhere between 1.0 and 1.5 depending on the quality of
code scheduling.
Example E-1 Calculating Insertion for Scheduling Distance of 3
top_loop:
prefetchnta [edx+esi+32*3]
prefetchnta [edx*4+esi+32*3]
. . . . .
movaps xmm1, [edx+esi]
movaps xmm2, [edx*4+esi]
movaps xmm3, [edx+esi+16]
movaps xmm4, [edx*4+esi+16]
. . . . .
. . .
add esi, 32
cmp esi, ecx
jl top_loop
Summary of Contents for ARCHITECTURE IA-32
Page 1: ...IA 32 Intel Architecture Optimization Reference Manual Order Number 248966 013US April 2006...
Page 220: ...IA 32 Intel Architecture Optimization 3 40...
Page 434: ...IA 32 Intel Architecture Optimization 9 20...
Page 514: ...IA 32 Intel Architecture Optimization B 60...
Page 536: ...IA 32 Intel Architecture Optimization C 22...