Sun Microelectronics
293
17. Grouping Rules and Stalls
17.7.1.4 Read-After-Write and Interaction with Store Buffer
If a load hits the D-Cache and overlaps a store in the store buffer, the load will
not return data until two clocks after the store updates the D-Cache. The overlap
check is pessimistic, because only the lower 14 bits of the effective memory ad-
dress are checked. If a store is issued one clock earlier than an overlapping load
that hits the D-Cache, the load data will be returned seven clocks later than nor-
mal. If a load misses the D-Cache and if bits 13..4 of the load’s effective memory
address are the same as a store in the store buffer, the load data will not be re-
turned until six clocks after the store leaves the store buffer. If a store is issued
one clock earlier than a D-Cache miss load and bits 13..4 of the address are the
same, the load data will be returned six clocks later than a normal D-Cache miss
load.
MEMBAR
#StoreLoad
or
#MemIssue
will block younger loads from returning
data until three clocks after no older stores are outstanding (see Section 17.7.2,
“Store Dependencies,” on page 294). In the best case, a load use will be stalled in
the E Stage until 15 clocks after the previous store is dispatched.
17.7.1.5 Other Timing Issues
Additional clocks are added to the time a load returns data for E-Cache misses
and arbitration for the D- and E-Caches. An E-Cache miss adds at least twelve
clocks plus the System Interconnect latency for the first word of the block, com-
pared to a D-Cache hit. A D-Cache hit following an E-Cache miss returns data
one clock after the E-Cache miss data is returned. A D-Cache miss, E-Cache hit
following an E-Cache miss returns data nine clocks after the last word of data
from the E-Cache miss is delivered on the system interconnect. Back-to-back
E-Cache misses to clean lines can be issued at a maximum rate of four clocks plus
the system latency for the first word of the block. Writeback of dirty data can be
overlapped if the system supports it; the latency to the first word of read data is
at least 18 processor clocks.
LD{X}FSR
blocks dispatch of younger floating-point / graphics instructions that
reference floating-point registers,
FB{P}fcc
,
MOVfcc
,
ST{X}FSR,
and
LD{X}FSR
in-
structions until four clocks after the data is returned in delayed return mode, and
five clocks after the load data is returned otherwise. For example, if there are no
outstanding load misses from the D-Cache:
LDFSR (D-Cache hit)
G
E
C
N
1
N
2
N
3
W
W
1
W
2
FMULS f7,f7,f8
G
Artisan Technology Group - Quality Instrumentation ... Guaranteed | (888) 88-SOURCE | www.artisantg.com