Sun Microelectronics
267
16. Code Generation Guidelines
•
The instruction buffer almost always contains several instructions when an
I-Cache miss occurs (an average of about 6.6).
•
The instruction buffer is filled faster (up to 4 instructions per cycle) than it is
emptied.
All these factors contribute to reducing the apparent I-Cache miss latency from 6
cycles (assuming an E-Cache hit) to 0.14 cycles on average for fpppp; that is, on
average, the pipeline is stalled for 0.14 cycles when an I-Cache miss occurs.
The effectiveness of the instruction buffer and the prefetcher on fpppp demon-
strated that techniques (such as loop unrolling) that create large sequential blocks
of code can be used efficiently on UltraSPARC, even if these blocks do not fit in
the I-Cache. On the other hand, for code properly scheduled to take advantage of
the four issue slots on UltraSPARC, the rate of instruction “consumption” may
easily exceed the rate of instruction fetching, thus making I-Cache misses more
apparent.
16.2.5 uTLB and iTLB Misses
The one-entry uTLB contains the virtual page number and the associated physical
page number of the line accessed last. If the line currently accessed is to the same
page, the instructions from that line are simply forwarded to the next stage. If the
line is from a different virtual page, the translation is obtained from the iTLB a
cycle later. The cost of crossing a page boundary is thus one cycle (the smallest
possible page size, 8 Kbytes, is assumed). This may or may not translate into a
one cycle penalty for the whole processor. For a tight loop with code spanning
over two pages, this cost may be significant, especially if the instruction buffer is
empty at the time of the page crossing. For this reason, it is desirable to position
short loops within a page (avoid page crossing).
An iTLB miss is handled by software through the use of the TSB, and takes about
32 cycles. Consequently, an iTLB miss may be very costly in terms of idle proces-
sor cycles. In order to minimize the frequency of iTLB misses, UltraSPARC pro-
vides a large number of entries (64) in the iTLB and allows pages as large as
4Mbytes to be used. Nonetheless, techniques that allocate pages based on profil-
ing are encouraged to further decrease the iTLB miss cost.
16.2.6 Branch Prediction
UltraSPARC predicts the outcome of branches and fetches the next instructions
likely to be executed based on that outcome. While this is all done dynamically in
hardware, the compiler has an impact on the initialization of the state machine.
Artisan Technology Group - Quality Instrumentation ... Guaranteed | (888) 88-SOURCE | www.artisantg.com