Sun Microelectronics
263
16. Code Generation Guidelines
struction would be fetched in that case. If the target is accessed from more than
one place, it should be aligned so that it accommodates the largest possible
group. If accesses to the I-Cache are expected to miss, it may be desirable to align
targets on a 16-byte (even 32-byte) boundary so that 4 instructions are forwarded
to the next stage. Such an alignment can at least assure that 4 (8 for 32-byte align-
ment) instructions can be processed between cache misses, assuming that the
code does not branch out of the sequence of instructions (which is generally not
the case for integer programs).
16.2.2.3 Impact of the Delay Slot on Instruction Fetch
If the last instruction of a line is a branch, the next sequential line in the I-Cache
must be fetched even if the branch is predicted taken, since the delay slot must be
sent to the grouping logic. This leads to inefficient fetches, since an entire
E-Cache access must be dedicated to fetching the missing delay slot. Take care
not to place delayed CTIs (control transfer instructions) that are predicted taken at
the end of a cache line.
16.2.2.4 Instruction Alignment for the Grouping Logic
UltraSPARC can execute up to four instructions per cycle. The first three instruc-
tions in a group occupy slots that in most cases are interchangeable with respect
to resources. Only special cases of instructions that can only be executed in IEU
1
followed by IEU
0
candidates violate this interchangeability (described in Section
17.5, “Integer Execution Unit (IEU) Instructions,” on page 284). The fourth slot
can only be used for PC-based branches or for floating-point instructions. Conse-
quently, in order to get the most performance out of UltraSPARC, the code
should be organized so that either a floating-point operation (FPOP) or a branch
is aligned with the fourth slot. For floating-point code, it should be relatively
easy for the compiler to take advantage of the added execution bandwidth pro-
vided by the fourth slot. For integer code, aligning the branch so that it is issued
fourth in a group must be balanced with other factors that may be more impor-
tant, such as not placing a branch at the end of a cache line. Moreover if depen-
dency analysis shows that a group of four instructions could be issued, but the
fourth instruction is not a branch or an FPop while one of the first three is a
branch, the compiler must evaluate the following trade-off before switching the
two instructions (assuming no data dependency):
•
Moving the fourth instruction ahead of the branch (cross-block scheduling)
and generating possible compensation code for the alternate path.
Artisan Technology Group - Quality Instrumentation ... Guaranteed | (888) 88-SOURCE | www.artisantg.com