Sun Microelectronics
273
16. Code Generation Guidelines
16.3.2 D-Cache Timing
The latency of a load to the D-Cache depends on the opcode. For unsigned loads,
data can be used two cycles after the load. For instance, if the first two instruc-
tions in the instruction buffer are a load and an instruction dependent on that
load, the grouping logic will break the group after the load and a bubble will be
inserted in the pipeline the following cycle. Code compiled for an earlier SPARC
processor with a load use penalty of one cycle will show a penalty of about.1 CPI
just for this rule; thus, it is very important to separate loads from their use.
16.3.2.1 Signed Loads
All signed loads smaller than 64 bits must be separated from their use by three
cycles; otherwise, an extra bubble is inserted in the pipeline to force the separa-
tion between the load and its use. Floating-point loads are not sign extended, so
they have a latency of two cycles.
Once a signed load (smaller than 64 bits) is encountered in the instruction stream,
all subsequent consecutive loads (signed or unsigned) also return data in three
cycles; otherwise, there would be a collision between two loads returning data.
As soon as a cycle without a load appears in the pipeline, the latency of loads is
brought back to two cycles.
Note:
The SPARC-V8 LD instruction is replaced with LDUW in SPARC-V9; the
new instruction does not require sign extension.
16.3.3 Data Alignment
SPARC-V9 requires that all accesses be aligned on an address equal to the size of
the access. Otherwise a
mem_address_not_aligned
trap is generated. This is espe-
cially important for double precision floating-point loads, which should be
aligned on an 8-byte boundary. If misalignment is determined to be possible at
compile time, it is better to use two LDF (load floating-point, single precision) in-
structions and avoid the trap. UltraSPARC supports single-precision loads mixed
with double-precision operations, so that the case above can execute without pen-
alty (except for the additional load). If a trap does occur, UltraSPARC dedicates a
trap vector for this specific misalignment, which reduces the overall penalty of
the trap.
Grouping load data is desirable, since a D-Cache sub-block can contain either
four properly aligned single-precision operands or two properly aligned double-
precision operands (eight and four respectively for a D-Cache line). As we shall
Artisan Technology Group - Quality Instrumentation ... Guaranteed | (888) 88-SOURCE | www.artisantg.com