Floating point unit
85
Programming the MIPS32® 74K™ Core Family, Revision 02.14
FP hardware can rule out an exception, and that leads to a delay on every non-trivial FP operation. With a half-rate
FPU, this stall will most likely be 6-7 clocks.
Software which can tolerate some deviation from IEEE precision can avoid these delays by opting to replace all
denormalized inputs and results by zero - controlled by the
FCSR[FS,FO,FN]
register bits described in
"FPU (co-processor 1) control registers"
and its notes. If you have also disabled all IEEE traps, you get no possibility
of FP exceptions and no extra main pipeline delay.
6.5.5 Delays when main pipeline waits for FPU to accept an instruction
FP instructions are queued (some queues are shared with other co-processors, if fitted) for transmission to the FPU
hardware. If that queue (which has 8 entries) fills up, the CPU will be unable to issue more FP instructions — and
since FP instructions are issued in-order, that will quickly clog up the CPU
6.5.6 Delays on mfc1/mtc1 instructions
mtc1
goes down the main pipe and gets its GP register data just like any other instruction (from the register file, a
completion buffer or a by-pass): then it passes it across to the FPU. In the FPU pipeline, the
mtc1
looks like an FP
load which hits: the data is sent to the FP unit a predictable number of cycles after it is issued
mfc1
(in the FPU pipeline) resembles a FP store. The FP data is sent back the same FPU-to-EU data path as is used
in a store, but then written into the CB which belongs to the integer AGEN pipeline’s version of the same
mfc1
instruction. The timing is awkward because you have to find a free completion buffer write port. Once the data is in
the CB, the
mfc1
is a candidate for graduation. Since the FPU pipeline is long and it usually runs slower than the
integer pipeline, the effective latency of
mfc1
can be high. A program will run faster if the
mfc1
can be placed 10-15
instruction positions ahead of its consumer.
6.5.7 Delays caused by dependency on FPU status register fields
The conditional branch instructions
bc1f
/
bc1t
and the conditional moves
movf
/
movt
execute in the main pipe-
line, but test a FP condition bit generated by the various FPU compare instructions.
bc1f
/
bc1t
(like other conditional branches) are executed speculatively in the execution unit. FP condition values
are not passed through CBs, so the check for a mispredict is not made until the branch instruction tries to graduate.
That means that mispredicted FP branches are a couple of cycles more expensive than regular mispredictions.
MIPS recommends that you don’t use the “branch likely” (
bc1fl
/
bc1tl
) versions of these instructions in new code.
6.5.8 Slower operation in MIPS I™ compatibility mode
Historic 32-bit MIPS CPUs had only 16 “even-numbered” floating point registers usable for arithmetic, with odd-
numbered registers working together with them to let you load, store and transfer double-precision (64-bit) values.
Software written for those old CPUs is incompatible with the full modern FPU, so there’s a compatibility bit provided
in
Status[FR]
- set zero to use MIPS I compatible code. This comes at the cost of slower repeat rates for FP instruc-
tions, because in compatibility mode not all the bypasses shown in the pipeline diagram above are active.
Summary of Contents for MIPS32 74K Series
Page 1: ...Document Number MD00541 Revision 02 14 March 30 2011 Programming the MIPS32 74K Core Family...
Page 10: ...Programming the MIPS32 74K Core Family Revision 02 14 10...
Page 54: ...3 8 The TLB and translation Programming the MIPS32 74K Core Family Revision 02 14 54...
Page 83: ......
Page 101: ...The MIPS32 DSP ASE 101 Programming the MIPS32 74K Core Family Revision 02 14...
Page 134: ...8 4 Performance counters Programming the MIPS32 74K Core Family Revision 02 14 134...