4.5 Tuning software for the 74K
™
family pipeline
Programming the MIPS32® 74K™ Core Family, Revision 02.14
58
•
div
/
divu
: divide - the quotient goes into
lo
and the remainder into
hi
.
Many of the most powerful instructions in the MIPS DSP ASE are variants of multiply or multiply-accumulate oper-
ations, and are described in
Chapter 9, “The MIPS32® DSP ASE” on page 121Chapter 7, “The MIPS32® DSP ASE”
. The DSP ASE also provides three additional “accumulators” which behave like the
hi
/
lo
pair).
No multiply/divide operation ever produces an exception - even divide-by-zero is silent - so compilers typically insert
explicit check code where it’s required.
The 74K core multiplier is high performance and pipelined; multiply/accumulate instructions can run at a rate of 1
per clock, but a 32
×
32 3-operand multiply takes six clocks longer than a simple ALU operation. Divides use a bit-per-
clock algorithm, which is short-cut for smaller dividends. Multiply/divide instructions are generally slow enough that
it is difficult to arrange programs so that their results will be ready when needed.
4.5 Tuning software for the 74K
™
family pipeline
This section is addressed to low-level programmers who are tuning software by hand and to those working on effi-
cient compilers or code translators.
74K family cores have a complex out-of-order pipeline, which makes fine-grain instruction interactions very difficult
to summarize. See
Section 1.4 “A brief guide to the 74K‘ core implementation”
for a reasonably accurate picture of
the basic pipeline, from which you will be able to foresee some effects. We hope that a later version of this manual
may be able to be more helpful, but with a complex out-of-order CPU like this one you will always get more insight
from running code on a real CPU or a cycle-accurate simulator.
4.5.1 Cache delays and mitigating their effect
In a typical 74K CPU implementation a cache miss which has to be refilled from DRAM memory (in the very next
chip on the board) will be delayed by a period of time long enough to run 50-200 instructions. A miss or uncached
read (perhaps of a device register) may easily be several times slower. These really are important!
Because these delays are so large, there’s not a lot you can do to help a cache-missing program make progress. But
every little bit helps. The 74K core has non-blocking loads, so if you can move your load instruction producer away
from its consumer, you won’t start paying for your memory delay until you try to run the consuming instruction.
Compilers and programmers find it difficult to move fragments of algorithm backwards like this, so the architecture
also provides prefetch instructions (which fetch designated data into the D-cache, but do nothing else). Because
they’re free of most side-effects it’s easier to issue prefetches very early. Any loop which walks predictably through a
large array is a candidate for prefetch instructions, which are conveniently placed within one iteration to prefetch data
for the next.
The
pref PrepareForStore
prefetch saves a cache refill read, for cache lines which you intend to overwrite in
their entirety. Read more about prefetch in
Section 4.2, "Prefetching data"
above.
Tuning data-intensive common functions
Bulk operations like
bcopy()
and
bzero()
will benefit from CPU-specific tuning. To get excellent performance
for in-cache data, it’s only necessary to reorganize the software enough to cover the address-to-store and load-to-use
delays. But to get the loop to achieve the best performance when cache missing, you probably want to use some
prefetches. MIPS Technologies may have example code of such functions — ask.
Содержание MIPS32 74K Series
Страница 1: ...Document Number MD00541 Revision 02 14 March 30 2011 Programming the MIPS32 74K Core Family...
Страница 10: ...Programming the MIPS32 74K Core Family Revision 02 14 10...
Страница 20: ...1 4 A brief guide to the 74K core implementation Programming the MIPS32 74K Core Family Revision 02 14 20...
Страница 28: ...2 2 PRId register identifying your CPU type Programming the MIPS32 74K Core Family Revision 02 14 28...
Страница 54: ...3 8 The TLB and translation Programming the MIPS32 74K Core Family Revision 02 14 54...
Страница 83: ......
Страница 86: ...6 5 FPU pipeline and instruction timing Programming the MIPS32 74K Core Family Revision 02 14 86...
Страница 101: ...The MIPS32 DSP ASE 101 Programming the MIPS32 74K Core Family Revision 02 14...
Страница 134: ...8 4 Performance counters Programming the MIPS32 74K Core Family Revision 02 14 134...
Страница 154: ...C 3 FPU changes in Release 2 of the MIPS32 Architecture Programming the MIPS32 74K Core Family Revision 02 14 154...