User Manual
BCM1250/BCM1125/BCM1125H
10/21/02
B r o a d c o m C o r p o r a t i o n
Document
1250_1125-UM100CB-R
Section 5: L2 Cache
Page
93
C
OMMENTS
ON
U
SING
THE
L2
AS
M
EMORY
This section discusses some of the system design trade-offs associated with using the L2 cache as memory,
and some of the associated questions.
In normal operation the L2 cache is 512KB or 256KB and 4 way associative. There is no on chip memory. In
general this is the best combination, since the goal of a cache is to automatically keep recently used
information in fast on-chip memory and spill older data out to the main memory. However this is statistical,
there are no guarantees of behavior because sometimes a full memory cycle will be needed. Turning a way of
the cache into a memory bank will give some memory with guaranteed (fast) access time (subject only to bus
contention), but the performance of the cache is degraded both because it is smaller and because the
associativity has reduced. There is no easy way to tell what the performance impact will be for an embedded
system without modeling the details (there are many studies of general computing systems but they are less
useful because most uses of the part will have a much higher amount of I/O buffer and descriptor manipulation
than the studies typically include). The L2 cache is non-inclusive and non-exclusive (i.e. what is in the L2 and
what is in the L1 need not be related) which mitigates the effects some amount. But with a smaller L2 there is
very likely to be reduced performance. On the other hand on a BCM1250 a 256KB 2 way associative cache
(i.e. with half of the cache set aside as on chip memory) may be sufficient for the control processor if the task
is split such that the other CPU is running a real-time-loop. The real time processor can reduce its impact on
the L2 cache by keeping its main code loop in its L1 Instruction cache and most of its data either in its L1 Data
cache or fetched around the L2 from a large table in memory (using PREFetch for streaming or by mapping
the page with the no-L2 allocate cache attribute).
When the L2 cache is used as memory there is a restriction that writes must always be of full 32 byte cache
lines. In practice this is not a problem. The space must be mapped cacheable in the CPUs (which is reasonable
for data that needs a fast access time), this is not a burden for I/O devices doing DMA since the coherence
protocol will ensure the most up to date copy of the data is returned for a DMA read and the latency is about
the same regardless of whether the data comes from the L1 or L2 cache. In cacheable space the CPU will only
do full 32 byte block reads (to fill the L1 cache line) and full 32 byte block writes (when a line is evicted from
the L1 cache), so the full line write requirement is met. The I/O bridges (the point at which any DMA traffic hits
the coherent domain) recognize the L2-as-memory addresses as coherent memory space so on a read will
use the coherence protocol to get the latest data, and if a partial block write (i.e. <32 bytes or not cache
aligned) is made the I/O bridge will do a read (exclusive)-modify-write of the full line to coherently merge in the
new data. Therefore the full line write requirement is met in this case too. The Data Mover has the same
restriction as the CPU: the space must be marked cacheable to ensure full line writes.
Another option that should be considered rather than using the L2 cache as memory is to make use of the L2
as a cache in which I/O devices can cache data. The ZBbus L2CA flag causes an L2 cache allocate on a miss.
This flag accompanies every bus command (see
Section: “Address Phase” on page 22
). The on-chip DMA
engines can be configured to assert the flag to cache descriptors and/or packet headers. HyperTransport
devices can assert the flag by setting the isochronous bit in a HyperTransport command or (with revision 3 of
the interface
≥
system revision PERIPH_REV3) hitting in an address range defined by the IsocBAR and
IsocMask. PCI devices can assert it through the mapping in the BAR0 translation registers. Using this rather
than partitioning the cache will allow the cache to behave as a cache and store recently accessed data. This
will work well in situations where the CPUs are expected to be manipulating packet headers and descriptors
shortly before (on transmit) or after (on receive) the I/O device. The drawback of this method is that
occasionally the data cached by the I/O device will cause some CPU cached data to be evicted and force the
CPU to re-fetch it from memory (this is always true of the other CPU as well), and in overload, newly arrived
packets will displace older ones in the cache, so as the processors try to catchup, they will have to re-fetch
from memory (this can be easily prevented by detecting overload and flipping the bit in the DMA controllers to
disable the L2CA flag).