BCM1250/BCM1125/BCM1125H
User Manual
10/21/02
B r o a d c o m C o r p o r a t i o n
Page
90
Section 5: L2 Cache
Document
1250_1125-UM100CB-R
Every bus request is accompanied by an L2 cacheability flag, if this is set and the block is not in the L2 cache
then it will be allocated. On a write miss the L2 will allocate space for the block and accept the data when it is
transmitted. On a read miss the memory controller will supply the data to the requesting agent, the L2 will
allocate space for the block at the time the request is made and will take a copy of the data as it passes on the
bus. The L2 cacheability flag only affects the behavior on an L2 miss. Regardless of the L2 cacheability flag,
if a request that is L1 cacheable hits in the L2 then it will supply the data for a read and receive the data on a
write.
The L2 cache behaves the same whatever the source of the request, so DMA masters will read data from the
L2 if it is there. More interestingly, DMA writes can be done with the L2 cacheability bit (see A_L2CA in
Section: “Address Phase” on page 22
) set to cause the data to be written to the L2 cache (and on to memory
only if it gets evicted). This can be used to reduce the access latency for any data the CPU will need to
manipulate. However, this operation can potentially pollute the L2 cache so it must be used with care. The
internal DMA controllers have a control bit to enable this behavior, its use by external DMA masters is
described in the host bridge section for the buses.
To allow programs further control of the L2 cache, the CPU provides a mechanism to prevent the L2
cacheability flag from being set on accesses that are cacheable in the L1 cache. This is of particular use for
reducing L2 pollution with data that is highly likely to be referenced only by one of the CPUs and used for a
short time. There are two ways this can be done. Pages can be marked in the TLB with one of the cacheable
coherent no L2 attributes (codes 0 and 1), which will cause all blocks in the page to be accessed with the L2-
allocate flag clear. Alternatively, the pages can be marked with the usual cacheable coherent attributes (codes
4 and 5) and the PREFetch instruction with a streaming hint can be used to access individual lines around the
L2 cache. In the second case since the block is marked for streaming when it is put in to the L1 it is flagged as
the first of the 4 lines at that index to be evicted (it will be evicted first regardless of the LRU information).
When the L2 cache is accessed address bits [16:5] ([15:5] on a part with or configured for a 256KB cache and
[14:5] for a part with or configured for a 128KB cache) are used to form an index into the cache array, and the
four ways are examined at that index. When designing for best performance the potential for bad aliasing must
be taken into consideration. For example in an embedded system using 64K pages and a simple memory
allocator, it is possible that memory is always allocated in 64K chunks starting at a page boundary. Thus (for
a 512KB cache) data in the first cache block of allocated objects will be found at one of 8 locations in the L2
cache (the index will either be zero or have just bit [16] set, and there are 4 ways at each possible index). A
programmer unaware of the allocation policy could see a large amount of unexpected cache thrashing, and
thus get significantly lower performance than expected. A similar (though less dramatic) effect can be seen if
network packet buffers are allocated as 1KB long aligned at a 1KB boundary and the DMA engine is set to
cache 64 byte headers in the L2. The packet headers will all have address bits [9:6] as zero, so will be limited
to use 1/16th of the cache. If 64 bytes of padding is put between the buffers (or the buffer size is increased to
1088) then all of the cache will be used.
Statistics for the L2 cache are gathered by the debug/performance monitor unit. These are discussed in the
Section: “System Performance Counters” on page 61
.