32
POWER7 and Optimization and Tuning Guide
Optimizing for cache geometry
There are several ways to optimize for cache geometry, as described in this section.
Splitting structures into hot and cold elements
A technique for optimizing applications to take advantage of cache is to lay out data
structures so that fields that have a high rate of reference (that is, hot) are grouped, and fields
that have a relatively low rate of reference (that is, cold) are grouped.
19
The concept is to
place the hot elements into the same
byte
region of memory, so that when they are pulled into
the cache, they are co-located into the same cache line or lines. Additionally, because hot
elements are referenced often, they are likely to stay in the cache. Likewise, the cold
elements are in the same area of memory and result in being in the same cache line, so that
being written out to main storage and discarded causes less of a performance degradation.
This situation occurs because they have a much lower rate of access. Power Systems use
128-byte length cache lines. Compared to Intel processors (64-byte cache lines), these larger
cache lines have the advantage of increasing the reach possible with the same size cache
directory, and the efficiency of the cache by covering up to 128-bytes of hot data in a single
line. However, it also has the implication of potentially bringing more data into the cache than
needed for fine-grained accesses (that is, less than 64 bytes).
As described in Eliminate False Sharing, Stop your CPU power from invisibly going down the
drain,
20
it is also important to carefully assess the impact of this strategy, especially when
applied to systems where there are a high number of CPU cores and a phenomenon referred
to as
false sharing
can occur. False sharing occurs when multiple data elements are in the
same cache line that can otherwise be accessed independently. For example, if two different
hardware threads wanted to update (store) two different words in the same cache line, only
one of them at a time can gain exclusive access to the cache line to complete the store. This
situation results in:
Cache line transfers between the processors where those threads are
Stalls in other threads that are waiting for the cache line
Leaving all but the most recent thread to update the line without a copy in their cache
This effect is compounded as the number of application threads that share the cache line
(that is, threads that are using different data in the cache line under contention) is scaled
upwards.
21, 20
The discussion about cache sharing
22
in also presents techniques for
analyzing false sharing and suggestions for addressing the phenomenon.
L3 cache:
Capacity/associativity
bandwidth
On-Chip
4 MB/core, 8-way
16 B reads and 16 B writes per cycle
On-Chip
10 MB/core, 8-way
16 B reads and 16 B writes per cycle
19
Splitting Data Objects to Increase Cache Utilization (Preliminary Version, 9th October 1998). available at:
http://www.ics.uci.edu/%7Efranz/Site/pubs-pdf/ICS-TR-98-34.pdf
20
Eliminate False Sharing, Stop your CPU power from invisibly going down the drain, available at:
http://drdobbs.com/goparallel/article/showArticle.jhtml?articleID=217500206
21
Ibid
22
Ibid
Cache
POWER7
Summary of Contents for Power System POWER7 Series
Page 2: ......
Page 36: ...20 POWER7 and POWER7 Optimization and Tuning Guide...
Page 70: ...54 POWER7 and POWER7 Optimization and Tuning Guide...
Page 112: ...96 POWER7 and POWER7 Optimization and Tuning Guide...
Page 140: ...124 POWER7 and POWER7 Optimization and Tuning Guide...
Page 162: ...146 POWER7 and POWER7 Optimization and Tuning Guide...
Page 170: ...154 POWER7 and POWER7 Optimization and Tuning Guide...
Page 223: ......