IBM Power System POWER7 Series Optimization And Tuning Manual Download Page 48

Page: 48 / 224

POWER7 and Optimization and Tuning Guide

Optimizing for cache geometry

There are several ways to optimize for cache geometry, as described in this section.

Splitting structures into hot and cold elements

A technique for optimizing applications to take advantage of cache is to lay out data
structures so that fields that have a high rate of reference (that is, hot) are grouped, and fields
that have a relatively low rate of reference (that is, cold) are grouped.

The concept is to

place the hot elements into the same

byte

region of memory, so that when they are pulled into

the cache, they are co-located into the same cache line or lines. Additionally, because hot
elements are referenced often, they are likely to stay in the cache. Likewise, the cold
elements are in the same area of memory and result in being in the same cache line, so that
being written out to main storage and discarded causes less of a performance degradation.
This situation occurs because they have a much lower rate of access. Power Systems use
128-byte length cache lines. Compared to Intel processors (64-byte cache lines), these larger
cache lines have the advantage of increasing the reach possible with the same size cache
directory, and the efficiency of the cache by covering up to 128-bytes of hot data in a single
line. However, it also has the implication of potentially bringing more data into the cache than
needed for fine-grained accesses (that is, less than 64 bytes).

As described in Eliminate False Sharing, Stop your CPU power from invisibly going down the
drain,

it is also important to carefully assess the impact of this strategy, especially when

applied to systems where there are a high number of CPU cores and a phenomenon referred
to as

false sharing

can occur. False sharing occurs when multiple data elements are in the

same cache line that can otherwise be accessed independently. For example, if two different
hardware threads wanted to update (store) two different words in the same cache line, only
one of them at a time can gain exclusive access to the cache line to complete the store. This
situation results in:

򐂰

Cache line transfers between the processors where those threads are

򐂰

Stalls in other threads that are waiting for the cache line

򐂰

Leaving all but the most recent thread to update the line without a copy in their cache

This effect is compounded as the number of application threads that share the cache line
(that is, threads that are using different data in the cache line under contention) is scaled
upwards.

21, 20

The discussion about cache sharing

in also presents techniques for

analyzing false sharing and suggestions for addressing the phenomenon.

L3 cache:
Capacity/associativity
bandwidth

On-Chip
4 MB/core, 8-way
16 B reads and 16 B writes per cycle

On-Chip
10 MB/core, 8-way
16 B reads and 16 B writes per cycle

Splitting Data Objects to Increase Cache Utilization (Preliminary Version, 9th October 1998). available at:

http://www.ics.uci.edu/%7Efranz/Site/pubs-pdf/ICS-TR-98-34.pdf

Eliminate False Sharing, Stop your CPU power from invisibly going down the drain, available at:

http://drdobbs.com/goparallel/article/showArticle.jhtml?articleID=217500206

Ibid