
Chapter 4. Continuous availability and manageability
99
Draft Document for Review September 2, 2008 5:05 pm
4405ch04 Continuous availability and manageability.fm
Memory protection
Memory and cache arrays comprise data bit lines that feed into a memory word. A memory
word is addressed by the system as a single element. Depending on the size and
addressability of the memory element, each data bit line may include thousands of individual
bits or memory cells. For example:
A single memory module on a Dual Inline Memory Module (DIMM) can have a capacity of
1 Gb, and supply eight bit lines of data for an ECC word. In this case, each bit line in the
ECC word holds 128 Mb behind it, corresponding to more than 128 million memory cell
addresses.
A 32 KB L1 cache with a 16-byte memory word, on the other hand, would have only 2 Kb
behind each memory bit line.
A memory protection architecture that provides good error resilience for a relatively small L1
cache might be very inadequate for protecting the much larger system main store. Therefore,
a variety of different protection methods are used in POWER6 processor-based systems to
avoid uncorrectable errors in memory.
Memory protection plans must take into account many factors, including:
Size
Desired performance
Memory array manufacturing characteristics.
POWER6 processor-based systems have a number of protection schemes designed to
prevent, protect, or limit the effect of errors in main memory. These capabilities include:
Hardware scrubbing
Hardware scrubbing is a method used to deal with soft errors. IBM
POWER6 processor-based systems periodically address all
memory locations and any memory locations with an ECC error are
rewritten with the correct data.
Error correcting code
Error correcting code (ECC) allows a system to detect up to two
errors in a memory word and correct one of them. However, without
additional correction techniques if more than one bit is corrupted, a
system will fail.
Chipkill™
Chipkill is an enhancement to ECC that enables a system to
sustain the failure of an entire DRAM. Chipkill spreads the bit lines
from a DRAM over multiple ECC words, so that a catastrophic
DRAM failure would affect at most one bit in each word. Barring a
future single bit error, the system can continue indefinitely in this
state with no performance degradation until the failed DIMM can be
replaced.
Redundant bit steering
IBM systems use redundant bit steering to avoid situations where
multiple single-bit errors align to create a multi-bit error. In the event
that an IBM POWER6 processor-based system detects an
abnormal number of errors on a bit line, it can dynamically steer the
data stored at this bit line into one of a number of spare lines.