IBM Power 595 страница 150 Скачать руководство пользователя

Страница: 150 / 188

138

IBM Power 595 Technical Overview and Introduction

c. Deallocating faulty core.

Upon completion of the Alternate Processor Recovery operation, the POWER
Hypervisor will de-allocate the faulty core for deferred repair.

Processor contained checkstop: If a specific processor detected fault cannot be recovered
by processor instruction retry and alternate processor recovery is not an option, then the
POWER Hypervisor will terminate (checkstop) the partition that was using the processor
core when the fault was identified. In general, this limits the outage to a single partition.
When all the previously mentioned mechanisms fail, in almost all cases (excepting the
POWER Hypervisor) a termination will be contained to the single partition using the failing
processor core.

Memory protection

Memory and cache arrays comprise data bit lines that feed into a memory word. A memory
word is addressed by the system as a single element. Depending on the size and
addressability of the memory element, each data bit line can include thousands of individual
bits or memory cells. For example:

A single memory module on a dual inline memory module (DIMM) can have a capacity of
1 Gb, and supply eight bit lines of data for an error correcting code (ECC) word. In this
case, each bit line in the ECC word holds 128 Mb behind it, corresponding to more than
128 million memory cell addresses.

A 32 KB L1 cache with a 16-byte memory word, alternatively would have only 2 KB behind
each memory bit line.

A memory protection architecture that provides good error resilience for a relatively small L1
cache might be very inadequate for protecting the much larger system main storage.
Therefore, a variety of different protection methods are used in POWER6 processor-based
systems to avoid uncorrectable errors in memory.

Memory protection plans must take into account many factors, including:

Size

Desired performance

Memory array manufacturing characteristics

POWER6 processor-based systems have a number of protection schemes to prevent,
protect, or limit the effect of errors in main memory:

Hardware scrubbing

This method deals with soft errors. IBM POWER6 processor
systems periodically address all memory locations; any memory
locations with an ECC error are rewritten with the correct data.

Error correcting code

Error correcting code (ECC) allows a system to detect up to two
errors in a memory word and correct one of them. However, without
additional correction techniques, if more than one bit is corrupted a
system can fail.

Chipkill

This is an enhancement to ECC that enables a system to sustain
the failure of an entire DRAM. Chipkill spreads the bit lines from a
DRAM over multiple ECC words, so that a catastrophic DRAM
failure would affect at most one bit in each word. Barring a future
single bit error, the system can continue indefinitely in this state
with no performance degradation until the failed DIMM can be
replaced.

Redundant bit steering

This helps to avoid a situation in which multiple single-bit errors
align to create a multi-bit error. In the event that an IBM POWER6
process-based system detects an abnormal number of errors on a