156
IBM Power 720 and 740 Technical Overview and Introduction
processor-based systems have several protection schemes designed to prevent,
protect, or limit the effect of errors in main memory:
Chipkill
Chipkill is an enhancement that enables a system to sustain the failure of an entire
DRAM chip. An ECC word uses 18 DRAM chips from two DIMM pairs, and a failure on any
of the DRAM chips can be fully recovered by the ECC algorithm. The system can continue
indefinitely in this state with no performance degradation until the failed DIMM can
be replaced.
72-byte ECC
In , an ECC word consists of 72 bytes of data. Of these, 64 bytes are used to
hold application data. The remaining eight bytes are used to hold check bits and additional
information about the ECC word.
This innovative ECC algorithm from IBM research works on DIMM pairs on a rank basis.
(A
rank
is a group of nine DRAM chips.) With this ECC code, the system can dynamically
recover from an entire DRAM failure (by Chipkill) but can also correct an error even if
another
symbol
(a byte, accessed by a 2-bit line pair) experiences a fault (an improvement
from the double error detection or single error correction ECC implementation found on
the POWER6 processor-based systems).
Hardware scrubbing
Hardware scrubbing is a method used to deal with intermittent errors. IBM POWER
processor-based systems periodically address all memory locations. Any memory
locations with a correctable error are rewritten with the correct data.
Cyclic redundancy check (CRC)
The bus that is transferring data between the processor and the memory uses CRC error
detection with a failed operation-retry mechanism and the ability to dynamically retune the
bus parameters when a fault occurs. In addition, the memory bus has spare capacity to
substitute a data bit-line whenever it is determined to be faulty.
memory subsystem
The processor chip contains two memory controllers with four channels per
memory controller. Each channel connects to a single DIMM, but as the channels work in
pairs, a processor chip can address four DIMM pairs, two pairs per memory controller.
The bus transferring data between the processor and the memory uses CRC error detection
with a failed operation retry mechanism and the ability to dynamically retune bus parameters
when a fault occurs. In addition, the memory bus has spare capacity to substitute a spare
data bit-line for one that is determined to be faulty.
Advanced memory buffer chips are exclusive to IBM and help to increase performance, acting
as read/write buffers. The Power 720 and the Power 740 use one memory controller.
Advanced memory buffer chips are on the memory cards and support four DIMMs each.
Memory page deallocation
While coincident cell errors in separate memory chips are statistically rare, IBM
processor-based systems can contain these errors using a memory page deallocation
scheme for partitions running IBM AIX and IBM i operating systems, and also for memory
pages owned by the POWER Hypervisor. If a memory address experiences an uncorrectable
or repeated correctable single cell error, the service processor sends the memory page
address to the POWER Hypervisor to be marked for deallocation.
Pages used by the POWER Hypervisor are deallocated as soon as the page is released.
Содержание Power 720 Express
Страница 2: ......
Страница 14: ...xii IBM Power 720 and 740 Technical Overview and Introduction ...
Страница 128: ...114 IBM Power 720 and 740 Technical Overview and Introduction ...
Страница 204: ...190 IBM Power 720 and 740 Technical Overview and Introduction ...
Страница 205: ......