50
OpenPower 720 Technical Overview and Introduction
The operating system cannot program or access the temperature threshold using the SP.
EPOW events can, for example, trigger the following actions:
Temperature monitoring, which increases the fans speed rotation when ambient
temperature is above a preset operating range.
Temperature monitoring warns the system administrator of potential
environmental-related problems. It also performs an orderly system shutdown when the
operating temperature exceeds a critical level.
Voltage monitoring provides warning and an orderly system shutdown when the voltage is
out of the operational specification.
3.1.4 Self-healing
For a system to be self-healing, it must be able to recover from a failing component by first
detecting and isolating the failed component, taking it off line, fixing or isolating it, and
reintroducing the fixed or replacement component into service without any application
disruption. Examples include:
Bit steering
to redundant memory in the event of a failed memory module to keep the
server operational.
Bit-scattering
, thus allowing for error correction and continued operation in the presence
of a complete chip failure (
Chipkill recovery
).
Single bit error correction using ECC without reaching error thresholds for main, L2, and
L3 cache memory.
L3 cache line deletes extended from 2 to 10 for additional self-healing.
ECC extended to inter-chip connections on fabric and processor bus.
Memory scrubbing
to help prevent soft-error memory faults.
Dynamic processor deallocation
, a deallocated processor can be replaced by an unused
Capacity on Demand processor to keep the system operational.
Memory reliability, fault tolerance, and integrity
The OpenPower 720 uses Error Checking and Correcting (ECC) circuitry for system memory
to correct single-bit and to detect double-bit memory failures. Detection of double-bit memory
failures helps maintain data integrity. Furthermore, the memory chips are organized such that
the failure of any specific memory module only affects a single bit within a four-bit ECC word
(
bit-scattering
), thus allowing for error correction and continued operation in the presence of
a complete chip failure (
Chipkill recovery
). The memory DIMMs also use
memory scrubbing
and thresholding to determine when spare memory modules within each bank of memory
should be used to replace ones that have exceeded their threshold of error count (
dynamic
bit-steering
). Memory scrubbing is the process of reading the contents of the memory during
idle time and checking and correcting any single-bit errors that have accumulated by passing
the data through the ECC logic. This function is a hardware function on the memory controller
chip and does not influence normal system memory performance.
3.1.5 N+1 redundancy
The use of redundant parts allows the OpenPower 720 to remain operational with full
resources:
Redundant spare memory bits in L1, L2, L3, and main memory
Redundant fans
Содержание eServer OpenPower 720
Страница 2: ......
Страница 28: ...18 OpenPower 720 Technical Overview and Introduction...
Страница 68: ...58 OpenPower 720 Technical Overview and Introduction...
Страница 72: ...62 OpenPower 720 Technical Overview and Introduction...
Страница 73: ......