Chapter 3. RAS and manageability
51
Redundant power supplies (optional)
3.1.6 Fault masking
If corrections and retries succeed and do not exceed threshold limits, the system remains
operational with full resources, and no client or IBM customer engineer intervention is
required. This technology is used in the following faults:
CEC bus retry and recovery
PCI-X bus recovery
ECC Chipkill soft error
3.1.7 Resource deallocation
If recoverable errors exceed threshold limits, resources can be deallocated with system
remaining operational, allowing deferred maintenance at a convenient time.
Dynamic or persistent deallocation
Dynamic deallocation (available on SUSE LINUX Enterprise Server 9 for POWER) of
potentially failing components is non-disruptive, allowing the system to continue to run.
Persistent deallocation occurs when a failed component is detected, which is then
deactivated at a subsequent reboot.
Dynamic deallocation functions include:
Processor
L3 cache line delete
Partial L2 cache deallocation
PCI-X bus and slots
For dynamic processor deallocation, the service processor performs a predictive failure
analysis based on any recoverable processor errors that have been recorded. If these
transient errors exceed a defined threshold, the event is logged and the processor is
deallocated from the system while the operating system continues to run. This feature
(named
CPU Guard
) enables maintenance to be deferred until a suitable time. Processor
deallocation can only occur if there are sufficient functional processors (at least two).
Cache or cache-line deallocation is aimed at performing dynamic reconfiguration to bypass
potentially failing components. This capability is provided for both L2 and L3 caches. Dynamic
run-time deconfiguration is provided if a threshold of L1 or L2 recovered errors is exceeded.
In the case of an L3 cache run-time array single-bit solid error, the spare chip resources are
used to perform a line delete on the failing line.
PCI hot-plug slot fault tracking helps prevent slot errors from causing a system machine
check interrupt and subsequent reboot. This provides superior fault isolation, and the error
affects only the single adapter. Run-time errors on the PCI bus caused by failing adapters will
result in recovery action. If this is unsuccessful, the PCI device will be gracefully shut down.
Parity errors on the PCI bus itself will result in bus retry, and if uncorrected, the bus and any
I/O adapters or devices on that bus will be deconfigured.
The OpenPower 720 supports PCI Extended Error Handling (EEH) if it is supported by the
PCI-X adapter. In the past, PCI bus parity errors caused a global machine check interrupt,
which eventually required a system reboot in order to continue. In the OpenPower system,
Содержание eServer OpenPower 720
Страница 2: ......
Страница 28: ...18 OpenPower 720 Technical Overview and Introduction...
Страница 68: ...58 OpenPower 720 Technical Overview and Introduction...
Страница 72: ...62 OpenPower 720 Technical Overview and Introduction...
Страница 73: ......