Chapter 4. Reliability, availability, and serviceability
149
Draft Document for Review October 14, 2014 10:19 am
5137ch04.fm
4.3 Processor/Memory availability details
The more reliable a system or subsystem is, the more available it should be. Nevertheless,
considerable effort is made to design systems that can detect faults that do occur and take
steps to minimize or eliminate the outages that are associated with them. These design
capabilities extend availability beyond what can be obtained through the underlying reliability
of the hardware.
This design for availability begins with implementing an architecture for ED/FI.
First-Failure Data Capture (FFDC) is the capability of IBM hardware and microcode to
continuously monitor hardware functions. Within the processor and memory subsystem,
detailed monitoring is done by circuits within the hardware components themselves. Fault
information is gathered into fault isolation registers (FIRs) and reported to the appropriate
components for handling.
Processor and memory errors that are recoverable in nature are typically reported to the
dedicated service processor built into each system. The dedicated service processor then
works with the hardware to determine the course of action to be taken for each fault.
4.3.1 Correctable error introduction
Intermittent or soft errors are typically tolerated within the hardware design by using error
correction code or advanced techniques to try operations again after a fault.
Tolerating a correctable solid fault runs the risk that the fault aligns with a soft error and
causes an uncorrectable error situation. There is also the risk that a correctable error is
predictive of a fault that continues to worsen over time, resulting in an uncorrectable error
condition.
You can predictively deallocate a component to prevent correctable errors from aligning with
soft errors or other hardware faults and causing uncorrectable errors to avoid such situations.
However, unconfiguring components, such as processor cores or entire caches in memory,
can reduce the performance or capacity of a system. This in turn typically requires that the
failing hardware is replaced in the system. The resulting service action can also temporarily
impact system availability.
To avoid such situations in solid faults in POWER8, processors or memory might be
candidates for correction by using the “self-healing” features built into the hardware, such as
taking advantage of a spare DRAM module within a memory DIMM, a spare data lane on a
processor or memory bus, or spare capacity within a cache module.
When such self-healing is successful, the need to replace any hardware for a solid
correctable fault is avoided. The ability to predictively unconfigure a processor core is still
available for faults that cannot be repaired by self-healing techniques or because the sparing
or self-healing capacity is exhausted.