5137ch04.fm
Draft Document for Review October 14, 2014 10:19 am
150
IBM Power Systems E870 and E880 Technical Overview and Introduction
4.3.2 Uncorrectable error introduction
An uncorrectable error can be defined as a fault that can cause incorrect instruction execution
within logic functions, or an uncorrectable error in data that is stored in caches, registers, or
other data structures. In less sophisticated designs, a detected uncorrectable error nearly
always results in the termination of an entire system. More advanced system designs in some
cases might be able to terminate just the application by using the hardware that failed. Such
designs might require that uncorrectable errors are detected by the hardware and reported to
software layers, and the software layers must then be responsible for determining how to
minimize the impact of faults.
The advanced RAS features that are built in to POWER8 processor-based systems handle
certain “uncorrectable” errors in ways that minimize the impact of the faults, even keeping an
entire system up and running after experiencing such a failure.
Depending on the fault, such recovery may use the virtualization capabilities of PowerVM in
such a way that the operating system or any applications that are running in the system are
not impacted or must participate in the recovery.
4.3.3 Processor Core/Cache correctable error handling
Layer 2 (L2) and Layer 3 (L3) caches and directories can correct single bit errors and detect
double bit errors (SEC/DED ECC). Soft errors that are detected in the level 1 caches are also
correctable by a try again operation that is handled by the hardware. Internal and external
processor “fabric” busses have SEC/DED ECC protection as well.
SEC/DED capabilities are also included in other data arrays that are not directly visible to
customers.
Beyond soft error correction, the intent of the POWER8 design is to manage a solid
correctable error in an L2 or L3 cache by using techniques to delete a cache line with a
persistent issue, or to repair a column of an L3 cache dynamically by using spare capability.
Information about column and row repair operations is stored persistently for processors, so
that more permanent repairs can be made during processor reinitialization (during system
reboot, or individual Core Power on Reset using the Power On Reset Engine.)
4.3.4 Processor Instruction Retry and other try again techniques
Within the processor core, soft error events might occur that interfere with the various
computation units. When such an event can be detected before a failing instruction is
completed, the processor hardware might be able to try the operation again by using the
advanced RAS feature that is known as
Processor Instruction Retry
.
Processor Instruction Retry allows the system to recover from soft faults that otherwise result
in outages of applications or the entire server.
Try again techniques are used in other parts of the system as well. Faults that are detected on
the memory bus that connects processor memory controllers to DIMMs can be tried again. In
POWER8 systems, the memory controller is designed with a replay buffer that allows memory
transactions to be tried again after certain faults internal to the memory controller faults are
detected. This complements the try again abilities of the memory buffer module.