Chapter 4. Continuous availability and manageability
153
Processor instruction retry
As introduced with the POWER6 technology, the processor can do processor
instruction retry and alternate processor recovery for several core-related faults. Doing this
significantly reduces exposure to both permanent and intermittent errors in the processor
core.
Intermittent errors, often because of cosmic rays or other sources of radiation, are generally
not repeatable.
With the instruction retry function, when an error is encountered in the core, in caches and
certain logic functions, the processor first automatically retries the instruction. If
the source of the error was truly transient, the instruction succeeds and the system can
continue as before.
Alternate processor retry
Hard failures are more difficult, being permanent errors that will be replicated each time that
the instruction is repeated. Retrying the instruction does not help in this situation because the
instruction will continue to fail.
As introduced with POWER6, processors are able to extract the failing instruction
from the faulty core and retry it elsewhere in the system. The failing core is then dynamically
deconfigured and scheduled for replacement.
Dynamic processor deallocation
Dynamic processor deallocation enables automatic deconfiguration of processor cores when
patterns of recoverable core-related faults are detected. Dynamic processor deallocation
prevents a recoverable error from escalating to an unrecoverable system error, which might
otherwise result in an unscheduled server outage. Dynamic processor deallocation relies on
the service processor’s ability to use FFDC-generated recoverable error information to notify
the POWER Hypervisor when a processor core reaches its predefined error limit. The
POWER Hypervisor then dynamically deconfigures the failing core and notifies the system
administrator that a replacement is needed. The entire process is transparent to the partition
owning the failing instruction.
Single processor checkstop
As in the POWER6 processor, the processor provides single core checkstopping
for certain processor logic, command, or control errors that cannot be handled by the
availability enhancements in the preceding section.
This significantly reduces the probability of any one processor affecting total system
availability by containing most processor checkstops to the partition that was using the
processor at the time full checkstop goes into effect.
Even with all these availability enhancements to prevent processor errors from affecting
system-wide availability into play, there will be errors that can result in a system-wide outage.
Before POWER6: On IBM systems prior to POWER6, such an error typically caused a
checkstop.