IBM Power 770 Technical Overview And Introduction Download Page 187

Page: 187 / 220

Chapter 4. Continuous availability and manageability

173

4.2.6 Special uncorrectable error handling

Although rare, an uncorrectable data error can occur in memory or a cache. IBM
processor-based systems attempt to limit, to the least possible disruption, the impact of an
uncorrectable error using a well-defined strategy that first considers the data source.
Sometimes an uncorrectable error is temporary in nature and occurs in data that can be
recovered from another repository. Consider the following examples:

򐂰

Data in the instruction L1 cache is never modified within the cache itself. Therefore, an
uncorrectable error that is discovered in the cache is treated like an ordinary cache-miss,
and correct data is loaded from the L2 cache.

򐂰

The L2 and L3 cache of the processor-based systems can hold an unmodified
copy of data in a portion of main memory. In this case, an uncorrectable error simply
triggers a reload of a cache line from main memory.

In cases where the data cannot be recovered from another source, a technique called special
uncorrectable error (SUE) handling is used to prevent an uncorrectable error in memory or
cache from immediately causing the system to terminate. Instead, the system tags the data
and determines whether it can ever be used again:

򐂰

If the error is irrelevant, it does not force a checkstop.

򐂰

If the data is used, termination can be limited to the program, kernel, or hypervisor that
owns the data, or a freezing of the I/O adapters that are controlled by an I/O hub controller
if data is to be transferred to an I/O device.

When an uncorrectable error is detected, the system modifies the associated ECC word,
thereby signaling to the rest of the system that the

standard

ECC is no longer valid. The

service processor is then notified and takes appropriate actions. When running AIX V5.2 (or
later) or Linux, and a process attempts to use the data, the operating system is informed of
the error and might terminate, or only terminate a specific process that is associated with the
corrupt data, depending on the operating system and firmware level and whether the data
was associated with a kernel or non-kernel process.

Only when the corrupt data is being used by the POWER Hypervisor can the entire system be
rebooted, thereby preserving overall system integrity. If Active Memory Mirroring is enabled,
the entire system is protected and continues to run.

Depending on the system configuration and the source of the data, errors encountered during
I/O operations might not result in a machine check. Instead, the incorrect data is handled by
the PCI host bridge (PHB) chip. When the PHB chip detects a problem, it rejects the data,
preventing data from being written to the I/O device. The PHB then enters a freeze mode,
halting normal operations. Depending on the model and type of I/O being used, the freeze
can include the entire PHB chip, or only a single bridge, resulting in the loss of all I/O
operations that use the frozen hardware until a power-on reset of the PHB. The impact to
partitions depends on how the I/O is configured for redundancy. In a server that is configured
for fail-over availability, redundant adapters spanning multiple PHB chips can enable the
system to recover transparently, without partition loss.

4.2.7 PCI-enhanced error handling

IBM estimates that PCI adapters can account for a significant portion of the hardware-based
errors on a large server. Although servers that rely on boot-time diagnostics can identify
failing components to be replaced by hot-swap and reconfiguration, runtime errors pose a
more significant problem.