IBM Power 750 Express Скачать руководство пользователя страница 171

Страница: 171 / 204

Chapter 4. Continuous availability and manageability

157

In cases where the data cannot be recovered from another source, a technique called Special
Uncorrectable Error (SUE) handling is used to prevent an uncorrectable error in memory or
cache from immediately causing the system to terminate. Rather, the system tags the data
and determines whether it will ever be used again:

򐂰

If the error is irrelevant, SUE will not force a checkstop.

򐂰

If data is used, termination can be limited to the program/kernel or hypervisor owning the
data, or freeze the I/O adapters controlled by an I/O hub controller if data is going to be
transferred to an I/O device.

When an uncorrectable error is detected, the system modifies the associated ECC word,
thereby signaling to the rest of the system that the “standard” ECC is no longer valid. The
service processor is then notified and takes appropriate actions. When running AIX 5.2, or
later, or Linux, and a process attempts to use the data, the operating system is informed of
the error and might terminate, or only terminate a specific process associated with the corrupt
data, depending on the operating system and firmware level and whether the data was
associated with a kernel or non-kernel process.

Only in the case when the corrupt data is used by the POWER Hypervisor must the entire
system be rebooted, thereby preserving overall system integrity.

Depending on system configuration and the source of the data, errors encountered during I/O
operations might not result in a machine check. Instead, the incorrect data is handled by the
processor host bridge (PHB) chip. When the PHB chip detects a problem, it rejects the data,
preventing data being written to the I/O device.

The PHB then enters a freeze mode, halting normal operations. Depending on the model and
type of I/O being used, the freeze might include the entire PHB chip, or simply a single bridge,
resulting in the loss of all I/O operations that use the frozen hardware until a power-on reset of
the PHB is done. The impact to partitions depends on how the I/O is configured for
redundancy. In a server configured for failover availability, redundant adapters spanning
multiple PHB chips can enable the system to recover transparently, without partition loss.

4.2.6 PCI Enhanced Error Handling

IBM estimates that PCI adapters can account for a significant portion of the hardware-based
errors on a large server. Whereas servers that rely on boot-time diagnostics can identify
failing components to be replaced by hot-swap and reconfiguration, runtime errors pose a
more significant problem.

PCI adapters are generally complex designs involving extensive on-board instruction
processing, often on embedded microcontrollers. They tend to use industry standard grade
components with an emphasis on product cost relative to high reliability. In certain cases, they
might be more likely to encounter internal microcode errors or many of the hardware errors
described for the rest of the server.

The traditional means of handling these problems is through adapter internal error reporting
and recovery techniques in combination with operating system device driver management
and diagnostics. In certain cases, an error in the adapter might cause transmission of bad
data on the PCI bus itself, resulting in a hardware-detected parity error and causing a global
machine check interrupt, eventually requiring a system reboot to continue.

Adapters enabled PCI Enhanced Error Handling (EEH) respond to a special data packet
generated from the affected PCI slot hardware by calling system firmware, which will examine
the affected bus, allow the device driver to reset it, and continue without a system reboot. For