Chapter 4. Continuous availability and manageability
159
Depending on system configuration and the source of the data, errors encountered during I/O
operations might not result in a machine check. Instead, the incorrect data is handled by the
processor host bridge (PHB) chip. When the PHB chip detects a problem, it rejects the data,
preventing data being written to the I/O device.
The PHB then enters a freeze mode, halting normal operations. Depending on the model and
type of I/O being used, the freeze might include the entire PHB chip, or simply a single bridge,
resulting in the loss of all I/O operations that use the frozen hardware until a power-on reset of
the PHB is done. The impact to partitions depends on how the I/O is configured for
redundancy. In a server configured for failover availability, redundant adapters spanning
multiple PHB chips can enable the system to recover transparently, without partition loss.
4.2.6 PCI Enhanced Error Handling
IBM estimates that PCI adapters can account for a significant portion of the hardware-based
errors on a large server. Whereas servers that rely on boot-time diagnostics can identify
failing components to be replaced by hot-swap and reconfiguration, runtime errors pose a
more significant problem.
PCI adapters are generally complex designs involving extensive on-board instruction
processing, often on embedded microcontrollers. They tend to use industry standard grade
components with an emphasis on product cost relative to high reliability. In certain cases, they
might be more likely to encounter internal microcode errors or many of the hardware errors
described for the rest of the server.
The traditional means of handling these problems is through adapter internal error reporting
and recovery techniques in combination with operating system device driver management
and diagnostics. In certain cases, an error in the adapter might cause transmission of bad
data on the PCI bus itself, resulting in a hardware-detected parity error and causing a global
machine check interrupt, eventually requiring a system reboot to continue.
PCI Enhanced Error Handling (EEH) enabled adapters respond to a special data packet
generated from the affected PCI slot hardware by calling system firmware, which will examine
the affected bus, allow the device driver to reset it, and continue without a system reboot. For
Linux, EEH support extends to the majority of frequently used devices, although various
third-party PCI devices might not provide native EEH support.
To detect and correct PCIe bus errors, processor-based systems use CRC
detection and instruction retry correction, while for PCI-X they use ECC.