IBM PS700 Technical Overview And Introduction Download Page 125

Page: 125 / 148

Chapter 4. Continuous availability and manageability

111

򐂰

Fault monitoring

The built-in self-test (BIST) checks processor, cache, memory, and associated hardware
required for proper booting of the operating system, when the system is powered on at the
initial install or after a hardware configuration change (for example, an upgrade). If a
non-critical error is detected or if the error occurs in a resource that can be removed from
the system configuration, the booting process is designed to proceed to completion. The
errors are logged in the system nonvolatile random access memory (NVRAM). When the
operating system completes booting, the information is passed from the NVRAM into the
system error log where it is analyzed by error log analysis (ELA) routines. Appropriate
actions are taken to report the boot time error for subsequent service if required.

Error checkers

IBM POWER processor-based systems contain specialized hardware detection circuitry that
is used to detect erroneous hardware operations. Error checking hardware ranges from parity
error detection coupled with processor instruction retry and bus retry, to ECC correction on
caches and system buses. All IBM hardware error checkers have distinct attributes:

򐂰

Continual monitoring of system operations to detect potential calculation errors.

򐂰

Attempt to isolate physical faults based on run time detection of each unique failure.

򐂰

Ability to initiate a wide variety of recovery mechanisms designed to correct the problem.
The POWER processor-based systems include extensive hardware and firmware
recovery logic.

Fault isolation registers

Error checker signals are captured and stored in hardware fault isolation registers (FIRs). The
associated logic circuitry is used to limit the domain of an error to the first checker that
encounters the error. In this way, run-time error diagnostics can be deterministic so that for
every check station, the unique error domain for that checker is defined and documented.
Ultimately, the error domain becomes the field-replaceable unit (FRU) call, and manual
interpretation of the data is not normally required.

First-failure data capture (FFDC)

First-failure data capture (FFDC) is an error isolation technique, which ensures that when a
fault is detected in a system through error checkers or other types of detection methods, the
root cause of the fault is captured without the need to recreate the problem or run an
extended tracing or diagnostics program.

For the vast majority of faults, a good FFDC design means that the root cause is detected
automatically without intervention by a service representative. Pertinent error data related to
the fault is captured and saved for analysis. In hardware, FFDC data is collected from the fault
isolation registers and from the associated logic. In firmware, this data consists of return
codes, function calls, and so forth.

FFDC check stations are carefully positioned within the server logic and data paths to ensure
potential errors can be quickly identified and accurately tracked to an FRU.

This proactive diagnostic strategy is a significant improvement over the classic, less accurate
reboot and diagnose service approaches.