Chapter 4. Reliability, availability, and serviceability
163
Draft Document for Review October 14, 2014 10:19 am
5137ch04.fm
The serviceability features that are delivered in this system provide a highly efficient service
environment by incorporating the following attributes:
Design for customer setup (CSU), customer installed features (CIF), and
customer-replaceable units (CRU)
ED/FI incorporating FFDC
Converged service approach across multiple IBM server platforms
Concurrent Firmware Maintenance (CFM)
This section provides an overview of how these attributes contribute to efficient service in the
progressive steps of error detection, analysis, reporting, notification, and repair found in all
POWER processor-based systems.
4.6.1 Detecting errors
The first and most crucial component of a solid serviceability strategy is the ability to detect
accurately and effectively errors when they occur.
Although not all errors are a guaranteed threat to system availability, those that go undetected
can cause problems because the system has no opportunity to evaluate and act if necessary.
POWER processor-based systems employ IBM System z® server-inspired error detection
mechanisms, extending from processor cores and memory to power supplies and hard disk
drives (HDDs).
4.6.2 Error checkers, fault isolation registers, and First-Failure Data Capture
IBM POWER processor-based systems contain specialized hardware detection circuitry that
is used to detect erroneous hardware operations. Error checking hardware ranges from parity
error detection that is coupled with Processor Instruction Retry and bus try again, to ECC
correction on caches and system buses.
Within the processor/memory subsystem error-checker, error-checker signals are captured
and stored in hardware FIRs. The associated logic circuitry is used to limit the domain of an
error to the first checker that encounters the error. In this way, runtime error diagnostic tests
can be deterministic so that for every check station, the unique error domain for that checker
is defined and mapped to field-replaceable units (FRUs) that can be repaired when
necessary.
Integral to the Power Systems design is the concept of FFDC. FFDC is a technique that
involves sufficient error checking stations and co-ordination of faults so that faults are
detected and the root cause of the fault is isolated. FFDC also expects that necessary fault
information can be collected at the time of failure without needing re-create the problem or run
an extended tracing or diagnostics program.
For the vast majority of faults, a good FFDC design means that the root cause is isolated at
the time of the failure without intervention by an IBM SSR. For all faults, good FFDC design
still makes failure information available to the IBM SSR, and this information can be used to
confirm the automatic diagnosis. More detailed information can be collected by an IBM SSR
for rare cases where the automatic diagnosis is not adequate for fault isolation.