5137ch04.fm
Draft Document for Review October 14, 2014 10:19 am
166
IBM Power Systems E870 and E880 Technical Overview and Introduction
Run time
All Power Systems servers can monitor critical system components during run time, and they
can take corrective actions when recoverable faults occur. The IBM hardware error-check
architecture can report non-critical errors in the Central Electronics Complex in an
out-of-band
communications path to the service processor without affecting system
performance.
A significant part of IBM runtime diagnostic capabilities originate with the service processor.
Extensive diagnostic and fault analysis routines were developed and improved over many
generations of POWER processor-based servers, and enable quick and accurate predefined
responses to both actual and potential system problems.
The service processor correlates and processes runtime error information by using logic that
is derived from IBM engineering expertise to count recoverable errors (called
thresholding
)
and predict when corrective actions must be automatically initiated by the system. These
actions can include the following items:
Requests for a part to be replaced
Dynamic invocation of built-in redundancy for automatic replacement of a failing part
Dynamic deallocation of failing components so that system availability is maintained
Device drivers
In certain cases, diagnostic tests are best performed by operating system-specific drivers,
most notably adapters or I/O devices that are owned directly by a logical partition. In these
cases, the operating system device driver often works with I/O device microcode to isolate
and recover from problems. Potential problems are reported to an operating system device
driver, which logs the error. In non-HMC managed servers, the OS can start the Call Home
application to report the service event to IBM. For optional HMC managed servers, the event
is reported to the HMC, which can initiate the Call Home request to IBM. I/O devices can also
include specific exercisers that can be started by the diagnostic facilities for problem
recreation (if required by service procedures).
4.6.5 Reporting
In the unlikely event that a system hardware or environmentally induced failure is diagnosed,
IBM Power Systems servers report the error through various mechanisms. The analysis
result is stored in system NVRAM. Error log analysis (ELA) can be used to display the failure
cause and the physical location of the failing hardware.
Using the Call Home infrastructure, the system automatically can send an alert through a
phone line to a pager, or call for service if there is a critical system failure. A hardware fault
also illuminates the amber system fault LED, which is on the system unit, to alert the user of
an internal hardware problem.
On POWER8 processor-based servers, hardware and software failures are recorded in the
system log. When a management console is attached, an ELA routine analyzes the error,
forwards the event to the Service Focal Point™ (SFP) application running on the
management console, and can notify the system administrator that it isolated a likely cause of
the system problem. The service processor event log also records unrecoverable checkstop
conditions, forwards them to the SFP application, and notifies the system administrator. After
the information is logged in the SFP application, if the system is correctly configured, a Call
Home service request is initiated and the pertinent failure data with service parts information
and part locations is sent to the IBM service organization. This information also contains the
client contact information as defined in the IBM Electronic Service Agent (ESA) guided setup