IBM Power Systems E870 Скачать руководство пользователя страница 180

Страница: 180 / 202

5137ch04.fm

Draft Document for Review October 14, 2014 10:19 am

166

IBM Power Systems E870 and E880 Technical Overview and Introduction

Run time

All Power Systems servers can monitor critical system components during run time, and they
can take corrective actions when recoverable faults occur. The IBM hardware error-check
architecture can report non-critical errors in the Central Electronics Complex in an

out-of-band

communications path to the service processor without affecting system

performance.

A significant part of IBM runtime diagnostic capabilities originate with the service processor.
Extensive diagnostic and fault analysis routines were developed and improved over many
generations of POWER processor-based servers, and enable quick and accurate predefined
responses to both actual and potential system problems.

The service processor correlates and processes runtime error information by using logic that
is derived from IBM engineering expertise to count recoverable errors (called

thresholding

)

and predict when corrective actions must be automatically initiated by the system. These
actions can include the following items:

򐂰

Requests for a part to be replaced

򐂰

Dynamic invocation of built-in redundancy for automatic replacement of a failing part

򐂰

Dynamic deallocation of failing components so that system availability is maintained

Device drivers

In certain cases, diagnostic tests are best performed by operating system-specific drivers,
most notably adapters or I/O devices that are owned directly by a logical partition. In these
cases, the operating system device driver often works with I/O device microcode to isolate
and recover from problems. Potential problems are reported to an operating system device
driver, which logs the error. In non-HMC managed servers, the OS can start the Call Home
application to report the service event to IBM. For optional HMC managed servers, the event
is reported to the HMC, which can initiate the Call Home request to IBM. I/O devices can also
include specific exercisers that can be started by the diagnostic facilities for problem
recreation (if required by service procedures).

4.6.5 Reporting

In the unlikely event that a system hardware or environmentally induced failure is diagnosed,
IBM Power Systems servers report the error through various mechanisms. The analysis
result is stored in system NVRAM. Error log analysis (ELA) can be used to display the failure
cause and the physical location of the failing hardware.

Using the Call Home infrastructure, the system automatically can send an alert through a
phone line to a pager, or call for service if there is a critical system failure. A hardware fault
also illuminates the amber system fault LED, which is on the system unit, to alert the user of
an internal hardware problem.

On POWER8 processor-based servers, hardware and software failures are recorded in the
system log. When a management console is attached, an ELA routine analyzes the error,
forwards the event to the Service Focal Point™ (SFP) application running on the
management console, and can notify the system administrator that it isolated a likely cause of
the system problem. The service processor event log also records unrecoverable checkstop
conditions, forwards them to the SFP application, and notifies the system administrator. After
the information is logged in the SFP application, if the system is correctly configured, a Call
Home service request is initiated and the pertinent failure data with service parts information
and part locations is sent to the IBM service organization. This information also contains the
client contact information as defined in the IBM Electronic Service Agent (ESA) guided setup