168
IBM Power 720 and 740 Technical Overview and Introduction
and decide whether the fault is a call-home candidate. If the fault requires support
intervention, a call will be placed with service and support, and a notification will be sent to
the contact that is defined in the ESA guided setup wizard
Remote support
The Remote Management and Control (RMC) subsystem is delivered as part of the base
operating system, including the operating system that runs on the Hardware Management
Console. RMC provides a secure transport mechanism across the LAN interface between the
operating system and the Hardware Management Console and is used by the operating
system diagnostic application for transmitting error information. It performs several other
functions also, but these are not used for the service infrastructure.
Service Focal Point (SFP)
A critical requirement in a logically partitioned environment is to ensure that errors are not lost
before being reported for service, and that an error should only be reported once, regardless
of how many logical partitions experience the potential effect of the error. The Manage
Serviceable Events task on the management console is responsible for aggregating duplicate
error reports, and ensures that all errors are recorded for review and management.
When a local or globally reported service request is made to the operating system, the
operating system diagnostic subsystem uses the Remote Management and Control
subsystem to relay error information to the Hardware Management Console. For global
events (platform unrecoverable errors, for example) the service processor also forwards error
notification of these events to the Hardware Management Console, providing a redundant
error-reporting path in case of errors in the Remote Management and Control subsystem
network.
The first occurrence of each failure type is recorded in the Manage Serviceable Events task
on the management console. This task then filters and maintains a history of duplicate
reports from other logical partitions on the service processor. It then looks at all active service
event requests, analyzes the failure to ascertain the root cause and, if enabled, initiates a
call-home for service. This methodology ensures that all platform errors will be reported
through at least one functional path, ultimately resulting in a single notification for a single
problem.
Extended error data
Extended error data (EED) is additional data that is collected either automatically at the time
of a failure or manually at a later time. The data that is collected is dependent on the
invocation method but includes information like firmware levels, operating system levels,
additional fault isolation register values, recoverable error threshold register values, system
status, and any other pertinent data.
The data is formatted and prepared for transmission back to IBM either to assist the service
support organization with preparing a service action plan for the service representative or for
additional analysis.
System-dump handling
In certain circumstances, an error might require a dump to be automatically or manually
created. In this event, it is off-loaded to the management console. Specific management
console information is included as part of the information that can optionally be sent to IBM
support for analysis. If additional information relating to the dump is required, or if viewing the
dump remotely becomes necessary, the management console dump record notifies the IBM
support center regarding on which management console the dump is located.