Fault management overview
The goal of fault management and monitoring is to increase server blade availability, by moving
from a reactive fault detection, diagnosis, and repair strategy to a proactive fault detection,
diagnosis, and repair strategy. The objectives are:
•
To detect issues automatically, as close as possible to the time of occurrence.
•
To diagnose issues automatically, at the time of detection.
•
To automatically report (in understandable text) a description of the issue, the likely causes of
the issue, the recommended actions to resolve the issue, and detailed information about the
issue.
•
To be sure that tools are available to repair or recover from the fault.
HP-UX Fault management
Proactive fault prediction and notification is provided on HP-UX by SFM and WBEM indications.
WBEM is a collection of standards that aid large-scale systems management. WBEM allows
management applications to monitor systems in a network.
SFM and WBEM indication providers enable users to monitor the operation of a wide variety of
hardware products, and alert them immediately if any failure or other unusual event occurs. By
using hardware event monitoring, users can virtually eliminate undetected hardware failures that
could interrupt server blade operation or cause data loss.
HP SMH is the applications used to query information about monitored devices and view indications
and instances on WBEM. This WBEM-based network management application enables you to
create subscriptions and view indications.
SysMgmtPlus functionality displays the property pages of various devices and firmware on HP
SMH. SysMgmtPlus enables HP SMH to display improved property pages that contain dynamic
content, providing the user to view and hide details of devices and firmware. The Health Tests are
associated with components. The healthtest feature provides an option to perform health test on
all the device instances of the component.
For complete information on installing, administrating, and troubleshooting SFM software and its
components, see the System Fault Management Administrator's Guide (
http://
h20000.www2.hp.com/bc/docs/support/SupportManual/c02677231/c02677231.pdf?
HPBCMETA::
).
Errors and error logs
Event log definitions
Often the underlying root cause of an MCA event is captured by the server blade or firmware in
both the SEL and FPL logs. These errors are easily matched with MCA events by timestamps. For
example, the loss of a processor VRM might cause a processor fault. Decoding the MCA error
logs would only identify the failed processor as the most likely faulty FRU. Following are some
important points to remember about events and event logs:
•
Event logs are the equivalent of the old chassis logs for status or error information output.
•
Symbolic names are used in the source code; for example,
MC_CACHE_CHECK
.
•
The hex code for each event log is 128 bits long with an architected format:
Some enumerated fields can be mapped to defined text strings.
◦
◦
All can be displayed in hex, keyword, or text mode.
Errors and error logs
101