Fault management overview
The goal of fault management and monitoring is to increase system availability, by moving from
a reactive fault detection, diagnosis, and repair strategy to a proactive fault detection, diagnosis,
and repair strategy. The objectives are as follows:
•
To detect problems automatically, as nearly as possible to when they actually occur.
•
To diagnose problems automatically, at the time of detection.
•
To automatically report in understandable text a description of the problem, the likely causes
of the problem, the recommended actions to resolve the problem, and detailed information
about the problem.
•
To ensure that tools are available to repair or recover from the fault.
HP-UX fault management
Proactive fault prediction and notification is provided on HP-UX by SysFaultMgmt WBEM indication
providers. WBEM provideS frameworks for monitoring and reporting events.
SysFaultMgmt WBEM indication providers allow users to monitor the operation of a wide variety
of hardware products, and alert them immediately if any failure or other unusual event occurs. By
using hardware event monitoring, users can virtually eliminate undetected hardware failures that
could interrupt system operation or cause data loss.
WBEM indication providers
Hardware monitors are available to monitor the following components (These monitors are distributed
free on the OE media):
•
Server/fans/environment
•
CPU monitor
•
UPS monitor*
•
FC hub monitor*
•
FC switch monitor*
•
Memory monitor
•
Core electronics components
•
Disk drives
•
Ha_disk_array
NOTE:
No SysFaultMgmt WBEM indication provider is currently available for components
followed by an asterisk.
Errors and reading error logs
Event log definitions
Often the underlying root cause of an MCA event is captured by system or BMC firmware in both
the System Event and Forward Progress Event Logs (SEL and FP, respectively). These errors are
easily matched with MCA events by their timestamps. For example, the loss of a CPU VRM might
cause a CPU fault. Decoding the MCA error logs would only identify the failed CPU as the most
likely faulty CRU. Following are some important points to remember about events and event logs:
•
Event logs are the equivalent of the old server logs for status or error information output.
•
Symbolic names are used in the source code; for example,
MC_CACHE_CHECK
.
80
Troubleshooting