5 Troubleshooting
This chapter provides strategies, procedures, and tools for troubleshooting server error and fault
conditions.
Methodology
General Troubleshooting Methodology
There are multiple entry points to the troubleshooting process, dependent upon your level of
troubleshooting expertise, the tools/processes/procedures which you have at your disposal, and
the nature of the system fault or failure.
Typically, you select from a set of symptoms, ranging from very simple (System LED is blinking)
to the most difficult (Machine Check Abort, for example, MCA, has occurred). The following is a
list of symptom examples:
•
Front Panel LED blinking
•
System Alert present on console
•
System won’t power-up
•
System won’t boot
•
Error/Event Message received
•
Machine Check Abort (MCA) occurred
Next, you narrow down the observed problem to the specific troubleshooting procedure required.
Here, you isolate the failure to a specific part of the server, so that you can perform more detailed
troubleshooting. For example:
•
Problem- Front Panel LED blinking
NOTE:
The Front Panel Health LEDs will be flashing amber with a warning indication, or
flashing red with a fault indication.
◦
System Alert on console?
◦
Analyze the alert by using the system event log (SEL), to identify the last error logged
by the server. Use the iLO 2 MP commands to view the SEL, using either the iLO 2
MP’s serial text interface, or telnet, SSH, or web GUI on the iLO 2 MP LAN.
At this point, you will have a good idea about which area of the system requires further analysis.
For example, if the symptom was “system won’t power-up”, the initial troubleshooting procedure
may have indicated a problem with the dc power rail not coming up after the power switch was
turned on.
You have now reached the point where the failed Customer Replaceable Unit (CRU) has been
identified and needs to be replaced. You must now perform the specific removal and replacement
procedure, and verification steps. See
Chapter 6: “Removing and Replacing Server Components”
for information.
NOTE:
If multiple CRUs are identified as part of the solution, a fix cannot be guaranteed unless
all identified failed CRUs are replaced.
There may be specific recovery procedures you need to perform to finish the repair. For example,
if the core I/O board FRU is replaced, you will need to restore customer specific information.
134
Troubleshooting