Chapter 5: Troubleshooting
Fault Isolation Methodology
QX and QXS Setup Guide
130
monitoring/management is often done at a management console using storage management interfaces
rather than relying on line-of-sight to LEDs of racked hardware components.
Performing Basic Steps
You can use any of the available options described above in performing the basic steps comprising the fault
isolation methodology.
Gather Fault Information
When a fault occurs, it is important to gather as much information as possible. Doing so will help you
determine the correct action needed to remedy the fault.
Begin by reviewing the reported fault:
l
Is the fault related to an internal data path or an external data path?
l
Is the fault related to a hardware component such as a drive module, controller module, or power
supply?
By isolating the fault to one of the components within the storage system, you will be able to determine the
necessary corrective action more quickly.
Determine Where the Fault is Occurring
Once you have an understanding of the reported fault, review the chassis LEDs. The chassis LEDs are
designed to immediately alert users of any system faults, and might be what alerted the user to a fault in the
first place.
When a fault occurs, the Fault ID Status LED on an chassis right ear illuminates (see the diagram pertaining
to your product’s front panel components. Check the LEDs on the rear of the chassis to narrow the fault to a
FRU, connection, or both. The LEDs also help you identify the location of a FRU reporting a fault.
Use the Disk Storage Management Utility to verify any faults found while viewing the LEDs. The Disk
Storage Management Utility is also useful in determining where the fault is occurring if the LEDs cannot be
viewed due to the location of the system. It provides you with a visual representation of the system and
where the fault is occurring. It can also provide more detailed information about FRUs, data, and faults.
Review the Event Logs
The event logs record all system events. Each event has a numeric code that identifies the type of event that
occurred, and has one of the following severities:
l
Critical. A failure occurred that may cause a controller to shut down. Correct the problem
immediately
.
l
Error. A failure occurred that may affect data integrity or system stability. Correct the problem as soon as
possible.
l
Warning. A problem occurred that may affect system stability, but not data integrity. Evaluate the
problem and correct it if necessary.