Troubleshooting and Maintenance Manual
Page 4
Troubleshooting
Observing the Problem
Take a moment and think about what is being seen. Ask the following questions:
•
If the SAN has been working, what has changed?
Apart from a catastrophic hardware failure,
SANs typically do not just stop working – something has changed. For example, a commonly
overlooked SAN change is a switch firmware upgrade. Even something as seemingly innocuous
as an upgrade to disk drive firmware in a RAID storage system can have unexpected effects on
a SAN.
•
What is the observed behavior compared to the expected behavior?
For example, a
planned storage fail-over during system maintenance takes eight minutes to occur, but was
expected to complete in two minutes. Observe behavior on two levels:
•
The overall problem. For example, users experienced an outage of eight minutes.
•
The exact problem. For example, the storage controller is reporting an error.
•
Is the expected behavior supported by storage and system providers?
For example, is a
manually initiated path fail-over of two minutes supported by the storage and path management
software providers? A storage provider can only support two-minute failovers when their internal
Redundant Array of Independent Disks (RAID) controller cache is disabled.
•
What are the exact symptoms?
Make a list. Examples:
•
Mouse pointer stopped moving for 30 seconds immediately after failover was initiated,
then went to an hour glass until the failover completed.
•
Path management software reported errors in the system error log 60 seconds after the
failover was initiated.
•
Adapter FC link to switch dropped and did not recover immediately after failover was ini-
tiated.
•
Is the problem repeatable?
If yes, can it be repeated on a non-production test system?
Collecting information such as system error logs is often needed. Since production systems are
not generally set up to collect this information in normal operation, it is important to be able to
configure the system to collect data and recreate the problem on a non-production test system.
•
What do the LEDs on the adapter indicate?
Check the adapter and switch LEDs to determine
the status. If the LEDs stop flashing or flash an error code, this indicates the adapter may need
to be returned to Emulex for repair. See “LED Reference Information” on page 18.
Isolating the Problem
Once the problem is observed, determine where the problem originates. Eliminate problem sources on a
coarse level,
is the problem in the SAN or in the server
? Determining this reduces the size and scope
of the troubleshooting effort. If the SAN can be eliminated, only the server and its components need to
be examined. This topic discusses common problems related to the SAN and the server.
SAN Components
SAN-related problems can include the switch, storage and cabling.
•
Cables
•
Bad cables. Noise generates many cyclic redundancy checks (CRCs) or other invalid
data on the FC link. This translates into slower performance on disks and failed backups
on tapes.