G-6
Recovering from Drive Failure
Writer: Pamela King Project: SMART-2DH Array Controller Reference Guide Comments: 295469-002
File Name: N-APPG.DOC Last Saved On: 2/27/98 12:09 PM
COMPAQ CONFIDENTIAL - NEED TO KNOW REQUIRED
Automatic Data Recovery Failure
During Automatic Data Recovery, if the online LED of the replacement drive
stops blinking and all other drives in the array are still online, the Automatic
Data Recovery process may have been abnormally terminated due to an non-
correctable read error from another physical drive during the recovery process.
The background Auto-Reliability Monitoring process is meant to help prevent
this problem, but it cannot do anything about certain issues such as SCSI bus
signal integrity problems. Reboot the system and a POST message should
confirm the diagnosis. Retrying Automatic Data Recovery may possibly help.
If not, a backup of all data on the system, surface analysis (using User
Diagnostics), and restore is the recommended course of action in this
unfortunate situation.
During Automatic Data Recovery, if the online LED of the replacement
drive stops blinking and the replacement drive is failed (amber failure LED
illuminated or other LEDs go out), the replacement drive is producing
unrecoverable disk errors. In this case, the replacement drive should be
removed and replaced with another replacement drive.
Compromised Fault Tolerance
If fault tolerance is ever compromised due to failure of multiple drives, the
condition of the logical drive will be “failed” and “unrecoverable” errors will
be returned to the host. Data loss is probable. Insertion of replacement drives
at this time will not improve the condition of the logical drive. If this occurs,
first try turning the entire system off and on. In some cases, an intermittent
drive will appear to work again (perhaps long enough to make copies of
important files) after cycling power. If a 1779 POST message is displayed,
press F2 to reenable the logical drive(s). Remember that data loss has likely
occurred and any data on the logical drive is suspect.
Fault tolerance may be compromised due to non-drive problems such as a
faulty cable, faulty storage system power supply, or a user accidentally
turning off an external storage system while the host system power was on. In
such cases, the physical drives do not need to be replaced. However, data loss
can still occur in this situation, especially if the system was busy at the time
the problem developed.