92 | GigaStor™ (pub. 25.Apr.2014)
A disk drive has a life span, but as with anything electro-mechanical, it is impossible to predict. A Mean Time
Before Failure (MTBF) number is supplied with the drive. The MTBF is 750,000 hours to 85 years on one of the
drives we use (Seagate Barracuda 1T). Studies done by Carnegie Mellon, Google, Inc., and others suggest similar
results. The bottom line is the drives in your GigaStor probably will not last 85 years.
There are two types of drive failure:
Soft error: an error sent by the drive to the RAID controller, serious enough for it to be removed from the
array
Hard error: a physical drive failure (completely inoperable)
We recommend configuring the e-mail notification option in the RAID controller setup. This will notify the
GigaStor administrator (and up to four others) of the drive status.
For soft errors, most of the time reinserting the drive will bring it back into the array. The error generally
has to do with the drive having mapped some bad sectors, which is fairly common. If this event happens,
we recommend replacing the drive from inventory and shipping the problem drive back for testing and
replacement per your hardware maintenance arrangements that are in place. However, in many cases, the drive
can run for years after being reinserted into the system. Unfortunately there is no way to know this ahead of
time. If you choose to reinsert the drive, please log the failure with Support so if it happens again with the same
drive, it can be replaced.
If a hard error is detected, the drive does not operate and requires replacement. It is critical to have at least one
spare new drive on hand, and probably more than one. In general, we recommend that for every eight hard
drives in your GigaStor probe that you have one replacement drive.
Unit
Number of recommended spare
drives
Hours to rebuild array
GigaStor 4T (8 drives)
1
GigaStor 8T or 12 T (8 drives)
2
4
GigaStor 16T (16 drives)
3
4
GigaStor 32T (32 drives)
4
8
GigaStor 48T (48 drives)
5
12
RAID5 tolerates one drive failing, and can operate in a degraded state. When the drive is replaced, it starts to
rebuild. How long the rebuild takes depends on the array size and whether it needs to write new data at the
same time. If a second drive fails during the rebuild, the array is broken and must be recreated.
Your packet captures are available on a GigaStor running in a degraded mode, but are lost on a GigaStor with
two bad drives at the same time (and also if the second drive fails during the time the first drive is rebuilding).
From the time when the first bad drive is swapped until it has completed the rebuild period, your captured data
is at risk.
There is an option to set aside one of the drives in the array to be allocated as a hot spare, and have 15 available
for capture. Then if a drive fails, the controller automatically notifies you, and then includes the hot spare into
the array. You lose storage overall, because that spare drive is not available, but the drive swap is handled
automatically.
The optimal redundancy is two identical GigaStor s capturing the same set of traffic. If this is not practical, the
next best option might be having a smaller GigaStor capturing the same set of traffic. Then if the first GigaStor
has a drive failure followed quickly by another, your packet buffers would still be instantly available. Use filters
to limit the captures to only the most critical traffic to extend the troubleshooting time available. The point is
simply to have a backup plan to address even this unlikely drive-failure scenario.