27. Addendum: About Storage Problems and Solutions
RAID, Backup and ECC
exist in order to mitigate the probability of data loss. While the technical basics are clear, one now has to
estimate the data loss endangerment and relevance of all components involved. Included in this is an
assessment of which problems can occur and how often they occur in addition to an evaluation of how one
can minimize each risk on its own as well as relative to the file system.
What are the problems one has to take care of?
Broken Hardware
Broken hard drives, sectors/flash cells, RAM, controllers, cables, PSU and so on are manageable risks by using
real-time RAID, checksums and backups and redundand hardware for example a second PSU.
Statistic Problems and Silent Errors
When prior to data reading one fills a multi-terabyte array only with binary „zeros“, one may find binary „ones“
when reading the array. If one then leaves the array laying around for some years and reads it again, the
number of „ones“ will have increased. This is a massive problem of long-term storage and is called „bit rot“. The
same problem of flipping bits exists in RAM and can lead to program or system crashes and data corruption. In
addition, a bad power supply - even bad cables or plugs in the backplane can also cause data corruption.
Those problems can be taken care of with real-time RAID, end to end data checksums, scrubs and ECC RAM. If
using single-drive backups, with ZFS the parameter „Copies“ can be set to „2“ resulting in every block of data
being written a second time to a different location on the drive.
Transaction Problems
Disks are preferring RAM-cached writes as this offers a much better write performance. Database applications
cannot tolerate this. File or dataset locking example for a data warehouse application is only possible with a
secure uncached write behaviour. When an operating system is signaled that „a“ was written, „a“ really has to
be written to the stable storage. When „a“ and „b“ should be written, they either have to be written together
at the same time or not at all. A solution for this problems on ZFS are secure SyncWrites with a ZIL device as
this allows a fast write behaviour over a RAM cache paired with the safety that a commited write is really on
disk.
Time Problems
Accidental deletion, wanting to read older versions of a document, sabotage, Trojans which start to encrypt
data secretively in the background and the like are often only identified when it is too late and even backups
are already affected.
Those are problems which only can be solved to a certain degree with simple backups. Ideally one could use a
read-only versioning with many snapshots that hold a previous data state on the main file system or a second
backup system. ZFS Replication would be a real time solution which can keep a backup system up to date down
to minutes.
Disasters
Like fire, overvoltage, theft or a defective disk array (meaning more disks fail than the RAID level can handle).
Only an external backup can ease those cases – where the external backup should be located at least in a
different fire section. In case of very important data it is recommended to have at least two backup systems so
that in case of a failure of the first backup system the working second backup system can take over.
Содержание ZFS Storage
Страница 8: ...3 1 ZFS Configurations...
Страница 45: ...Example Map Chenbro 50 x 3 5 Bay...