napp-it ZFS Storage Page 60 User Manual Download

Page: 60 / 63

Involved System Components

RAID

RAID mainly exists in order to make failure of one or several disks possible without data loss. It may include
Self-Healing Features, meaning that corrupted or changed files can be repaired or restored on access. Contrary
to RAID-similar backup solutions (like Snapraid and Unraid), RAID protects the data in real time because it
distributes, stripes or mirrors data blocks to different drives on every write..
In addition, the sequential performance of a real time RAID scales with the number of drives and the IOPS scale
with the number of RAID sub systems (for example doubled performance on a RAID50 or RAID60). This means
that RAID boosts not only availability but also performance.

But still there is one major problem with RAID: If more disks fail than allowed by the selected RAID level, the
whole array is lost. With ZFS RAID-Z3, up to three disks per vdev can fail without data loss.

Addtionally, on a powerloss or crash a RAID1/5/6 might have corrupted data because the data is written to
the different disks of the array sequentially, which leads to different data on each disk for RAID1 or only partly
written data stripes on a RAID5 or RAID6. That problem is called the „Write hole“ phenomenon in RAID5, RAID6,
RAID1, and other arrays. A newer CopyOnWrite (COW) file system with software RAID like ZFS can fix the write
hole problem as a data modification is then done on all disks or completely cancelled. Still, an SSD without
powerloss protection might introduce a problem nevertheless when the firmware of the controller does
garbage collection in the background. When using a RAID1, a system would need to have additional checksums
in order to detect bad data and restore the good version of a datablock from the other disk which may be not
corrupted.

Another problem is posed by storage systems with a write-cache. On a crash, all data in the cache is lost. The
logical conclusion would be to disable the write-cache, but that would result in unbearably slow operations
as safe sync write without using a RAM write cache can slow down write performance up to 10%. If you use a
write cache some seconds of last writes may get lost on a crash. Older file systems may have corruption prob-
lems even without a cache because data may be written to disk but meta data might not have been updated.

More recent CopyOnWrite file systems like ZFS use atomic operations to either update data and meta data as
one operation or not at all. Such file systems will never become corrupt after a crash.
One problem remains. If you need filelocking or transactions for a database or store older filesystems for
example on a VM datastore for ESXi these databases may become inconsistent or the older filesystems become
corrupted after a crash - even when ZFS remains intact as on application or guest OS level you do not have any
way to control what goes into the cache and what is written to disk directly.

There is a solution, though. One can use a controller with cache and BBU for Hardwareaid on older file
systems. When an operating system receives a confirmation for a write-action, it will really be on stable disk
– or in case of a crash after the next system boot. But all in all a RAID controller cache is quite small and slow
compared to system RAM and the Write Hole Problem remains so this is not perfect.

In the case of ZFS, the problem was solved with a logging function of the cached data which was committed
as “written to disk” to the OS. ZFS always uses a RAM-based write-cache which can buffer several seconds of
data to write multiple small random write operations as a single large and fast sequential one. The logging of
cached data can be done additionally on a device called ZIL to guarantee a secure write behaviour.

For performance reasons, one may not use the pool as ZIL, but rather an extra logging device called Slog, which
is optimized for this kind of operation. The following hardware might be used for this kind of device: 8GB
ZeusRAM, NVMe like Intel P-Series (750, 3600, 3700) or an Intel S 3700/3710 SATA SSD that offer powerloss
protection, ultra low latency and high and constant write-iops for QD/QueueDepth=1.

Of course all those techniques do not help when a user program, like Word, writes files to the disk. In case of a
crash, they will be lost. Only the user program can prevent that kind of failure by creating temp-files.
Versioning and snapshots can help to recover older versions of a file.