Baidu Blog

Back

This article was translated from Chinese by ChatGPT 5.

Introduction to ZFS#

Those familiar with my community know that my previous blog site disappeared because the server was migrated to ZFS.

I recall that the old blog didn’t describe what ZFS is, so today I’ll carefully analyze what ZFS is and why I chose to use it on my main server.

Structure#

ZFS was originally developed by Sun Microsystems for Solaris as a file system and volume manager. It is now maintained as the OpenZFS project and continues to evolve on platforms like Linux.

Traditionally, we can think of a file system as a librarian who knows where each book is on the shelves but knows nothing about the contents of the books or the structure of the shelves themselves.

If the pages stick together (data corruption) or the shelf itself collapses (hardware failure), this librarian may be powerless.

However, ZFS (Zettabyte File System) completely changes this model. It is more like a combination of architect, librarian, and bookbinder—an all-in-one expert.

ZFS not only knows the location of the data but also understands the structural layout of the underlying physical storage (disks) and continuously checks the integrity of every page of data (data blocks).

Unlike the traditional storage stack, where RAID, volume management, and file systems are layered separately, ZFS integrates RAID controller, volume manager, and file system into one, providing a unified storage pool to manage data.

This integration means the operating system directly interfaces with the ZFS storage pool, avoiding inefficiencies and risks caused by information isolation between layers in traditional stacks.

The storage architecture of ZFS can be seen as a three-layer pyramid:

  1. Physical Disks: The lowest layer, i.e., actual hardware in the server, such as HDDs (hard disk drives) or SSDs (solid-state drives).
  2. vdev (Virtual Device): The basic building block of ZFS, composed of one or more physical disks. It defines how data and redundancy are organized. A vdev can be a single disk, a mirror group, or a RAID-Z array.
  3. zpool (Storage Pool): The highest-level storage container, composed of one or more vdevs. Once a zpool is created, its total storage space can be shared across all file systems (called datasets in ZFS) without needing predefined partitions.

Safety#

All redundancy and fault tolerance are implemented at the vdev level. The chosen vdev type directly determines the storage pool’s performance, usable capacity, and data safety.

  • Stripe/Single-disk: The simplest vdev, which can be a single disk or multiple disks forming a non-redundant stripe set (similar to RAID 0). Maximizes performance and capacity but has no fault tolerance. Failure of a single disk causes vdev failure.
  • Mirror: Similar to RAID 1. Each disk in the vdev stores identical copies of the data. A two-way mirror can survive one disk failure; as long as one disk remains healthy, no data is lost. ZFS mirrors are more flexible than traditional RAID 1, supporting three-way or higher mirrors for greater redundancy. Write performance equals that of a single disk, while read performance scales with the number of disks, making it excellent for high-IOPS workloads.
  • RAID-Z: ZFS’s innovative alternative to parity RAID (e.g., RAID 5 and RAID 6). Traditional RAID 5 suffers from the “write hole” issue, where an unexpected outage during write operations may cause data and parity mismatches, potentially leading to irreparable corruption. ZFS avoids this through its Copy-on-Write (CoW) mechanism, ensuring atomic operations by always writing new stripes instead of overwriting old ones. RAID-Z has three levels:
    • RAID-Z1: Similar to RAID 5, with one disk’s worth of parity; survives one disk failure.
    • RAID-Z2: Similar to RAID 6, with two disks’ worth of parity; survives two disk failures.
    • RAID-Z3: Triple parity; survives three disk failures, offering extremely high redundancy.

Copy-on-Write (CoW)#

In traditional file systems, when a file is modified, the system overwrites the old data blocks directly. If an interruption (e.g., power loss) occurs during the write, old data may be partially overwritten and new data incomplete, leading to corruption.

ZFS works differently. It never overwrites live data. Instead, when modifying a block:

  1. The new data is written to a new, unused location on disk.
  2. The parent metadata pointer is updated to reference the new block.
  3. This pointer update is atomic—either it succeeds completely or doesn’t happen at all.

This guarantees that the file system is always consistent. If power is lost during the process, the old data and pointers remain intact. Upon reboot, ZFS discards incomplete transactions, rolling back to the last consistent state. Thus, ZFS doesn’t need fsck (file system check) tools.

Snapshots#

Snapshots are one of the most powerful applications of CoW. A snapshot is a read-only copy of a file system or volume at a specific time.

Creating a snapshot doesn’t duplicate data; it simply freezes the metadata tree. Since CoW ensures old data blocks aren’t overwritten, snapshots only reference them. Snapshots are instant to create and initially take negligible space. Only when data is modified do snapshots begin consuming space, as the old blocks must be retained.

The space used by snapshots equals the total of blocks modified or deleted in the active file system but still referenced by snapshots.

Never Trust#

The ZFS philosophy is “never trust hardware.” It assumes hardware (memory, cables, controllers, disks) may fail anytime, and thus implements end-to-end, multilayered integrity protection.

End-to-End#

ZFS’s core protection mechanism is end-to-end checksums. Each block has a checksum (default: optimized Fletcher4, optionally SHA-256). On read, ZFS recalculates and verifies it. If mismatched, data corruption is detected.

These checksums are stored in parent block pointers, forming a Merkle tree (hash tree). This creates a “chain of trust” from data blocks up to the root “uber block.” Unlike traditional file systems that may store checksum alongside data (vulnerable to simultaneous corruption), ZFS stores them separately, ensuring corruption is nearly impossible to hide.

Self-Healing#

Beyond error detection, ZFS can self-heal. If corruption is found and redundancy exists (mirror or RAID-Z), ZFS retrieves correct data from the redundant source, transparently returns it to the application, and repairs the bad copy on disk.

For latent risks like bit rot, ZFS provides scrubbing. The zpool scrub process scans all blocks, verifies checksums, and repairs as needed. Regular scrubbing (e.g., monthly) is critical. However, pools without redundancy can only detect, not repair, corruption.

Performance#

ZFS integrates sophisticated caching and logging to optimize workloads.

Reading#

Reads rely heavily on the Adaptive Replacement Cache (ARC) in memory. ARC tracks both “recently used” and “frequently used” data, dynamically balancing space for higher hit rates. By default, ZFS uses up to 50% of system RAM for ARC. Since memory greatly accelerates storage, increasing RAM is the best performance boost.

L2ARC (Level 2 ARC) adds SSD-based caching for evicted data. However, it requires ARC memory for indexing, so excessively large L2ARC can hurt performance. Rule of thumb: increase RAM first; use L2ARC only if memory maxes out and cache hit rates remain low.

Writing#

ZFS distinguishes between asynchronous and synchronous writes:

  • Asynchronous Writes: Most operations (like file copies). Data goes to ARC, and the app is immediately told “done.” Later, ZFS flushes batches to disk in transaction groups. High performance but risks losing recent data in power failure.
  • Synchronous Writes: Required by databases, NFS, etc. Data must persist before acknowledgment. ZFS uses the ZFS Intent Log (ZIL) to record synchronous writes.
  • SLOG (Separate Log device): By default, ZIL lives on pool disks. Heavy synchronous workloads may suffer since each write hits slow disks. A dedicated SLOG device (low-latency SSD/NVMe/Optane) dramatically improves performance by storing ZIL there.

Important: SLOG only benefits synchronous writes. For typical home or media workloads (asynchronous), it’s useless and a waste.

Alternatives#

Despite its power, ZFS isn’t the only option. Other open-source and commercial systems exist, each with trade-offs.

FeatureZFSBtrfsLVM + ext4/XFS
ArchitectureUnifiedUnifiedLayered
Data IntegrityEnd-to-end checksumsPer-block checksumsNone
Self-HealingYes (with redundancy)Yes (with redundancy)No
RAID StabilityExcellent (RAID-Z1/2/3)Good for mirror/RAID10; RAID5/6 unstableExcellent (via mdadm)
SnapshotsEfficientEfficientInefficient
Pool ExpansionRigid (add vdevs, not single disks)FlexibleFlexible
Resource UsageHighMedium-lowLow
Kernel IntegrationOut-of-treeNativeNative

(Table by Gemini 2.5 Pro)

Or,

FeatureZFS (OpenZFS)BtrfsXFSExt4 + LVM
SnapshotsNative, efficientNative, subvolume-basedNot supported (needs LVM)Not supported (needs LVM)
ChecksumsEnd-to-end for data & metadataData/metadataMetadata onlyMetadata only
Built-in RAIDYes (RAID-Z1/2/3, mirrors)Yes (0/1/5/6/10)NoNo
CompressionYes (LZ4, Zstd, etc.)YesNoNo
DeduplicationYes (real-time, memory-intensive)Yes (offline tools)NoNo
Volume ManagementIntegratedPartialNoneVia LVM
Memory NeedsHigh (≥8GB recommended)ModerateLowLow
PerformanceStrong overall; small random writes slowerGood, but suffers under fragmentationExcellent for large/concurrent IOExcellent general-purpose
Data IntegrityVery highHighModerateModerate
MaturityVery mature, enterprise-provenFairly matureVery matureVery mature
Operational ComplexityHigher (learning curve)MediumLowMedium

(Table by ChatGPT 5)

Commercial closed-source equivalents include:

  • Storage Spaces
  • Veritas VxVM
  • Oracle ASM
  • NetApp ONTAP
  • Dell PowerScale (formerly Isilon OneFS)
  • Veritas InfoScale

Conclusion#

Since my main server has 96GB of ECC RAM, I ultimately chose the ZFS solution included in Proxmox VE.

After one abstract experience of accidentally deleting files and restoring them via snapshots, ZFS became my favorite. So, if you also have large memory, I highly recommend trying ZFS.

Since I usually write novrl, some paragraphs may feel oddly split—please forgive that. With this, we’ve wrapped up an in-depth yet beginner-friendly introduction to ZFS. Wish you all happy tinkering, and see you in the next article 👋

Introduction to ZFS
https://baidu.blog.icechui.com/blog/p/zfsintro
Author baidu0com
Published at August 18, 2025