This article was translated from Chinese by ChatGPT 5.

Introduction to ZFS#

Those familiar with my community know that my previous blog site disappeared because the server was migrated to ZFS.

I recall that the old blog didn’t describe what ZFS is, so today I’ll carefully analyze what ZFS is and why I chose to use it on my main server.

Structure#

ZFS was originally developed by Sun Microsystems for Solaris as a file system and volume manager. It is now maintained as the OpenZFS project and continues to evolve on platforms like Linux.

Traditionally, we can think of a file system as a librarian who knows where each book is on the shelves but knows nothing about the contents of the books or the structure of the shelves themselves.

If the pages stick together (data corruption) or the shelf itself collapses (hardware failure), this librarian may be powerless.

However, ZFS (Zettabyte File System) completely changes this model. It is more like a combination of architect, librarian, and bookbinder—an all-in-one expert.

ZFS not only knows the location of the data but also understands the structural layout of the underlying physical storage (disks) and continuously checks the integrity of every page of data (data blocks).

Unlike the traditional storage stack, where RAID, volume management, and file systems are layered separately, ZFS integrates RAID controller, volume manager, and file system into one, providing a unified storage pool to manage data.

This integration means the operating system directly interfaces with the ZFS storage pool, avoiding inefficiencies and risks caused by information isolation between layers in traditional stacks.

The storage architecture of ZFS can be seen as a three-layer pyramid:

Physical Disks: The lowest layer, i.e., actual hardware in the server, such as HDDs (hard disk drives) or SSDs (solid-state drives).
vdev (Virtual Device): The basic building block of ZFS, composed of one or more physical disks. It defines how data and redundancy are organized. A vdev can be a single disk, a mirror group, or a RAID-Z array.
zpool (Storage Pool): The highest-level storage container, composed of one or more vdevs. Once a zpool is created, its total storage space can be shared across all file systems (called datasets in ZFS) without needing predefined partitions.

Safety#

All redundancy and fault tolerance are implemented at the vdev level. The chosen vdev type directly determines the storage pool’s performance, usable capacity, and data safety.

Stripe/Single-disk: The simplest vdev, which can be a single disk or multiple disks forming a non-redundant stripe set (similar to RAID 0). Maximizes performance and capacity but has no fault tolerance. Failure of a single disk causes vdev failure.
Mirror: Similar to RAID 1. Each disk in the vdev stores identical copies of the data. A two-way mirror can survive one disk failure; as long as one disk remains healthy, no data is lost. ZFS mirrors are more flexible than traditional RAID 1, supporting three-way or higher mirrors for greater redundancy. Write performance equals that of a single disk, while read performance scales with the number of disks, making it excellent for high-IOPS workloads.
RAID-Z: ZFS’s innovative alternative to parity RAID (e.g., RAID 5 and RAID 6). Traditional RAID 5 suffers from the “write hole” issue, where an unexpected outage during write operations may cause data and parity mismatches, potentially leading to irreparable corruption. ZFS avoids this through its Copy-on-Write (CoW) mechanism, ensuring atomic operations by always writing new stripes instead of overwriting old ones. RAID-Z has three levels:
- RAID-Z1: Similar to RAID 5, with one disk’s worth of parity; survives one disk failure.
- RAID-Z2: Similar to RAID 6, with two disks’ worth of parity; survives two disk failures.
- RAID-Z3: Triple parity; survives three disk failures, offering extremely high redundancy.

Copy-on-Write (CoW)#

In traditional file systems, when a file is modified, the system overwrites the old data blocks directly. If an interruption (e.g., power loss) occurs during the write, old data may be partially overwritten and new data incomplete, leading to corruption.

ZFS works differently. It never overwrites live data. Instead, when modifying a block:

The new data is written to a new, unused location on disk.
The parent metadata pointer is updated to reference the new block.
This pointer update is atomic—either it succeeds completely or doesn’t happen at all.

This guarantees that the file system is always consistent. If power is lost during the process, the old data and pointers remain intact. Upon reboot, ZFS discards incomplete transactions, rolling back to the last consistent state. Thus, ZFS doesn’t need fsck (file system check) tools.

Snapshots#

Snapshots are one of the most powerful applications of CoW. A snapshot is a read-only copy of a file system or volume at a specific time.

Creating a snapshot doesn’t duplicate data; it simply freezes the metadata tree. Since CoW ensures old data blocks aren’t overwritten, snapshots only reference them. Snapshots are instant to create and initially take negligible space. Only when data is modified do snapshots begin consuming space, as the old blocks must be retained.

The space used by snapshots equals the total of blocks modified or deleted in the active file system but still referenced by snapshots.

Never Trust#

The ZFS philosophy is “never trust hardware.” It assumes hardware (memory, cables, controllers, disks) may fail anytime, and thus implements end-to-end, multilayered integrity protection.

End-to-End#

ZFS’s core protection mechanism is end-to-end checksums. Each block has a checksum (default: optimized Fletcher4, optionally SHA-256). On read, ZFS recalculates and verifies it. If mismatched, data corruption is detected.

These checksums are stored in parent block pointers, forming a Merkle tree (hash tree). This creates a “chain of trust” from data blocks up to the root “uber block.” Unlike traditional file systems that may store checksum alongside data (vulnerable to simultaneous corruption), ZFS stores them separately, ensuring corruption is nearly impossible to hide.

Self-Healing#

Beyond error detection, ZFS can self-heal. If corruption is found and redundancy exists (mirror or RAID-Z), ZFS retrieves correct data from the redundant source, transparently returns it to the application, and repairs the bad copy on disk.

For latent risks like bit rot, ZFS provides scrubbing. The zpool scrub process scans all blocks, verifies checksums, and repairs as needed. Regular scrubbing (e.g., monthly) is critical. However, pools without redundancy can only detect, not repair, corruption.

Performance#

ZFS integrates sophisticated caching and logging to optimize workloads.

Reading#

Reads rely heavily on the Adaptive Replacement Cache (ARC) in memory. ARC tracks both “recently used” and “frequently used” data, dynamically balancing space for higher hit rates. By default, ZFS uses up to 50% of system RAM for ARC. Since memory greatly accelerates storage, increasing RAM is the best performance boost.

L2ARC (Level 2 ARC) adds SSD-based caching for evicted data. However, it requires ARC memory for indexing, so excessively large L2ARC can hurt performance. Rule of thumb: increase RAM first; use L2ARC only if memory maxes out and cache hit rates remain low.

Writing#

ZFS distinguishes between asynchronous and synchronous writes:

Asynchronous Writes: Most operations (like file copies). Data goes to ARC, and the app is immediately told “done.” Later, ZFS flushes batches to disk in transaction groups. High performance but risks losing recent data in power failure.
Synchronous Writes: Required by databases, NFS, etc. Data must persist before acknowledgment. ZFS uses the ZFS Intent Log (ZIL) to record synchronous writes.
SLOG (Separate Log device): By default, ZIL lives on pool disks. Heavy synchronous workloads may suffer since each write hits slow disks. A dedicated SLOG device (low-latency SSD/NVMe/Optane) dramatically improves performance by storing ZIL there.

Important: SLOG only benefits synchronous writes. For typical home or media workloads (asynchronous), it’s useless and a waste.

Alternatives#

Despite its power, ZFS isn’t the only option. Other open-source and commercial systems exist, each with trade-offs.

Feature	ZFS	Btrfs	LVM + ext4/XFS
Architecture	Unified	Unified	Layered
Data Integrity	End-to-end checksums	Per-block checksums	None
Self-Healing	Yes (with redundancy)	Yes (with redundancy)	No
RAID Stability	Excellent (RAID-Z1/2/3)	Good for mirror/RAID10; RAID5/6 unstable	Excellent (via mdadm)
Snapshots	Efficient	Efficient	Inefficient
Pool Expansion	Rigid (add vdevs, not single disks)	Flexible	Flexible
Resource Usage	High	Medium-low	Low
Kernel Integration	Out-of-tree	Native	Native

(Table by Gemini 2.5 Pro)

Or,

Feature	ZFS (OpenZFS)	Btrfs	XFS	Ext4 + LVM
Snapshots	Native, efficient	Native, subvolume-based	Not supported (needs LVM)	Not supported (needs LVM)
Checksums	End-to-end for data & metadata	Data/metadata	Metadata only	Metadata only
Built-in RAID	Yes (RAID-Z1/2/3, mirrors)	Yes (0/1/5/6/10)	No	No
Compression	Yes (LZ4, Zstd, etc.)	Yes	No	No
Deduplication	Yes (real-time, memory-intensive)	Yes (offline tools)	No	No
Volume Management	Integrated	Partial	None	Via LVM
Memory Needs	High (≥8GB recommended)	Moderate	Low	Low
Performance	Strong overall; small random writes slower	Good, but suffers under fragmentation	Excellent for large/concurrent IO	Excellent general-purpose
Data Integrity	Very high	High	Moderate	Moderate
Maturity	Very mature, enterprise-proven	Fairly mature	Very mature	Very mature
Operational Complexity	Higher (learning curve)	Medium	Low	Medium

(Table by ChatGPT 5)

Commercial closed-source equivalents include:

Storage Spaces
Veritas VxVM
Oracle ASM
NetApp ONTAP
Dell PowerScale (formerly Isilon OneFS)
Veritas InfoScale

Conclusion#

Since my main server has 96GB of ECC RAM, I ultimately chose the ZFS solution included in Proxmox VE.

After one abstract experience of accidentally deleting files and restoring them via snapshots, ZFS became my favorite. So, if you also have large memory, I highly recommend trying ZFS.

Since I usually write novrl, some paragraphs may feel oddly split—please forgive that. With this, we’ve wrapped up an in-depth yet beginner-friendly introduction to ZFS. Wish you all happy tinkering, and see you in the next article 👋