Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Introduction

btrfsutils is a Rust implementation of the btrfs filesystem utilities. It provides three command-line tools: btrfs, for managing and inspecting btrfs filesystems; btrfs-mkfs, for creating new ones; and btrfs-tune, for offline superblock tuning. All three aim to be drop-in replacements for the tools provided by btrfs-progs.

Most commands are fully implemented and produce output matching the C reference. The explicit goal is to be drop-in compatible with the reference implementation, with additional features. This is currently in a beta (pre-1.0) version, so it should not be used in production, but the commands that are implemented are thoroughly tested and can be assumed to be correctly implemented.

It also provides library crates that can be used to access kernel APIs to manage btrfs filesystems, decode and write on-disk structures and decode and handle the btrfs send format.

Source Code

The source is available on github and gitlab.

Installation

While these tools are still in their beta (pre-1.0 release) phase, you can already install them and try them out. Currently, the recommended way to install them is using Cargo, there are no binary builds to download.

Cargo

If you have cargo installed, you can install the utilities with it.

cargo install btrfs-cli
cargo install btrfs-tune
cargo install btrfs-mkfs

Nix

If you use Nix with flakes enabled, you can run the tool directly without installing it:

nix run github:rustutils/btrfsutils -- filesystem show /mnt

Or install it into your profile:

nix profile install github:rustutils/btrfsutils

From source

See Building from Source for instructions on compiling btrfsutils yourself from the repository.

Requirements

btrfsutils runs on Linux. Most commands that interact with a mounted filesystem require CAP_SYS_ADMIN (i.e. root, or a process with that capability granted). The exceptions are btrfs inspect-internal dump-super and dump-tree, which only require read access to the block device or image file.

Building from Source

Prerequisites

You need a Rust toolchain matching the version in rust-toolchain.toml — running rustup toolchain install in the project directory will pick it up automatically. You also need clang and libclang for bindgen, which generates Rust bindings from the kernel UAPI headers at build time.

On Fedora/RHEL:

sudo dnf install clang

On Debian/Ubuntu:

sudo apt install clang libclang-dev

Building with Cargo

cargo build --release

The resulting binaries are target/release/btrfs, target/release/btrfs-mkfs, and target/release/btrfs-tune.

Building with Nix

The project includes a Nix flake that provides a fully reproducible build with all dependencies pinned:

nix build

Outputs land in result/bin/btrfs, result/bin/btrfs-mkfs, result/bin/btrfs-tune, and result/share/man/man1/.

To enter a development shell with all tools available (including nightly rustfmt, cargo-insta, and cargo-llvm-cov):

nix develop

Contributors who want to run the full lint sweep (just check) on a non-Nix machine may also need a host-arch musl cross-compiler — see the “Static checks” section of the testing guide for setup instructions.

Concepts

This page defines the terms used throughout the btrfs documentation and command output.

Filesystem

A btrfs filesystem is a single logical storage pool. It has a UUID and an optional human-readable label, and it can span one or more physical block devices. All data and metadata stored in the filesystem is distributed across its devices according to the configured RAID profiles.

A filesystem is accessed by mounting it at a path. Most btrfs commands take that mount point (or any path within it) as their argument.

Device

A device is a block device — a disk partition or a whole disk — that belongs to a filesystem. Every filesystem has at least one device. Additional devices can be added or removed while the filesystem is mounted, allowing online capacity changes.

Subvolume

A subvolume is an independently managed subtree within a filesystem. It looks like a directory, but it has its own inode namespace and can be snapshotted, sent, or deleted independently from the rest of the filesystem.

When you mount a btrfs filesystem, you are mounting one of its subvolumes (the default subvolume, unless you specify otherwise). Other subvolumes appear as directories within it but can also be mounted directly with the subvol= or subvolid= mount options.

Snapshot

A snapshot is a copy-on-write copy of a subvolume taken at a point in time. It initially shares all of its data with the source subvolume; pages diverge as either copy is written. Snapshots can be read-write or read-only. Read-only snapshots are required for btrfs send.

Chunk

btrfs divides storage into chunks — large, contiguous regions of logical address space (typically 256 MiB for metadata, 1 GiB for data). Each chunk is backed by one or more physical stripes on the underlying devices, according to the RAID profile in use. The mapping from logical addresses to physical device locations is stored in the chunk tree.

Extent

An extent is a contiguous run of bytes within a chunk. File data is stored in data extents; the B-trees that make up btrfs metadata are stored in metadata extents. btrfs uses copy-on-write: modifying data creates a new extent rather than overwriting the old one, which is what makes snapshots cheap.

Generation

Every committed transaction increments the filesystem’s generation number. Subvolumes track the generation at which they were last modified (their generation) and the generation at which they were originally created (their ogeneration, or original generation). These are used by tools like btrfs subvolume find-new to identify recently changed files, and by btrfs send to select an appropriate incremental parent.

qgroup

A quota group (qgroup) tracks and optionally limits the amount of space used by a set of subvolumes. qgroups can be nested into a hierarchy, which allows shared space (space that would not be freed even if one subvolume were deleted) to be accounted at the group level. Quotas must be enabled on the filesystem before qgroups can be used.

Commands

btrfsutils implements the same command structure as the upstream btrfs tool. Commands are organized into groups:

btrfs filesystem

Manage and inspect mounted filesystems.

CommandDescription
btrfs filesystem show [path]Show filesystem info and devices
btrfs filesystem df <path>Show space usage by chunk type
btrfs filesystem usage <path>Detailed space usage with per-device breakdown
btrfs filesystem du <path>Show disk usage including shared extents
btrfs filesystem sync <path>Sync the filesystem
btrfs filesystem defrag <path>Defragment a file or directory
btrfs filesystem resize <size> <path>Resize a mounted filesystem
btrfs filesystem label <path> [label]Get or set the filesystem label
btrfs filesystem mkswapfile <path>Create a swapfile
btrfs filesystem commit-stats <path>Show commit statistics

btrfs subvolume

Create and manage subvolumes and snapshots.

CommandDescription
btrfs subvolume create <path>Create a subvolume
btrfs subvolume delete <path>Delete a subvolume
btrfs subvolume snapshot <src> <dst>Create a snapshot
btrfs subvolume list <path>List subvolumes
btrfs subvolume show <path>Show subvolume details
btrfs subvolume get-default <path>Show the default subvolume
btrfs subvolume set-default <id> <path>Set the default subvolume
btrfs subvolume get-flags <path>Show subvolume flags
btrfs subvolume set-flags <path>Set subvolume flags
btrfs subvolume find-new <path> <gen>Find files modified since a generation
btrfs subvolume sync <path>Wait for deleted subvolumes to be cleaned up

btrfs device

Manage devices in a multi-device filesystem.

CommandDescription
btrfs device add <dev> <path>Add a device
btrfs device remove <dev> <path>Remove a device
btrfs device stats <path>Show per-device error statistics
btrfs device scan [dev]Scan for btrfs devices
btrfs device ready <dev>Check if a multi-device filesystem is ready
btrfs device usage <path>Show per-device allocation details

btrfs balance

Rebalance data and metadata across devices or profiles.

CommandDescription
btrfs balance start <path>Start a balance
btrfs balance pause <path>Pause a running balance
btrfs balance resume <path>Resume a paused balance
btrfs balance cancel <path>Cancel a running or paused balance
btrfs balance status <path>Show balance status

Balance filters (-d, -m, -s) accept filter strings such as usage=50,profiles=raid1|single.

btrfs scrub

Verify data and metadata checksums.

CommandDescription
btrfs scrub start <path>Start a scrub
btrfs scrub cancel <path>Cancel a running scrub
btrfs scrub resume <path>Resume a cancelled scrub
btrfs scrub status <path>Show scrub status
btrfs scrub limit <path>Get or set scrub throughput limit

btrfs replace

Replace a device in a filesystem.

CommandDescription
btrfs replace start <srcdev> <tgtdev> <path>Start a device replacement
btrfs replace status <path>Show replacement status
btrfs replace cancel <path>Cancel a running replacement

btrfs send / receive

Stream filesystem data between systems.

CommandDescription
btrfs send <subvol>Send a subvolume as a stream
btrfs receive <path>Receive a stream into a directory

btrfs send supports full sends and incremental sends (-p parent, -c clone sources). btrfs receive supports v1, v2 (compressed data), and v3 (fs-verity) stream formats.

btrfs inspect-internal

Low-level inspection tools.

CommandDescription
btrfs inspect-internal rootid <path>Show the subvolume ID for a path
btrfs inspect-internal inode-resolve <ino> <path>Resolve an inode to paths
btrfs inspect-internal logical-resolve <addr> <path>Resolve a logical address to paths
btrfs inspect-internal subvolid-resolve <id> <path>Resolve a subvolume ID to a path
btrfs inspect-internal min-dev-size <path>Show the minimum safe device size
btrfs inspect-internal list-chunks <path>List all chunk allocations
btrfs inspect-internal dump-super <dev>Dump the superblock
btrfs inspect-internal dump-tree <dev>Dump raw B-tree contents
btrfs inspect-internal tree-stats <dev>Walk a B-tree and report node/leaf statistics
btrfs inspect-internal map-swapfile <path>Show physical extent map of a swapfile

dump-super and dump-tree read directly from a block device or image file and do not require a mounted filesystem or elevated privileges.

btrfs quota / qgroup

Manage filesystem quotas.

CommandDescription
btrfs quota enable <path>Enable quotas
btrfs quota disable <path>Disable quotas
btrfs quota rescan <path>Rescan quota usage
btrfs quota status <path>Show quota status
btrfs qgroup show <path>Show qgroup usage
btrfs qgroup create <id> <path>Create a qgroup
btrfs qgroup destroy <id> <path>Destroy a qgroup
btrfs qgroup assign <src> <dst> <path>Assign a qgroup to a parent
btrfs qgroup remove <src> <dst> <path>Remove a qgroup assignment
btrfs qgroup limit <size> <id> <path>Set a qgroup size limit
btrfs qgroup clear-stale <path>Remove stale qgroups

btrfs property

Get and set filesystem object properties.

CommandDescription
btrfs property get <path> [name]Get a property
btrfs property set <path> <name> <value>Set a property
btrfs property list <path>List available properties

Supported properties: ro (subvolumes), label (filesystem/device), compression (inodes).

btrfs restore

Recover files from a damaged or unmounted filesystem by reading on-disk structures directly.

CommandDescription
btrfs restore <dev> <path>Restore files to a destination directory
btrfs restore -l <dev>List available tree roots

Supports regular files, directories, symlinks (-S), extended attributes (-x), metadata (owner/mode/times with -m), and compressed extents (zlib/zstd/lzo). Use --path-regex to filter restored files and -s to include snapshots.

btrfs rescue

Emergency recovery tools for damaged filesystems.

CommandDescription
btrfs rescue super-recover <dev>Restore superblock from mirrors
btrfs rescue zero-log <dev>Clear the log tree pointer
btrfs rescue create-control-deviceCreate /dev/btrfs-control if missing
btrfs rescue fix-device-size <dev>Re-align device and superblock sizes
btrfs rescue fix-data-checksum [--readonly|--mirror 1] <dev>Scan and (with --mirror 1) repair data csums
btrfs rescue clear-uuid-tree <dev>Drop the UUID tree so the kernel rebuilds it
btrfs rescue clear-space-cache <v1|v2> <dev>Clear the v1 or v2 free space cache
btrfs rescue clear-ino-cache <dev>Remove leftover items from the deprecated inode cache

btrfs rescue chunk-recover has argument parsing scaffolded but is not yet implemented.

btrfs-mkfs

Create a new btrfs filesystem on a block device or image file.

btrfs-mkfs [options] <device> [device...]

Supports single-device and multi-device filesystems with all RAID profiles (SINGLE, DUP, RAID0, RAID1, RAID1C3, RAID1C4, RAID10, RAID5, RAID6), all four checksum algorithms (crc32c, xxhash, sha256, blake2b), quota and simple quota setup, custom nodesize/sectorsize, labels, UUIDs, feature flags, and directory population via --rootdir.

btrfs-tune

Modify btrfs filesystem parameters on an unmounted device.

btrfs-tune [options] <device>
FlagDescription
-rEnable extended inode refs (extref)
-xEnable skinny metadata extent refs
-nEnable no-holes feature
-S 0 / -S 1Clear or set the seeding flag
-mChange fsid to a random UUID (metadata_uuid mechanism)
-M <uuid>Change fsid to a specific UUID (metadata_uuid mechanism)
-uRewrite fsid to a random UUID (patches all tree blocks)
-U <uuid>Rewrite fsid to a specific UUID (patches all tree blocks)

Global flags

These flags are accepted by all btrfs commands:

FlagDescription
-v / --verboseIncrease verbosity (repeatable)
-q / --quietSuppress non-error output
-f, --formatSet the format, one of: text, json, modern

Output Format

Many commands accept a --format json which will cause them to output JSON-formatte data.

Differences from btrfs-progs

btrfsutils aims to be a drop-in replacement for btrfs-progs. Most commands produce identical output and accept the same flags. This page lists the known gaps and the features that go beyond what btrfs-progs offers.

What’s not yet supported

These features from btrfs-progs are not yet implemented:

  • btrfs check --repair and related write-mode flags (--init-csum-tree, --init-extent-tree, etc.). Read-only checking works.
  • btrfs check --mode lowmem (currently only the default mode is supported).
  • btrfs rescue chunk-recover. Other write-mode rescue subcommands (fix-device-size, clear-space-cache, clear-uuid-tree, clear-ino-cache, fix-data-checksum) are implemented.
  • btrfs filesystem resize --offline.
  • btrfs-mkfs zoned device support.
  • btrfs-tune --convert-to-free-space-tree and --convert-to-block-group-tree.

What’s added beyond btrfs-progs

These features are original additions not present in the C tools:

  • --format modern (or BTRFS_OUTPUT_FORMAT=modern): opt-in improved output with adaptive column widths and tree views. Supported by most tabular commands including device stats, device usage, subvolume list, inspect list-chunks, filesystem du/df/show/usage, qgroup show, quota status, scrub start/status.
  • btrfs filesystem du --depth N: limit display depth while computing full totals.
  • btrfs filesystem du --sort: sort entries by path, total, exclusive, or shared.
  • btrfs inspect list-chunks --offline: read chunks directly from an unmounted device or image file without CAP_SYS_ADMIN.
  • btrfs inspect min-dev-size --offline: compute minimum device size from an unmounted device or image file.
  • btrfs device stats --offline: read device error statistics from the on-disk device tree without requiring a mounted filesystem.

Architecture

Crate structure

The project follows a strict layering: lower crates have no knowledge of the layers above them.

Architecture diagram

btrfs-uapi wraps kernel ioctls, sysfs reads, and procfs reads into safe Rust APIs. It is Linux-only and the only crate that talks directly to the kernel.

btrfs-disk parses on-disk structures — superblocks, B-tree nodes, item payloads — from raw byte buffers. It is platform-independent and does not depend on btrfs-uapi, so it can be used to inspect filesystem images on any OS.

btrfs-stream parses the btrfs send stream wire format. The core parser is platform-independent. The optional receive feature is Linux-only and applies a parsed stream to a mounted filesystem via btrfs-uapi.

btrfs-mkfs implements the mkfs.btrfs tool. It constructs B-tree nodes as raw byte buffers and writes them directly to a block device or image file using pwrite. It does not use ioctls.

btrfs-tune implements the btrfstune tool. It modifies on-disk superblock parameters (feature flags, seeding, filesystem UUIDs) on unmounted devices. For lightweight UUID changes it only rewrites the superblock; for full fsid rewrites it traverses every tree block on disk via btrfs-disk.

btrfs-cli implements the btrfs tool. It handles argument parsing via clap, calls into btrfs-uapi and btrfs-disk as needed, and formats all output. Optionally, this tool can also embed the btrfs-tune and btrfs-mkfs tools as subcommands, for easier single-file deployment.

The two-layer model

Every feature that involves kernel communication is split across two layers. The uapi/ layer provides a safe Rust function: it takes typed arguments, calls the ioctl, and returns a typed result, with no unsafe in the public API and no knowledge of CLI concerns. The cli/ layer provides a clap subcommand that calls into uapi/ and formats the result for the user, with no ioctl calls or raw kernel types.

This rule applies to all kernel interfaces — btrfs ioctls, standard VFS ioctls like FS_IOC_FIEMAP, and block device ioctls like BLKGETSIZE64 all live in uapi/, never in cli/.

The same principle applies to disk/: it parses raw bytes into typed structs, and cli/ handles all display formatting. The disk/ crate never calls println!.

How Commands Work

Every command in btrfsutils is implemented across two layers: a safe kernel interface wrapper in btrfs-uapi, and a CLI command in btrfs-cli. This page walks through a concrete example — btrfs filesystem label — to show how the two layers fit together and why the split exists.

The uapi layer

The uapi layer lives in uapi/src/. Its job is to translate between Rust types and the raw kernel interfaces — allocating ioctl argument buffers, calling the ioctl, and converting the result into something the rest of the code can use without touching any unsafe code or bindgen types.

For btrfs filesystem label, that looks like this (from uapi/src/filesystem.rs):

#![allow(unused)]
fn main() {
pub fn label_get(fd: BorrowedFd) -> nix::Result<CString> {
    let mut buf = [0i8; BTRFS_LABEL_SIZE as usize];
    unsafe { btrfs_ioc_get_fslabel(fd.as_raw_fd(), &mut buf) }?;
    let cstr = unsafe { CStr::from_ptr(buf.as_ptr()) };
    Ok(cstr.to_owned())
}

pub fn label_set(fd: BorrowedFd, label: &CStr) -> nix::Result<()> {
    let bytes = label.to_bytes();
    if bytes.len() >= BTRFS_LABEL_SIZE as usize {
        return Err(nix::errno::Errno::EINVAL);
    }
    let mut buf = [0i8; BTRFS_LABEL_SIZE as usize];
    for (i, &b) in bytes.iter().enumerate() {
        buf[i] = b as c_char;
    }
    unsafe { btrfs_ioc_set_fslabel(fd.as_raw_fd(), &buf) }?;
    Ok(())
}
}

The function signatures use BorrowedFd rather than a raw integer, CString rather than a byte array, and nix::Result rather than checking errno manually. The caller never sees btrfs_ioctl_* types. The unsafe is contained to the ioctl call itself, with surrounding logic that is safe and testable.

The cli layer

The CLI layer lives in cli/src/. Its job is to parse arguments, call the uapi function, and format the output. It never calls ioctls directly.

The same command in cli/src/filesystem/label.rs:

#![allow(unused)]
fn main() {
#[derive(Parser, Debug)]
pub struct FilesystemLabelCommand {
    /// The device or mount point to operate on
    pub path: PathBuf,
    /// The new label to set (if omitted, the current label is printed)
    pub new_label: Option<OsString>,
}

impl Runnable for FilesystemLabelCommand {
    fn run(&self, _format: Format, _dry_run: bool) -> Result<()> {
        let file = open_path(&self.path)?;
        match &self.new_label {
            None => {
                let label = label_get(file.as_fd())
                    .with_context(|| format!("failed to get label for '{}'", self.path.display()))?;
                println!("{}", label.to_bytes().escape_ascii());
            }
            Some(new_label) => {
                let cstring = CString::new(new_label.as_bytes())
                    .context("label must not contain null bytes")?;
                label_set(file.as_fd(), &cstring)
                    .with_context(|| format!("failed to set label for '{}'", self.path.display()))?;
            }
        }
        Ok(())
    }
}
}

The struct derives Parser from clap — the field doc comments become the help text. Runnable::run handles the two cases (get and set) by opening the path, calling the appropriate uapi function, and either printing the result or reporting an error. Error messages include the path so the user knows which filesystem failed.

Why the split

The separation keeps each layer focused and independently testable. The uapi layer can be tested with unit tests that mock the ioctl, or with integration tests that operate on a real filesystem, without any CLI machinery involved. The CLI layer can be tested with argument parsing snapshot tests (no filesystem needed at all) and help text snapshot tests.

It also keeps the library crates clean. Because btrfs-uapi, btrfs-disk, and btrfs-stream contain no CLI logic and no GPL-derived code, they can be licensed MIT/Apache-2.0 and used by other projects independently of the CLI tools.

Routing

Each top-level command group has a router in cli/src/ (e.g. cli/src/filesystem.rs) that defines a FilesystemCommand enum with a variant per subcommand. The Runnable implementation for the router matches on the variant and delegates to the subcommand’s own run method. Adding a new subcommand means adding a variant to the enum, a mod declaration, and a run dispatch arm.

Kernel Interfaces

All kernel communication lives in btrfs-uapi. This page describes the patterns used to wrap the three main kernel interface types: ioctls, sysfs, and tree search.

Binding ioctls

Raw bindgen output is in uapi::raw, generated from uapi/src/raw/btrfs.h and btrfs_tree.h. Ioctl wrappers are declared in uapi/src/raw.rs using nix macros:

#![allow(unused)]
fn main() {
ioctl_write_ptr!(btrfs_ioc_resize, BTRFS_IOCTL_MAGIC, 3, btrfs_ioctl_vol_args);
ioctl_read!(btrfs_ioc_fs_info, BTRFS_IOCTL_MAGIC, 31, btrfs_ioctl_fs_info_args);
ioctl_readwrite!(btrfs_ioc_balance_v2, BTRFS_IOCTL_MAGIC, 32, btrfs_ioctl_balance_args);
ioctl_none!(btrfs_ioc_scrub_cancel, BTRFS_IOCTL_MAGIC, 28);
ioctl_write_int!(btrfs_ioc_balance_ctl, BTRFS_IOCTL_MAGIC, 33);
}

The macro to use is determined by the ioctl direction in the C header:

C macronix macroDirection
_IOWioctl_write_ptr!userspace → kernel (pointer to struct)
_IORioctl_read!kernel → userspace
_IOWRioctl_readwrite!both directions
_IOioctl_none!no data
_IOW (integer)ioctl_write_int!value passed directly in arg slot

Flexible array member ioctls

Some ioctls return variable-length arrays (e.g. btrfs_ioctl_space_args with a trailing spaces[] field). The pattern is a two-phase call:

  1. Call with zero slots to get the count from the kernel.
  2. Allocate a Vec<u64> (for 8-byte alignment) sized to base_size + count * item_size.
  3. Cast the vec’s pointer to the struct type, set the slot count, call again.
  4. Read results via __IncompleteArrayField::as_slice(count).

See uapi/src/space.rs for a worked example.

The btrfs_ioctl_vol_args_v2 union

Several subvolume and device ioctls share btrfs_ioctl_vol_args_v2. Bindgen generates two anonymous union fields:

  • __bindgen_anon_1 — the {size, qgroup_inherit} / unused[4] union
  • __bindgen_anon_2 — the name[4040] / devid / subvolid union
#![allow(unused)]
fn main() {
// Set a name:
let name_buf: &mut [c_char] = unsafe { &mut args.__bindgen_anon_2.name };

// Set devid (no unsafe needed for plain integer writes):
args.flags = BTRFS_DEVICE_SPEC_BY_ID as u64;
args.__bindgen_anon_2.devid = devid;
}

The tree search ioctl is the primary way to read data from btrfs B-trees from userspace. It is wrapped in uapi/src/tree_search.rs as a callback-based cursor:

#![allow(unused)]
fn main() {
tree_search(fd, SearchFilter::for_type(tree_id, item_type), |hdr, data| {
    // hdr: SearchHeader — objectid, offset, item_type, len (host byte order)
    // data: &[u8] — raw on-disk item payload (little-endian)
    Ok(())
})?;
}

Common SearchFilter constructors:

#![allow(unused)]
fn main() {
// All items of a specific type across all objectids:
SearchFilter::for_type(raw::BTRFS_CHUNK_TREE_OBJECTID as u64,
                       raw::BTRFS_CHUNK_ITEM_KEY as u32)

// Items of a specific type within an objectid range:
SearchFilter::for_objectid_range(tree_id, item_type, min_oid, max_oid)
}

For searches spanning multiple item types (e.g. the quota tree walk that reads STATUS, INFO, LIMIT, and RELATION keys in one pass), construct SearchFilter directly with start and end Key values spanning the desired type range.

Important: The start and end keys form compound bounds on the B-tree key order (objectid, item_type, offset). They are not independent per-field filters. Items with unexpected types can appear if their compound key falls between start and end. Callbacks should filter on hdr.item_type when they need a single type.

Bindgen type note

Tree objectid constants from btrfs_tree.h bind as u32 in Rust despite being ULL in C (e.g. BTRFS_ROOT_TREE_OBJECTID: u32 = 1). Always cast at the use site. BTRFS_LAST_FREE_OBJECTID binds as i32 = -256; cast to u64 gives 0xFFFFFFFF_FFFFFF00 as expected.

Cursor advancement

This is the most common source of bugs with tree search. The kernel interprets (min_objectid, min_type, min_offset) as a compound tuple key, not three independent range filters. After each batch, all three fields must be advanced together past the last returned item:

  • Normal case (offset does not overflow u64): set min_objectid = last.objectid, min_type = last.item_type, min_offset = last.offset + 1.
  • Offset overflow: set min_offset = 0, keep min_objectid = last.objectid, set min_type = last.item_type + 1.
  • Type also overflows u32: set min_offset = 0, min_type = 0, min_objectid = last.objectid + 1.

Advancing only min_offset while leaving min_objectid unchanged causes items from lower objectids to match the new minimum on every subsequent batch, producing an infinite loop.

Sysfs

Some data is read from sysfs rather than ioctls — for example, scrub throughput limits and quota state. The SysfsBtrfs type in uapi/src/sysfs.rs provides typed access to /sys/fs/btrfs/<uuid>/. The filesystem UUID is obtained from fs_info() (BTRFS_IOC_FS_INFO).

Send and Receive

btrfs send and btrfs receive transfer filesystem state between two btrfs filesystems as a byte stream. This page explains how the mechanism works and how to use the btrfs-stream and btrfs-uapi crates to implement receive in your own application.

How send works

btrfs send asks the kernel to generate a stream representing the contents of a read-only subvolume. The kernel traverses the subvolume’s B-trees and emits a sequence of commands describing every file, directory, symlink, and extent. For an incremental send (with -p <parent>), only the differences from the parent subvolume are emitted.

The kernel is invoked via BTRFS_IOC_SEND, which writes the stream to a file descriptor (typically the write end of a pipe). A reader thread on the other end consumes the stream and writes it to a file or stdout.

The stream format

The stream is a binary format consisting of a header followed by a sequence of commands.

The stream header identifies the format version (v1, v2, or v3) and contains a magic number (btrfs-stream\0). After the header, commands follow back-to-back until an END command signals completion.

Each command has the following structure:

u32  total_length    (length of the entire command, including this header)
u16  command_type    (BTRFS_SEND_C_* constant)
u32  crc32c          (checksum of the command, with the crc field zeroed)
     attributes...   (variable-length TLV list)

Attributes are TLV-encoded:

u16  attribute_type  (BTRFS_SEND_A_* constant)
u16  length
     data...

The CRC32C used by btrfs is the raw variant (initial seed 0, no final XOR), not the standard ISO 3309 variant (initial seed 0xFFFFFFFF). When computing or verifying a checksum, use:

#![allow(unused)]
fn main() {
let crc = !crc32c::crc32c_append(!0u32, data);
}

Parsing a stream with btrfs-stream

The btrfs-stream crate provides StreamReader, which parses commands one at a time from any Read source:

#![allow(unused)]
fn main() {
use btrfs_stream::{StreamReader, StreamCommand};

let mut reader = StreamReader::new(input)?; // reads and validates the header
while let Some(command) = reader.read_command()? {
    match command {
        StreamCommand::Subvol { path, uuid, ctransid } => { /* create subvolume */ }
        StreamCommand::MkFile { path } => { /* create file */ }
        StreamCommand::Write { path, offset, data } => { /* write data */ }
        StreamCommand::Rename { path, path_to } => { /* rename */ }
        StreamCommand::End => break,
        // ... all 22+ command types
    }
}
}

StreamReader::new reads the stream header and returns an error if the magic is wrong or the version is unsupported. read_command returns None at EOF.

Applying a stream with btrfs-uapi

To implement receive, you need to apply each command to a mounted btrfs filesystem. The relevant operations are:

Subvolume and snapshot creation (BTRFS_IOC_SUBVOL_CREATE, BTRFS_IOC_SNAP_CREATE_V2): for Subvol commands, create a new empty subvolume. For Snapshot commands, look up the source subvolume by UUID using subvolume_search_by_received_uuid or subvolume_search_by_uuid, then create a writable snapshot.

File operations: standard POSIX calls — open/create, unlink, mkdir, rmdir, symlink, link, rename. btrfs does not require any special ioctls for these.

Write (BTRFS_IOC_ENCODED_WRITE or pwrite): v2 streams may send pre-compressed data via ENCODED_WRITE. If the kernel supports it, this can be passed directly; otherwise decompress and fall back to pwrite.

Clone (BTRFS_IOC_CLONE_RANGE): shares an extent between two files without copying data. The source file is found by resolving its UUID via the UUID tree.

Subvolume finalization: once all commands for a subvolume have been processed, call BTRFS_IOC_SET_RECEIVED_SUBVOL to record the UUID and ctransid, then set the subvolume read-only with BTRFS_IOC_SUBVOL_SETFLAGS.

Using ReceiveContext

If you want a complete, ready-to-use receive implementation rather than building your own, the receive feature of btrfs-stream provides ReceiveContext:

btrfs-stream = { version = "0.5", features = ["receive"] }
#![allow(unused)]
fn main() {
use btrfs_stream::ReceiveContext;

let mut ctx = ReceiveContext::new(destination_dir)?;
ctx.receive(input_stream)?;
}

ReceiveContext handles all command types including v2 encoded writes (with decompression fallback for zlib, zstd, and lzo) and v3 fs-verity. It uses an fd cache to avoid reopening the same file for sequential writes, which is important for performance when receiving large files.

Parsing

The btrfs-disk crate parses btrfs on-disk structures from raw byte buffers. It is platform-independent — it works on any OS and can be used to inspect filesystem images without a running kernel.

Reading a filesystem

The typical entry point is filesystem_open, which bootstraps from the superblock:

superblock → sys_chunk_array → chunk tree → root tree

The returned OpenFilesystem contains a BlockReader (for reading tree blocks by logical address) and a map of tree root locations. From there, tree_walk traverses any tree in BFS or DFS order, calling a visitor callback for each block:

#![allow(unused)]
fn main() {
let open = filesystem_open(file)?;
let mut reader = open.reader;
tree_walk(&mut reader, root_bytenr, Traversal::Bfs, &mut |block| {
    // block: &TreeBlock — either a Node (internal) or Leaf
    Ok(())
})?;
}

Item payloads

Leaf blocks contain items, each with a DiskKey (objectid, type, offset) and a raw payload. parse_item_payload dispatches to a typed parser based on the key type:

#![allow(unused)]
fn main() {
let payload = parse_item_payload(&key, data);
match payload {
    ItemPayload::InodeItem(inode) => { /* ... */ }
    ItemPayload::RootItem(root) => { /* ... */ }
    ItemPayload::FileExtentItem(extent) => { /* ... */ }
    // ...
}
}

Reading on-disk fields safely

On-disk structs are packed and little-endian. Casting a *const u8 pointer directly to a packed struct is undefined behaviour due to potential misalignment.

btrfs-disk: bytes::Buf / bytes::BufMut

The disk crate uses the bytes crate for all parsing and serialization. A &[u8] implements Buf, so you can read fields sequentially with methods like get_u64_le(), which advances the cursor automatically:

#![allow(unused)]
fn main() {
let mut buf = data;
let generation = buf.get_u64_le();
let size = buf.get_u64_le();
let mode = buf.get_u32_le();
}

For serialization, BufMut provides the inverse (put_u64_le, put_slice, etc.). This approach avoids manual offset arithmetic and makes it impossible to read past the end of the buffer (it panics instead of silently producing garbage).

btrfs-uapi: offset-based LE readers

The uapi crate parses tree search results returned by the kernel, which are raw &[u8] buffers at known offsets. It uses explicit offset-based helpers from uapi/src/util.rs:

#![allow(unused)]
fn main() {
use btrfs_uapi::util::read_le_u64;
use std::mem::offset_of;

let size = read_le_u64(data, offset_of!(raw::btrfs_inode_item, size));
}

Always use std::mem::offset_of! and std::mem::size_of to derive offsets and sizes from the bindgen struct definitions — never hard-code numeric byte offsets. The field_size!(T, field) macro (from crate::util) gives the size of an individual field.

Superblock mirrors

btrfs writes up to three superblock copies at fixed offsets. super_mirror_offset(n) returns the byte offset for mirror n (0, 1, or 2). read_superblock reads and validates a superblock — checking the magic number and CRC — from any seekable reader.

Display logic belongs in cli/

The disk/ crate only produces typed structs. All formatting and human-readable output lives in cli/src/inspect/. The disk/ crate never calls println! or constructs output strings.

Testing

The goal for this project is to maintain a high test coverage, to make sure that these tools function correctly.

Running tests

Running the tests for this project is complicated by the fact that many btrfs operations talk directly to the kernel and require elevated privileges.

You can run all non-privileged tests with regular cargo test commands. This will still build the privileged tests, but they are skipped.

cargo test

In order to run privileged tests, there is a just target that will build them, and run (only the test binaries, not cargo itself) using sudo. This is the recommended way to run the full test suite on this project.

just test

You can build a coverage report (requires cargo-llvm-cov) of the full test suite similarly, using the coverage target.

just coverage
# open target/coverage/llvm-cov/html/index.html

Static checks

Before committing, run just check. This wraps the formatter check (nightly rustfmt), cargo deny, taplo for Cargo.toml formatting, cargo doc (with -Dwarnings), cargo clippy --all-features, per-libc cargo check for the host arch, the optional CLI features, and cargo msrv verify against every publishable crate’s declared rust-version.

The host-arch detection means just check works on x86_64 and aarch64 alike. The musl half (<host>-unknown-linux-musl) needs a matching C cross-compiler on PATH, since the zstd-sys and lzo-sys build scripts compile C code:

  • Nix devshell (nix develop) provides everything; you don’t need any of the steps below.

  • Fedora aarch64: dnf install musl-gcc ships musl-gcc as a thin wrapper around the host gcc plus musl specs. cc-rs looks for the target-prefixed aarch64-linux-musl-gcc name, so symlink it once:

    sudo ln -s /usr/bin/musl-gcc /usr/local/bin/aarch64-linux-musl-gcc
    

    (or set CC_aarch64_unknown_linux_musl=musl-gcc and AR_aarch64_unknown_linux_musl=ar if you prefer to avoid touching /usr/local/bin.)

  • Debian / Ubuntu: apt install musl-tools (host arch) or one of the gcc-<arch>-linux-musl-cross packages for cross builds; same target-prefix handling applies if cc-rs doesn’t pick it up automatically.

If the cross C compiler isn’t on PATH, just check prints skipping <triple> check: <prefix>-linux-musl-gcc not on PATH and keeps going — only CI is expected to fail on a missing musl toolchain.

Unit tests

Unit tests live as #[cfg(test)] mod tests blocks within the module they test. They require no privileges and run with cargo test.

Coverage spans all pure logic across the crates: LE readers, struct size assertions, tree search cursor arithmetic, stream parsing (all 22 v1 command types, CRC validation), superblock parsing, B-tree node parsing, size/time formatting, argument parsing helpers, balance filter parsing, and property classification.

When adding a new feature, add unit tests for any logic that doesn’t require a real kernel or filesystem.

Integration tests

Integration tests live in uapi/tests/ and cli/tests/commands/ and are marked:

#![allow(unused)]
fn main() {
#[ignore = "requires elevated privileges"]
}

They are skipped by cargo test and run only via just test.

Fixture tests (commands/fixture.rs)

Read-only snapshot tests against a pre-built filesystem image (cli/tests/commands/fixture.img.gz). The image has a fixed UUID, label, and subvolume layout, so output is fully deterministic. These tests cover all read-only commands: filesystem df/show/usage/label/du, subvolume list/show, device stats/usage, all inspect-internal commands, and property get/list.

dump-tree and dump-super tests read the image file directly and do not require mounting, so they run without elevated privileges even within the privileged test suite.

Live tests (commands/live.rs)

Tests that create and mutate real btrfs filesystems on loopback devices. These cover all mutating commands: subvolume create/delete/snapshot, send/receive, scrub, balance, device add/remove, quota, qgroup, label set, resize, defrag, replace, and more.

Test helpers

cli/tests/common.rs provides RAII helpers that clean up automatically on drop:

BackingFile → LoopbackDevice → Mount

Convenience functions:

FunctionDescription
single_mount()512 MiB single-device filesystem in a tempdir
deterministic_mount()Same, with a fixed UUID and label
fixture_mount()Mounts the pre-built fixture image read-only
write_test_data(path, n)Write deterministic byte-pattern files
verify_test_data(path, n)Verify previously written test data

Snapshot testing with insta

CLI output tests use insta for snapshot testing. Snapshots live in cli/tests/snapshots/ and are checked in to the repository.

Four snapshot categories:

PatternPrivilegesDescription
arguments__*.snapnoneArgument parsing output
help__*.snapnoneHelp text for every subcommand
commands__fixture__*.snaprootRead-only CLI output (fixture image)
commands__live__*.snaprootCLI output from live filesystem tests

Snapshot workflow

# Run tests; fails if any snapshot has changed:
cargo test

# Run tests and collect pending snapshot changes:
cargo insta test

# Interactively review each changed snapshot:
cargo insta review

# Accept all pending changes at once:
cargo insta accept --all

After running privileged tests via just test, the Justfile fixes ownership of any root-owned snapshot files and sets INSTA_WORKSPACE_ROOT so snapshots land in the right directory.

Adding tests for a new subcommand

  1. Argument parsing: add cases to cli/tests/arguments.rs following the existing pattern.
  2. Help text: cli/tests/help.rs auto-discovers all subcommands by walking the clap tree — no changes needed.
  3. Read-only output: if the fixture image has suitable content, add snapshot tests to commands/fixture.rs.
  4. Mutating commands: add tests to commands/live.rs using the RAII helpers.

Use the snap!("description", output) macro for snapshot tests — the description appears in the snapshot file header.

Conventions

The goal is to write idiomatic Rust code that is consistent across the whole codebase. btrfsutils spans several crates with different roles (kernel interface wrappers, on-disk parsers, CLI tools) and each has its own patterns. Following these conventions makes it easier to navigate unfamiliar code and to understand what a function or type is responsible for at a glance.

Where possible, lean on the Rust ecosystem rather than reinventing things: uuid for UUIDs, bitflags for flag sets, nix for syscalls and ioctls, anyhow for error context in the CLI. This keeps the code readable to anyone already familiar with those crates.

Naming

Module names are usually generic nouns. For example, in the uapi crate, the ioctl call wrappers are organized by the thing they operate on, and live in modules like filesystem, device, sync.

For the btrfs-cli crate, the module naming structure matches the subcommand hierarchy. Meaning: the btrfs subvolume create command is implemented in cli/src/subvolume/create.rs.

Types are named with the general concept first: SysfsBtrfs, BlockGroupFlags, BalanceArgs — never BtrfsSysfs.

Functions follow a noun_verb pattern: label_get, label_set — never get_label. Ioctl wrapper functions match the lowercased C macro name: btrfs_ioc_balance_v2.

Avoid abbreviations. For example, use ChecksumType instead of CsumType.

Types

Always prefer proper typed values. For example, use Uuid from the uuid crate, never [u8; 16]. In the CLI, if there is an argument that can take one of multiple options, don’t represent it as a string, but instead create an enum and derive clap::ValueEnum.

Null-terminated kernel strings (labels, device paths) use CString/CStr. Make sure that allocation and deallocation is handled properly.

File descriptors passed to uapi functions use BorrowedFd.

Kernel flag fields use bitflags!, usually with a Display implementation so they can be formatted with {}.

Complex argument structs (BalanceArgs, DefragRangeArgs) use the builder pattern with new(), chained setters, and Default.

Never expose bindgen types (btrfs_ioctl_*) in public uapi APIs, instead create idiomatic Rust structs.

Error handling

In uapi/, almost every function just performs a single syscall, so we return the raw nix::Result<T>. Where possible, list potential error codes and their meanings in the documentation comments.

Map specific errnos to Option or a typed error at the call site where appropriate (ENODEVNone, etc.).

In cli/, mkfs/, and tune/, use anyhow::Result<T> and convert at the uapi boundary with .with_context(). Always include the relevant path or resource in the error message.

Constants

All BTRFS_* constants are available via crate::raw::* in the uapi and disk crates. Unless you have a good reason to, import from crate::raw and don’t define local copies. Size constants like SZ_1M that are not part of the btrfs UAPI headers are the exception; define those locally with a comment.

There should not be any stray constants in the code. For example, use std::mem::offset_of!() or std::mem::size_of!() macros to compute offsets and sizes, and if there are any magic constants, give them a name.

Don’t redefine things that are already defined in crate::raw::*.

Parsing on-disk structures

In disk/ and mkfs/, use bytes::Buf for reading and bytes::BufMut for writing on-disk fields. Sequential get_u64_le() / put_u64_le() calls advance the cursor automatically, eliminating manual offset arithmetic. See the Parsing page for details.

In uapi/, tree search results are parsed with explicit offset-based LE readers (read_le_u64, read_le_u32) from uapi/src/util.rs, since those buffers are accessed at known offsets rather than sequentially.

Style

Keep unsafe blocks as small as possible; non-trivial ones get a // SAFETY: comment. For packed structs, copy fields to locals before taking references to avoid misaligned reference UB. Use escape_ascii() when printing byte strings that may be non-UTF-8. Import symbols used more than once rather than qualifying them at every call site (single-use qualified paths are fine).

Shared CLI helpers live in cli/src/util.rs, these include utilities to format sizes, bytes, time, and parse various types.

Doc comments

In uapi/, module-level docs start with a # heading describing the module’s purpose. Function docs explain what the function does and why; the ioctl name is a parenthetical in the implementation, not the primary description.

In cli/, don’t put doc comments on subcommand enum variants — clap uses the variant doc in preference to the struct doc, forcing duplication. Don’t use Markdown in clap struct doc comments: wrap_help reflows all text and destroys formatting. Use plain prose paragraphs instead.

Btrfs On-Disk Format Specification

This document describes the binary layout of btrfs on-disk structures as understood from the parser in disk/src/ and the serializer in mkfs/src/. All multi-byte integer fields are little-endian. All byte offsets in this document are zero-based unless noted otherwise.

Kernel header names are referenced in parentheses where helpful (e.g. btrfs_super_block, btrfs_header). The authoritative source is the Linux kernel UAPI headers btrfs.h and btrfs_tree.h.

Conventions used in this document:

  • “LE u64” means a 64-bit unsigned integer stored in little-endian byte order.
  • Byte offsets are from the start of the enclosing structure.
  • Field sizes are in bytes unless noted otherwise.
  • “Logical address” refers to an address in btrfs’s virtual address space, which must be resolved to a physical device offset via the chunk tree.
  • “Physical address” refers to a byte offset on a specific block device.

Overview

Btrfs is a copy-on-write (COW) B-tree filesystem. All persistent data is organized into B-trees, and all B-trees share a single logical address space that is mapped to physical device locations through a chunk/stripe layer.

Architecture: trees within trees

The fundamental architecture is “trees within trees”:

  • The superblock (at fixed offsets on disk) bootstraps access to the chunk tree and root tree.
  • The chunk tree maps logical addresses to physical device locations. A small subset of the chunk tree is embedded in the superblock to bootstrap access to the full tree.
  • The root tree is the directory of all other trees: it contains a ROOT_ITEM for each tree, pointing to that tree’s root block.
  • Content trees (FS tree, extent tree, checksum tree, etc.) store the actual filesystem data and metadata.

Copy-on-write semantics

Every modification creates new copies of affected blocks (COW), from the modified leaf up through the root of the tree. The final step atomically updates the superblock to point to the new root tree root. This ensures crash consistency without a journal: at any point, the last successfully written superblock points to a fully consistent tree hierarchy.

The COW property means that tree blocks are never modified in place. Instead:

  1. The leaf containing the modified item is written to a new location.
  2. The parent node’s key-pointer is updated to reference the new leaf, and the parent is written to a new location.
  3. This propagates up to the tree root.
  4. The root tree’s ROOT_ITEM is updated with the new root block address.
  5. The root tree itself is COWed up to its root.
  6. The superblock is written with the new root tree root address.

The generation counter is incremented with each transaction. All blocks written in a transaction share the same generation number.

Shared format

All trees share the same block format (header + items or key-pointers) and the same key structure (objectid, type, offset). The block size (nodesize) is uniform across the filesystem, typically 16384 bytes. The sectorsize (typically 4096 bytes) is the minimum I/O unit for data.

Multi-device support

Btrfs supports multiple devices in a single filesystem. The chunk tree maps logical addresses to physical offsets on specific devices. RAID profiles (SINGLE, DUP, RAID0, RAID1, RAID5, RAID6, RAID10, RAID1C3, RAID1C4) determine how chunks are distributed across devices.

Bootstrap sequence

Reading a btrfs filesystem from a raw device follows this sequence:

  1. Read the superblock at offset 64 KiB (try mirrors if primary fails).
  2. Parse sys_chunk_array from the superblock to seed the chunk cache with system chunk mappings.
  3. Resolve chunk_root through the chunk cache to a physical address.
  4. Read the chunk tree root block and all chunk items to populate the full chunk cache.
  5. Resolve root (root tree root) through the chunk cache.
  6. Read the root tree to discover all other trees via ROOT_ITEM entries.
  7. Access any tree by resolving its root block address through the chunk cache.

Superblock

The superblock (btrfs_super_block) is a 4096-byte structure stored at fixed offsets on each device. It is the entry point for reading the filesystem.

Mirror locations

Three copies (mirrors) of the superblock are maintained:

MirrorOffsetDecimal
00x1000065536 (64 KiB)
10x400000067108864 (64 MiB)
20x4000000000274877906944 (256 GiB)

Mirror 0 is always present. Mirrors 1 and 2 are written only if the device is large enough. The offsets are computed as:

mirror 0:  64 KiB
mirror i:  16 KiB << (12 * i)    for i > 0

On read, all mirrors present on the device are checked and the one with the highest valid generation is used.

Binary layout

FieldOffsetSizeNotes
csum032Checksum of bytes 32..4095
fsid3216Filesystem UUID (shared across devices)
bytenr488Physical offset of this superblock copy
flags568BTRFS_SUPER_FLAG_* flags
magic6480x4D5F53665248425F (_BHRfS_M LE)
generation728Transaction generation counter
root808Logical bytenr of root tree root
chunk_root888Logical bytenr of chunk tree root
log_root968Logical bytenr of log tree root (0 if none)
__unused_log_root_transid1048Reserved, formerly log_root_transid
total_bytes1128Total usable bytes across all devices
bytes_used1208Total bytes used by data and metadata
root_dir_objectid1288Objectid of root directory (always 6)
num_devices1368Number of devices in this filesystem
sectorsize1444Minimum I/O alignment (typically 4096)
nodesize1484Tree block size in bytes (typically 16384)
__unused_leafsize1524Legacy field, equal to nodesize
stripesize1564Stripe size for RAID (typically 65536)
sys_chunk_array_size1604Valid bytes in sys_chunk_array
chunk_root_generation1648Generation of the chunk tree root
compat_flags1728Compatible feature flags
compat_ro_flags1808Compatible read-only feature flags
incompat_flags1888Incompatible feature flags
csum_type1962Checksum algorithm (0=CRC32C, 1=xxHash, 2=SHA256, 3=BLAKE2)
root_level1981B-tree level of root tree root
chunk_root_level1991B-tree level of chunk tree root
log_root_level2001B-tree level of log tree root
dev_item20198Embedded btrfs_dev_item for this device
label299256Filesystem label (NUL-terminated, max 255 chars)
cache_generation5558Generation of free space cache (v1)
uuid_tree_generation5638Generation of UUID tree
metadata_uuid57116Metadata UUID (when METADATA_UUID incompat set)
nr_global_roots5878Number of global roots (extent-tree-v2)
(reserved fields)595Zero-filled up to sys_chunk_array
sys_chunk_array8002048Bootstrap chunk items
super_roots[4]2848672Four rotating backup root entries (168 bytes each)
(padding)3520576Zero-filled to 4096 bytes

Total: 4096 bytes (BTRFS_SUPER_INFO_SIZE).

System chunk array bootstrap

The sys_chunk_array field embeds a subset of the chunk tree sufficient to locate the full chunk tree on disk. It contains a sequence of (disk_key, chunk_item) pairs:

For each entry:
  17 bytes   btrfs_disk_key     (objectid, type, offset) -- offset = logical addr
  variable   btrfs_chunk        Chunk item (see Section 8.9)

The array is parsed sequentially until sys_chunk_array_size bytes are consumed. These entries typically contain the SYSTEM chunk(s) that map the chunk tree and root tree blocks.

Backup roots

The super_roots array contains four rotating backup copies of critical tree root pointers. The kernel updates one entry per transaction, cycling through indices 0-3. Each backup root entry (btrfs_root_backup) is 168 bytes:

FieldOffsetSizeNotes
tree_root08Logical bytenr of root tree root
tree_root_gen88Generation of root tree root
chunk_root168Logical bytenr of chunk tree root
chunk_root_gen248Generation of chunk tree root
extent_root328Logical bytenr of extent tree root
extent_root_gen408Generation of extent tree root
fs_root488Logical bytenr of FS tree root
fs_root_gen568Generation of FS tree root
dev_root648Logical bytenr of device tree root
dev_root_gen728Generation of device tree root
csum_root808Logical bytenr of checksum tree root
csum_root_gen888Generation of checksum tree root
total_bytes968Total filesystem bytes at backup time
bytes_used1048Bytes used at backup time
num_devices1128Number of devices at backup time
(reserved)12032Unused u64[4]
tree_root_level1521B-tree level of root tree root
chunk_root_level1531B-tree level of chunk tree root
extent_root_level1541B-tree level of extent tree root
fs_root_level1551B-tree level of FS tree root
dev_root_level1561B-tree level of device tree root
csum_root_level1571B-tree level of checksum tree root
(padding)15810Unused bytes to 168 total

Superblock checksum

The checksum field (csum, bytes 0..31) covers everything from byte 32 through byte 4095 (inclusive). For CRC32C, the 4-byte result is stored little-endian at bytes 0..3 and bytes 4..31 are zeroed.

The magic number _BHRfS_M (hex 0x4D5F53665248425F) must be present at offset 64 for a valid superblock.

Superblock validity is determined by checking both magic and checksum match. When multiple valid mirrors exist, the one with the highest generation is used.

Tree Block Format

Every B-tree block (node or leaf) is exactly nodesize bytes. The block begins with a 101-byte header (btrfs_header), followed by either item descriptors (leaves) or key-pointer entries (nodes).

FieldOffsetSizeNotes
csum032Checksum of bytes 32..nodesize-1
fsid3216Filesystem UUID (must match superblock)
bytenr488Logical byte offset of this block
flags568Header flags (lower 56 bits) + backref rev (upper 8 bits)
chunk_tree_uuid6416UUID of the chunk tree mapping this block
generation808Transaction generation when last written
owner888Objectid of the tree owning this block
nritems964Number of items (leaf) or key-pointers (node)
level10010 = leaf, >0 = internal node

Total header size: 101 bytes.

The flags field combines two values:

  • Bits 0-55: block flags (BTRFS_HEADER_FLAG_WRITTEN = 1, BTRFS_HEADER_FLAG_RELOC = 2)
  • Bits 56-63: backref revision (BTRFS_MIXED_BACKREF_REV = 1 for modern filesystems)

The header checksum covers bytes 32 through nodesize - 1. For CRC32C, the result is stored as a 4-byte LE value at bytes 0..3 with bytes 4..31 zeroed.

Leaf vs node distinction

The level field determines the block type:

  • level == 0: leaf block, containing items
  • level > 0: internal node, containing key-pointers to child blocks

The maximum tree depth is bounded by the number of key-pointers that fit in a node. For a 16 KiB nodesize, a node holds up to:

max_ptrs = (nodesize - HEADER_SIZE) / KEY_PTR_SIZE
         = (16384 - 101) / 33
         = 493 key-pointers

With 493 children per node, a tree of depth 2 (root node + leaf) can hold 493 * 651 = ~320,000 items. A tree of depth 3 can hold 493^2 * 651 = ~158 million items. In practice, trees rarely exceed depth 3 or 4.

Leaf Format

A leaf block (level 0) contains sorted item descriptors followed by a data area. Item descriptors grow forward from the header; item data grows backward from the end of the block.

+-------------------------------------------+
| Header (101 bytes)                        |
+-------------------------------------------+
| Item descriptor 0  (25 bytes)             |
| Item descriptor 1  (25 bytes)             |
| ...                                       |
| Item descriptor N-1 (25 bytes)            |
+-------------------------------------------+
| (free space)                              |
+-------------------------------------------+
| Item data N-1                             |
| ...                                       |
| Item data 1                               |
| Item data 0                               |
+-------------------------------------------+

Item descriptor

Each item descriptor (btrfs_item) is 25 bytes:

FieldOffsetSizeNotes
objectid08Key objectid (LE u64)
type81Key type byte (u8)
offset98Key offset (LE u64)
data_offset174Byte offset of item data from end of header (LE u32)
data_size214Size of item data in bytes (LE u32)

The first 17 bytes form a btrfs_disk_key. The data_offset field is relative to the start of the leaf data area, which begins immediately after the header. To locate item data in the raw block buffer:

absolute_offset = HEADER_SIZE + data_offset

where HEADER_SIZE = 101 bytes.

Data area layout

Item data is packed from the end of the block backward. The first item pushed has its data at the highest offset; subsequent items have data at progressively lower offsets. This means:

  • Item descriptors grow forward: HEADER_SIZE + i * 25
  • Item data grows backward: starting from nodesize and moving toward the descriptor area

The free space in a leaf is the gap between the end of the last descriptor and the start of the earliest (lowest-offset) item data.

Offset bookkeeping

When building a leaf (as the mkfs LeafBuilder does), the bookkeeping works as follows:

Initial state:
  item_offset = HEADER_SIZE (101)    // next descriptor position
  data_end    = nodesize (16384)     // next data write position

For each item pushed (key, data[N bytes]):
  1. data_end -= N                   // reserve space for item data
  2. Write data at buf[data_end .. data_end + N]
  3. data_offset = data_end - HEADER_SIZE   // relative to header end
  4. Write descriptor at buf[item_offset]:
       key (17 bytes) + data_offset (LE u32) + data_size (LE u32)
  5. item_offset += 25               // advance to next descriptor slot

The available space for additional items is:

space_left = data_end - (item_offset + ITEM_SIZE)

This must accommodate both the 25-byte descriptor and the item data.

Key ordering invariant

Items within a leaf are sorted by their keys in lexicographic order: first by objectid, then by type, then by offset. This invariant is maintained by the B-tree insertion logic and verified by btrfs check.

Capacity

For a 16384-byte leaf, the maximum number of items depends on their data sizes. With zero-length data items (such as TREE_BLOCK_REF or FREE_SPACE_EXTENT), the theoretical maximum is:

max_items = (nodesize - HEADER_SIZE) / ITEM_SIZE
          = (16384 - 101) / 25
          = 651 items

In practice, most items have data payloads that reduce this number significantly.

Node Format

An internal node (level > 0) contains sorted key-pointer entries (btrfs_key_ptr). Each entry points to a child block and records the lowest key in that child’s subtree.

Key-pointer entry

Each key-pointer (btrfs_key_ptr) is 33 bytes:

FieldOffsetSizeNotes
objectid08Key objectid (LE u64)
type81Key type byte (u8)
offset98Key offset (LE u64)
blockptr178Logical byte address of child block (LE u64)
generation258Generation of the child block (LE u64)

The first 17 bytes form the btrfs_disk_key representing the lowest key in the child subtree. The generation field is used for consistency checks: when reading the child block, its header generation must match this value.

Layout

+-------------------------------------------+
| Header (101 bytes)                        |
+-------------------------------------------+
| Key-pointer 0  (33 bytes)                 |
| Key-pointer 1  (33 bytes)                 |
| ...                                       |
| Key-pointer N-1 (33 bytes)                |
+-------------------------------------------+
| (unused space to nodesize)                |
+-------------------------------------------+

Key-pointers are sorted by their key in the same lexicographic order as leaf items. The child block referenced by key-pointer i contains all items with keys >= key-pointer[i].key and < key-pointer[i+1].key (or unbounded above for the last pointer).

Key Structure

Every item and key-pointer is addressed by a three-part key (btrfs_disk_key):

FieldOffsetSizeNotes
objectid08LE u64
type81u8
offset98LE u64

Total: 17 bytes.

Lexicographic ordering

Keys are compared as a tuple (objectid, type, offset) in that order. The objectid is compared first; on a tie, type is compared; on a further tie, offset breaks the tie. All comparisons are unsigned integer comparisons.

Field semantics by tree

The meaning of the three key fields varies depending on the tree and item type:

FS tree:

  • objectid = inode number (starting at 256 = BTRFS_FIRST_FREE_OBJECTID)
  • type = item type (INODE_ITEM, DIR_ITEM, EXTENT_DATA, etc.)
  • offset = type-dependent (0 for INODE_ITEM, name hash for DIR_ITEM, file byte offset for EXTENT_DATA, parent inode for INODE_REF, etc.)

Root tree:

  • objectid = tree objectid (e.g. 5 for FS_TREE, 256+ for subvolumes)
  • type = ROOT_ITEM, ROOT_REF, or ROOT_BACKREF
  • offset = 0 for ROOT_ITEM, child/parent tree ID for refs

Extent tree:

  • objectid = logical byte address of the extent
  • type = EXTENT_ITEM, METADATA_ITEM, or backref type
  • offset = extent length (EXTENT_ITEM), level (METADATA_ITEM), or backref-specific (root objectid, parent bytenr, hash)

Chunk tree:

  • objectid = BTRFS_FIRST_CHUNK_TREE_OBJECTID (256) for CHUNK_ITEM
  • type = CHUNK_ITEM
  • offset = logical byte address of the chunk

Device tree:

  • objectid = device ID for DEV_EXTENT; BTRFS_DEV_ITEMS_OBJECTID (1) for DEV_ITEM
  • type = DEV_EXTENT or DEV_ITEM
  • offset = physical offset for DEV_EXTENT; device ID for DEV_ITEM

Checksum tree:

  • objectid = BTRFS_EXTENT_CSUM_OBJECTID
  • type = EXTENT_CSUM
  • offset = logical byte address of the first checksummed sector

Free space tree:

  • objectid = block group logical offset (for FREE_SPACE_INFO) or extent start (for FREE_SPACE_EXTENT/BITMAP)
  • type = FREE_SPACE_INFO, FREE_SPACE_EXTENT, or FREE_SPACE_BITMAP
  • offset = block group length (for INFO) or extent length (for EXTENT/BITMAP)

UUID tree:

  • objectid = upper 8 bytes of UUID interpreted as LE u64
  • type = UUID_KEY_SUBVOL or UUID_KEY_RECEIVED_SUBVOL
  • offset = lower 8 bytes of UUID interpreted as LE u64

Quota tree:

  • objectid = packed qgroupid (level << 48) | subvolid
  • type = QGROUP_STATUS, QGROUP_INFO, QGROUP_LIMIT, QGROUP_RELATION
  • offset = packed qgroupid for relations, 0 otherwise

Key type values

ValueNameDescription
1INODE_ITEM_KEYInode metadata (mode, size, timestamps, nlink)
12INODE_REF_KEYLink from inode to parent directory (name + index)
13INODE_EXTREF_KEYExtended inode ref for names exceeding INODE_REF capacity
24XATTR_ITEM_KEYExtended attribute (name + value, keyed by name hash)
36VERITY_DESC_ITEM_KEYfs-verity descriptor
37VERITY_MERKLE_ITEM_KEYfs-verity Merkle tree data
48ORPHAN_ITEM_KEYOrphan inode pending cleanup
60DIR_LOG_ITEM_KEYDirectory log for fsync optimization
72DIR_LOG_INDEX_KEYDirectory log index
84DIR_ITEM_KEYDirectory entry keyed by crc32c(name) hash
96DIR_INDEX_KEYDirectory entry keyed by sequential index
108EXTENT_DATA_KEYFile extent (inline data or reference to disk extent)
128EXTENT_CSUM_KEYData checksum covering one or more sectors
132ROOT_ITEM_KEYTree root descriptor (bytenr, generation, UUID, timestamps)
144ROOT_BACKREF_KEYBackref from child subvolume to parent
156ROOT_REF_KEYForward ref from parent subvolume to child
168EXTENT_ITEM_KEYExtent allocation with backrefs (non-skinny: offset = size)
169METADATA_ITEM_KEYSkinny metadata extent (offset = level, not size)
172EXTENT_OWNER_REF_KEYSimple quota owner backref
176TREE_BLOCK_REF_KEYStandalone backref: metadata extent → owning tree
178EXTENT_DATA_REF_KEYStandalone backref: data extent → (root, ino, offset)
182SHARED_BLOCK_REF_KEYShared metadata backref (parent block address)
184SHARED_DATA_REF_KEYShared data backref (parent block address + count)
192BLOCK_GROUP_ITEM_KEYBlock group allocation info (used bytes, type, profile)
198FREE_SPACE_INFO_KEYFree space tree: per-block-group metadata
199FREE_SPACE_EXTENT_KEYFree space tree: free extent range
200FREE_SPACE_BITMAP_KEYFree space tree: bitmap of free sectors
204DEV_EXTENT_KEYPhysical extent allocated to a chunk on a device
216DEV_ITEM_KEYDevice descriptor (size, UUID, I/O parameters)
228CHUNK_ITEM_KEYChunk mapping logical → physical with stripe info
230RAID_STRIPE_KEYRAID stripe tree entry (zoned devices)
240QGROUP_STATUS_KEYQuota group global status and generation
242QGROUP_INFO_KEYPer-qgroup usage counters (referenced, exclusive)
244QGROUP_LIMIT_KEYPer-qgroup size limits
246QGROUP_RELATION_KEYParent-child relationship between qgroups
248TEMPORARY_ITEM_KEYTransient item; also used as BALANCE_ITEM_KEY
249PERSISTENT_ITEM_KEYPersistent metadata; also used as DEV_STATS_KEY
250DEV_REPLACE_KEYDevice replace operation state
251UUID_KEY_SUBVOLUUID tree: maps subvolume UUID → subvolume ID
252UUID_KEY_RECEIVED_SUBVOLUUID tree: maps received UUID → subvolume ID
253STRING_ITEM_KEYLabel or other string metadata

Well-known objectid values

ValueNameNotes
1ROOT_TREE_OBJECTIDRoot tree
2EXTENT_TREE_OBJECTIDExtent tree
3CHUNK_TREE_OBJECTIDChunk tree
4DEV_TREE_OBJECTIDDevice tree
5FS_TREE_OBJECTIDDefault FS tree
6ROOT_TREE_DIR_OBJECTIDRoot tree directory
7CSUM_TREE_OBJECTIDChecksum tree
8QUOTA_TREE_OBJECTIDQuota tree
9UUID_TREE_OBJECTIDUUID tree
10FREE_SPACE_TREE_OBJECTIDFree space tree
11BLOCK_GROUP_TREE_OBJECTIDBlock group tree
12RAID_STRIPE_TREE_OBJECTIDRAID stripe tree
256FIRST_FREE_OBJECTIDFirst user inode / first subvolume ID
(u64)-4BALANCE_OBJECTIDBalance status
(u64)-5ORPHAN_OBJECTIDOrphan items
(u64)-6TREE_LOG_OBJECTIDTree log
(u64)-7TREE_LOG_FIXUP_OBJECTIDTree log fixup
(u64)-8TREE_RELOC_OBJECTIDTree relocation
(u64)-9DATA_RELOC_TREE_OBJECTIDData relocation tree
(u64)-10EXTENT_CSUM_OBJECTIDExtent checksums
(u64)-11FREE_SPACE_OBJECTIDFree space cache (v1)
(u64)-12FREE_INO_OBJECTIDFree inode number tracking
(u64)-255MULTIPLE_OBJECTIDSMultiple-owner sentinel

Negative objectids are stored as their unsigned 64-bit two’s complement representation. For example, BALANCE_OBJECTID = -4 is stored as 0xFFFFFFFF_FFFFFFFC.

Trees

Btrfs uses multiple B-trees, each identified by a well-known objectid. The root tree stores a ROOT_ITEM for each tree, pointing to its root block.

Root tree (objectid 1)

The directory of all other trees. Contains:

  • ROOT_ITEM for each tree (objectid = tree ID, type = ROOT_ITEM, offset = 0)
  • ROOT_REF for parent-to-child subvolume links
  • ROOT_BACKREF for child-to-parent subvolume links
  • ROOT_TREE_DIR directory entry linking to the default subvolume
  • TEMPORARY_ITEM for balance status persistence
  • PERSISTENT_ITEM for device statistics and replace status

Extent tree (objectid 2)

Tracks all allocated space (data extents and metadata tree blocks) with reference counting and backreferences. Contains:

  • EXTENT_ITEM for data and non-skinny metadata extents
  • METADATA_ITEM for skinny metadata extents
  • TREE_BLOCK_REF for direct metadata backrefs
  • SHARED_BLOCK_REF for shared metadata backrefs (snapshots)
  • EXTENT_DATA_REF for direct data backrefs
  • SHARED_DATA_REF for shared data backrefs (snapshots)
  • BLOCK_GROUP_ITEM for each block group (unless block_group_tree feature)

Chunk tree (objectid 3)

Maps logical address ranges to physical device stripes. Contains:

  • CHUNK_ITEM for each chunk (logical-to-physical mapping)
  • DEV_ITEM for each device

The chunk tree is bootstrapped from the superblock’s sys_chunk_array.

Device tree (objectid 4)

Tracks per-device physical extent allocations. Contains:

  • DEV_EXTENT for each allocated physical range on each device

FS tree (objectid 5, 256+)

Holds the filesystem content for a subvolume. The default subvolume uses objectid 5; additional subvolumes and snapshots use objectids starting at 256. Contains:

  • INODE_ITEM for each inode
  • INODE_REF / INODE_EXTREF for hard links
  • DIR_ITEM for directory entries (keyed by name hash)
  • DIR_INDEX for directory entries (keyed by sequence number)
  • EXTENT_DATA for file extent descriptors
  • XATTR_ITEM for extended attributes
  • ORPHAN_ITEM for unlinked but still open inodes

Checksum tree (objectid 7)

Stores per-sector data checksums. Contains:

  • EXTENT_CSUM items: each item covers a contiguous range of data sectors, storing an array of per-sector checksums

Quota tree (objectid 8)

Tracks quota group accounting. Contains:

  • QGROUP_STATUS (one per filesystem)
  • QGROUP_INFO for each qgroup
  • QGROUP_LIMIT for each qgroup with limits
  • QGROUP_RELATION for parent-child qgroup relationships

UUID tree (objectid 9)

Provides fast UUID-to-subvolume lookups for send/receive. Contains:

  • UUID_KEY_SUBVOL mapping subvolume UUID to objectid
  • UUID_KEY_RECEIVED_SUBVOL mapping received UUID to objectid

Free space tree (objectid 10)

Tracks free space per block group, replacing the older free space cache (v1). Contains:

  • FREE_SPACE_INFO for each block group
  • FREE_SPACE_EXTENT for free ranges
  • FREE_SPACE_BITMAP for bitmap-tracked regions

Requires the free_space_tree compat_ro feature flag.

Block group tree (objectid 11)

Separates block group items from the extent tree for faster mount times. Contains:

  • BLOCK_GROUP_ITEM for each block group

Requires the block_group_tree compat_ro feature flag. When this tree is absent, block group items live in the extent tree.

Data relocation tree (objectid (u64)-9)

A temporary FS tree used during balance to hold relocated data extents. Uses the same item types as a regular FS tree.

RAID stripe tree (objectid 12)

Maps logical extents to per-device physical stripe offsets. Contains:

  • RAID_STRIPE items

Requires the raid_stripe_tree incompat feature flag.

Item Types

This section documents the key format and payload layout for each major item type.

INODE_ITEM (type 1)

Key: (inode_number, INODE_ITEM, 0)

Exactly one per inode. Stores POSIX attributes, timestamps, and btrfs-specific flags.

Payload (btrfs_inode_item, 160 bytes):

FieldOffsetSizeNotes
generation08Generation when created
transid88Transaction ID of last modification
size168Logical file size in bytes
nbytes248On-disk bytes used (all copies)
block_group328Block group hint for new allocations
nlink404Hard link count
uid444Owner user ID
gid484Owner group ID
mode524POSIX file mode (type + permissions)
rdev568Device number (char/block device inodes)
flags648Inode flags (see below)
sequence728NFS-compatible change sequence number
reserved8032Reserved u64[4], must be zero
atime11212Access time (btrfs_timespec)
ctime12412Change time (btrfs_timespec)
mtime13612Modification time (btrfs_timespec)
otime14812Creation time (btrfs_timespec)

Each btrfs_timespec is 12 bytes:

FieldOffsetSizeNotes
sec08Seconds since Unix epoch (LE u64)
nsec84Nanosecond component, 0..999999999 (LE u32)

Inode flags (bitmask):

BitValueName
00x1NODATASUM
10x2NODATACOW
20x4READONLY
30x8NOCOMPRESS
40x10PREALLOC
50x20SYNC
60x40IMMUTABLE
70x80APPEND
80x100NODUMP
90x200NOATIME
100x400DIRSYNC
110x800COMPRESS
200x100000ROOT_ITEM_INIT

INODE_REF (type 12)

Key: (inode_number, INODE_REF, parent_dir_inode)

Hard-link reference from an inode to a directory entry. Multiple refs can be packed into a single item when an inode has several hard links in the same parent directory.

Payload (variable, packed sequence of entries):

For each ref:

FieldOffsetSizeNotes
index08DIR_INDEX sequence number (LE u64)
name_len82Length of name in bytes (LE u16)
name10name_lenFilename bytes (no NUL terminator)

Multiple refs are concatenated without padding.

INODE_EXTREF (type 13)

Key: (inode_number, INODE_EXTREF, crc32c(parent_ino, name))

Extended inode reference. Unlike INODE_REF, the parent inode is stored in the struct, allowing references from different parent directories. Requires the extended_iref incompat feature.

Payload (variable, packed sequence):

For each ref:

FieldOffsetSizeNotes
parent08Parent directory inode number (LE u64)
index88DIR_INDEX sequence number (LE u64)
name_len162Length of name (LE u16)
name18name_lenFilename bytes

DIR_ITEM (type 84) / DIR_INDEX (type 96)

Key for DIR_ITEM: (dir_inode, DIR_ITEM, crc32c(name)) Key for DIR_INDEX: (dir_inode, DIR_INDEX, sequence)

Both use the same on-disk format. DIR_ITEM entries are keyed by the CRC32C hash of the filename (raw CRC32C, not standard). DIR_INDEX entries are keyed by a monotonically increasing sequence number for ordered directory iteration.

Multiple entries can be packed into a single DIR_ITEM when names hash to the same value (hash collision).

Payload (btrfs_dir_item, variable, packed sequence):

For each entry:

FieldOffsetSizeNotes
location017Target inode key (btrfs_disk_key)
transid178Transaction ID (LE u64)
data_len252Xattr value length, 0 for dirs (LE u16)
name_len272Filename length (LE u16)
type291File type (see below)
name30name_lenFilename bytes
data30+NLdata_lenXattr value (for XATTR_ITEM only)

The location field is a btrfs_disk_key pointing to the target. For regular directory entries, this typically has objectid = target inode, type = INODE_ITEM, offset = 0. For subvolume entries, type = ROOT_ITEM and objectid = the subvolume’s tree objectid.

File type values:

ValueName
0FT_UNKNOWN
1FT_REG_FILE
2FT_DIR
3FT_CHRDEV
4FT_BLKDEV
5FT_FIFO
6FT_SOCK
7FT_SYMLINK
8FT_XATTR

FILE_EXTENT_ITEM (type 108)

Key: (inode_number, EXTENT_DATA, file_byte_offset)

Describes how a range of file bytes maps to on-disk storage. Three extent types exist: inline, regular, and preallocated.

Common header (21 bytes):

FieldOffsetSizeNotes
generation08Allocation generation (LE u64)
ram_bytes88Uncompressed size (LE u64)
compression161Compression type (0=none, 1=zlib, 2=lzo, 3=zstd)
encryption171Reserved (always 0)
other_encoding182Reserved (always 0)
type201Extent type (0=inline, 1=regular, 2=prealloc)

Inline extent (type 0):

After the 21-byte header, the remaining bytes in the item are the file data itself. The data length is item_size - 21. For compressed inline extents, the data is compressed and ram_bytes gives the uncompressed size.

FieldOffsetSizeNotes
header021Common header (type = 0)
data21item_size-21Inline file data

Total item size: 21 + data_length.

Regular extent (type 1) and prealloc extent (type 2):

FieldOffsetSizeNotes
header021Common header (type = 1 or 2)
disk_bytenr218Logical address of extent on disk (LE u64)
disk_num_bytes298Size of extent on disk (LE u64)
offset378Byte offset into extent (LE u64)
num_bytes458Number of logical file bytes covered (LE u64)

Total item size: 53 bytes.

A disk_bytenr of 0 indicates a hole (sparse region). For compressed extents, disk_num_bytes is the compressed size on disk and ram_bytes is the uncompressed size. The offset field allows referencing into the middle of a shared extent (e.g., after COW of part of a cloned extent).

Prealloc extents (type 2) are reserved but unwritten; reads return zeroes.

EXTENT_ITEM (type 168) / METADATA_ITEM (type 169)

Key for EXTENT_ITEM: (logical_bytenr, EXTENT_ITEM, extent_length) Key for METADATA_ITEM: (logical_bytenr, METADATA_ITEM, level)

Tracks reference counts and backreferences for allocated space. METADATA_ITEM is the “skinny” variant (when skinny_metadata incompat flag is set): the extent length is implicit (= nodesize) and the key offset stores the tree block level instead.

Base payload (btrfs_extent_item, 24 bytes):

FieldOffsetSizeNotes
refs08Number of references (LE u64)
generation88Allocation generation (LE u64)
flags168Extent flags (LE u64)

Extent flags:

BitValueName
00x1EXTENT_FLAG_DATA
10x2EXTENT_FLAG_TREE_BLOCK
80x100BLOCK_FLAG_FULL_BACKREF

Tree block info (for non-skinny EXTENT_ITEM with TREE_BLOCK flag):

After the base extent item, non-skinny tree block extents include a btrfs_tree_block_info (18 bytes):

FieldOffsetSizeNotes
key2417First key in the tree block (btrfs_disk_key)
level411Tree block level (u8)

This is absent for skinny metadata items (METADATA_ITEM), where the level is encoded in the key offset.

Inline backreferences:

After the extent item header (and tree_block_info if present), zero or more inline backreferences may be packed. Each starts with a 1-byte type tag followed by type-specific data:

Type byteNameData after type byte
176 (0xB0)TREE_BLOCK_REF8 bytes: root_objectid (LE u64)
182 (0xB6)SHARED_BLOCK_REF8 bytes: parent_bytenr (LE u64)
178 (0xB2)EXTENT_DATA_REF28 bytes: root(8) + objectid(8) + offset(8) + count(4)
184 (0xB8)SHARED_DATA_REF12 bytes: parent_bytenr(8) + count(4)
172 (0xAC)EXTENT_OWNER_REF8 bytes: root_objectid (LE u64)

Note that for EXTENT_DATA_REF, the 8-byte offset field that normally follows the type byte is absent; the struct fields begin immediately after the type byte:

FieldOffsetSizeNotes
type01178 (EXTENT_DATA_REF_KEY)
root18Owning tree objectid (LE u64)
objectid98Referencing inode number (LE u64)
offset178File byte offset of reference (LE u64)
count254Number of references (LE u32)

For other inline ref types, the format is:

FieldOffsetSizeNotes
type01Type byte (176/182/184/172)
offset18Type-specific offset (LE u64)

For SHARED_DATA_REF, an additional 4 bytes follow:

9       4     count       Number of references (LE u32)

Standalone backreference items

When backreferences do not fit inline in the extent item, they are stored as separate items in the extent tree:

TREE_BLOCK_REF (type 176): Key: (extent_bytenr, TREE_BLOCK_REF, root_objectid). No data payload; the key offset encodes the owning root.

SHARED_BLOCK_REF (type 182): Key: (extent_bytenr, SHARED_BLOCK_REF, parent_bytenr). No data payload; the key offset encodes the parent block.

EXTENT_DATA_REF (type 178): Key: (extent_bytenr, EXTENT_DATA_REF, hash). The hash is computed from (root, objectid, offset) using two CRC32C passes:

high_crc = raw_crc32c(0xFFFFFFFF, root_le_bytes)
low_crc  = raw_crc32c(0xFFFFFFFF, objectid_le_bytes)
low_crc  = raw_crc32c(low_crc,    offset_le_bytes)
hash     = (high_crc << 31) ^ low_crc

Payload (btrfs_extent_data_ref, 28 bytes):

FieldOffsetSizeNotes
root08Owning tree objectid (LE u64)
objectid88Referencing inode (LE u64)
offset168File byte offset (LE u64)
count244Reference count (LE u32)

SHARED_DATA_REF (type 184): Key: (extent_bytenr, SHARED_DATA_REF, parent_bytenr). Payload (4 bytes):

FieldOffsetSizeNotes
count04Reference count (LE u32)

EXTENT_OWNER_REF (type 172): Key: (extent_bytenr, EXTENT_OWNER_REF, root_objectid). No data payload. Used with the simple_quota feature.

DEV_ITEM (type 216)

Key: (DEV_ITEMS_OBJECTID [1], DEV_ITEM, devid)

Stored in the chunk tree. Also embedded in the superblock at offset 201.

Payload (btrfs_dev_item, 98 bytes):

FieldOffsetSizeNotes
devid08Device ID (LE u64)
total_bytes88Total device size (LE u64)
bytes_used168Bytes allocated on device (LE u64)
io_align244I/O alignment (LE u32)
io_width284I/O width (LE u32)
sector_size324Device sector size (LE u32)
type368Device type (reserved, 0) (LE u64)
generation448Generation last updated (LE u64)
start_offset528Allocation start offset (LE u64)
dev_group604Device group (reserved, 0) (LE u32)
seek_speed641Seek speed hint (0 = unset)
bandwidth651Bandwidth hint (0 = unset)
uuid6616Device UUID
fsid8216Filesystem UUID

CHUNK_ITEM (type 228)

Key: (FIRST_CHUNK_TREE_OBJECTID [256], CHUNK_ITEM, logical_offset)

Maps a range of logical addresses to physical device locations. Stored in the chunk tree and (for system chunks) in the superblock’s sys_chunk_array.

Payload (btrfs_chunk + stripes, variable):

FieldOffsetSizeNotes
length08Chunk size in bytes (LE u64)
owner88Owner objectid (LE u64)
stripe_len168Stripe length (typically 65536) (LE u64)
type248Chunk type + RAID profile flags (LE u64)
io_align324I/O alignment (LE u32)
io_width364I/O width (LE u32)
sector_size404Sector size (LE u32)
num_stripes442Number of stripes (LE u16)
sub_stripes462Sub-stripes for RAID10 (LE u16)
stripes[]48Array of num_stripes stripe entries

Each stripe entry (btrfs_stripe, 32 bytes):

FieldOffsetSizeNotes
devid08Device ID (LE u64)
offset88Physical byte offset on device (LE u64)
dev_uuid1616Device UUID

Total payload size: 48 + num_stripes * 32 bytes.

Chunk type flags (bitmask, same as block group flags):

BitValueName
00x1DATA
10x2SYSTEM
20x4METADATA
30x8RAID0
40x10RAID1
50x20DUP
60x40RAID10
70x80RAID5
80x100RAID6
90x200RAID1C3
100x400RAID1C4

When no RAID profile bits are set, the chunk is SINGLE profile.

DEV_EXTENT (type 204)

Key: (devid, DEV_EXTENT, physical_offset)

The inverse of a chunk stripe: maps a physical range on a device back to the owning chunk.

Payload (btrfs_dev_extent, 48 bytes):

FieldOffsetSizeNotes
chunk_tree08Chunk tree objectid (always 3) (LE u64)
chunk_objectid88Chunk objectid (LE u64)
chunk_offset168Logical offset of owning chunk (LE u64)
length248Length of this device extent (LE u64)
chunk_tree_uuid3216Chunk tree UUID

BLOCK_GROUP_ITEM (type 192)

Key: (logical_offset, BLOCK_GROUP_ITEM, length)

Tracks space usage for a chunk. Stored in the extent tree (or block group tree when the block_group_tree feature is enabled).

Payload (btrfs_block_group_item, 24 bytes):

FieldOffsetSizeNotes
used08Bytes used in this block group (LE u64)
chunk_objectid88Chunk objectid backing this group (LE u64)
flags168Type + RAID profile flags (LE u64)

The flags field uses the same bitmask as chunk type flags (Section 8.9).

ROOT_ITEM (type 132)

Key: (tree_objectid, ROOT_ITEM, 0)

Stored in the root tree. Describes a tree root: its block address, generation, subvolume UUIDs, and timestamps.

Payload (btrfs_root_item, 439 bytes):

FieldOffsetSizeNotes
inode0160Embedded btrfs_inode_item (root dir inode)
generation1608Generation when last modified (LE u64)
root_dirid1688Root directory inode objectid (LE u64)
bytenr1768Logical bytenr of root block (LE u64)
byte_limit1848Quota byte limit, 0=unlimited (LE u64)
bytes_used1928Bytes used by this tree (LE u64)
last_snapshot2008Generation of last snapshot (LE u64)
flags2088Root flags (LE u64)
refs2164Reference count (LE u32)
drop_progress22017Drop operation progress key (btrfs_disk_key)
drop_level2371Drop operation tree level (u8)
level2381B-tree level of root block (u8)
generation_v22398Extended generation (v2) (LE u64)
uuid24716Subvolume UUID
parent_uuid26316Parent subvolume UUID (for snapshots)
received_uuid27916Received UUID (for send/receive)
ctransid2958Last change transaction (LE u64)
otransid3038Creation transaction (LE u64)
stransid3118Send transaction (LE u64)
rtransid3198Receive transaction (LE u64)
ctime32712Change timestamp (btrfs_timespec)
otime33912Creation timestamp (btrfs_timespec)
stime35112Send timestamp (btrfs_timespec)
rtime36312Receive timestamp (btrfs_timespec)
reserved37564Reserved u64[8]

The embedded inode_item at the start describes the root directory inode (objectid 256 = BTRFS_FIRST_FREE_OBJECTID for FS trees).

Older filesystems may store a shorter v1 root item without the UUID, transaction, and timestamp fields. The parser handles both formats.

Root item flags:

BitValueName
00x1SUBVOL_RDONLY (read-only snapshot)

SUBVOL_DEAD (bit 48, value 0x1000000000000) marks a deleted subvolume pending cleanup.

ROOT_REF (type 156) / ROOT_BACKREF (type 144)

Key for ROOT_REF: (parent_tree_id, ROOT_REF, child_tree_id) Key for ROOT_BACKREF: (child_tree_id, ROOT_BACKREF, parent_tree_id)

Forward and backward references linking subvolumes to their parent directories. Both use the same on-disk format.

Payload (btrfs_root_ref, 18 bytes + name):

FieldOffsetSizeNotes
dirid08Directory inode containing the subvol entry (LE u64)
sequence88DIR_INDEX sequence number (LE u64)
name_len162Length of name (LE u16)
name18name_lenSubvolume name bytes

FREE_SPACE_INFO (type 198)

Key: (block_group_offset, FREE_SPACE_INFO, block_group_length)

Metadata about free space tracking for a block group.

Payload (btrfs_free_space_info, 8 bytes):

FieldOffsetSizeNotes
extent_count04Number of free extents/bitmap entries (LE u32)
flags44Free space info flags (LE u32)

Flags:

BitValueName
00x1USING_BITMAPS

FREE_SPACE_EXTENT (type 199)

Key: (start, FREE_SPACE_EXTENT, length)

Represents a contiguous free range within a block group. The item has no data payload; the key itself encodes the start address and length.

FREE_SPACE_BITMAP (type 200)

Key: (start, FREE_SPACE_BITMAP, length)

A bitmap covering a portion of a block group’s address range. The item data is the raw bitmap, where each bit represents one sector of space. Bit set = free, bit clear = allocated.

XATTR_ITEM (type 24)

Key: (inode_number, XATTR_ITEM, crc32c(name))

Extended attribute storage. Uses the same on-disk format as DIR_ITEM (Section 8.4), but with:

  • location = zeroed key
  • data_len = length of the xattr value
  • type = FT_XATTR (8)
  • name = xattr name (e.g. user.myattr)
  • data = xattr value

EXTENT_CSUM (type 128)

Key: (EXTENT_CSUM_OBJECTID, EXTENT_CSUM, logical_bytenr)

Stores an array of per-sector checksums for a contiguous range of data blocks. The item data is a packed array of checksums, one per sector.

For CRC32C, each checksum is 4 bytes (LE u32), so the item covers item_size / 4 sectors. The logical byte range covered is:

start = key.offset
end   = key.offset + (item_size / csum_size) * sectorsize

QGROUP_STATUS (type 240)

Key: (0, QGROUP_STATUS, 0)

One per filesystem. Tracks the overall state of quota accounting.

Payload (btrfs_qgroup_status_item, 32-40 bytes):

FieldOffsetSizeNotes
version08On-disk format version (LE u64)
generation88Last consistent generation (LE u64)
flags168Status flags (LE u64)
scan248Rescan progress objectid (LE u64)
enable_gen328Enable generation (kernel 6.8+, optional) (LE u64)

QGROUP_INFO (type 242)

Key: (packed_qgroupid, QGROUP_INFO, 0)

where packed_qgroupid = (level << 48) | subvolid.

Payload (btrfs_qgroup_info_item, 40 bytes):

FieldOffsetSizeNotes
generation08Last update generation (LE u64)
referenced88Total referenced bytes (LE u64)
referenced_compressed168Referenced bytes (compressed) (LE u64)
exclusive248Exclusive bytes (LE u64)
exclusive_compressed328Exclusive bytes (compressed) (LE u64)

QGROUP_LIMIT (type 244)

Key: (packed_qgroupid, QGROUP_LIMIT, 0)

Payload (btrfs_qgroup_limit_item, 40 bytes):

FieldOffsetSizeNotes
flags08Active limit bitmask (LE u64)
max_referenced88Max referenced bytes, 0=unlimited (LE u64)
max_exclusive168Max exclusive bytes, 0=unlimited (LE u64)
rsv_referenced248Reserved referenced bytes (LE u64)
rsv_exclusive328Reserved exclusive bytes (LE u64)

QGROUP_RELATION (type 246)

Key: (child_qgroupid, QGROUP_RELATION, parent_qgroupid)

Defines a parent-child relationship between qgroups. No data payload; the relationship is fully encoded in the key.

UUID_KEY_SUBVOL (type 251) / UUID_KEY_RECEIVED_SUBVOL (type 252)

Key: (upper_half_uuid, UUID_KEY_SUBVOL, lower_half_uuid)

Maps a UUID to one or more subvolume objectids. The UUID is split: the upper 8 bytes are stored as a LE u64 in the objectid field, the lower 8 bytes as a LE u64 in the offset field.

Payload (variable, array of u64):

For each associated subvolume:
  8 bytes   subvolid   Subvolume tree objectid (LE u64)

STRING_ITEM (type 253)

Key: (BTRFS_FREE_SPACE_OBJECTID, STRING_ITEM, 0)

Raw byte string. Typically stores the filesystem label in the root tree.

Payload: Raw bytes (length = item data size).

TEMPORARY_ITEM (type 248) / BALANCE_ITEM

Key: (BALANCE_OBJECTID, TEMPORARY_ITEM, 0)

Persists in-progress balance state across reboots.

Payload: The first 8 bytes are balance flags (LE u64). The remainder contains btrfs_balance_args structures for data, metadata, and system filters.

PERSISTENT_ITEM (type 249) / DEV_STATS

Key for device stats: (DEV_STATS_OBJECTID [0], PERSISTENT_ITEM, devid) Key for device replace: (DEV_REPLACE_OBJECTID, DEV_REPLACE, 0)

Device stats payload (40 bytes):

FieldOffsetSizeNotes
write_errs08Write error count (LE u64)
read_errs88Read error count (LE u64)
flush_errs168Flush error count (LE u64)
corruption_errs248Corruption error count (LE u64)
generation_errs328Generation mismatch count (LE u64)

Device replace payload (btrfs_dev_replace_item, 72+ bytes):

FieldOffsetSizeNotes
src_devid08Source device ID (LE u64)
cursor_left88Left cursor position (LE u64)
cursor_right168Right cursor position (LE u64)
replace_mode248Replace mode (LE u64)
replace_state328Current state (LE u64)
time_started408Start timestamp (LE u64)
time_stopped488Stop timestamp (LE u64)
num_write_errors568Write errors (LE u64)
num_uncorrectable_read_errors648Uncorrectable reads (LE u64)

ORPHAN_ITEM (type 48)

Key: (ORPHAN_OBJECTID, ORPHAN_ITEM, inode_number)

Marks an inode that has been unlinked but is still open. The item has no data payload. Orphan items are cleaned up on mount or by the kernel’s orphan cleanup thread.

RAID_STRIPE (type 230)

Key: (logical_offset, RAID_STRIPE, length)

Maps logical extents to per-device physical stripe offsets. Requires the raid_stripe_tree incompat feature.

Payload (variable):

FieldOffsetSizeNotes
encoding08RAID encoding type (LE u64)
stripes[]8Array of stripe entries

Each stripe entry (16 bytes):

FieldOffsetSizeNotes
devid08Device ID (LE u64)
physical88Physical byte offset (LE u64)

Checksums

Btrfs uses two distinct CRC32C computation modes:

Standard CRC32C (on-disk structures)

Used for all on-disk checksums: superblocks, tree block headers, and data checksums (EXTENT_CSUM items).

This is ISO 3309 / Castagnoli CRC32C: seed = 0xFFFFFFFF, result is XORed with 0xFFFFFFFF. Equivalent to the standard crc32c() function in most libraries.

checksum = crc32c(data)    // standard ISO 3309 CRC32C

The 4-byte LE result is stored in the checksum field. For superblocks and tree blocks, the checksum covers everything after the 32-byte csum field to the end of the structure.

Raw CRC32C (hash computations)

Used for internal hash computations where the kernel calls crc32c_le() directly:

  • Name hashes for DIR_ITEM keys (crc32c(name))
  • Name hashes for XATTR_ITEM keys
  • Name hashes for INODE_EXTREF keys
  • extent_data_ref key hash computation
  • Send stream CRC32C

The raw CRC32C passes the seed through without inversion:

raw_crc32c(seed, data) = !crc32c_append(!seed, data)

This is NOT the standard ISO 3309 algorithm. The seed is typically 0xFFFFFFFF (which is ~0u32), but unlike the standard algorithm, the output is not inverted.

Supported checksum algorithms

The csum_type field in the superblock selects the algorithm:

ValueNameOutput sizeNotes
0CRC32C4 bytesDefault, by far the most common
1xxHash648 bytesFast non-cryptographic hash
2SHA-25632 bytesCryptographic hash
3BLAKE2b32 bytesCryptographic hash (BLAKE2b-256)

The maximum checksum size is 32 bytes (BTRFS_CSUM_SIZE), which is also the size of the checksum field in headers.

Feature Flags

Feature flags are stored in three fields in the superblock. A filesystem implementation must understand all set flags to correctly operate:

  • compat_flags: features that are backward-compatible (no known flags currently defined)
  • compat_ro_flags: features compatible for read-only mounting
  • incompat_flags: features that are fully incompatible

Incompatible feature flags (incompat_flags)

BitValueNameNotes
00x1MIXED_BACKREFMixed backref revision (always set on modern fs)
10x2DEFAULT_SUBVOLA non-default subvolume is the mount target
20x4MIXED_GROUPSData and metadata may share block groups
30x8COMPRESS_LZOLZO compression used
40x10COMPRESS_ZSTDZstandard compression used
50x20BIG_METADATAMetadata blocks > sectorsize (always set when nodesize > sectorsize)
60x40EXTENDED_IREFExtended inode references (INODE_EXTREF items)
70x80RAID56RAID5/6 profiles used
80x100SKINNY_METADATASkinny metadata extent refs (METADATA_ITEM instead of EXTENT_ITEM for tree blocks)
90x200NO_HOLESFile extents do not need explicit hole entries
100x400METADATA_UUIDmetadata_uuid differs from fsid
110x800RAID1C34RAID1C3/RAID1C4 profiles used
120x1000ZONEDZoned device support
130x2000EXTENT_TREE_V2Extent tree v2 (experimental)
140x4000RAID_STRIPE_TREERAID stripe tree for stripe mappings
160x10000SIMPLE_QUOTASimple quota (per-extent ownership tracking)
170x20000REMAP_TREERemap tree (reserved for future use)

MIXED_BACKREF (bit 0): Indicates the filesystem uses mixed backref format (revision 1). All modern filesystems set this. Old filesystems without it use revision 0 backrefs.

DEFAULT_SUBVOL (bit 1): Set when a non-default subvolume has been configured as the default mount target via btrfs subvolume set-default.

MIXED_GROUPS (bit 2): Allows data and metadata to share the same block group. Unusual; typically used only on very small filesystems.

COMPRESS_LZO (bit 3): Set when any file on the filesystem uses LZO compression. Once set, it is never cleared.

COMPRESS_ZSTD (bit 4): Set when any file uses Zstandard compression.

BIG_METADATA (bit 5): Set when nodesize > sectorsize, allowing metadata blocks to span multiple sectors. Always set on modern filesystems with the typical 16384-byte nodesize and 4096-byte sectorsize.

EXTENDED_IREF (bit 6): Enables INODE_EXTREF items for inodes with hard links from multiple parent directories. Without this, only INODE_REF is used (keyed by single parent inode, limiting hard links per parent directory).

SKINNY_METADATA (bit 8): Uses METADATA_ITEM (type 169) instead of EXTENT_ITEM (type 168) for tree block extent records. The tree block level is encoded in the key offset, eliminating the separate btrfs_tree_block_info structure and saving 18 bytes per metadata extent item.

NO_HOLES (bit 9): File extents do not require explicit hole entries. Without this flag, holes in sparse files are represented by FILE_EXTENT_ITEM with disk_bytenr = 0; with it, holes are implicit (no item needed for the gap).

METADATA_UUID (bit 10): The metadata_uuid field in the superblock differs from fsid. This allows changing the user-visible filesystem UUID without rewriting every tree block header.

Compatible read-only feature flags (compat_ro_flags)

BitValueNameNotes
00x1FREE_SPACE_TREEFree space tree exists
10x2FREE_SPACE_TREE_VALIDFree space tree is valid and should be used
20x4VERITYfs-verity support enabled
30x8BLOCK_GROUP_TREEBlock group items in separate tree

FREE_SPACE_TREE (bit 0) + FREE_SPACE_TREE_VALID (bit 1): When both are set, the free space tree (objectid 10) is used instead of the legacy free space cache (v1). Both bits must be set for the tree to be considered valid.

VERITY (bit 2): Indicates that fs-verity has been enabled on at least one file, and the filesystem contains VERITY_DESC_ITEM and VERITY_MERKLE_ITEM entries.

BLOCK_GROUP_TREE (bit 3): Block group items are stored in a dedicated block group tree (objectid 11) instead of the extent tree. This improves mount time by avoiding a full extent tree scan to find block groups.

Appendix A: Transaction Model

Btrfs uses a generation-based transaction model. Each transaction is identified by a monotonically increasing generation counter stored in the superblock.

Transaction commit

A transaction commit involves:

  1. All modified tree blocks are written to new locations (COW). Each block’s header records the current generation.
  2. The superblock is updated with:
    • Incremented generation
    • New root (root tree root address)
    • New chunk_root (if chunk tree changed)
    • Updated bytes_used and total_bytes
    • Rotated super_roots backup entry
  3. The superblock is written to all mirrors that fit on the device.

The superblock write is the atomic commit point. If the system crashes before the superblock is fully written, the previous superblock (with the previous generation) remains valid and the filesystem rolls back to that state.

Generation consistency

The generation field appears in multiple places, all of which must be consistent:

  • Superblock generation: the current transaction counter
  • Tree block header generation: must equal the generation when the block was last COWed
  • Node key-pointer generation: must match the child block’s header generation (used for read-time validation)
  • ROOT_ITEM.generation: the generation when the tree was last modified
  • Backup root *_gen fields: generation of each tree root at backup time

When reading a tree, the kernel validates that each block’s generation matches the expected generation from its parent’s key-pointer. A mismatch indicates corruption or a torn write.

Superblock flag: CHANGING_FSID

The BTRFS_SUPER_FLAG_CHANGING_FSID flag (bit 2 of flags) is set during an offline fsid rewrite operation. If the system crashes while this flag is set, the rewrite must be completed or rolled back on the next access. This provides crash safety for the multi-block fsid change operation.

Appendix B: Size Constants

ConstantSizeNotes
BTRFS_SUPER_INFO_SIZE4096 bytes
BTRFS_HEADER_SIZE101 bytessizeof(btrfs_header)
BTRFS_ITEM_SIZE25 bytessizeof(btrfs_item)
BTRFS_KEY_PTR_SIZE33 bytessizeof(btrfs_key_ptr)
BTRFS_DISK_KEY_SIZE17 bytessizeof(btrfs_disk_key)
BTRFS_CSUM_SIZE32 bytesMaximum checksum field width
BTRFS_STRIPE_SIZE32 bytessizeof(btrfs_stripe)
BTRFS_INODE_ITEM_SIZE160 bytessizeof(btrfs_inode_item)
BTRFS_ROOT_ITEM_SIZE439 bytessizeof(btrfs_root_item)
BTRFS_DEV_ITEM_SIZE98 bytessizeof(btrfs_dev_item)
BTRFS_TIMESPEC_SIZE12 bytessizeof(btrfs_timespec)
BTRFS_BLOCK_GROUP_SIZE24 bytessizeof(btrfs_block_group_item)
BTRFS_EXTENT_ITEM_SIZE24 bytessizeof(btrfs_extent_item)
BTRFS_TREE_BLOCK_INFO_SIZE18 bytessizeof(btrfs_tree_block_info)
BTRFS_EXTENT_DATA_REF_SIZE28 bytessizeof(btrfs_extent_data_ref)
BTRFS_DEV_EXTENT_SIZE48 bytessizeof(btrfs_dev_extent)
BTRFS_FREE_SPACE_INFO_SIZE8 bytessizeof(btrfs_free_space_info)
BTRFS_ROOT_REF_SIZE18 bytessizeof(btrfs_root_ref), without name
BTRFS_DIR_ITEM_SIZE30 bytessizeof(btrfs_dir_item), without name/data
BTRFS_BACKUP_ROOT_SIZE168 bytessizeof(btrfs_root_backup)
SYS_CHUNK_ARRAY_SIZE2048 bytes

Appendix C: Logical-to-Physical Address Resolution

All tree block addresses and extent addresses in btrfs are logical addresses. To read a logical address from disk, it must be resolved to a physical device offset through the chunk tree.

The resolution process:

  1. Bootstrap: Parse the superblock’s sys_chunk_array to seed an initial chunk cache with system chunk mappings.

  2. Read the chunk tree: Using the system chunk mappings, resolve superblock.chunk_root to a physical address and read the chunk tree. Add all CHUNK_ITEM entries to the cache.

  3. Resolve: For any logical address, find the chunk whose range contains that address. The physical address is:

    physical = stripe.offset + (logical - chunk.logical)
    

    For SINGLE and DUP profiles, any stripe yields a valid copy. For RAID1, all stripes hold identical copies. For RAID0/5/6/10, stripe index calculation is needed.

  4. Read the root tree: Using the full chunk cache, resolve superblock.root to a physical address and read the root tree. From here, all other trees can be located via their ROOT_ITEM entries.

Appendix D: File Data Layout

A regular file’s on-disk data is described by a sequence of FILE_EXTENT_ITEM entries in the FS tree, keyed by (inode, EXTENT_DATA, file_offset).

Inline extents: Small files (typically < sectorsize) store their data directly in the tree leaf. No separate disk allocation is needed.

Regular extents: Larger files reference data stored in data chunks. The extent is described by disk_bytenr (logical address) and disk_num_bytes (on-disk size). The offset field allows partial references into shared extents (e.g., after COW or clone operations).

Compressed extents: When compression is enabled, the compression field is nonzero, disk_num_bytes is the compressed size, and ram_bytes is the uncompressed size. Inline compressed extents store the compressed data directly in the item.

Sparse files: With the NO_HOLES feature, gaps between extent items are implicit holes. Without it, explicit hole entries with disk_bytenr = 0 fill the gaps.

The file size is stored in INODE_ITEM.size and is authoritative even if the extent items would suggest a different range.

Extent sharing and cloning

When a file extent is cloned (via cp --reflink or BTRFS_IOC_CLONE), both the source and destination inodes reference the same on-disk extent via their FILE_EXTENT_ITEM entries. The reference count in the extent tree’s EXTENT_ITEM is incremented.

The offset field in FILE_EXTENT_ITEM allows each reference to start at a different position within the shared extent:

File A:  [--- extent X (offset=0, num_bytes=4096) ---]
File B:  [--- extent X (offset=2048, num_bytes=2048) ---]

Both reference the same disk_bytenr, but File B starts reading 2048 bytes into the extent.

Compression type encoding

The compression field in FILE_EXTENT_ITEM uses these values:

ValueNameNotes
0noneNo compression
1zlibDeflate compression
2lzoLZO compression (btrfs per-sector format)
3zstdZstandard compression

When compression is used with inline extents, the stored data is compressed and the inline data size may differ from ram_bytes.

Appendix E: Subvolume and Snapshot Model

Subvolumes

Each subvolume is an independent FS tree with its own tree objectid (5 for the default, 256+ for user-created subvolumes). The root tree stores:

  • A ROOT_ITEM for each subvolume, recording the root block address, generation, UUIDs, and timestamps.
  • ROOT_REF / ROOT_BACKREF pairs linking parent and child subvolumes.

Snapshots

A snapshot is a subvolume created by COWing the root block of another subvolume. At creation time, the snapshot shares all tree blocks with the source. As either the source or snapshot is modified, shared blocks are COWed on demand, gradually diverging.

The parent_uuid field in ROOT_ITEM links a snapshot back to its source subvolume. The received_uuid field tracks the source across send/receive operations.

Subvolume deletion

Deleted subvolumes are marked with the SUBVOL_DEAD flag in their ROOT_ITEM.flags. The kernel cleans up the tree blocks asynchronously, tracking progress via the drop_progress key and drop_level fields.

Read-only snapshots

A subvolume can be made read-only by setting the SUBVOL_RDONLY flag in ROOT_ITEM.flags. This is required for send operations (the source subvolume must be read-only).

Appendix F: Name Hashing

Directory entries (DIR_ITEM) and extended attributes (XATTR_ITEM) are keyed by a CRC32C hash of the name. The hash uses raw CRC32C (see Section 9.2) with seed ~0:

hash = raw_crc32c(0xFFFFFFFF, name_bytes)

This hash determines the key offset for the DIR_ITEM. If two names hash to the same value (collision), their DIR_ITEM entries are packed into a single item, concatenated one after another.

DIR_INDEX entries use a monotonically increasing sequence number instead of a hash, providing deterministic iteration order independent of name hashing.

For INODE_EXTREF, the hash combines the parent inode number and name:

hash = raw_crc32c(raw_crc32c(0xFFFFFFFF, parent_ino_le_bytes), name_bytes)

Appendix G: Block Group and Chunk Relationship

The relationship between chunks, block groups, and device extents forms the space allocation layer:

Chunk (chunk tree)
  |
  +-- maps logical range [L, L+length) to physical stripes
  |   on one or more devices
  |
  +-- Block Group (extent tree or block group tree)
  |     tracks used/free space within the logical range
  |     type flags must match the chunk type
  |
  +-- Device Extent(s) (device tree)
        one per stripe, maps physical range back to the chunk

Allocation order: mkfs creates chunks by:

  1. Choosing a physical region on each device (creating device extents)
  2. Assigning a logical address range (creating the chunk item)
  3. Creating a block group covering the logical range
  4. For the free space tree, creating a FREE_SPACE_INFO and initial FREE_SPACE_EXTENT entries

Consistency invariant: For every chunk, there must be:

  • Exactly one BLOCK_GROUP_ITEM with matching logical offset and length
  • One DEV_EXTENT per stripe, with chunk_offset pointing back to the chunk
  • The block group flags must match the chunk type field

These cross-references are verified by btrfs check.

Appendix H: Default Feature Set

A modern btrfs filesystem created by mkfs.btrfs (or this project’s btrfs-mkfs) typically has the following features enabled:

Incompatible features:

  • MIXED_BACKREF (bit 0) – always set
  • BIG_METADATA (bit 5) – set because nodesize (16384) > sectorsize (4096)
  • EXTENDED_IREF (bit 6) – enables extended inode references
  • SKINNY_METADATA (bit 8) – compact metadata extent records
  • NO_HOLES (bit 9) – implicit holes in sparse files

Compatible read-only features:

  • FREE_SPACE_TREE (bit 0) – free space tracking tree
  • FREE_SPACE_TREE_VALID (bit 1) – free space tree is valid

These are the extref, skinny-metadata, no-holes, and free-space-tree features referenced in mkfs output.

Default parameters:

  • nodesize = 16384 (16 KiB)
  • sectorsize = 4096 (4 KiB), matching the device sector size
  • stripesize = 65536 (64 KiB)
  • csum_type = 0 (CRC32C)
  • Metadata profile: DUP (two copies on the same device)
  • Data profile: SINGLE (no redundancy)
  • System profile: DUP (for single-device) or RAID1 (for multi-device)

Appendix I: Extent Reference Counting

Btrfs tracks references to every allocated extent (both data and metadata) in the extent tree. The reference count in EXTENT_ITEM.refs (or METADATA_ITEM.refs) records how many times the extent is referenced.

Metadata extents

A metadata extent (tree block) is referenced by key-pointers in parent nodes. When a snapshot is created, the snapshot initially shares all tree blocks with the source. Each shared block has refs >= 2. When either tree COWs a shared block, the old block’s refcount is decremented and the new copy gets refs = 1.

Backreferences track which tree(s) own each block:

  • TREE_BLOCK_REF (inline or standalone): direct ownership by a tree root
  • SHARED_BLOCK_REF (inline or standalone): ownership via a parent block that is itself shared between trees

Data extents

A data extent is referenced by FILE_EXTENT_ITEM entries in FS trees. Multiple files (or multiple positions in the same file) can reference the same data extent through reflink cloning.

Backreferences track which file inodes reference each extent:

  • EXTENT_DATA_REF (inline or standalone): records (root, inode, offset, count)
  • SHARED_DATA_REF (inline or standalone): records (parent_block, count)

Reference count invariant

The refs field must equal the sum of all backreference counts for the extent. btrfs check verifies this invariant by walking the extent tree and cross-referencing with the FS trees.

When refs reaches 0, the extent is freed and its space returned to the block group’s free space pool.

Chunks and Block Groups

This document describes the btrfs chunk and block group system: how the filesystem maps logical addresses to physical device locations, how space is organized into typed block groups, and how these structures relate to each other on disk.

All multi-byte integers in btrfs on-disk structures are little-endian.

Address Spaces

Btrfs uses two distinct address spaces:

Logical address space. Every byte of allocated space in the filesystem has a logical address. Tree node pointers, extent references, block group descriptors, and file extent records all use logical addresses. The logical address space is a flat 64-bit namespace shared across all devices in the filesystem. There is no inherent relationship between a logical address and any particular physical device.

Physical address space. Each device has its own independent physical address space, starting at byte 0. Physical addresses identify actual byte offsets on a block device.

The separation exists for several reasons:

  1. Multi-device support. A single logical address can map to stripes on multiple physical devices (RAID1, DUP, RAID0, etc.) without the upper layers of the filesystem needing to know which devices are involved.

  2. Relocation. The balance and resize operations can move data between physical locations while logical addresses remain stable. Since all internal pointers use logical addresses, no tree rewriting is needed when physical locations change.

  3. Redundancy profiles. The same logical address range can have multiple physical copies (DUP, RAID1) or be striped across devices (RAID0) — this is invisible to everything above the chunk layer.

The mapping between the two address spaces is maintained by three cooperating data structures: chunks (logical to physical), device extents (physical to logical), and block groups (space accounting).

Chunks

A chunk maps a contiguous range of logical addresses to one or more physical locations on devices. Chunks are the fundamental unit of the logical-to-physical translation.

CHUNK_ITEM On-Disk Structure

Chunks are stored in the chunk tree. Each chunk item has a key:

Key: (FIRST_CHUNK_TREE_OBJECTID, CHUNK_ITEM, logical_offset)
      objectid = 256                type = 228    offset = start of logical range

The item payload is a btrfs_chunk structure followed by an array of btrfs_stripe structures:

btrfs_chunk (48 bytes):

FieldOffsetSizeDescription
length08Logical extent length in bytes
owner88Owner tree objectid (always EXTENT_TREE_OBJECTID = 2)
stripe_len168Stripe length for striped profiles (default 64 KiB)
type248Block group type + RAID profile flags
io_align324I/O alignment (STRIPE_LEN for normal chunks, sectorsize for bootstrap)
io_width364I/O width (same as io_align)
sector_size404Sector size of the underlying devices
num_stripes442Number of stripe entries following
sub_stripes462Sub-stripe count (nonzero only for RAID10)

btrfs_stripe (32 bytes each, num_stripes entries):

FieldOffsetSizeDescription
devid08Device ID
offset88Physical byte offset on that device
dev_uuid1616UUID of the device

The total item size is 48 + num_stripes * 32 bytes.

Logical-to-Physical Resolution

To resolve a logical address to a physical location:

  1. Find the chunk whose logical range contains the address. The chunk tree is a B-tree keyed by (256, CHUNK_ITEM, logical_offset), so a lookup finds the entry with the largest logical_offset <= target.

  2. Verify the address falls within the chunk: logical_offset <= target < logical_offset + length.

  3. Compute the offset within the chunk: within = target - logical_offset.

  4. For simple profiles (SINGLE, DUP, RAID1): the physical address on stripe i is stripe[i].offset + within.

  5. For striped profiles (RAID0, RAID10, RAID5, RAID6): the stripe index and offset within the stripe are computed from within, stripe_len, and num_stripes/sub_stripes.

The ChunkTreeCache in disk/src/chunk.rs implements this as a BTreeMap keyed by logical start address, with resolve() returning the physical offset on the first stripe (sufficient for SINGLE, DUP, and RAID1 reads).

Chunk Ownership

The owner field in the chunk item is always BTRFS_EXTENT_TREE_OBJECTID (2). This is a historical artifact — it does not mean the extent tree “owns” the chunk in any meaningful sense. The chunk tree is its own independent tree (tree objectid 3) with its root pointer stored directly in the superblock.

Block Groups

A block group is the unit of space management in btrfs. Each block group corresponds to exactly one chunk and tracks how much of that chunk’s space is used. Block groups carry type information that determines what kind of data can be stored in them.

BLOCK_GROUP_ITEM On-Disk Structure

Block group items are stored either in the extent tree (traditional layout) or in the dedicated block-group tree (when the BLOCK_GROUP_TREE compat_ro feature is enabled).

Key: (logical_offset, BLOCK_GROUP_ITEM, length)
      objectid = chunk start    type = 192    offset = chunk length

The item payload is a btrfs_block_group_item (24 bytes):

FieldOffsetSizeDescription
used08Bytes currently allocated within this block group
chunk_objectid88Always FIRST_CHUNK_TREE_OBJECTID (256)
flags168Type flags + RAID profile flags

Type Flags

The flags field is a bitfield combining a chunk type (what gets stored) and a RAID profile (how it is stored):

Chunk type bits (mutually exclusive in practice):

FlagValueMeaning
DATA0x001File data extents
SYSTEM0x002Chunk tree blocks (needed to bootstrap reads)
METADATA0x004Tree node blocks (all trees except chunk)

The kernel also supports DATA|METADATA (0x005) for the mixed-bg feature, where data and metadata share block groups.

RAID profile bits:

FlagValueMeaning
(none)0SINGLE — one copy, one device
RAID00x008Striped across N devices, no redundancy
RAID10x010Mirrored on 2 devices
DUP0x020Two copies on the same device
RAID100x040Striped mirrors
RAID50x080Single parity
RAID60x100Double parity
RAID1C30x200Mirrored on 3 devices
RAID1C40x400Mirrored on 4 devices

For example, a metadata block group using DUP has flags 0x024 (METADATA | DUP). A system block group with no profile bits set is SYSTEM|single (0x002).

The BlockGroupFlags type in disk/src/items.rs represents these flags as a bitflags struct with methods type_name() (returns “Data”, “Metadata”, “System”, etc.) and profile_name() (returns “RAID1”, “DUP”, “single”, etc.).

Block Group to Chunk Relationship

Every block group has a 1:1 correspondence with a chunk. The block group’s key (logical_offset, BLOCK_GROUP_ITEM, length) must match a chunk item’s (256, CHUNK_ITEM, logical_offset) with matching length. The block group’s flags must agree with the chunk item’s type field.

This invariant is verified by btrfs check (see section 8).

Device Extents

Device extents are the inverse mapping of chunks: they record which ranges of physical space on each device are allocated to which chunks.

DEV_EXTENT On-Disk Structure

Device extents are stored in the device tree (tree objectid 4).

Key: (devid, DEV_EXTENT, physical_offset)
      objectid = device ID    type = 204    offset = start byte on device

The item payload is a btrfs_dev_extent (48 bytes):

FieldOffsetSizeDescription
chunk_tree08Chunk tree objectid (always 3)
chunk_objectid88FIRST_CHUNK_TREE_OBJECTID (256)
chunk_offset168Logical offset of the owning chunk
length248Physical extent length in bytes
chunk_tree_uuid3216UUID of the chunk tree

Relationship to Chunks and Stripes

For each stripe in a chunk item, there is a corresponding device extent. If a chunk at logical address L has num_stripes stripes, then:

  • Stripe 0: (stripe[0].devid, DEV_EXTENT, stripe[0].offset) with chunk_offset = L and length = chunk.length (for SINGLE/DUP/RAID1).

  • Stripe 1 (for DUP/RAID1): (stripe[1].devid, DEV_EXTENT, stripe[1].offset) with chunk_offset = L and length = chunk.length.

For a DUP metadata chunk on a single device, both stripes have the same devid but different physical offsets, producing two device extents on the same device.

Device Items

Each device in the filesystem also has a DEV_ITEM in the chunk tree:

Key: (DEV_ITEMS_OBJECTID, DEV_ITEM, devid)
      objectid = 1              type = 216    offset = device ID

The item payload is a btrfs_dev_item (98 bytes):

FieldOffsetSizeDescription
devid08Unique device ID
total_bytes88Total device size
bytes_used168Bytes allocated to chunks on this device
io_align244I/O alignment
io_width284I/O width
sector_size324Sector size
dev_type368Reserved (0)
generation448Last-updated generation
start_offset528Start offset for allocations
dev_group604Reserved (0)
seek_speed641Seek speed hint (0)
bandwidth651Bandwidth hint (0)
uuid6616Device UUID
fsid8216Filesystem UUID

The bytes_used field is the sum of the lengths of all device extents on that device. A copy of the device item for device 1 is also embedded in the superblock.

The Bootstrap Problem

Circular Dependency

To read any tree, you need to resolve logical addresses to physical offsets, which requires the chunk tree. But the chunk tree is itself stored at a logical address that needs resolution. This creates a circular dependency.

sys_chunk_array

Btrfs solves this with the sys_chunk_array — a 2048-byte buffer embedded directly in the superblock. This array contains a subset of the chunk tree: specifically, the chunk items for SYSTEM-type block groups.

The SYSTEM block group contains the chunk tree’s root block. By parsing the sys_chunk_array, the filesystem driver can locate the chunk tree on disk without needing a chunk tree to find it.

The array format is a packed sequence of (btrfs_disk_key, btrfs_chunk) pairs:

sys_chunk_array[0..sys_chunk_array_size]:
  repeat {
    btrfs_disk_key (17 bytes):
      objectid: u64_le      (always FIRST_CHUNK_TREE_OBJECTID = 256)
      type:     u8           (always CHUNK_ITEM = 228)
      offset:   u64_le       (logical offset of the chunk)
    btrfs_chunk + stripes:
      (same format as the chunk item payload described in section 2.1)
  }

The sys_chunk_array_size field in the superblock records how many bytes of the 2048-byte buffer are valid.

Bootstrap Sequence

The full bootstrap sequence for reading a btrfs filesystem is:

  1. Read the superblock at the primary offset (64 KiB). Verify the magic number, checksum, and fsid. The superblock provides:

    • sys_chunk_array + sys_chunk_array_size
    • chunk_root (logical address of the chunk tree root)
    • root (logical address of the root tree root)
    • nodesize, sectorsize, csum_type
  2. Parse the sys_chunk_array to build an initial ChunkTreeCache. This cache contains only the SYSTEM chunk(s), which is enough to resolve the chunk tree root address.

  3. Read the chunk tree starting from chunk_root. For each CHUNK_ITEM found, add the mapping to the ChunkTreeCache. After this step, the cache can resolve any logical address in the filesystem.

  4. Read the root tree starting from root. This tree contains ROOT_ITEM entries for every other tree (extent, device, FS, csum, free-space, etc.), providing their root block logical addresses.

  5. Read any other tree by looking up its ROOT_ITEM in the root tree and using the ChunkTreeCache to resolve addresses.

The seed_from_sys_chunk_array() function in disk/src/chunk.rs implements step 2. The BlockReader in disk/src/reader.rs orchestrates the full bootstrap sequence.

RAID Profiles

The RAID profile determines how a chunk’s logical space maps to physical device locations. The profile affects num_stripes, sub_stripes, and the interpretation of stripe entries.

SINGLE

num_stripes = 1
sub_stripes = 0

One stripe, one device. Logical offset maps 1:1 to a physical offset on a single device. No redundancy.

Logical:   [--------chunk------]
Physical:  [dev1: stripe 0     ]

DUP

num_stripes = 2
sub_stripes = 0

Two stripes on the same device at different physical offsets. Both stripes contain identical data. Provides protection against localized media errors but not device failure.

Logical:   [--------chunk------]
Physical:  [dev1: stripe 0     ]
           [dev1: stripe 1     ]  (different offset, same data)

DUP is the default metadata profile for single-device filesystems. The logical size of the chunk equals one stripe size. The physical space consumed is 2 * stripe_size.

In mkfs, DUP metadata stripes are laid out sequentially after the system group:

Physical layout on device 1:
  [0..1M)          reserved (superblock at 64K)
  [1M..5M)         system chunk (4 MiB)
  [5M..5M+meta)    metadata stripe 0
  [5M+meta..5M+2*meta)  metadata stripe 1
  [5M+2*meta..)    data stripe 0

RAID1

num_stripes = 2  (RAID1C3: 3, RAID1C4: 4)
sub_stripes = 0

One stripe per device, each containing identical data. RAID1 uses 2 devices, RAID1C3 uses 3, RAID1C4 uses 4.

Logical:   [--------chunk------]
Physical:  [dev1: stripe 0     ]
           [dev2: stripe 1     ]  (same data, different device)

For RAID1 metadata on a 2-device filesystem, mkfs places one stripe on each device at the same physical offset (CHUNK_START):

Device 1: [system][meta stripe 0][data stripe 0]
Device 2:        [meta stripe 1]

RAID0

num_stripes = N  (number of devices)
sub_stripes = 0

Data is striped across N devices in stripe_len-sized (64 KiB) units. No redundancy. The logical chunk size equals N * physical_stripe_size.

Logical:   [--A--][--B--][--C--][--A--][--B--][--C--]
Physical:  dev1: [--A--]       [--A--]
           dev2:        [--B--]       [--B--]
           dev3:               [--C--]       [--C--]

To resolve a logical address within a RAID0 chunk:

  1. offset = logical - chunk_start
  2. stripe_nr = offset / stripe_len
  3. stripe_index = stripe_nr % num_stripes
  4. stripe_offset = (stripe_nr / num_stripes) * stripe_len + (offset % stripe_len)
  5. Physical address = stripes[stripe_index].offset + stripe_offset

RAID10

num_stripes = N  (must be even, >= 4)
sub_stripes = 2

Striped mirrors: data is striped across N/2 mirror groups, each group having sub_stripes (2) copies. Combines RAID0 throughput with RAID1 redundancy.

RAID5 and RAID6

RAID5: num_stripes = N, sub_stripes = 0, one parity stripe
RAID6: num_stripes = N, sub_stripes = 0, two parity stripes

Data is striped with rotating parity. RAID5 tolerates one device failure; RAID6 tolerates two.

Allocation Sizing

When creating a new filesystem (mkfs), the initial chunk sizes are computed from the total device size. The formulas, implemented in mkfs/src/layout.rs (ChunkLayout::new), are:

System Block Group

Fixed size and position:

  • Offset: SYSTEM_GROUP_OFFSET = 1 MiB (0x100000)
  • Size: SYSTEM_GROUP_SIZE = 4 MiB (0x400000)
  • Profile: always SINGLE
  • Contains: the chunk tree root block

The first 1 MiB of the device is reserved. The primary superblock sits at offset 64 KiB within this reserved area.

Metadata Block Group

meta_size = clamp(total_bytes / 10, 32 MiB, 256 MiB)
meta_size = round_down(meta_size, STRIPE_LEN)

where STRIPE_LEN = 64 KiB and total_bytes is the sum across all devices.

The metadata chunk starts at logical offset CHUNK_START = 5 MiB (SYSTEM_GROUP_OFFSET + SYSTEM_GROUP_SIZE). For DUP, two physical stripes are placed sequentially on device 1. For RAID1, one stripe is placed on each of the first two devices.

Examples:

  • 256 MiB device: clamp(25.6M, 32M, 256M) = 32 MiB
  • 1 GiB device: clamp(102.4M, 32M, 256M) = 102 MiB (rounded to 64K)
  • 10 GiB device: clamp(1G, 32M, 256M) = 256 MiB

Data Block Group

data_size = clamp(total_bytes / 10, 64 MiB, 1 GiB)
data_size = round_down(data_size, STRIPE_LEN)

The data chunk follows the metadata chunk in both logical and physical address spaces. Logical offset = CHUNK_START + meta_size.

Examples:

  • 256 MiB device: clamp(25.6M, 64M, 1G) = 64 MiB
  • 1 GiB device: clamp(102.4M, 64M, 1G) = 102 MiB (rounded to 64K)
  • 10 GiB device: clamp(1G, 64M, 1G) = 1 GiB

Minimum Device Size

For a single-device filesystem with DUP metadata and SINGLE data, the minimum physical space needed is:

1 MiB (reserved) + 4 MiB (system) + 2 * meta_size + data_size

With the minimum sizes (meta = 32 MiB, data = 64 MiB), this works out to approximately 133 MiB. A 100 MiB device will fail with “device too small”.

Physical Layout Summary

For a single-device DUP-metadata SINGLE-data filesystem:

Physical byte offset:
  [0 .. 1M)                          Reserved (superblock at 64K)
  [1M .. 5M)                         System block group (4 MiB)
  [5M .. 5M + meta_size)             Metadata stripe 0
  [5M + meta_size .. 5M + 2*meta)    Metadata stripe 1 (DUP copy)
  [5M + 2*meta .. 5M + 2*meta + data)  Data

Logical address space:
  [1M .. 5M)                         System chunk
  [5M .. 5M + meta_size)             Metadata chunk
  [5M + meta_size .. 5M + meta + data)  Data chunk

Note the physical space for DUP metadata is 2 * meta_size, but the logical address range is only meta_size. Both physical stripes map to the same logical range.

Cross-Checks

The btrfs check command (implemented in cli/src/check/chunks.rs) verifies the consistency of the chunk/block-group/device-extent triad.

Chunk-to-Block-Group Check

For every chunk in the chunk tree, there must be a matching block group item. The check walks the chunk tree cache and verifies that block_groups.contains_key(chunk.logical) for each chunk.

If a chunk has no corresponding block group, btrfs check reports:

ChunkMissingBlockGroup { logical }

Block-Group-to-Chunk Check

For every block group item (from the extent tree or block-group tree), there must be a matching chunk. The check verifies that chunk_cache.lookup(bg_logical) succeeds for each block group.

If a block group has no corresponding chunk, btrfs check reports:

BlockGroupMissingChunk { logical }

Device Extent Overlap Check

All device extents for each device are collected from the device tree, sorted by physical offset, and checked for overlaps. For consecutive extents on the same device, the check verifies:

extent[i].offset >= extent[i-1].offset + extent[i-1].length

If two device extents overlap, btrfs check reports:

DeviceExtentOverlap { devid, offset }

Block Group Source

When the BLOCK_GROUP_TREE compat_ro feature is enabled, block group items are stored in a separate tree (tree objectid 10) rather than in the extent tree. The check code handles both cases by selecting the appropriate tree root:

#![allow(unused)]
fn main() {
let bg_root = block_group_tree_root.unwrap_or(extent_root);
}

The Chunk Tree

The chunk tree (tree objectid 3) stores two kinds of items:

  1. DEV_ITEM entries for each device in the filesystem: (DEV_ITEMS_OBJECTID=1, DEV_ITEM=216, devid)

  2. CHUNK_ITEM entries for each chunk: (FIRST_CHUNK_TREE_OBJECTID=256, CHUNK_ITEM=228, logical_offset)

Items are sorted by key, so DEV_ITEMs (objectid 1) come before CHUNK_ITEMs (objectid 256).

The chunk tree root pointer is stored directly in the superblock’s chunk_root field — it does not go through the root tree like other trees. This is because the chunk tree is needed to read the root tree itself.

mkfs Chunk Tree Construction

When mkfs builds the chunk tree (build_chunk_tree in mkfs/src/mkfs.rs), it creates:

  1. One DEV_ITEM per device, with bytes_used set to the sum of all chunk stripes on that device.

  2. Three CHUNK_ITEM entries:

    • System chunk at SYSTEM_GROUP_OFFSET (1 MiB), size 4 MiB
    • Metadata chunk at CHUNK_START (5 MiB), with profile-dependent stripes
    • Data chunk after metadata, with profile-dependent stripes

The system chunk item uses sectorsize for io_align and io_width (matching the kernel’s bootstrap behavior), while the metadata and data chunks use STRIPE_LEN (64 KiB).

The Device Tree

The device tree (tree objectid 4) stores:

  1. DEV_STATS (PERSISTENT_ITEM) for each device: per-device I/O error counters, initialized to zero by mkfs.

  2. DEV_EXTENT entries for each physical stripe of each chunk.

Items are sorted by key: (objectid=devid, type=DEV_EXTENT, offset=physical_byte_offset).

mkfs Device Tree Construction

When mkfs builds the device tree (build_dev_tree in mkfs/src/mkfs.rs), it creates:

  1. One DEV_STATS item per device (zeroed counters).

  2. Device extents for each stripe:

    • System chunk: one DEV_EXTENT on device 1 at SYSTEM_GROUP_OFFSET
    • Metadata chunk: one DEV_EXTENT per stripe (two for DUP on device 1, or one per device for RAID1)
    • Data chunk: one DEV_EXTENT per stripe

All device tree items are collected, sorted by key, and written in order. This is necessary because items span multiple device IDs and physical offsets that are not naturally ordered by construction.

Superblock Mirrors

The superblock is written at up to three fixed physical offsets on each device:

MirrorOffsetSize
064 KiB4 KiB
164 MiB4 KiB
2256 GiB4 KiB

The formula is: mirror 0 at 65536 bytes; mirror N (N > 0) at 16384 << (12 * N) bytes. Mirrors are only written if the device is large enough to contain them.

The superblock contains the sys_chunk_array bootstrap data, root pointers for the chunk tree and root tree, the embedded device item for device 1, and all filesystem-level metadata (UUID, label, feature flags, generation counter, bytes_used, etc.).

All three mirrors contain identical data for a given generation. On mount, the kernel reads all available mirrors and uses the one with the highest valid generation, providing resilience against corruption of the primary superblock.

Tree Block Placement in mkfs

During filesystem creation, tree blocks must be placed at specific logical addresses within the chunks. The BlockLayout struct in mkfs/src/layout.rs assigns addresses:

Chunk tree block: placed at SYSTEM_GROUP_OFFSET (1 MiB) in the system chunk. This is the only tree block in the system block group.

All other tree blocks (root, extent, device, FS, csum, free-space, data-reloc, and optionally block-group): placed sequentially in the metadata chunk starting at meta_logical = 5 MiB. With a 16 KiB nodesize:

Logical addressTree
meta_logical + 0Root tree
meta_logical + 16KExtent tree
meta_logical + 32KDevice tree
meta_logical + 48KFS tree
meta_logical + 64KCsum tree
meta_logical + 80KFree-space tree
meta_logical + 96KData-reloc tree
meta_logical + 112KBlock-group tree (if enabled)

For --rootdir mode, where trees may require multiple blocks, the BlockAllocator hands out sequential addresses from the system and metadata chunks, supporting trees of arbitrary size.

System chunk bytes used = nodesize (one chunk tree block). Metadata chunk bytes used = 7 * nodesize (or 8 with block-group tree).

Extent Tree and Backrefs

This document describes the btrfs extent tree: how every allocated byte on disk is tracked, how reference counting works, and how backreferences link extents to the trees and files that use them.

All multi-byte integers in btrfs on-disk structures are little-endian.

Purpose

The extent tree is the central allocator of the btrfs filesystem. It records every contiguous range of allocated disk space (both data extents used by files and metadata blocks used by trees) and tracks who references each extent.

The extent tree serves three purposes:

  1. Allocation tracking. The set of extent items defines which logical byte ranges are in use. The free-space tree (or free-space cache) is derived from the gaps between extent items.

  2. Reference counting. Each extent has a declared reference count. Snapshots and clones share extents by incrementing this count rather than copying data. When the count drops to zero, the extent can be freed.

  3. Backreferences. Each extent stores references back to the trees, inodes, and file offsets that use it. This enables the filesystem to find all users of an extent (for relocation during balance, for example) and to verify consistency (during btrfs check).

The extent tree is stored in tree objectid 2 (BTRFS_EXTENT_TREE_OBJECTID). Its root pointer is stored in the root tree via a ROOT_ITEM entry.

EXTENT_ITEM vs METADATA_ITEM

There are two key types used to record allocated extents:

EXTENT_ITEM (type 168)

The original extent item format, used for both data and metadata extents.

Key: (bytenr, EXTENT_ITEM, length)
      objectid = logical start    type = 168    offset = size in bytes

For data extents, length is the extent’s size on disk. For metadata extents (tree blocks), length equals the filesystem’s nodesize.

METADATA_ITEM (type 169) — Skinny Metadata

When the SKINNY_METADATA incompat feature is enabled (the default since Linux 3.10), metadata extents use a more compact key:

Key: (bytenr, METADATA_ITEM, level)
      objectid = logical start    type = 169    offset = tree level (0..7)

The extent’s length is implicitly nodesize (not stored in the key). The level field in the key offset records the B-tree level of the tree block, which is useful for verification without reading the block itself.

Skinny metadata items are called “skinny refs” because they eliminate the need for the btrfs_tree_block_info structure that non-skinny EXTENT_ITEM entries for tree blocks carry.

Key Differences

AspectEXTENT_ITEM (non-skinny)METADATA_ITEM (skinny)
Key type168169
Key offsetnodesizetree level (0..7)
Item bodyextent_item + tree_block_info + inline refsextent_item + inline refs
When usedAlways for data; metadata only without skinny_metadataMetadata only, with skinny_metadata

In mkfs, the choice is controlled by the skinny_metadata() config flag:

#![allow(unused)]
fn main() {
let (item_type, offset) = if skinny {
    (BTRFS_METADATA_ITEM_KEY, 0u64)   // level 0 for leaf blocks
} else {
    (BTRFS_EXTENT_ITEM_KEY, nodesize as u64)
};
}

The Extent Item Header

Both EXTENT_ITEM and METADATA_ITEM share the same header structure, btrfs_extent_item (24 bytes):

FieldOffsetSizeDescription
refs08Total reference count
generation88Generation when allocated
flags168Extent type flags

Extent Flags

The flags field uses these bits:

FlagValueMeaning
DATA0x01Extent holds file data
TREE_BLOCK0x02Extent holds a metadata tree block
FULL_BACKREF0x80Uses shared (parent-based) backrefs only

A data extent has flags = DATA (0x01). A metadata extent has flags = TREE_BLOCK (0x02). The FULL_BACKREF flag is set when the extent uses shared backreferences (after a snapshot) rather than normal tree backreferences.

The ExtentFlags type in disk/src/items.rs represents these flags as a bitflags struct.

Tree Block Info (Non-Skinny Only)

For non-skinny EXTENT_ITEM entries with TREE_BLOCK flag, the header is followed by btrfs_tree_block_info (25 bytes):

FieldOffsetSizeDescription
key017First key in the tree block (btrfs_disk_key)
level171B-tree level of the block

This structure is omitted when using METADATA_ITEM (skinny metadata), since the level is stored in the key offset and the first key is not needed.

Full Item Layout

For a skinny metadata extent item with one inline TREE_BLOCK_REF:

Byte offsetSizeContent
08refs (u64_le)
88generation (u64_le)
168flags = TREE_BLOCK (u64_le)
241inline ref type = TREE_BLOCK_REF_KEY (176)
258root objectid (u64_le)
Total: 33 bytes

For a data extent item with one inline EXTENT_DATA_REF:

Byte offsetSizeContent
08refs (u64_le)
88generation (u64_le)
168flags = DATA (u64_le)
241inline ref type = EXTENT_DATA_REF_KEY (178)
258root (u64_le)
338objectid (u64_le) – inode number
418offset (u64_le) – file offset
494count (u32_le)
Total: 53 bytes

Inline Backrefs

After the extent item header (and tree_block_info for non-skinny metadata), zero or more inline backreferences are packed contiguously. Each inline ref starts with a 1-byte type code, followed by type-specific data.

Inline refs are the common case: they are stored directly inside the extent item, avoiding the overhead of separate B-tree items. When an extent item grows too large to fit in a leaf (due to many references), backrefs are stored as standalone items instead.

TREE_BLOCK_REF (type 176)

Direct backref from a metadata extent to the tree that owns it.

FieldOffsetSizeDescription
type01176 (BTRFS_TREE_BLOCK_REF_KEY)
root objectid18u64_le

The root field identifies the tree that owns this metadata block. For example, root = 5 means the FS tree, root = 2 means the extent tree itself.

Total size: 9 bytes.

SHARED_BLOCK_REF (type 182)

Shared backref from a metadata extent to a parent tree block. Used when a tree block is shared between snapshots — the backref points to a parent node rather than a root.

FieldOffsetSizeDescription
type01182 (BTRFS_SHARED_BLOCK_REF_KEY)
parent bytenr18u64_le

The parent field is the logical byte address of the tree node that contains a pointer to this extent.

Total size: 9 bytes.

EXTENT_DATA_REF (type 178)

Backref from a data extent to a specific file inode. This is the most common inline ref type for data extents.

FieldOffsetSizeDescription
type01178 (BTRFS_EXTENT_DATA_REF_KEY)
root18Tree objectid owning the inode (u64_le)
objectid98Inode number (u64_le)
offset178File byte offset (u64_le)
count254Number of references (u32_le)

Note that unlike other inline ref types, EXTENT_DATA_REF does not have an 8-byte offset field between the type byte and the struct body. The struct starts immediately after the type byte. The parser in disk/src/items.rs handles this by reinterpreting the speculatively consumed offset bytes as the root field:

#![allow(unused)]
fn main() {
raw::BTRFS_EXTENT_DATA_REF_KEY => {
    let root = ref_offset; // already read as u64_le
    let oid = buf.get_u64_le();
    let off = buf.get_u64_le();
    let count = buf.get_u32_le();
    // ...
}
}

The count field represents how many times this particular (root, objectid, offset) triple references the extent. For a normal file with one reference, count = 1. For a file cloned via reflink, each clone adds a new EXTENT_DATA_REF with its own triple and count.

Total size: 29 bytes.

SHARED_DATA_REF (type 184)

Shared data backref, used when data extents are shared between snapshots.

FieldOffsetSizeDescription
type01184 (BTRFS_SHARED_DATA_REF_KEY)
parent bytenr18u64_le
count94u32_le

Total size: 13 bytes.

EXTENT_OWNER_REF (type 172)

Simple ownership reference, used with the simple_quota feature. Records which tree root owns the extent without full backref details.

FieldOffsetSizeDescription
type01172 (BTRFS_EXTENT_OWNER_REF_KEY)
root objectid18u64_le

Total size: 9 bytes.

Standalone Backrefs

When inline backrefs do not fit inside the extent item (because the item would exceed the available leaf space), they are stored as separate items in the extent tree. Standalone backrefs use the same type codes as inline refs but are encoded as independent key/value pairs.

Standalone TREE_BLOCK_REF

Key: (bytenr, TREE_BLOCK_REF, root_objectid)
      objectid = extent start    type = 176    offset = owning tree

Item payload: empty (zero bytes). The backref information is entirely in the key.

Standalone SHARED_BLOCK_REF

Key: (bytenr, SHARED_BLOCK_REF, parent_bytenr)
      objectid = extent start    type = 182    offset = parent block

Item payload: empty.

Standalone EXTENT_DATA_REF

Key: (bytenr, EXTENT_DATA_REF, hash)
      objectid = extent start    type = 178    offset = CRC32C hash

The key offset is a hash of (root, objectid, offset) computed by:

#![allow(unused)]
fn main() {
fn extent_data_ref_hash(root: u64, objectid: u64, offset: u64) -> u64 {
    let high_crc = raw_crc32c(!0u32, &root.to_le_bytes());
    let low_crc = raw_crc32c(!0u32, &objectid.to_le_bytes());
    let low_crc = raw_crc32c(low_crc, &offset.to_le_bytes());
    (u64::from(high_crc) << 31) ^ u64::from(low_crc)
}
}

This hash function uses raw CRC32C (seed = !0, i.e. 0xFFFFFFFF, without final complement) applied independently to the root (high part) and objectid+offset (low part), then combined with a shift and XOR.

Item payload (28 bytes):

FieldOffsetSizeDescription
root08u64_le
objectid88u64_le
offset168u64_le
count244u32_le

Standalone SHARED_DATA_REF

Key: (bytenr, SHARED_DATA_REF, parent_bytenr)
      objectid = extent start    type = 184    offset = parent block

Item payload (4 bytes):

FieldOffsetSizeDescription
count04u32_le

Reference Counting

The refs Field

The refs field in btrfs_extent_item is the declared total reference count for the extent. It equals the sum of all references from both inline and standalone backrefs.

For TREE_BLOCK_REF, SHARED_BLOCK_REF, and EXTENT_OWNER_REF, each backref contributes 1 to the total. For EXTENT_DATA_REF and SHARED_DATA_REF, each backref contributes its count field to the total.

Counting Rules

The total reference count is computed as:

total = 0
for each inline ref:
    if EXTENT_DATA_REF:  total += count
    if SHARED_DATA_REF:  total += count
    otherwise:           total += 1
for each standalone ref:
    if EXTENT_DATA_REF:  total += count  (from item payload)
    if SHARED_DATA_REF:  total += count  (from item payload)
    otherwise:           total += 1

The declared refs in the extent item header must equal this computed total. A mismatch indicates corruption.

Example: Simple File

A newly created file with one 4 KiB extent in the FS tree (root 5):

Key: (bytenr, EXTENT_ITEM, 4096)
  refs = 1
  generation = 100
  flags = DATA
  inline EXTENT_DATA_REF:
    root = 5, objectid = 257, offset = 0, count = 1

Total refs: count(1) = 1. Matches declared refs.

Example: Snapshot

After taking a snapshot of the FS tree, the same extent is now referenced by both the original and the snapshot. The extent item is updated:

Key: (bytenr, EXTENT_ITEM, 4096)
  refs = 2
  generation = 100
  flags = DATA
  inline EXTENT_DATA_REF:
    root = 5, objectid = 257, offset = 0, count = 1
  inline EXTENT_DATA_REF:
    root = 260, objectid = 257, offset = 0, count = 1

Total refs: count(1) + count(1) = 2. Matches declared refs.

A reflink clone within the same tree adds another backref with a different file offset:

Key: (bytenr, EXTENT_ITEM, 4096)
  refs = 2
  generation = 100
  flags = DATA
  inline EXTENT_DATA_REF:
    root = 5, objectid = 257, offset = 0, count = 1
  inline EXTENT_DATA_REF:
    root = 5, objectid = 258, offset = 0, count = 1

Example: Metadata Block

A metadata block owned by the FS tree:

Key: (bytenr, METADATA_ITEM, 0)    // level 0 = leaf
  refs = 1
  generation = 100
  flags = TREE_BLOCK
  inline TREE_BLOCK_REF:
    root = 5

Data Extent Backrefs in Detail

The EXTENT_DATA_REF Triple

Each data extent backref identifies its user by a (root, objectid, offset) triple:

  • root: the tree objectid containing the referencing inode. For user files this is the FS tree (5) or a subvolume/snapshot tree ID.

  • objectid: the inode number of the file that references the extent. Regular file inodes start at 257 (BTRFS_FIRST_FREE_OBJECTID + 1).

  • offset: the byte offset within the file where this extent is referenced. This is the key offset of the EXTENT_DATA item in the FS tree.

The count Field

The count field records how many times the exact same (root, objectid, offset) triple references this extent. In normal operation, count = 1. It can be greater than 1 in specific scenarios involving log replay or certain reflink patterns.

Hash Computation for Standalone Keys

When an EXTENT_DATA_REF is stored as a standalone item, the key offset is not the file offset but rather a hash of the full triple. This allows multiple data refs with different triples to be stored as separate items under the same extent bytenr.

The hash function (from disk/src/items.rs) computes:

high = CRC32C(seed=0xFFFFFFFF, root_le_bytes)
low  = CRC32C(seed=0xFFFFFFFF, objectid_le_bytes)
low  = CRC32C(seed=low,        offset_le_bytes)
hash = (high << 31) ^ low

This produces a 63-bit hash (the top bit is always the MSB of the high CRC, shifted to bit 62). The hash is deterministic and the same function is used in both the kernel and userspace tools.

Metadata Extent Backrefs in Detail

TREE_BLOCK_REF

A TREE_BLOCK_REF links a metadata block to the tree that owns it. The root field is the tree’s objectid:

  • 1 = root tree
  • 2 = extent tree
  • 3 = chunk tree
  • 4 = device tree
  • 5 = FS tree (default subvolume)
  • 6 = csum tree
  • 7 = quota tree
  • 10 = free-space tree
  • = 256 = subvolume/snapshot trees

SHARED_BLOCK_REF

When a tree block is shared between a subvolume and its snapshot, the normal TREE_BLOCK_REF is replaced with a SHARED_BLOCK_REF that points to the parent node. This happens because the same physical block cannot be “owned” by two different trees simultaneously.

The parent field is the logical bytenr of the tree node whose key pointer array includes this block. When the filesystem needs to modify a shared block, it performs copy-on-write: allocating a new block, copying the data, and updating the parent’s pointer. This is how snapshots achieve their constant-time creation — they share all blocks with the source subvolume.

FULL_BACKREF Flag

The FULL_BACKREF flag in the extent item’s flags field indicates that this metadata extent uses only shared backrefs (no direct tree backrefs). This typically happens for tree blocks at levels > 0 after a snapshot, where the ownership is ambiguous until the block is CoW’d.

Cross-Referencing with Tree Ownership

btrfs check collects a map of (block_address -> owning_tree) during its tree walks. The owning tree for each block is determined by the owner field in the block’s header (btrfs_header). This map is then cross-referenced against the extent tree’s TREE_BLOCK_REF entries in both directions.

Block Group Items in the Extent Tree

Historically, BLOCK_GROUP_ITEM entries are stored directly in the extent tree alongside extent items. With the BLOCK_GROUP_TREE compat_ro feature (default since btrfs-progs 6.x), they are moved to a separate tree (objectid 10).

BLOCK_GROUP_ITEM Structure

Key: (logical_offset, BLOCK_GROUP_ITEM, length)
      objectid = group start    type = 192    offset = group size

Item payload (24 bytes):

FieldOffsetSizeDescription
used08Bytes allocated within this block group
chunk_objectid88FIRST_CHUNK_TREE_OBJECTID (256)
flags168Type + profile flags (`DATA

The used field tracks how many bytes of the block group are currently allocated to extents. For a new filesystem:

  • System block group: used = one nodesize (the chunk tree block)
  • Metadata block group: used = N * nodesize (all non-chunk tree blocks)
  • Data block group: used = 0 (no file data yet)

Ordering in the Extent Tree

When block group items are in the extent tree, they sort among the extent items by key. Since BLOCK_GROUP_ITEM has type 192 and EXTENT_ITEM has type 168 / METADATA_ITEM has type 169, block group items for a given logical offset sort after any extent item at the same address (because key comparison is (objectid, type, offset) and 192 > 169).

mkfs Construction

mkfs creates three block group items, one for each chunk:

#![allow(unused)]
fn main() {
add_block_group_items(extent_items, cfg, layout, chunks, data_used);
}

This adds entries for the system (SYSTEM flag), metadata (METADATA | profile flag), and data (DATA | profile flag) block groups.

When the BLOCK_GROUP_TREE feature is enabled, these items are placed in a separate tree instead (build_block_group_tree_with_used).

What btrfs check Verifies

The extent tree checker (implemented in cli/src/check/extents.rs) performs several categories of verification.

Reference Count Matching

For each extent item (EXTENT_ITEM or METADATA_ITEM) and its associated standalone backrefs, the checker computes the total reference count from inline + standalone refs and compares it to the declared refs field:

#![allow(unused)]
fn main() {
if state.pending_refs != state.pending_counted {
    results.report(CheckError::ExtentRefMismatch {
        bytenr, expected: state.pending_refs, found: state.pending_counted,
    });
}
}

The checker processes items in key order. When it encounters a new EXTENT_ITEM or METADATA_ITEM, it “flushes” the previous extent (checking its ref count) and begins accumulating refs for the new one. Standalone backref items (TREE_BLOCK_REF, SHARED_BLOCK_REF, EXTENT_DATA_REF, SHARED_DATA_REF, EXTENT_OWNER_REF) that follow an extent item with a matching objectid add to the running count.

Extent Overlap Detection

Extents in the extent tree are sorted by logical address. The checker tracks the end address of the previous extent and reports an error if the next extent starts before the previous one ends:

#![allow(unused)]
fn main() {
if length > 0 && bytenr < state.prev_end && state.prev_end > 0 {
    results.report(CheckError::OverlappingExtent {
        bytenr, length, prev_end: state.prev_end,
    });
}
}

Note that METADATA_ITEM entries store the tree level (not the length) in the key offset. Since the checker does not have access to the nodesize at this point, it uses length = 0 for metadata items and skips overlap detection for them.

Backref Owner Cross-Checks (Direction 1: Walk to Extent)

During tree walks in earlier check phases, the checker builds a map of tree_block_owners: HashMap<u64, u64> mapping each tree block’s logical address to the tree objectid that owns it (from the block header’s owner field).

After processing the extent tree, the checker verifies that every block encountered during walks has an extent item:

#![allow(unused)]
fn main() {
if !state.extent_item_addrs.contains(&addr) {
    results.report(CheckError::MissingExtentItem { bytenr: addr });
}
}

And that the extent tree’s backrefs agree with the actual owner:

#![allow(unused)]
fn main() {
if !claimed_owners.contains(&actual_owner) {
    results.report(CheckError::BackrefOwnerMismatch {
        bytenr: addr, actual_owner, claimed_owners,
    });
}
}

Backref Owner Cross-Checks (Direction 2: Extent to Walk)

The checker also verifies the reverse: every TREE_BLOCK_REF in the extent tree (both inline and standalone) must correspond to a tree block that was actually encountered during walks and is owned by the claimed tree:

#![allow(unused)]
fn main() {
let actual = tree_block_owners.get(&addr).copied();
if actual != Some(claimed) {
    results.report(CheckError::BackrefOrphan {
        bytenr: addr, claimed_owner: claimed,
    });
}
}

This catches “orphan” backrefs that point to blocks that either do not exist or are owned by a different tree than claimed.

Data Byte Accounting

The checker accumulates two statistics from data extents:

  • data_bytes_allocated: the sum of length for all data extent items. This is the total physical space reserved for data.

  • data_bytes_referenced: the sum of length * count for all data extent references. When data is shared (via snapshots or reflinks), referenced bytes exceed allocated bytes.

For inline-only data refs (no standalone ExtentDataRef items), referenced bytes are computed from the inline ref count. For standalone refs, each EXTENT_DATA_REF and SHARED_DATA_REF item contributes length * count.

Extent Item Construction in mkfs

Metadata Extent Items

For each tree block allocated during mkfs, the extent tree receives a metadata extent item with one inline TREE_BLOCK_REF:

#![allow(unused)]
fn main() {
fn metadata_extent_item(addr, skinny, generation, owner, nodesize) -> (Key, Vec<u8>) {
    let (item_type, offset) = if skinny {
        (BTRFS_METADATA_ITEM_KEY, 0u64)     // offset = level 0
    } else {
        (BTRFS_EXTENT_ITEM_KEY, nodesize)    // offset = nodesize
    };
    (
        Key::new(addr, item_type, offset),
        extent_item(1, generation, skinny, owner),
    )
}
}

The extent_item() function serializes:

  1. btrfs_extent_item header: refs=1, generation, flags=TREE_BLOCK
  2. For non-skinny: zero-filled btrfs_tree_block_info (25 bytes)
  3. Inline TREE_BLOCK_REF: type byte (176) + root objectid (8 bytes)

Total item size: 33 bytes (skinny) or 58 bytes (non-skinny).

Data Extent Items

For each data extent written during --rootdir mode, the extent tree receives a data extent item with one inline EXTENT_DATA_REF:

#![allow(unused)]
fn main() {
fn data_extent_item(refs, generation, root, objectid, offset, count) -> Vec<u8> {
    // btrfs_extent_item header
    buf.put_u64_le(refs);
    buf.put_u64_le(generation);
    buf.put_u64_le(BTRFS_EXTENT_FLAG_DATA);
    // inline EXTENT_DATA_REF
    buf.put_u8(BTRFS_EXTENT_DATA_REF_KEY);
    buf.put_u64_le(root);
    buf.put_u64_le(objectid);
    buf.put_u64_le(offset);
    buf.put_u32_le(count);
}
}

Total item size: 53 bytes. The key is (extent_bytenr, EXTENT_ITEM, extent_length).

Self-Referential Convergence

The extent tree must contain entries for its own tree blocks. But the number of tree blocks needed depends on how many items the tree contains, which depends on how many extent items there are, which depends on the number of tree blocks… This creates a circular dependency.

The --rootdir code path solves this with a convergence loop (converge_extent_tree_block_count in mkfs/src/mkfs.rs):

  1. Start with extent_tree_block_count = 1.
  2. Build a trial extent tree with all items (including placeholder entries for the extent tree’s own blocks).
  3. If the trial tree’s actual block count differs from the assumed count, update the count and repeat.
  4. The loop converges quickly (usually in 1-2 iterations) because adding extent items for additional blocks only marginally increases the tree size.

After convergence, the real extent tree is built with actual logical addresses assigned by the BlockAllocator.

Extent Tree Key Ordering

Items in the extent tree are sorted by the standard btrfs key comparison (objectid, type, offset). Since objectid is the extent’s logical byte address, items are effectively sorted by logical address.

Within a single extent’s address, the ordering is:

  1. EXTENT_ITEM or METADATA_ITEM (type 168 or 169) — the extent header
  2. EXTENT_OWNER_REF (type 172) — if simple quotas are enabled
  3. TREE_BLOCK_REF (type 176) — standalone metadata backrefs
  4. EXTENT_DATA_REF (type 178) — standalone data backrefs
  5. SHARED_BLOCK_REF (type 182) — standalone shared metadata backrefs
  6. SHARED_DATA_REF (type 184) — standalone shared data backrefs
  7. BLOCK_GROUP_ITEM (type 192) — if not using block-group tree

This ordering is a natural consequence of the type field values and ensures that btrfs check can process all backrefs for an extent by reading items sequentially until the objectid (bytenr) changes.

Relationship to File Extents

The connection between the extent tree and actual file data flows through EXTENT_DATA items in FS trees:

FS tree: (inode, EXTENT_DATA, file_offset)
  -> disk_bytenr, disk_num_bytes, offset, num_bytes

Extent tree: (disk_bytenr, EXTENT_ITEM, disk_num_bytes)
  -> refs, generation, flags=DATA
  -> inline EXTENT_DATA_REF(root, inode, file_offset, count)

The disk_bytenr in the file extent item is the logical address of the data extent. The extent tree entry at that address records who references the extent and how many times.

For inline file extents (small files where data is embedded directly in the tree leaf), there is no corresponding extent tree entry — the data does not occupy a separate extent.

For hole/sparse extents (disk_bytenr = 0), there is similarly no extent tree entry. The no-holes feature eliminates explicit hole extent items entirely.

Summary of Key Formats

Item typeKeyPayload
EXTENT_ITEM(bytenr, 168, length)extent_item + inline refs
METADATA_ITEM(bytenr, 169, level)extent_item + inline refs
EXTENT_OWNER_REF(bytenr, 172, root)(empty)
TREE_BLOCK_REF(bytenr, 176, root)(empty)
EXTENT_DATA_REF(bytenr, 178, hash)extent_data_ref (28 bytes)
SHARED_BLOCK_REF(bytenr, 182, parent)(empty)
SHARED_DATA_REF(bytenr, 184, parent)shared_data_ref (4 bytes)
BLOCK_GROUP_ITEM(logical, 192, length)block_group_item (24 bytes)

All bytenr values are logical byte addresses. The extent tree provides the complete picture of space allocation and ownership across the entire filesystem.

Btrfs Transaction Infrastructure: On-Disk Format Specification

This document is the sole reference for implementing the btrfs-transaction crate. It describes the on-disk format, invariants, and protocols needed to safely modify a btrfs filesystem from userspace.

Tree block layout

A btrfs filesystem stores its metadata in a B-tree. Each tree block (also called a node or extent buffer) is nodesize bytes (typically 16,384, but can be 4,096 to 65,536). Tree blocks are identified by their logical byte address (bytenr), which is translated to a physical device offset via the chunk tree.

Every tree block begins with a 101-byte header, followed by either leaf items (level 0) or internal node key pointers (level > 0).

Header (101 bytes)

All multi-byte integers are little-endian on disk.

OffsetSizeFieldDescription
032csumChecksum of bytes 32..nodesize (header fields after csum + all payload). Algorithm determined by superblock csum_type. Zero-padded: for CRC32C only bytes 0..3 are meaningful.
3216fsidFilesystem UUID. Must match superblock fsid (or metadata_uuid if METADATA_UUID incompat flag is set).
488bytenrLogical byte address of this block. Must match the address used to read/write it.
568flagsBits 0..55: header flags (currently unused by userspace). Bits 56..63: backref revision (1 = mixed backrefs, the modern format).
6416chunk_tree_uuidUUID of the chunk tree that maps this block’s logical address to physical. Typically the same for all blocks on a single-device fs.
808generationTransaction generation when this block was last written. Critical for COW: a block with generation == current transaction has already been COWed and can be modified in place.
888ownerTree ID that owns this block (e.g. 1 for root tree, 2 for extent tree, 5 for default fs tree). Used for backref accounting.
964nritemsNumber of items (leaf) or key pointers (node).
1001levelB-tree level. 0 = leaf, 1..7 = internal node. Maximum level is 7 (BTRFS_MAX_LEVEL = 8 levels total, 0-indexed).

Key (17 bytes)

Every item and pointer in the B-tree is identified by a three-part key. On disk this is the btrfs_disk_key (little-endian):

OffsetSizeFieldDescription
08objectidPrimary identifier (inode number, tree ID, extent bytenr, etc. depending on key type).
81typeKey type discriminator (see section 7).
98offsetType-specific secondary value (file offset, extent size, parent ID, etc.).

Keys are compared as a tuple (objectid, type, offset) in that order, all as unsigned integers. This defines the sort order within every B-tree.

Leaf layout (level 0)

A leaf contains item descriptors that grow forward from the header, and item data payloads that grow backward from the end of the block. Free space is the gap between them.

Byte 0..100:                    Header
Byte 101..101+nritems*25-1:     Item descriptors [item0, item1, ..., itemN-1]
                                (25 bytes each, sorted by key ascending)
  ...free space...
Byte X..nodesize-1:             Item data [dataN-1, ..., data1, data0]
                                (packed from the end of the block backward)

Each item descriptor is 25 bytes:

OffsetSizeFieldDescription
017keyThe item’s key (btrfs_disk_key).
174offsetByte offset of this item’s data payload, relative to the start of the data area (byte 101). To get the absolute position in the block: absolute = 101 + offset.
214sizeSize of the item’s data payload in bytes.

Invariants:

  • Items are sorted by key in ascending order.
  • Item data regions must not overlap.
  • The last item’s data starts at 101 + item[N-1].offset and extends for item[N-1].size bytes. Items with lower indices have data at higher offsets (data grows backward).
  • The first item’s data ends at 101 + item[0].offset + item[0].size, which must be <= nodesize.
  • Free space = (101 + item[N-1].offset) - (101 + nritems * 25). When this is < 25 + data_size for a new item, the leaf is full.

Data offset convention:

The offset field in btrfs_item counts from byte 101 (immediately after the header), not from the start of the block. When constructing a new leaf:

  1. Start data_end at nodesize.
  2. For each item (in key order): data_end -= data.len(), write data at data_end, store offset = data_end - 101 in the item descriptor.
  3. Item descriptors are written at 101 + i * 25.

Internal node layout (level > 0)

An internal node contains key pointers that identify child subtrees.

Byte 0..100:                    Header
Byte 101..101+nritems*33-1:     Key pointers [ptr0, ptr1, ..., ptrN-1]
                                (33 bytes each, sorted by key ascending)

Each key pointer is 33 bytes:

OffsetSizeFieldDescription
017keyLowest key in the child subtree.
178blockptrLogical byte address of the child block.
258generationGeneration of the child block (used for consistency checking during reads).

Invariants:

  • Key pointers are sorted by key in ascending order.
  • blockptr must be a valid, allocated logical address.
  • generation must match the generation in the child block’s header.

Maximum capacities

For a given nodesize:

  • Leaf items per block: depends on item data size. The theoretical maximum number of zero-size items is (nodesize - 101) / 25 = 651 for 16 KiB.
  • Key pointers per node: (nodesize - 101) / 33 = 493 for 16 KiB.
  • Maximum tree depth: 8 levels (BTRFS_MAX_LEVEL). In practice, trees rarely exceed 3-4 levels.

Superblock

The superblock is the entry point for reading a btrfs filesystem. It is a 4,096-byte structure stored at fixed offsets on every device:

  • Mirror 0: byte 65,536 (64 KiB)
  • Mirror 1: byte 67,108,864 (64 MiB)
  • Mirror 2: byte 274,877,906,944 (256 GiB), only if device is large enough

Superblock layout (4,096 bytes)

OffsetSizeFieldDescription
032csumChecksum of bytes 32..4095.
3216fsidFilesystem UUID.
488bytenrPhysical byte offset of this copy.
568flagsBTRFS_SUPER_FLAG_* bits.
648magic0x4D5F53665248425F (“_BHRfS_M” reversed).
728generationCurrent transaction generation.
808rootLogical bytenr of root tree root block.
888chunk_rootLogical bytenr of chunk tree root block.
968log_rootLogical bytenr of log tree root (0 if none).
1048__unused_log_root_transidDeprecated, always 0.
1128total_bytesTotal usable bytes across all devices.
1208bytes_usedTotal bytes allocated to extents.
1288root_dir_objectidAlways 6 (BTRFS_ROOT_TREE_DIR_OBJECTID).
1368num_devicesNumber of devices.
1444sectorsizeMinimum I/O unit (typically 4096).
1484nodesizeTree block size (typically 16384).
1524__unused_leafsizeLegacy, always equal to nodesize.
1564stripesizeRAID stripe unit (typically 65536).
1604sys_chunk_array_sizeValid bytes in the sys_chunk_array field.
1648chunk_root_generationGeneration of the chunk tree root.
1728compat_flagsCompatible feature flags.
1808compat_ro_flagsRead-only compatible feature flags.
1888incompat_flagsIncompatible feature flags.
1962csum_typeChecksum algorithm (0=CRC32C, 1=xxhash, 2=SHA256, 3=BLAKE2).
1981root_levelB-tree level of root tree root.
1991chunk_root_levelB-tree level of chunk tree root.
2001log_root_levelB-tree level of log tree root.
20198dev_itemEmbedded device item for this device (see section 6.4).
299256labelNUL-terminated filesystem label.
5558cache_generationFree space cache v1 generation.
5638uuid_tree_generationUUID tree last-updated generation.
57116metadata_uuidMetadata UUID (if METADATA_UUID flag set).
5878nr_global_rootsGlobal root count (extent-tree-v2, rare).
5958remap_rootRemap tree bytenr.
6038remap_root_generationRemap tree generation.
6111remap_root_levelRemap tree level.
612199reservedZero-filled.
8112048sys_chunk_arrayBootstrap chunk tree entries (key + chunk item pairs, packed sequentially).
2859668super_roots4 rotating backup root entries (167 bytes each). See section 2.3.
3527569paddingZero-filled to 4096.

Fields updated on every transaction commit

When committing a transaction, the following superblock fields are updated:

  1. generation — incremented by 1.
  2. root — logical bytenr of the (possibly new) root tree root block.
  3. root_level — level of the root tree root.
  4. chunk_root — logical bytenr of the chunk tree root (if chunk tree was modified).
  5. chunk_root_generation — generation of the chunk tree root.
  6. chunk_root_level — level of the chunk tree root.
  7. bytes_used — updated to reflect allocations/frees.
  8. log_root — set to 0 after log replay, or updated if log is active.
  9. super_roots — one of the 4 backup root slots is written (rotating).
  10. csum — recomputed last, covering bytes 32..4095.

The commit writes the superblock to all mirrors. The superblock write is the atomic commit point: if power is lost before the superblock is written, the previous generation’s state is intact because COW ensures old blocks are never overwritten (see section 3).

Backup roots (167 bytes each, 4 entries)

The superblock contains 4 rotating backup root entries. On each commit, one slot is overwritten (cycling 0 → 1 → 2 → 3 → 0 → …). These are used for recovery when the primary root pointers are corrupt.

OffsetSizeFieldDescription
08tree_rootRoot tree root bytenr.
88tree_root_genRoot tree generation.
168chunk_rootChunk tree root bytenr.
248chunk_root_genChunk tree generation.
328extent_rootExtent tree root bytenr.
408extent_root_genExtent tree generation.
488fs_rootDefault FS tree root bytenr.
568fs_root_genFS tree generation.
648dev_rootDevice tree root bytenr.
728dev_root_genDevice tree generation.
808csum_rootChecksum tree root bytenr.
888csum_root_genChecksum tree generation.
968total_bytesTotal filesystem bytes at this point.
1048bytes_usedBytes used at this point.
1128num_devicesDevice count at this point.
12032unusedReserved (zero).
1521tree_root_levelRoot tree level.
1531chunk_root_levelChunk tree level.
1541extent_root_levelExtent tree level.
1551fs_root_levelFS tree level.
1561dev_root_levelDevice tree level.
1571csum_root_levelChecksum tree level.
1589paddingPadding to 167 bytes.

Superblock flags

BitNameDescription
2BTRFS_SUPER_FLAG_ERRORFilesystem has errors.
32BTRFS_SUPER_FLAG_SEEDINGSeed device (read-only base for cloning).
33BTRFS_SUPER_FLAG_METADUMPMetadump image.
34BTRFS_SUPER_FLAG_METADUMP_V2Metadump v2 image.
35BTRFS_SUPER_FLAG_CHANGING_FSIDFSID rewrite in progress.
36BTRFS_SUPER_FLAG_CHANGING_FSID_V2FSID rewrite v2 in progress.
38BTRFS_SUPER_FLAG_CHANGING_BG_TREEBlock group tree migration.
39BTRFS_SUPER_FLAG_CHANGING_DATA_CSUMData csum algorithm change.
40BTRFS_SUPER_FLAG_CHANGING_META_CSUMMetadata csum algorithm change.

Feature flags

Incompatible (incompat_flags):

BitNameHexDescription
0MIXED_BACKREF0x1Modern backreference format.
1DEFAULT_SUBVOL0x2Non-default default subvolume set.
2MIXED_GROUPS0x4Mixed data+metadata block groups.
3COMPRESS_LZO0x8LZO compression used.
4COMPRESS_ZSTD0x10ZSTD compression used.
5BIG_METADATA0x20Metadata blocks > 4 KiB (always set with modern mkfs for nodesize > 4096).
6EXTENDED_IREF0x40Extended inode references (INODE_EXTREF).
7RAID560x80RAID5/RAID6 profiles in use.
8SKINNY_METADATA0x100Skinny metadata extent refs (see 5.1).
9NO_HOLES0x200No explicit hole extent items.
10METADATA_UUID0x400metadata_uuid field is in use.
11RAID1C340x800RAID1C3 or RAID1C4 profiles in use.
12ZONED0x1000Zoned block device support.
13EXTENT_TREE_V20x2000Extent tree v2 (experimental).
14RAID_STRIPE_TREE0x4000RAID stripe tree.
16SIMPLE_QUOTA0x10000Simple quota accounting.

Read-only compatible (compat_ro_flags):

BitNameHexDescription
0FREE_SPACE_TREE0x1Free space tree present.
1FREE_SPACE_TREE_VALID0x2Free space tree is valid/consistent.
2VERITY0x4fs-verity enabled files present.
3BLOCK_GROUP_TREE0x8Separate block group tree.

Default features for modern mkfs:

  • incompat_flags: MIXED_BACKREF | BIG_METADATA | EXTENDED_IREF | SKINNY_METADATA | NO_HOLES = 0x361
  • compat_ro_flags: FREE_SPACE_TREE | FREE_SPACE_TREE_VALID | BLOCK_GROUP_TREE = 0xB

System chunk array

The sys_chunk_array (2,048 bytes at offset 811) contains bootstrap chunk entries needed to read the chunk tree itself. Format: packed sequence of (btrfs_disk_key, btrfs_chunk) pairs. The sys_chunk_array_size field says how many bytes are valid. Parsing: read key (17 bytes), then chunk header (48 bytes) + stripes (num_stripes * 32 bytes), repeat until consumed.

Copy-on-write (COW) protocol

Btrfs never modifies tree blocks in place (except when a block was already allocated in the current transaction). This is the fundamental mechanism that provides crash consistency.

COW a tree block

When a transaction needs to modify a tree block:

  1. Check generation. If block.generation == current_transaction_generation, the block was already COWed in this transaction. Modify it in place.

  2. Allocate a new block. Find free space in an appropriate metadata block group and allocate nodesize bytes at a new logical address.

  3. Copy. Copy the entire block contents to the new address.

  4. Update parent pointer. In the parent node, change the blockptr for the relevant slot to the new address, and set generation to the current transaction generation.

  5. Update the new block’s header. Set bytenr to the new logical address, generation to the current transaction generation.

  6. Queue old block for freeing. The old block’s extent reference is decremented. If its refcount reaches 0, the space is freed (but only after the transaction commits, to maintain crash consistency).

  7. COW cascades upward. If the parent was not yet COWed, it must be COWed first (step 1 check), then updated. This cascades up to the root.

COW and the root pointer

The root of each tree is stored in a root_item in the root tree (tree ID 1). The root tree’s own root pointer is stored in the superblock (root field).

When COW reaches the root of a non-root tree:

  • Update the root_item’s bytenr and level fields in the root tree.
  • This modification to the root tree triggers COW of the root tree itself.

When COW reaches the root tree’s root:

  • The new root block address is written to the superblock’s root field at commit time.

COW and the chunk tree

The chunk tree root is special: its pointer lives directly in the superblock (chunk_root field), not in the root tree. If the chunk tree is modified, its new root address updates chunk_root at commit time.

Crash consistency

The commit point is the superblock write. Before the superblock is updated:

  • All new tree blocks have been written to new locations.
  • All old tree blocks are still intact at their original locations.
  • The old superblock still points to the old root tree root, which points to the old state of all trees.

If power is lost before the superblock write completes, the filesystem reverts to the previous generation. No fsck needed.

Transaction lifecycle

A transaction groups multiple tree modifications into a single atomic commit.

Start

  1. Read the current superblock generation G.
  2. Set the new transaction generation to G + 1.
  3. Track all blocks modified during this transaction (the “dirty set”).

Modify

All tree modifications (insert, delete, update items) go through COW:

  • search_slot descends the tree, COWing each block along the path.
  • Item operations modify the COWed leaf.
  • Reference counts are updated for allocated and freed extents.

Commit

  1. Flush pending reference updates. Process all queued extent reference changes (delayed refs, see section 5.3). This may modify the extent tree, which may COW more blocks and generate more ref updates. Repeat until stable (no more pending updates).

  2. Update root items. For every tree whose root block changed, update its root_item in the root tree (fields: bytenr, generation, level). This may COW the root tree.

  3. Write dirty blocks. Write all blocks in the dirty set to disk with correct checksums. Each block’s checksum covers bytes 32..nodesize.

  4. Prepare superblock. Update the superblock fields listed in section 2.2. Write one backup root entry (rotating through slots 0-3). Recompute the superblock checksum.

  5. Write superblock. Write the superblock to all mirrors. Issue fsync to ensure durability.

Abort

Discard all dirty blocks. Do not write the superblock. The filesystem remains at the previous generation.

Extent tree and reference counting

The extent tree (tree ID 2) tracks which logical address ranges are allocated and who references them. Every allocated extent (both data and metadata) has an entry in the extent tree.

Extent items

There are two key types for extent records:

EXTENT_ITEM (type 168): Used for data extents and (on older filesystems without SKINNY_METADATA) for tree blocks.

  • Key: (logical_bytenr, EXTENT_ITEM=168, size_in_bytes)
  • Data: extent_item header (24 bytes), optionally tree_block_info (18 bytes), then inline backreferences.

METADATA_ITEM (type 169): Used for tree blocks when SKINNY_METADATA incompat flag is set. This is the modern default.

  • Key: (logical_bytenr, METADATA_ITEM=169, tree_level)
  • Data: extent_item header (24 bytes), then inline backreferences. No tree_block_info (the level is in the key offset, and the first key is not stored).

Extent item header (24 bytes):

OffsetSizeFieldDescription
08refsTotal reference count for this extent.
88generationTransaction generation when allocated.
168flagsEXTENT_FLAG_DATA (bit 0) for data extents, EXTENT_FLAG_TREE_BLOCK (bit 1) for metadata. BLOCK_FLAG_FULL_BACKREF (bit 8) indicates full backrefs (shared block refs use parent bytenr instead of root ID).

Tree block info (18 bytes, only for non-skinny EXTENT_ITEM with TREE_BLOCK flag):

OffsetSizeFieldDescription
017keyFirst key in the tree block (btrfs_disk_key).
171levelLevel of the tree block.

Backreferences

Backreferences record who uses an extent. They come in two forms: inline (packed inside the extent item’s data) and standalone (separate items in the extent tree).

Inline backreferences follow the extent item header (and tree_block_info if present). Each inline ref has a 1-byte type followed by an 8-byte offset, then type-specific data:

OffsetSizeFieldDescription
01typeOne of the backref type codes below.
18offsetType-dependent (see below).

The backref types:

Type codeNameOffset meaningExtra dataTotal inline size
176TREE_BLOCK_REFRoot tree ID(none)9 bytes
182SHARED_BLOCK_REFParent block bytenr(none)9 bytes
178EXTENT_DATA_REF(see below)28 bytes37 bytes
184SHARED_DATA_REFParent block bytenr4-byte count13 bytes
172EXTENT_OWNER_REFRoot tree ID(none)9 bytes

TREE_BLOCK_REF (type 176): A tree block is referenced by a specific tree (identified by root ID). The offset field IS the root objectid. No additional data. Each such ref contributes 1 to the extent’s refcount.

SHARED_BLOCK_REF (type 182): A tree block is referenced by another tree block (identified by its bytenr) rather than by root ID. This happens during snapshots. The offset field IS the parent block’s bytenr. Each such ref contributes 1 to the extent’s refcount.

EXTENT_DATA_REF (type 178): A data extent is referenced by a file. The inline form packs the following 28 bytes immediately after the type byte (the 8-byte offset from the generic header is actually the first field root of this struct — parse carefully):

OffsetSizeFieldDescription
08rootRoot tree ID containing the referencing inode.
88objectidInode number.
168offsetFile offset where this extent is referenced.
244countNumber of references (typically 1, >1 for reflinked files).

Each EXTENT_DATA_REF contributes count to the extent’s refcount.

SHARED_DATA_REF (type 184): A data extent is referenced through a shared tree block (snapshot). The offset field is the parent block bytenr. Additional 4 bytes:

OffsetSizeFieldDescription
04countReference count from this parent.

Each SHARED_DATA_REF contributes count to the extent’s refcount.

Standalone backreferences: When inline refs don’t fit in the extent item (rare, happens with many references), they overflow to standalone items:

  • TREE_BLOCK_REF_KEY (176): key (extent_bytenr, 176, root_id), no data.
  • SHARED_BLOCK_REF_KEY (182): key (extent_bytenr, 182, parent_bytenr), no data.
  • EXTENT_DATA_REF_KEY (178): key (extent_bytenr, 178, hash), 28-byte btrfs_extent_data_ref data. The hash is computed as:
    high_crc = crc32c(seed=0xFFFFFFFF, root.to_le_bytes())
    low_crc  = crc32c(seed=0xFFFFFFFF, objectid.to_le_bytes())
    low_crc  = crc32c(seed=low_crc,    offset.to_le_bytes())
    hash     = (high_crc as u64) << 31 ^ (low_crc as u64)
    
    Note: these are raw CRC32C (no final inversion), not the standard ISO 3309 form.
  • SHARED_DATA_REF_KEY (184): key (extent_bytenr, 184, parent_bytenr), 4-byte count.

Delayed references

Modifying a tree generates many reference count updates (every COWed block creates a new ref and removes an old ref). Processing each one immediately would cause excessive extent tree modifications. Instead, reference updates are queued and batched:

  1. When a block is COWed, queue: +1 ref at new_bytenr, -1 ref at old_bytenr.
  2. When a block is allocated for splitting, queue +1 ref.
  3. When blocks are freed (e.g., after merging), queue -1 ref.

At commit time, process all queued refs:

  • Merge updates to the same extent (e.g., +1 and -1 cancel out).
  • For each remaining update, modify the extent item in the extent tree.
  • If a refcount drops to 0, delete the extent item and free the space.
  • Processing delayed refs modifies the extent tree, which may generate more delayed refs (from COWing extent tree blocks). Repeat until the queue is empty. This converges because each iteration processes more refs than it creates.

Refcount invariant

The refs field in an extent item must always equal the sum of all its backreferences:

  • Each TREE_BLOCK_REF or SHARED_BLOCK_REF contributes 1.
  • Each EXTENT_DATA_REF contributes its count field.
  • Each SHARED_DATA_REF contributes its count field.

If refs reaches 0, the extent is freed.

Block groups, chunks, and device extents

Btrfs organizes disk space into three layers: block groups (logical allocation regions), chunks (logical-to-physical mapping), and device extents (physical device reservations).

Block group item (24 bytes)

Stored in the extent tree (or block group tree if BLOCK_GROUP_TREE compat_ro flag is set).

Key: (logical_offset, BLOCK_GROUP_ITEM=192, length)

OffsetSizeFieldDescription
08usedBytes currently allocated within this group.
88chunk_objectidAlways 256 (BTRFS_FIRST_CHUNK_TREE_OBJECTID).
168flagsType + RAID profile (see 6.5).

Block groups are the allocation units: when allocating an extent, the allocator finds a block group of the right type (DATA, METADATA, or SYSTEM) with enough free space.

Chunk item (48 + num_stripes * 32 bytes)

Stored in the chunk tree (tree ID 3).

Key: (256, CHUNK_ITEM=228, logical_offset)

OffsetSizeFieldDescription
08lengthLogical size of this chunk.
88ownerOwner tree (always 2, extent tree).
168stripe_lenStripe unit for RAID (typically 65536).
248typeFlags: same as block group flags.
324io_alignI/O alignment (typically 65536 for non-system, sectorsize for system chunks).
364io_widthI/O width (same as io_align).
404sector_sizeDevice sector size (typically 4096).
442num_stripesNumber of stripes.
462sub_stripesSub-stripes for RAID10 (0 otherwise).
48+32*NstripesArray of stripe descriptors.

Each stripe (32 bytes):

OffsetSizeFieldDescription
08devidDevice ID.
88offsetPhysical byte offset on the device.
1616dev_uuidDevice UUID.

Chunk-to-physical resolution: For a logical address L within a chunk starting at chunk_start with a single stripe at device offset phys: physical = phys + (L - chunk_start). RAID profiles use more complex mapping.

Device extent (48 bytes)

Stored in the device tree (tree ID 4).

Key: (devid, DEV_EXTENT=204, physical_offset)

OffsetSizeFieldDescription
08chunk_treeAlways 3 (BTRFS_CHUNK_TREE_OBJECTID).
88chunk_objectidAlways 256.
168chunk_offsetLogical offset of the owning chunk.
248lengthLength of this device extent.
3216chunk_tree_uuidChunk tree UUID.

For each stripe in a chunk, there is one device extent on the corresponding device.

Device item (98 bytes)

Stored in the chunk tree (and embedded in the superblock for the local device).

Key: (1, DEV_ITEM=216, devid) (objectid 1 = BTRFS_DEV_ITEMS_OBJECTID)

OffsetSizeFieldDescription
08devidDevice ID (1, 2, 3, …).
88total_bytesTotal device size.
168bytes_usedBytes allocated to chunks on this device.
244io_alignI/O alignment.
284io_widthI/O width.
324sector_sizeSector size.
368typeReserved (0).
448generationLast transaction touching this device.
528start_offsetStart offset for new allocations.
604dev_groupReserved (0).
641seek_speedHint (0 = unset).
651bandwidthHint (0 = unset).
6616uuidDevice UUID.
8216fsidFilesystem UUID.

Block group type flags

BitNameHexDescription
0DATA0x1Data extents.
1SYSTEM0x2System (chunk tree) metadata.
2METADATA0x4Metadata extents.
3RAID00x8Striped.
4RAID10x10Mirrored (2 copies).
5DUP0x20Duplicated on same device.
6RAID100x40Striped + mirrored.
7RAID50x80RAID5.
8RAID60x100RAID6.
9RAID1C30x200Mirrored (3 copies).
10RAID1C40x400Mirrored (4 copies).

A block group’s flags combine exactly one type (DATA, SYSTEM, METADATA) with zero or one RAID profile. If no RAID profile bit is set, the block group is SINGLE (no replication, but the virtual SINGLE bit 48 = 0x1000000000000 is used in some display contexts only).

Relationships between structures

For each allocated region of logical space:

  1. A block group item in the extent tree defines the logical range and tracks usage.
  2. A chunk item in the chunk tree maps the same logical range to one or more physical stripes.
  3. For each stripe, a device extent in the device tree reserves the physical space on that device.
  4. The device item in the chunk tree tracks total and used bytes per device.

All four must be consistent. When allocating a new block group (rare in rescue operations), all four structures must be created atomically within one transaction.

Tree types and key reference

Tree IDs

IDNameStored in
1Root treeSuperblock (root field)
2Extent treeRoot tree (ROOT_ITEM objectid=2)
3Chunk treeSuperblock (chunk_root field)
4Device treeRoot tree (ROOT_ITEM objectid=4)
5Default FS treeRoot tree (ROOT_ITEM objectid=5)
6Root tree directory(virtual, in root tree)
7Checksum treeRoot tree (ROOT_ITEM objectid=7)
8Quota treeRoot tree (ROOT_ITEM objectid=8)
9UUID treeRoot tree (ROOT_ITEM objectid=9)
10Free space treeRoot tree (ROOT_ITEM objectid=10)
11Block group treeRoot tree (ROOT_ITEM objectid=11)
12RAID stripe treeRoot tree (ROOT_ITEM objectid=12)
256+User subvolume/snapshot treesRoot tree (ROOT_ITEM objectid=N)

The root tree is the master index. It contains a ROOT_ITEM for every other tree (except itself and the chunk tree, whose roots are in the superblock).

Root item (439 bytes used, padded to 496 bytes)

Stored in root tree with key (tree_id, ROOT_ITEM=132, 0).

OffsetSizeFieldDescription
0176inodeEmbedded btrfs_inode_item (see 7.3).
1768generationTransaction generation of this root.
1848root_diridRoot directory objectid (typically 256).
1928bytenrLogical bytenr of this tree’s root block.
2008byte_limitDeprecated (0).
2088bytes_usedTotal bytes used by this tree’s extents.
2168last_snapshotGeneration of last snapshot of this tree.
2248flagsRoot flags (bit 0 = read-only subvolume).
2324refsReference count.
23617drop_progressKey tracking in-progress drop operation.
2531drop_levelLevel of drop progress.
2541levelCurrent B-tree height of this root.
2558generation_v2Same as generation (marks v2 format).
26316uuidSubvolume UUID.
27916parent_uuidParent subvolume UUID (for snapshots).
29516received_uuidSource UUID (for received subvolumes).
3118ctransidTransaction of last inode change.
3198otransidTransaction when this root was created.
3278stransidTransaction when sent.
3358rtransidTransaction when received.
34312ctimeChange time (8-byte sec + 4-byte nsec).
35512otimeCreation time.
36712stimeSend time.
37912rtimeReceive time.
39164reservedZero-filled.

Fields updated when a tree’s root block changes (during commit):

  • bytenr — new root block address.
  • generation and generation_v2 — current transaction generation.
  • level — root block level.

Inode item (176 bytes)

Embedded in root items and stored standalone in FS trees.

OffsetSizeFieldDescription
08generationNFS generation.
88transidLast modifying transaction.
168sizeFile size.
248nbytesSum of EXTENT_DATA.num_bytes for regular/prealloc extents plus inline payload length. See File data extents.
328block_groupBlock group hint for allocation.
404nlinkHard link count.
444uidUser ID.
484gidGroup ID.
524modeFile mode (permissions + type).
568rdevDevice number (block/char devices).
648flagsInode flags.
728sequenceNFS sequence number.
8032reservedZero-filled.
11212atimeAccess time (8-byte sec + 4-byte nsec).
12412ctimeChange time.
13612mtimeModification time.
14812otimeCreation time.

Key type reference

All key types with their numeric values:

ValueNamePrimary treeKey semantics
1INODE_ITEMFS tree(inode#, 1, 0)
12INODE_REFFS tree(inode#, 12, parent_dir_inode#)
13INODE_EXTREFFS tree(inode#, 13, hash)
24XATTR_ITEMFS tree(inode#, 24, name_hash)
36VERITY_DESC_ITEMFS tree(inode#, 36, 0)
37VERITY_MERKLE_ITEMFS tree(inode#, 37, offset)
48ORPHAN_ITEMRoot/FS tree(objectid, 48, offset)
60DIR_LOG_ITEMLog tree(dir_inode#, 60, hash)
72DIR_LOG_INDEXLog tree(dir_inode#, 72, index)
84DIR_ITEMFS tree(dir_inode#, 84, name_hash)
96DIR_INDEXFS tree(dir_inode#, 96, index)
108EXTENT_DATAFS tree(inode#, 108, file_offset)
128EXTENT_CSUMCsum tree(-10, 128, logical_bytenr)
132ROOT_ITEMRoot tree(tree_id, 132, 0)
144ROOT_BACKREFRoot tree(child_id, 144, parent_id)
156ROOT_REFRoot tree(parent_id, 156, child_id)
168EXTENT_ITEMExtent tree(bytenr, 168, size)
169METADATA_ITEMExtent tree(bytenr, 169, level)
172EXTENT_OWNER_REF(inline only)
176TREE_BLOCK_REFExtent tree(bytenr, 176, root_id)
178EXTENT_DATA_REFExtent tree(bytenr, 178, hash)
182SHARED_BLOCK_REFExtent tree(bytenr, 182, parent_bytenr)
184SHARED_DATA_REFExtent tree(bytenr, 184, parent_bytenr)
192BLOCK_GROUP_ITEMExtent tree*(logical, 192, length)
198FREE_SPACE_INFOFree space tree(bg_start, 198, bg_length)
199FREE_SPACE_EXTENTFree space tree(start, 199, length)
200FREE_SPACE_BITMAPFree space tree(start, 200, length)
204DEV_EXTENTDevice tree(devid, 204, phys_offset)
216DEV_ITEMChunk tree(1, 216, devid)
228CHUNK_ITEMChunk tree(256, 228, logical)
230RAID_STRIPEStripe tree(logical, 230, length)
240QGROUP_STATUSQuota tree(0, 240, 0)
242QGROUP_INFOQuota tree(qgroupid, 242, 0)
244QGROUP_LIMITQuota tree(qgroupid, 244, 0)
246QGROUP_RELATIONQuota tree(qgroupid, 246, other_qgroupid)
248TEMPORARY_ITEMRoot tree(objectid, 248, offset)
249PERSISTENT_ITEMRoot tree(objectid, 249, offset)
250DEV_REPLACERoot tree(objectid, 250, 0)

*BLOCK_GROUP_ITEM lives in the extent tree by default. With the BLOCK_GROUP_TREE compat_ro flag, it moves to tree ID 11.

Root ref and root backref (18+ bytes)

Forward and backward links between parent and child subvolumes.

ROOT_REF key: (parent_tree_id, ROOT_REF=156, child_tree_id) ROOT_BACKREF key: (child_tree_id, ROOT_BACKREF=144, parent_tree_id)

Both use the same data format:

OffsetSizeFieldDescription
08diridDirectory objectid in the parent tree that contains this subvolume.
88sequenceIndex in the directory.
162name_lenLength of the subvolume name.
18NnameSubvolume name (not NUL-terminated).

File data extents

Regular file content lives in EXTENT_DATA items in the FS tree, keyed (inode#, EXTENT_DATA, file_offset). Each item describes a contiguous range of the file’s logical bytes; consecutive items must cover non-overlapping ranges. Three extent types exist:

  • BTRFS_FILE_EXTENT_INLINE (0): data embedded directly in the leaf.
  • BTRFS_FILE_EXTENT_REG (1): pointer to a separate data extent on disk.
  • BTRFS_FILE_EXTENT_PREALLOC (2): reserved on disk but not yet written.

Common header (21 bytes)

OffsetSizeFieldDescription
08generationTransid at extent creation.
88ram_bytesUncompressed size of the extent’s data.
161compression0=none, 1=zlib, 2=LZO, 3=zstd.
171encryptionAlways 0.
182other_encodingAlways 0.
201extent_type0=inline, 1=regular, 2=prealloc.

Regular and prealloc body (32 bytes follow header)

OffsetSizeFieldDescription
218disk_bytenrLogical address of data extent (0 = hole).
298disk_num_bytesOn-disk size, sectorsize-aligned.
378offsetByte offset into the on-disk extent (bookend; 0 for non-shared).
458num_bytesLogical file bytes covered by this item.

Inline body

For inline extents the bytes after the 21-byte header are the (possibly compressed) file data. There is no disk_bytenr, no extent-tree entry, and no csum entry: the inline payload is covered by the FS tree leaf’s own checksum.

For LZO inline extents the embedded bytes carry an additional framing header: [4B total_len LE] [4B seg_len LE] [lzo1x compressed bytes], where total_len includes the 8-byte framing header itself.

Validation rules

These invariants are enforced by btrfs check and must hold for any EXTENT_DATA written by userspace:

  • Regular and prealloc extents: num_bytes must be sectorsize-aligned and non-zero. disk_num_bytes must also be sectorsize-aligned. num_bytes + offset <= ram_bytes.
  • Inline extents: total embedded payload (compressed or not) must fit in a leaf, capped at min(nodesize - 147, sectorsize - 1) bytes on a default filesystem. The 147 = HEADER_SIZE (101) + ITEM_SIZE (25) + 21-byte file-extent header. The sectorsize - 1 cap is btrfs’s rule that sector-or-larger files must use a regular extent.
  • INODE.nbytes: sum of num_bytes for every regular/prealloc extent (where disk_bytenr > 0) plus the inline payload length for any inline extent. For non-compressed extents num_bytes is the sector-aligned logical size, NOT the on-disk byte count. For compressed extents num_bytes is still the sector-aligned logical size — the smaller disk_num_bytes is not what gets summed.
  • INODE.size: the file’s logical size in bytes. May be smaller than the sum of num_bytes (the unwritten tail in the last extent reads as zero up to size).

LZO regular framing

For non-inline LZO extents, the on-disk bytes use a per-sector framed format:

[4B total_len LE] { [4B seg_len LE] [lzo1x compressed bytes] [zero pad] }*
  • Each input sector is compressed independently.
  • seg_len is the size of that sector’s compressed segment.
  • total_len is the total framed buffer size, including the 4-byte header.
  • After each segment, if fewer than 4 bytes remain in the current sector (i.e. the next 4-byte length header would cross a sector boundary), zero-pad to the next sector boundary so the next segment’s length header is sector-aligned.

This per-sector independence lets the kernel decompress individual sectors without reading neighbours.

Holes

With the NO_HOLES incompat flag (default on modern filesystems), gaps in the file_offset sequence indicate holes — no EXTENT_DATA item is written for the unmapped range. Without NO_HOLES, hole regions are recorded as regular extents with disk_bytenr == 0 and disk_num_bytes == 0.

Checksum computation

Tree block checksums

The checksum field (bytes 0..31 of the header) covers bytes 32..nodesize. For CRC32C (type 0), the checksum is 4 bytes stored at offset 0, with bytes 4..31 zero-padded.

Computation: standard ISO 3309 CRC32C (initial seed 0xFFFFFFFF, final XOR with 0xFFFFFFFF) over the data region bytes 32..nodesize.

Superblock checksums

Same as tree block checksums: bytes 0..31 are the checksum field, covering bytes 32..4095.

Data checksums (csum tree)

Data checksums are stored in the csum tree (tree ID 7) with key (EXTENT_CSUM_OBJECTID=-10, EXTENT_CSUM=128, logical_bytenr).

The item data is a packed array of checksums, one per sector. For CRC32C, each checksum is 4 bytes. The number of sectors covered is item_size / csum_size_for_type. Sectors are consecutive starting at the key’s offset (logical_bytenr).

Computation: standard ISO 3309 CRC32C (the same algorithm as for tree blocks; initial seed 0xFFFFFFFF, final XOR with 0xFFFFFFFF). The csum input is the on-disk bytes of the data extent — for compressed extents, that is the compressed+sector-padded payload, NOT the uncompressed original.

Note this is distinct from the raw_crc32c (no final invert) used by EXTENT_DATA_REF_KEY hashes and by the send-stream protocol. On-disk csum-tree entries always use the standard variant.

A single csum item may cover multiple consecutive sectors. The practical upper bound for a single item’s payload is roughly leaf_data_size - 2 * item_header_size - csum_size bytes, leaving room for a future split. Adjacent items at sector-contiguous logical addresses may be merged into one larger item, but btrfs check accepts either layout.

Inline extents have no csum entries — the data lives in the leaf and is covered by the leaf’s own header checksum.

NODATASUM extents (inode flag BTRFS_INODE_NODATASUM) skip csum computation entirely. btrfs check rejects csum entries for NODATASUM extents, and rejects missing csum entries for non-NODATASUM regular extents.

Extent data ref hash

The hash used in EXTENT_DATA_REF_KEY’s offset field:

high_crc = raw_crc32c(seed=0xFFFFFFFF, root.to_le_bytes())
low_crc  = raw_crc32c(seed=0xFFFFFFFF, objectid.to_le_bytes())
low_crc  = raw_crc32c(seed=low_crc,    offset.to_le_bytes())
hash     = (high_crc as u64) << 31 ^ (low_crc as u64)

Here raw_crc32c means NO final XOR — the raw CRC register value. This can be recovered from the standard API: raw = !standard_crc32c(data) when seed is !0, or equivalently raw = crc32c_with_seed(!0, data) if the API exposes the seed.

B-tree operations

This section describes the algorithms for searching, inserting, and deleting items in a btrfs B-tree. These are standard B-tree algorithms adapted for the btrfs leaf/node layout and COW model.

Binary search within a block

Given a block and a target key, find the slot:

In a leaf: Binary search over items[0..nritems-1] comparing keys. If found, return (true, slot). If not found, return (false, slot) where slot is the insertion point (the index of the first item with key > target).

In a node: Binary search over ptrs[0..nritems-1] comparing keys. The result is the slot of the child subtree that could contain the target key. Specifically, find the largest slot where ptrs[slot].key <= target. If the target is less than all keys, use slot 0.

Search (search_slot)

search_slot(trans, root, key, path, ins_len, cow) descends from the root to a leaf:

  1. Start at the root block (level = root_level).
  2. If cow != 0 and the block hasn’t been COWed in this transaction, COW it.
  3. Binary search for the key within the block.
  4. Store (block, slot) in path.nodes[level] and path.slots[level].
  5. If level > 0: read the child at ptrs[slot].blockptr, go to step 2 with the child.
  6. If level == 0: done. If the key was found, path.slots[0] points to it. If not found, path.slots[0] is the insertion point.

When ins_len > 0 (insert operation), the search checks whether the target leaf has enough free space. If not, it triggers a leaf split before returning.

Item insertion

Given a search path pointing to the insertion slot in a leaf:

  1. If the leaf has enough free space (>= 25 + data_size): a. Shift items at slots [insert_slot..nritems-1] right by 25 bytes (one item descriptor). b. Shift all data belonging to items at [insert_slot..nritems-1] left by data_size bytes (making room at the end of the data area). c. Update the offset field of shifted items (subtract data_size from each). d. Write the new item descriptor at the insert slot. e. Write the new item data. f. Increment nritems.

  2. If the leaf is full: split the leaf (section 9.5), then insert.

Item deletion

Given a search path pointing to items to delete (slot, count):

  1. If deleting items in the middle: shift items at [slot+count..nritems-1] left by count * 25 bytes.
  2. Shift data: move data belonging to remaining items to fill the gap left by deleted items’ data. Update offset fields accordingly.
  3. Decrement nritems by count.
  4. If the leaf becomes empty: remove the key pointer from the parent node and free the leaf block. If the parent also becomes empty (or has only one child), rebalance upward.

Leaf split

When a leaf is too full for an insertion:

  1. Allocate a new leaf block.
  2. Find the split point: aim for roughly half the data in each leaf. The split point should be at an item boundary (never split an item).
  3. Copy items [split..nritems-1] and their data to the new leaf.
  4. Update the original leaf’s nritems.
  5. Insert a new key pointer in the parent node pointing to the new leaf. The key is the first key of the new leaf.
  6. If the parent node is full, split the parent (section 9.6).

Node split

When an internal node is too full for a new key pointer:

  1. Allocate a new node at the same level.
  2. Move roughly half the key pointers to the new node.
  3. Insert a new key pointer in the parent (one level up) for the new node. The key is the first key of the new node.
  4. If the parent is also full, split it recursively.
  5. If the root node splits, create a new root one level higher containing two key pointers (to the old and new nodes). Update the tree’s root pointer. The tree grows taller by one level.

Rebalancing (optional optimization)

Before splitting, try to redistribute items to a neighboring sibling:

  • Push left: If the left sibling has free space, move items from the start of the full leaf to the end of the left sibling. Update the parent’s key for the full leaf.
  • Push right: If the right sibling has free space, move items from the end of the full leaf to the start of the right sibling. Update the parent’s key for the right sibling.

This reduces tree height growth. It’s an optimization, not required for correctness. The same applies to nodes (push key pointers to siblings).

After deletion, if a leaf or node is less than ~25% full, consider merging with a sibling. This is also optional for correctness but prevents excessive tree bloat.

Path advancement

next_leaf(path): advance from the current leaf to the next one.

  1. Walk up the path until finding a level where slot < nritems - 1.
  2. Increment that slot.
  3. Walk back down, always taking slot 0, until reaching a leaf.
  4. Update the path at each level.

prev_leaf(path): similar but in reverse (walk up until slot > 0, decrement, walk down taking the last slot at each level).

Free space management

To allocate extents, the transaction crate needs to know which logical addresses are free within each block group.

Extent tree scanning

The simplest approach: walk the extent tree within a block group’s logical range. Allocated extents are contiguous EXTENT_ITEM/METADATA_ITEM entries. Gaps between them are free space. This is O(n) in the number of extents but works without additional infrastructure.

Free space tree (optional optimization)

If the FREE_SPACE_TREE compat_ro flag is set, the free space tree (tree ID 10) provides pre-computed free space information per block group.

For each block group, there is a FREE_SPACE_INFO item: Key: (block_group_start, FREE_SPACE_INFO=198, block_group_length)

OffsetSizeFieldDescription
04extent_countNumber of free extents.
44flagsBit 0: USING_BITMAPS (bitmap mode).

If not using bitmaps, free extents are stored as: Key: (start, FREE_SPACE_EXTENT=199, length) — no item data.

If using bitmaps: Key: (start, FREE_SPACE_BITMAP=200, length) — item data is a bitmap where each bit represents one sector (1 = free).

The free space tree must be kept in sync with the extent tree during transactions. When allocating or freeing extents, update both.

Allocation strategy

For metadata blocks:

  • Find a block group with type METADATA (or SYSTEM for chunk tree blocks).
  • Find a free region >= nodesize.
  • Prefer the block group hinted by the tree’s root item or the most recently used block group.

For data extents:

  • Find a block group with type DATA.
  • Find a free region >= requested size.

Rescue command requirements

This section maps each rescue command to the specific tree operations needed.

clear-uuid-tree

Delete all items from the UUID tree and remove its root item.

  1. Start transaction.
  2. Search for the first key in the UUID tree: search_slot(uuid_root, min_key).
  3. Delete items in batches (walk forward, delete, repeat until tree empty).
  4. Delete the ROOT_ITEM for tree ID 9 from the root tree.
  5. Free all tree blocks that belonged to the UUID tree (decrement refs).
  6. Set uuid_tree_generation = 0 in the superblock (tells the kernel to rebuild the UUID tree on next mount).
  7. Commit transaction.

clear-ino-cache

Remove leftover inode cache items (from the deprecated v1 inode cache).

  1. Start transaction.
  2. For each FS tree (tree IDs 5, 256+): search for INODE_ITEM with objectid = BTRFS_FREE_INO_OBJECTID (-12). Delete the inode item and all associated EXTENT_DATA items.
  3. Free any data extents referenced by the deleted extent data items.
  4. Commit transaction.

clear-space-cache

Two modes: v1 (free space inode cache) and v2 (free space tree).

v1: Similar to clear-ino-cache — delete free space cache inodes (objectid = BTRFS_FREE_SPACE_OBJECTID = -11) from each block group.

v2: Delete the entire free space tree (tree ID 10) like clear-uuid-tree. Clear the FREE_SPACE_TREE_VALID compat_ro flag so the kernel rebuilds it on next mount.

fix-device-size

Correct device and superblock size fields when they’re inconsistent.

  1. Start transaction.
  2. Walk the device tree to find all DEV_EXTENT items for each device.
  3. Sum the extent lengths to get the true bytes_used per device.
  4. Update each DEV_ITEM’s total_bytes and bytes_used.
  5. Update the superblock’s embedded dev_item and total_bytes.
  6. Commit transaction.

fix-data-checksum

Verify and repair data checksums using mirror redundancy.

  1. Start transaction.
  2. Walk the csum tree (EXTENT_CSUM items).
  3. For each checksummed range, read data from each available mirror.
  4. Verify each mirror’s data against the stored checksum.
  5. If a checksum mismatch is found and a good mirror exists: optionally update the csum item to match the good mirror’s data (or rewrite the data from the good mirror).
  6. Commit transaction.

Requires: extent tree walking for backref resolution (to report which files are affected), multi-device I/O for reading mirrors.

chunk-recover

Rebuild the chunk tree by scanning device surfaces for tree blocks.

  1. Scan all devices for valid tree block headers (check magic, csum).
  2. From found tree blocks, reconstruct chunk items by cross-referencing block group items and device extents.
  3. Rebuild the chunk tree with the recovered mappings.
  4. Commit.

This is the most complex rescue operation and requires extensive device scanning infrastructure beyond basic tree operations.

mkfs.btrfs: filesystem creation process

This document describes how mkfs.btrfs creates a new btrfs filesystem, covering both the empty filesystem case (make_btrfs) and the directory population case (make_btrfs_with_rootdir).

Overview

mkfs.btrfs creates a filesystem by constructing B-tree nodes as raw byte buffers and writing them directly to a block device or image file with pwrite. No kernel ioctls or mounting are involved. The process produces a valid, mountable btrfs filesystem.

The implementation spans several modules:

  • mkfs/src/mkfs.rs – orchestration: make_btrfs and make_btrfs_with_rootdir
  • mkfs/src/layout.rs – chunk layout computation and block address assignment
  • mkfs/src/tree.rsLeafBuilder and NodeBuilder for individual blocks
  • mkfs/src/treebuilder.rsTreeBuilder for multi-leaf trees
  • mkfs/src/items.rs – serializers for all on-disk item types
  • mkfs/src/rootdir.rs – directory walking, data writing, compression
  • mkfs/src/write.rs – checksum computation and pwrite I/O

Part 1: empty filesystem creation (make_btrfs)

Step 1: validation

Before any I/O, the configuration is validated:

  • sectorsize must be a power of 2 and >= 4096.
  • nodesize must be a power of 2, >= sectorsize, and <= 65536.
  • If the mixed-bg incompat feature is set, nodesize must equal sectorsize.

Step 2: chunk layout computation

ChunkLayout::new computes the physical placement of three block groups on disk:

System block group

  • Logical offset: 1 MiB (SYSTEM_GROUP_OFFSET).
  • Size: 4 MiB (SYSTEM_GROUP_SIZE).
  • Physical offset: same as logical (system chunk has identity mapping on device 1).
  • Profile: always SINGLE (one stripe on device 1).
  • Contains: the chunk tree block.

Metadata block group

  • Logical offset: 5 MiB (CHUNK_START = system offset + system size).
  • Size: clamp(total_bytes / 10, 32 MiB, 256 MiB), rounded down to 64 KiB (STRIPE_LEN).
  • Profile: DUP on single device (two physical stripes on device 1, sequential after the system group) or RAID1 on multi-device (one stripe per device at CHUNK_START).
  • Contains: all non-chunk tree blocks (root, extent, dev, FS, csum, free-space, data-reloc, and optionally block-group tree).

Data block group

  • Logical offset: metadata logical + metadata size.
  • Size: clamp(total_bytes / 10, 64 MiB, 1 GiB), rounded down to STRIPE_LEN.
  • Profile: SINGLE (one stripe on device 1, after the last metadata stripe).
  • Contains: file data (empty for a freshly created filesystem).

The layout validates that all stripes fit on their respective devices. If they do not, ChunkLayout::new returns None and mkfs reports “device too small”.

The minimum device size is approximately 133 MiB: 5 MiB (system) + 64 MiB (2 x 32 MiB metadata DUP) + 64 MiB (data).

Step 3: block address assignment

BlockLayout assigns a logical address to each tree block:

  • Chunk tree: at SYSTEM_GROUP_OFFSET (1 MiB), in the system chunk.
  • Root, Extent, Dev, FS, Csum, FreeSpace, DataReloc trees: sequential in the metadata chunk starting at meta_logical, spaced by nodesize.
  • Block-group tree (if enabled): the 8th block in the metadata chunk.

For example, with nodesize = 16384 and meta_logical = 5 MiB:

TreeLogical address
Chunk0x100000 (1 MiB)
Root0x500000 (5 MiB)
Extent0x504000
Dev0x508000
FS0x50C000
Csum0x510000
FreeSpace0x514000
DataReloc0x518000
BlockGroup0x51C000 (optional)

Step 4: tree block construction

Each tree is built as a single leaf node using LeafBuilder. Items must be pushed in strictly ascending key order. The builder handles offset bookkeeping: item descriptors grow forward from byte 101 (after the header), item data grows backward from the end of the block.

Tree block format

Bytes 0-31:    checksum (32 bytes, computed last)
Bytes 32-47:   fsid (16 bytes)
Bytes 48-55:   bytenr (logical address, 8 bytes LE)
Bytes 56-63:   flags (8 bytes LE)
Bytes 64-79:   chunk_tree_uuid (16 bytes)
Bytes 80-87:   generation (8 bytes LE)
Bytes 88-95:   owner tree objectid (8 bytes LE)
Bytes 96-99:   nritems (4 bytes LE)
Byte 100:      level (0 for leaf, >0 for internal node)

After the 101-byte header, item descriptors occupy 25 bytes each:

Bytes 0-16:    key (objectid:8 + type:1 + offset:8)
Bytes 17-20:   data_offset (relative to end of header, 4 bytes LE)
Bytes 21-24:   data_size (4 bytes LE)

Item data payloads fill from the end of the block backward. The space between the last descriptor and the first data payload is unused.

Root tree contents

The root tree contains a ROOT_ITEM (key type 132) for each tree that needs one. The root tree itself and the chunk tree are excluded (the root tree cannot reference itself; the chunk tree is referenced by the superblock’s chunk_root pointer, though a ROOT_ITEM is still written for the chunk tree in practice through the ROOT_ITEM_TREES list).

Trees receiving a ROOT_ITEM: Extent, Dev, FS, Csum, FreeSpace, DataReloc, and optionally BlockGroup. Each ROOT_ITEM is 439 bytes and contains:

  • An embedded btrfs_inode_item (160 bytes) for the root directory.
  • Tree-specific fields: generation, root_dirid, bytenr (pointing to the tree’s block), byte_limit, bytes_used, refs, level.

The FS tree’s ROOT_ITEM gets additional initialization:

  • A deterministic UUID (derived by XOR-flipping the filesystem UUID).
  • BTRFS_INODE_ROOT_ITEM_INIT flag set in the embedded inode.
  • inode.size = 3, inode.nbytes = nodesize.
  • ctime and otime timestamps set to the creation time.

Extent tree contents

The extent tree contains one METADATA_ITEM (or EXTENT_ITEM if skinny metadata is disabled) for each tree block, plus BLOCK_GROUP_ITEM entries for each block group (unless the block-group tree is enabled, in which case block group items go there instead).

Each metadata extent item consists of 24 bytes (btrfs_extent_item: refs, generation, flags) plus a 9-byte inline TREE_BLOCK_REF (type byte + root objectid). With skinny metadata, the key is (bytenr, METADATA_ITEM, level). Without skinny metadata, the key is (bytenr, EXTENT_ITEM, nodesize) and an additional 18-byte btrfs_tree_block_info is included.

Block group items (24 bytes each) are keyed as (logical_addr, BLOCK_GROUP_ITEM, chunk_size) and contain the bytes used, chunk objectid, and profile flags.

All items are collected, sorted by key, then pushed to the leaf.

Chunk tree contents

The chunk tree contains:

  1. DEV_ITEM entries for each device, keyed as (DEV_ITEMS_OBJECTID, DEV_ITEM, devid). Each contains the device’s total bytes, bytes used, sector size, and UUIDs.

  2. CHUNK_ITEM entries for each block group:

    • System chunk: uses sectorsize for io_align/io_width (bootstrap convention). One stripe on device 1.
    • Metadata chunk: uses STRIPE_LEN (64 KiB) for io_align/io_width. Two stripes for DUP, one per device for RAID1.
    • Data chunk: uses STRIPE_LEN for io_align/io_width. One stripe for SINGLE.

Dev tree contents

The dev tree contains:

  1. PERSISTENT_ITEM (DEV_STATS) for each device – all five counters zeroed (40 bytes).
  2. DEV_EXTENT items for each physical allocation:
    • System chunk: device 1 at SYSTEM_GROUP_OFFSET.
    • Metadata stripes: one or two entries per device.
    • Data stripes: one entry per device.

Items are sorted by key (devid, DEV_EXTENT, physical_offset).

FS tree contents

Contains two items for the root directory inode (objectid 256):

  1. INODE_ITEM: directory mode 040755, nlink=1, nbytes=nodesize, generation=1, timestamps set to creation time.
  2. INODE_REF: index=0, name=.., parent_ino=256 (self-referencing for the root directory).

Csum tree

Empty leaf (no items). Populated later if files are written.

Free-space tree

If the free-space-tree feature is enabled, contains FREE_SPACE_INFO and FREE_SPACE_EXTENT items for each block group. Each block group gets:

  • One FREE_SPACE_INFO item with extent_count=1.
  • One FREE_SPACE_EXTENT item covering the unused portion of the block group (from used_bytes to group_size).

If the free-space-tree feature is disabled, this is an empty leaf.

Data-reloc tree

Same structure as the FS tree: root directory inode (objectid 256) with INODE_ITEM and INODE_REF.

Block-group tree (optional)

If the block-group-tree compat_ro feature is enabled, block group items are placed here instead of in the extent tree. Contains three BLOCK_GROUP_ITEM entries (system, metadata, data).

Step 5: checksum computation

After each tree block is fully constructed, btrfs_disk::util::csum_tree_block computes the checksum of bytes CSUM_SIZE..nodesize and writes the result into the first bytes of the block:

  • CRC32C: 4 bytes (standard CRC32C via crc32c::crc32c).
  • xxHash64: 8 bytes.
  • SHA-256: 32 bytes.
  • BLAKE2b-256: 32 bytes.

Remaining bytes in the 32-byte checksum field stay zero.

Step 6: writing to disk

Tree blocks

Each tree block is written to its physical location(s) using pwrite_all. The logical-to-physical mapping is provided by ChunkLayout::logical_to_physical:

  • System chunk blocks: one write at the logical address (identity mapping) on device 1.
  • Metadata chunk blocks: one write per stripe. For DUP: two writes on device 1 at different offsets. For RAID1: one write per device.
  • Data chunk blocks: one write per stripe (typically one for SINGLE).

Superblocks

The superblock is constructed with all necessary fields:

  • magic: _BHRfS_M
  • root: logical address of the root tree block
  • chunk_root: logical address of the chunk tree block
  • total_bytes: sum across all devices
  • bytes_used: system used + metadata used (no data used for empty filesystem)
  • sectorsize, nodesize, leafsize (= nodesize), stripesize (= sectorsize)
  • num_devices: device count
  • incompat_flags, compat_ro_flags: from configuration
  • csum_type: checksum algorithm
  • cache_generation: 0 if free-space-tree enabled, u64::MAX otherwise
  • sys_chunk_array: embedded copy of the system chunk (disk_key + chunk_item bytes), enabling the kernel to bootstrap chunk mapping from the superblock alone

The sys_chunk_array is the bootstrap mechanism: it contains a serialized disk key followed by the system chunk item data (including stripe info), stored in a fixed 2048-byte buffer within the superblock. The kernel reads this array first to locate the chunk tree block, then reads the chunk tree to find all other chunks.

Each device gets its own superblock with device-specific fields (devid, dev_uuid, bytes_used for that device). The superblock is written to all valid mirror locations (up to 3):

  • Mirror 0: byte offset 65536 (64 KiB) – always written.
  • Mirror 1: byte offset 67108864 (64 MiB) – written if device is large enough.
  • Mirror 2: byte offset 274877906944 (256 GiB) – written if device is large enough.

After all writes, fsync is called on all device files.

Part 2: rootdir population (make_btrfs_with_rootdir)

The --rootdir flag populates the new filesystem from a source directory on the host. This is significantly more complex than the empty filesystem case because:

  1. The FS tree may need multiple leaf blocks (and internal nodes).
  2. File data must be written to the data chunk.
  3. The extent tree must reference both metadata blocks and data extents.
  4. The csum tree must contain checksums for all data.
  5. The extent tree must contain entries for its own blocks, creating a circular dependency.

Step 1: directory walk (walk_directory)

The rootdir::walk_directory function performs a depth-first traversal of the source directory, building all FS tree items and identifying files that need data extents.

Inode assignment

Inode numbers are assigned sequentially starting at 257 (inode 256 is the root directory, handled separately). The root directory (objectid 256) gets its INODE_ITEM and INODE_REF added during the merge phase.

For files with nlink > 1, the function tracks (dev, ino) pairs from the host filesystem in a HashMap. When a subsequent directory entry refers to the same host inode:

  • No new btrfs inode number is assigned; the existing one is reused.
  • An INODE_REF is added (additional reference from the new parent).
  • No new INODE_ITEM is created.
  • The nlink counter for that btrfs inode is incremented.

After all entries are processed, fixup_inode_nlink patches the nlink field in the INODE_ITEM for all hardlinked inodes.

Per-entry processing

For each directory entry (file, directory, symlink, special file):

  1. DIR_ITEM in the parent directory, keyed by name hash (crc32c(0xFFFFFFFE, name)).
  2. DIR_INDEX in the parent directory, keyed by sequential index (starting at 2 for each directory).
  3. INODE_REF for the new inode, pointing to the parent.
  4. INODE_ITEM with metadata copied from the host filesystem (uid, gid, mode, timestamps, rdev for special files).
  5. XATTR_ITEM entries for each extended attribute on the host file (read via llistxattr/lgetxattr).

Type-specific items:

  • Directories: Push children onto the DFS stack (reversed for correct order). Initialize the dir_index counter for the new directory.
  • Symlinks: Create an inline FILE_EXTENT_ITEM containing the link target (never compressed).
  • Regular files with size > 0:
    • If size <= max_inline_data_size: read the file, optionally compress, create an inline FILE_EXTENT_ITEM.
    • If size > max_inline_data_size: defer to the data writing phase. Record a FileAllocation with the host path, btrfs inode, size, and NODATASUM flag.
  • Special files (FIFO, socket, char/block device): INODE_ITEM only, no extent.

Inline extent threshold

The maximum inline data size is min(sectorsize - 1, nodesize - 147). With the defaults (sectorsize=4096, nodesize=16384), this is 4095 bytes. Files at or below this threshold are stored directly in the tree leaf.

Inode flags

The --inode-flags argument allows setting NODATACOW and NODATASUM flags per path. NODATACOW implies NODATASUM for regular files. These flags are set in the INODE_ITEM and affect whether checksums are generated during the data writing phase.

Directory size fixup

After the walk, fixup_inode_size patches each non-root directory’s INODE_ITEM size field to match the sum of name_len * 2 from its DIR_INDEX entries (the btrfs convention for directory sizes).

Inline nbytes fixup

fixup_inline_nbytes patches the nbytes field of INODE_ITEM entries for files with inline extents. For inline extents, nbytes equals the inline data size (the actual stored bytes, which may be compressed).

Output

walk_directory returns a RootdirPlan containing:

  • fs_items: sorted list of all FS tree items (excluding root dir inode).
  • file_extents: list of FileAllocation entries for files needing data extents.
  • data_bytes_needed: total aligned data bytes needed in the data chunk.
  • root_dir_nlink, root_dir_size: root directory metadata.

Step 2: data writing (write_file_data)

For each file in plan.file_extents, the function reads the host file in 1 MiB chunks (MAX_EXTENT_SIZE) and writes each chunk to the data block group:

Per-extent processing

  1. Read up to 1 MiB of raw data from the host file.
  2. Optionally try compression (zlib or zstd). If the compressed output is smaller than the input, use it; otherwise store uncompressed.
  3. Pad the (possibly compressed) data to sectorsize alignment.
  4. Compute the logical disk address: data_logical + current_offset.
  5. Write the padded data to all physical locations for this logical address.
  6. Compute per-sector checksums (skipped for NODATASUM files):
    • For each sector in the padded data, compute the checksum using the configured algorithm.
    • Pack all checksums into a single EXTENT_CSUM item.
  7. Create a FILE_EXTENT_ITEM (regular type) in the FS tree items: disk_bytenr, disk_num_bytes (aligned compressed size), offset=0, num_bytes (logical file extent size), ram_bytes (uncompressed size), compression type.
  8. Create an EXTENT_ITEM with inline EXTENT_DATA_REF in the extent tree items: refs=1, generation=1, flags=DATA.

After processing all files, nbytes_updates records the total disk-allocated bytes per inode, which are patched into the corresponding INODE_ITEM entries via apply_nbytes_updates.

Step 3: multi-leaf tree building (TreeBuilder)

When a tree has more items than fit in a single leaf, TreeBuilder splits them across multiple leaves and creates internal nodes to form a valid B-tree.

Leaf packing

Items are packed into leaves sequentially:

  1. Start a new leaf.
  2. For each item, check if the leaf has space for the item descriptor (25 bytes) plus the item data. If not, finalize the current leaf and start a new one.
  3. Record the first key of each leaf for parent node entries.

Internal node construction

If more than one leaf is produced:

  1. Create internal nodes at level 1, each pointing to up to (nodesize - 101) / 33 child blocks (33 bytes per key-pointer entry: 17 key + 8 blockptr + 8 generation).
  2. If more than one level-1 node is needed, create level-2 nodes, and so on.
  3. Repeat until a single root node remains.

Node balancing: if the last node at a level would have fewer than 1/4 of the maximum entries, the previous node is split more evenly to avoid a tiny remainder.

Placeholder addresses

All blocks are initially built with bytenr = 0 in the header. After address assignment, TreeBuilder::assign_addresses patches:

  • The bytenr field in each block’s header (offset 48).
  • The blockptr fields in internal nodes (for each key-pointer entry at offset 17 relative to the entry start).

Step 4: the convergence loop

This is the solution to the bootstrapping problem.

The bootstrapping problem

The extent tree must contain a METADATA_ITEM (or EXTENT_ITEM) for every tree block in the filesystem, including the extent tree’s own blocks. But the number of extent tree blocks depends on how many items it contains, which includes its own self-referential entries. Adding more extent tree blocks requires more extent items, which might require even more blocks.

Solution: iterate until stable

The converge_extent_tree_block_count function iteratively computes the extent tree block count:

  1. Start with extent_tree_block_count = 1.
  2. Construct a trial set of all extent items:
    • One METADATA_ITEM per tree block (chunk tree, root tree, extent_tree_block_count extent tree blocks, dev tree, FS tree blocks, csum tree blocks, free-space tree block, data-reloc tree blocks, block-group tree block if applicable).
    • All data extent items from the data writing phase.
    • Block group items (if not using block-group tree).
  3. Sort all trial items by key.
  4. Build the trial extent tree using TreeBuilder::build to determine how many blocks it needs.
  5. If trial.blocks.len() == extent_tree_block_count, the count has stabilized; break.
  6. Otherwise, set extent_tree_block_count = trial.blocks.len() and repeat.

In practice, this converges in 1-3 iterations. The count is monotonically non-decreasing (adding self-referential items can only increase the block count), so convergence is guaranteed.

Step 5: address assignment

Once the extent tree block count is known, BlockAllocator assigns real logical addresses in a fixed order:

  1. Chunk tree: allocate from the system chunk (alloc_system).
  2. Root tree: allocate from the metadata chunk (alloc_metadata).
  3. Extent tree blocks (count from convergence loop): sequential metadata allocations.
  4. Dev tree: one metadata allocation.
  5. FS tree blocks: sequential metadata allocations.
  6. Csum tree blocks: sequential metadata allocations.
  7. Free-space tree: one metadata allocation (if enabled).
  8. Data-reloc tree blocks: sequential metadata allocations.
  9. Block-group tree: one metadata allocation (if enabled).

BlockAllocator maintains separate bumping pointers for the system chunk (SYSTEM_GROUP_OFFSET to SYSTEM_GROUP_OFFSET + SYSTEM_GROUP_SIZE) and the metadata chunk (meta_logical to meta_logical + meta_size), returning an error if either runs out of space.

Step 6: building the real extent tree

With real addresses known, the actual extent tree is built:

  1. Create METADATA_ITEM entries for every tree block using their real addresses.
  2. Include all data extent items from the data writing phase.
  3. Include block group items (in-extent-tree or separate block-group tree).
  4. Sort all items by key.
  5. Build with TreeBuilder::build.
  6. Assert that the block count matches the converged count (if it does not, the convergence loop has a bug).
  7. Assign addresses to extent tree blocks from the pre-allocated address list.

Step 7: building remaining trees

With all addresses finalized:

  1. FS tree: TreeBuilder::assign_addresses patches bytenr fields using pre-allocated addresses.
  2. Csum tree: same.
  3. Data-reloc tree: same.
  4. Chunk tree: rebuilt as a single leaf with final device bytes_used values.
  5. Dev tree: rebuilt as a single leaf with final device extent information.
  6. Free-space tree: rebuilt with final used-byte counts for each block group.
  7. Block-group tree: rebuilt with final used-byte counts.
  8. Root tree: rebuilt with final tree root addresses and levels for all trees.

The root tree is always a single leaf because the number of ROOT_ITEM entries is small (6-8 trees). It is built last because it needs the root address and level of every other tree.

Step 8: writing to disk

All tree blocks are written in order:

  1. Single-leaf trees (chunk, root, dev): compute checksum, write to all physical locations.
  2. Multi-block trees (extent, FS, csum, data-reloc): for each block, compute checksum, write to all physical locations.
  3. Optional single-leaf trees (free-space, block-group): compute checksum, write.

The write_rootdir_trees helper manages this process.

Step 9: superblock

The superblock is built with:

  • root: root tree address (from step 5).
  • chunk_root: chunk tree address (from step 5).
  • bytes_used: system_used + metadata_used + data_used.

Written to all mirror locations on all devices.

Step 10: shrink (optional)

If --shrink is specified and there is a single device:

  1. Compute the physical end of the last chunk (considering all metadata and data stripes).
  2. Round up to sectorsize alignment.
  3. Create a new config with total_bytes set to this shrunk size.
  4. Rebuild the chunk tree and superblock with the reduced total_bytes (so DEV_ITEM.total_bytes and superblock.total_bytes reflect the actual image size).
  5. After all writes, truncate the image file to the shrunk size with set_len.

This produces a minimal image file suitable for distribution or flashing.

Item serialization (items.rs)

All item serializers produce Vec<u8> suitable for LeafBuilder::push. They use the bytes::BufMut trait for little-endian encoding and derive field positions from std::mem::offset_of! and std::mem::size_of on the bindgen structs.

Key serializers and their sizes:

FunctionItem typeApproximate size
root_itemROOT_ITEM439 bytes
extent_itemEXTENT_ITEM/METADATA_ITEM33 bytes (skinny) or 51 bytes
block_group_itemBLOCK_GROUP_ITEM24 bytes
dev_itemDEV_ITEM98 bytes
chunk_itemCHUNK_ITEM48 + 32*num_stripes bytes
dev_extentDEV_EXTENT48 bytes
dev_stats_zeroedPERSISTENT_ITEM40 bytes
free_space_infoFREE_SPACE_INFO8 bytes
inode_item_dirINODE_ITEM160 bytes
inode_itemINODE_ITEM160 bytes
inode_refINODE_REF10 + name_len bytes
dir_itemDIR_ITEM/DIR_INDEX30 + name_len bytes
xattr_itemXATTR_ITEM30 + name_len + value_len bytes
file_extent_inlineFILE_EXTENT_ITEM21 + data_len bytes
file_extent_regFILE_EXTENT_ITEM53 bytes
data_extent_itemEXTENT_ITEM53 bytes

Checksum computation (write.rs)

ChecksumType supports four algorithms, each computing checksums of the data portion (bytes 32..end) of tree blocks and superblocks:

AlgorithmOn-disk type valueOutput sizeImplementation
CRC32C04 bytescrc32c crate
xxHash6418 bytesxxhash-rust crate
SHA-256232 bytessha2 crate
BLAKE2b-256332 bytesblake2 crate

csum_tree_block writes the computed hash into the first N bytes of the block’s checksum field (32 bytes total), zero-filling the remaining bytes.

Data block checksums (in the csum tree) use the same algorithm but are computed per-sector.

The bootstrapping problem in detail

The bootstrapping problem is fundamental to mkfs and worth understanding in depth.

The circular dependency

Consider a minimal filesystem with 8 tree blocks. The extent tree must contain 8 METADATA_ITEM entries (one for each block, including itself). But what if those 8 entries do not fit in a single leaf?

With skinny metadata (METADATA_ITEM, 33-byte payload), each item uses 25 (descriptor) + 33 (data) = 58 bytes. A 16 KiB leaf has 16384 - 101 = 16283 usable bytes, fitting 280 items. So for an empty filesystem, the extent tree easily fits in one block.

But with --rootdir populating thousands of files, the FS tree, csum tree, and extent tree can each grow to many blocks. If the FS tree has 100 blocks and there are 500 data extents, the extent tree might need several blocks itself, and each additional extent tree block requires another METADATA_ITEM entry in the extent tree.

Why pre-computing works

The solution works because:

  1. Addresses are independent of content. Tree block addresses are assigned by sequential bump allocation, so the address of each block depends only on how many blocks precede it, not on the content of any block.

  2. Block count is monotonically non-decreasing. Adding self-referential entries can only increase (or maintain) the block count, never decrease it.

  3. The system is finite. There is a maximum number of blocks that can fit in the metadata chunk, bounding the iteration.

  4. Content depends only on addresses and counts. Once addresses are assigned, every tree block’s content is fully determined. There are no further dependencies.

The convergence loop exploits properties (1) and (2): it guesses a block count, computes trial content, checks if the trial needs the same number of blocks, and if not, tries again with the new count. Property (2) guarantees this converges (the count can only go up until it stabilizes).

Implementation detail

The trial in each iteration uses placeholder addresses (sequential from meta_logical), not the final addresses. This is acceptable because the TreeBuilder only needs the item count and sizes to determine how many blocks are needed – the actual address values do not affect block count. After convergence, the real extent tree is built with the actual addresses from BlockAllocator.

Default features

The default incompat feature flags are:

  • MIXED_BACKREF – mixed backreference format
  • BIG_METADATA – larger metadata blocks
  • EXTENDED_IREF – extended inode references (INODE_EXTREF)
  • SKINNY_METADATA – skinny metadata extent refs (METADATA_ITEM key type)
  • NO_HOLES – no explicit hole extent items

The default compat_ro feature flags are:

  • FREE_SPACE_TREE – free-space tree (v2 free space tracking)
  • FREE_SPACE_TREE_VALID – marks the free-space tree as valid
  • BLOCK_GROUP_TREE – separate tree for block group items

Features can be enabled or disabled with -O feature or -O ^feature.

Multi-device support

For multi-device filesystems, chunk layout computation distributes stripes across devices:

  • RAID1 metadata: one stripe per device at CHUNK_START.
  • SINGLE data: one stripe on device 1.

Each device gets its own superblock with device-specific devid, dev_uuid, and bytes_used. The chunk tree contains a DEV_ITEM per device, and the dev tree contains DEV_EXTENT entries mapping physical allocations to chunks.

The logical_to_physical function determines write destinations: system chunk blocks go to device 1 only, metadata blocks go to all metadata stripe devices, data blocks go to all data stripe devices.

Limitations

Not yet implemented:

  • --rootdir with LZO compression (rejected at argument validation).
  • RAID0/5/6/10 profiles.
  • Zoned device support.
  • Mixed block group mode with --rootdir.

btrfs check: verification phases

This document describes the seven phases of btrfs check, as implemented in the cli/src/check/ module. The checker operates in read-only mode on an unmounted filesystem, reading the raw on-disk image through btrfs-disk’s BlockReader without requiring any kernel ioctls.

Overview

The check command opens the filesystem image and bootstraps the chunk tree (superblock -> sys_chunk_array -> chunk tree -> root tree), then runs seven sequential verification phases:

  1. Superblock mirror validation
  2. Tree structure checks (all trees)
  3. Extent tree cross-checks (reference counting and ownership)
  4. Chunk / block group / device extent cross-checks
  5. FS tree inode consistency
  6. Checksum tree verification
  7. ROOT_REF / ROOT_BACKREF consistency

Each phase accumulates errors into a CheckResults struct. Errors are printed to stderr as they are found, and a summary is printed at the end. The process exits with code 1 if any errors were detected.

Orchestration (check.rs)

The main CheckCommand::run method:

  1. Rejects unsupported flags (--repair, --init-csum-tree, --init-extent-tree, --backup, --tree-root, --chunk-root, --qgroup-report, --subvol-extents).
  2. Checks mount status (skippable with --force).
  3. Validates the superblock mirror index (0-2).
  4. Opens the filesystem via reader::filesystem_open_mirror, which bootstraps chunk mapping and discovers all tree roots.
  5. Runs phases 1-7 in order.
  6. Prints summary and exits.

Statistics tracking

Throughout all phases, CheckResults accumulates byte counts that are printed in the final summary:

  • total_tree_bytes: sum of nodesize for every tree block visited in phase 2.
  • total_fs_tree_bytes: subset of the above for FS trees (objectid 5 and >= 256).
  • total_extent_tree_bytes: subset of the above for the extent tree (objectid 2).
  • btree_space_waste: for each leaf, nodesize minus actual bytes used (header + item descriptors + item data payloads).
  • data_bytes_allocated: total length of data extents from extent items.
  • data_bytes_referenced: total referenced bytes, accounting for shared extents via ExtentDataRef and SharedDataRef count fields.
  • total_csum_bytes: total bytes of checksum data in the csum tree.

Phase 1: Superblocks

Source: cli/src/check/superblock.rs

Purpose: Validate all three superblock mirror copies.

What it checks

Btrfs stores up to three copies of the superblock at fixed byte offsets on the device:

  • Mirror 0: byte offset 65536 (64 KiB)
  • Mirror 1: byte offset 67108864 (64 MiB)
  • Mirror 2: byte offset 274877906944 (256 GiB)

For each mirror (0 through SUPER_MIRROR_MAX - 1):

  1. Read 4096 bytes from the mirror offset using read_superblock_bytes_at.
  2. Validate the superblock using superblock_is_valid, which checks:
    • The magic number matches _BHRfS_M (0x4D5F53665248425F).
    • The CRC32C checksum of bytes 32..4096 matches the stored checksum in bytes 0..4.

If a mirror cannot be read (I/O error), this is only reported as an error for mirror 0. Mirrors 1 and 2 may legitimately be absent on small devices where the device is shorter than the mirror offset.

Generation consistency

The current implementation validates each mirror independently (magic + checksum). The C reference additionally checks that the generation fields across valid mirrors are consistent (the primary mirror should have the highest generation). This is not yet implemented.

Error variants produced

  • SuperblockInvalid { mirror, detail } – reported when:
    • A mirror has an invalid checksum or magic number.
    • Mirror 0 cannot be read at all (I/O error).

Return value

Returns the count of valid mirrors found (0-3). This value is currently not used by the caller but could be used for repair decisions in the future.

Phase 2: Tree structure

Source: cli/src/check/tree_structure.rs

Purpose: Walk every tree in the filesystem and verify per-block structural integrity. Collect a map of all tree block addresses and their owners for use in phase 3.

Trees checked

The phase checks:

  1. Root tree – directly from superblock.root.
  2. Chunk tree – directly from superblock.chunk_root.
  3. All trees discovered in the root tree – every (tree_id, (bytenr, gen)) pair from open.tree_roots. This includes the extent tree, dev tree, FS tree, csum tree, free-space tree, data-reloc tree, block-group tree (if present), and all subvolume/snapshot trees.

Each tree is walked using reader::tree_walk_tolerant, which performs a depth-first traversal through all internal nodes and leaves, calling the visitor callback for each block. The _tolerant variant collects read errors instead of aborting, allowing the checker to report all problems rather than stopping at the first.

Per-block checks

For every tree block (leaf or internal node), the following checks are performed:

CRC32C checksum verification

The first 32 bytes of each block contain the checksum. The checker computes btrfs_csum_data(&raw[32..]) (standard CRC32C with ISO 3309 seed) and compares it to the stored value in raw[0..4]. This check is only performed when the superblock’s csum_type is CRC32C; other checksum types emit a warning and skip verification.

Fsid match

The block header’s fsid field (16 bytes at offset 32) must match the filesystem’s effective fsid. The effective fsid is metadata_uuid if the METADATA_UUID incompat flag is set, or fsid otherwise. This distinction matters for filesystems that have had their metadata UUID changed via btrfs-tune -m.

Generation bound

The block header’s generation field must not exceed the superblock’s generation. A block with a higher generation than the superblock indicates corruption (the block was written in a transaction that was never committed, or the block has been corrupted).

Level consistency

  • Leaf blocks (items present) must have header.level == 0.
  • Internal nodes (key-pointer entries) must have header.level > 0.

A mismatch indicates structural corruption where a block’s type disagrees with its declared level.

Key ordering

Within each block, keys must be in strictly ascending order using the compound key comparison (objectid, type, offset):

  • For leaves: consecutive items items[i-1] and items[i] must satisfy key_less(prev, cur).
  • For internal nodes: consecutive key-pointers ptrs[i-1] and ptrs[i] must satisfy key_less(prev, cur).

Strictly ascending means no duplicates are allowed. The comparison function uses the raw type byte for the type field (via key_type.to_raw()), comparing the tuple (objectid, type_raw, offset) lexicographically.

Byte attribution

Each visited block contributes nodesize bytes to the appropriate category:

  • Extent tree blocks (objectid 2) -> total_extent_tree_bytes
  • FS tree blocks (objectid 5 or >= 256) -> total_fs_tree_bytes
  • All blocks -> total_tree_bytes

For leaf blocks, space waste is computed as:

waste = nodesize - (101 + nritems * 25 + sum(item.size for each item))

where 101 is the header size and 25 is the item descriptor size.

Output

Returns a HashMap<u64, u64> mapping each tree block’s logical address to the objectid of the tree that owns it. This map is used by phase 3 for bidirectional ownership verification.

Tree name resolution

Tree names for error messages are derived from the objectid using ObjectId formatting (e.g., objectid 1 = “ROOT_TREE”, objectid 5 = “FS_TREE”, objectid 256+ = the numeric subvolume ID). Names are leaked as &'static str since the set of tree names is small and bounded.

Error variants produced

  • TreeBlockChecksumMismatch { tree, logical } – CRC32C does not match.
  • TreeBlockBadFsid { tree, logical } – header fsid does not match the filesystem’s effective fsid.
  • TreeBlockBadBytenr { tree, logical, header_bytenr } – the header’s bytenr field does not match the logical address where the block was read. (Note: this check is performed by the block reader during parsing, not directly in this phase, but the error is reported here if it occurs.)
  • TreeBlockBadGeneration { tree, logical, block_gen, super_gen } – block generation exceeds superblock generation.
  • TreeBlockBadLevel { tree, logical, detail } – level/type mismatch (leaf with non-zero level, or node with zero level).
  • KeyOrderViolation { tree, logical, index } – key at index is not strictly greater than the key at index - 1.
  • ReadError { logical, detail } – I/O error reading a tree block.

Phase 3: Extents

Source: cli/src/check/extents.rs

Purpose: Walk the extent tree to verify reference counts, detect overlapping extents, and cross-check tree block ownership against extent tree backrefs in both directions.

How it works

The phase walks the extent tree leaf by leaf, processing items in key order. It maintains an ExtentCheckState that tracks the “current” extent being verified and accumulates statistics.

Item processing

Items are processed based on their key type:

EXTENT_ITEM / METADATA_ITEM: Start a new extent. The previous extent (if any) is flushed first. For the new extent:

  1. Record the bytenr in extent_item_addrs (for later ownership checks).
  2. Determine the extent length:
    • EXTENT_ITEM: length = key.offset.
    • METADATA_ITEM: length = 0 (skinny refs use key.offset as level, not length, so overlap detection is skipped for metadata items).
  3. Check for overlap: if bytenr < prev_end and prev_end > 0, report an overlapping extent error.
  4. Parse the ExtentItem payload to extract:
    • The declared reference count (refs).
    • Inline backrefs and their count.
    • Whether this is a data extent (via BTRFS_EXTENT_FLAG_DATA).
    • For tree block extents: collect TreeBlockBackref inline refs into extent_backref_owners[bytenr].
  5. Initialize pending state: pending_refs = declared refs, pending_counted = inline ref count.
  6. For data extents, add length to data_bytes_allocated.

TREE_BLOCK_REF: Standalone tree block backref. Increments pending_counted by 1. Records key.offset (the root objectid) in extent_backref_owners.

SHARED_BLOCK_REF / EXTENT_OWNER_REF: Standalone backrefs. Each increments pending_counted by 1.

EXTENT_DATA_REF: Standalone data backref. Parses the item to extract the count field (number of references from this particular root/objectid/offset combination). Increments pending_counted by count. Adds length * count to data_bytes_referenced.

SHARED_DATA_REF: Same as EXTENT_DATA_REF but for shared (relocated) data references.

All other key types (e.g., BLOCK_GROUP_ITEM): ignored.

Inline reference counting

The count_inline_refs function iterates over the InlineRef variants in an ExtentItem:

  • TreeBlockBackref, SharedBlockBackref, ExtentOwnerRef: count as 1 each.
  • ExtentDataBackref, SharedDataBackref: count as their embedded count field (which may be > 1 for multiply-referenced data extents).

Flushing

When a new EXTENT_ITEM/METADATA_ITEM is encountered, or at the end of the tree walk, flush_pending is called:

  1. Skip if no extent is pending (pending_bytenr == 0).
  2. For data extents where data_bytes_referenced is still 0 (only inline refs, no standalone ExtentDataRef), compute data_bytes_referenced += pending_length * pending_counted.
  3. Compare pending_refs (declared) to pending_counted (actual). If they differ, report an ExtentRefMismatch error.
  4. Reset pending_bytenr to 0.

Bidirectional ownership cross-check

After the extent tree walk completes, two cross-checks are performed using the tree_block_owners map from phase 2:

Direction 1: tree block -> extent tree. For every tree block address found during phase 2 tree walks:

  • If the address has no EXTENT_ITEM or METADATA_ITEM in the extent tree, report MissingExtentItem.
  • If the address has extent tree entries but none of the claimed owner roots match the actual owner (the tree that contained this block during phase 2 walks), report BackrefOwnerMismatch.

Direction 2: extent tree -> tree block. For every tree block address with backrefs in the extent tree:

  • For each claimed owner root, check if the actual owner (from the phase 2 map) matches. If the block was not found during phase 2 walks, or belongs to a different tree, report BackrefOrphan.

Both cross-checks sort addresses before iteration for deterministic error ordering.

Error variants produced

  • ExtentRefMismatch { bytenr, expected, found } – the declared reference count in the ExtentItem does not match the sum of inline and standalone backrefs.
  • MissingExtentItem { bytenr } – a tree block observed during phase 2 has no corresponding EXTENT_ITEM or METADATA_ITEM in the extent tree.
  • BackrefOwnerMismatch { bytenr, actual_owner, claimed_owners } – the tree block’s actual owner (from phase 2) does not appear in the extent tree’s list of backref owners for that address.
  • BackrefOrphan { bytenr, claimed_owner } – the extent tree claims a backref for a tree that does not actually contain a block at that address.
  • OverlappingExtent { bytenr, length, prev_end } – two data extents overlap in logical address space (the start of one extent is before the end of the previous).
  • ReadError { logical, detail } – I/O error reading the extent tree.

Phase 4: Chunks / block groups / device extents

Source: cli/src/check/chunks.rs

Purpose: Cross-check the chunk tree, block group items, and device extents for mutual consistency.

What it checks

This phase performs three categories of cross-checks:

Chunk <-> block group cross-check

Every chunk in the chunk tree’s ChunkTreeCache (built during filesystem open) should have a corresponding BLOCK_GROUP_ITEM in the extent tree (or block-group tree, if the BLOCK_GROUP_TREE compat_ro feature is enabled). And vice versa: every block group item should correspond to a chunk.

Block groups are collected by walking either:

  • The block-group tree if BTRFS_FEATURE_COMPAT_RO_BLOCK_GROUP_TREE is set in the superblock’s compat_ro_flags.
  • The extent tree otherwise (block group items historically lived in the extent tree).

The walk collects all items with key type BLOCK_GROUP_ITEM into a BTreeMap keyed by logical address.

Then:

  1. For each chunk in the chunk cache: if no block group exists at that logical address, report ChunkMissingBlockGroup.
  2. For each block group: if the chunk cache has no chunk at that logical address, report BlockGroupMissingChunk.

Device extent overlap detection

Device extents are collected from the device tree by walking all items with key type DEV_EXTENT. Each extent is recorded as (offset, length) grouped by device ID (key.objectid).

For each device, extents are sorted by physical offset. Then consecutive pairs are checked: if extents[i].offset < extents[i-1].offset + extents[i-1].length, the extents overlap and DeviceExtentOverlap is reported.

Error variants produced

  • ChunkMissingBlockGroup { logical } – a chunk exists in the chunk tree but no block group item was found at the same logical address.
  • BlockGroupMissingChunk { logical } – a block group item exists but no chunk was found at the same logical address.
  • DeviceExtentOverlap { devid, offset } – two device extents on the same device overlap in physical address space.
  • ReadError { logical, detail } – I/O error reading the block-group tree, extent tree, or device tree.

Phase 5: FS roots

Source: cli/src/check/fs_roots.rs

Purpose: Walk every filesystem tree (the default FS tree and all subvolume trees) and verify inode-level consistency.

Which trees are checked

From the tree_roots map (populated during filesystem open), the phase selects trees whose objectid is either:

  • BTRFS_FS_TREE_OBJECTID (5) – the default filesystem tree.
  • >= BTRFS_FIRST_FREE_OBJECTID (256) – subvolume and snapshot trees.

Item collection

For each FS tree, collect_fs_items walks all leaves and groups items by objectid (inode number). Each item is stored as a (KeyType, key_offset, raw_data_bytes) tuple. Items arrive in sorted key order due to the B-tree traversal, which means within an objectid group, items are sorted by (key_type, offset).

Per-inode checks

For each objectid group (inode), the following checks are performed:

INODE_ITEM presence

The checker notes whether the objectid has an INODE_ITEM. If directory entries reference an objectid that has no INODE_ITEM, the entry is an orphan.

Parsed from INODE_ITEM: nlink, size (isize), nbytes, and mode.

The actual reference count is computed by counting entries across all INODE_REF items (via InodeRef::parse_all) and INODE_EXTREF items (via InodeExtref::parse_all) for this objectid. If the computed count differs from inode_item.nlink and the inode has at least one reference, NlinkMismatch is reported.

The root directory inode (objectid 256, BTRFS_FIRST_FREE_OBJECTID) is excluded from this check because it has special nlink handling in btrfs.

File extent overlap detection

For regular files, all EXTENT_DATA items are processed to extract (file_offset, file_offset + length) ranges:

  • Regular extents: length = num_bytes from the FileExtentBody::Regular variant.
  • Inline extents: length = inline_size from the FileExtentBody::Inline variant.

Since items are in key order and EXTENT_DATA keys use the file offset, ranges are already sorted by start offset. Consecutive ranges are checked: if ranges[i].start < ranges[i-1].end, a FileExtentOverlap is reported.

Directory inode size (isize) check

For directory inodes (mode & S_IFMT == S_IFDIR), the expected inode size is computed by summing name_len * 2 for every DIR_INDEX entry belonging to this inode. The factor of 2 matches the btrfs convention where directory inode size counts each entry’s name length twice (once for DIR_ITEM, once for DIR_INDEX).

If the inode’s stored size field differs from this computed sum, DirSizeWrong is reported.

File nbytes check

For regular files and symlinks (mode & S_IFMT == S_IFREG or S_IFLNK), the expected nbytes is computed from extent items:

  • Inline extents: nbytes += data_len (the inline payload size).
  • Regular extents: nbytes += disk_num_bytes, but only for non-prealloc extents. Prealloc extents (preallocated but unwritten) and hole extents (disk_bytenr == 0) do not contribute.

If the inode’s stored nbytes differs from the computed total, NbytesWrong is reported.

Orphan directory entries

When processing DIR_ITEM and DIR_INDEX items, for each entry whose location key type is INODE_ITEM and whose target objectid is >= BTRFS_FIRST_FREE_OBJECTID (256): if the target inode has no INODE_ITEM anywhere in this tree, DirItemOrphan is reported. Both DIR_ITEM and DIR_INDEX entries are checked, so an orphan reference in either will be caught.

Error variants produced

  • InodeMissing { tree, ino } – an objectid is referenced but has no INODE_ITEM. (Note: this is detected indirectly through DirItemOrphan in the current implementation.)
  • NlinkMismatch { tree, ino, expected, found } – the inode’s stored nlink differs from the number of INODE_REF + INODE_EXTREF entries.
  • FileExtentOverlap { tree, ino, offset } – two file extent items for the same inode overlap in file offset space.
  • DirItemOrphan { tree, parent_ino, name } – a directory entry references an inode that has no INODE_ITEM.
  • DirSizeWrong { tree, ino, expected, found } – a directory inode’s stored size does not match the computed sum of DIR_INDEX name lengths times 2.
  • NbytesWrong { tree, ino, expected, found } – a file inode’s stored nbytes does not match the computed sum from extent items.
  • ReadError { logical, detail } – I/O error reading the FS tree.

Phase 6: Checksums

Source: cli/src/check/csums.rs

Purpose: Walk the checksum tree and optionally verify data block checksums against the actual on-disk data.

Structure of the csum tree

The csum tree contains EXTENT_CSUM items (key type 128). Each item covers a contiguous range of data sectors:

  • Key objectid: BTRFS_EXTENT_CSUM_OBJECTID (fixed constant).
  • Key offset: the logical byte address of the first sector covered.
  • Item data: packed array of checksums, one per sector. With CRC32C (4 bytes per checksum) and 4K sectors, a single item can cover many sectors.

What it checks

Phase 6a: tree walk and byte counting

Always performed. The phase walks the csum tree and for each EXTENT_CSUM item, computes num_csums = item_data_len / csum_size and adds item_data_len to total_csum_bytes. This total is reported in the final summary.

Phase 6b: data verification (optional)

Only performed when --check-data-csum is passed. Only supported for CRC32C checksums; other checksum types emit a warning and skip verification.

For each csum item, the phase iterates over every sector:

  1. Compute the logical address: item.key.offset + i * sectorsize.
  2. Read sectorsize bytes from that logical address via reader.read_data.
  3. Compute btrfs_csum_data(&data) (standard CRC32C).
  4. Compare to the stored checksum (extracted from the item data at offset i * csum_size).
  5. If they differ, or if the read fails, report CsumMismatch.

The btrfs_csum_data function uses the standard ISO 3309 CRC32C computation (seed = 0xFFFFFFFF, final XOR), matching the kernel’s checksum for tree blocks and data. This is distinct from the raw CRC32C used in send streams.

Error variants produced

  • CsumMismatch { logical } – the computed CRC32C of the data at the given logical address does not match the stored checksum, or the data could not be read.
  • ReadError { logical, detail } – I/O error reading the csum tree itself.

Phase 7: Root refs

Source: cli/src/check/root_refs.rs

Purpose: Verify that ROOT_REF and ROOT_BACKREF items in the root tree are consistent with each other.

Background

In btrfs, subvolume parent-child relationships are recorded in the root tree using two item types:

  • ROOT_REF (key type 156): stored with objectid = parent_root_id, offset = child_root_id. Contains the directory ID, sequence number, and name of the directory entry that references the child subvolume.

  • ROOT_BACKREF (key type 157): stored with objectid = child_root_id, offset = parent_root_id. Contains the same fields as the corresponding ROOT_REF.

These items form a bidirectional link. For every ROOT_REF there should be a matching ROOT_BACKREF, and vice versa. The fields (dirid, sequence, name) should be identical between the pair.

What it checks

The phase walks the root tree and collects all ROOT_REF and ROOT_BACKREF items into two maps, keyed by (child_root_id, parent_root_id). Both item types are parsed using RootRef::parse (the on-disk format is identical).

Then two passes are made:

Forward check: every ROOT_REF has a matching ROOT_BACKREF

For each (child, parent) pair in the forward refs map:

  • If no entry exists in the back refs map, report RootBackrefMissing.
  • If an entry exists, compare the three fields:
    • dirid: if they differ, report RootRefMismatch with “dirid mismatch”.
    • sequence: if they differ, report RootRefMismatch with “sequence mismatch”.
    • name: if they differ, report RootRefMismatch with “name mismatch”.

Each field is checked independently, so a single pair can produce up to 3 mismatch errors.

Reverse check: every ROOT_BACKREF has a matching ROOT_REF

For each (child, parent) pair in the back refs map:

  • If no entry exists in the forward refs map, report RootRefMissing.

Field comparison is not repeated in this direction because the forward check already caught any field mismatches for pairs that exist in both maps.

Error variants produced

  • RootRefMissing { child, parent } – a ROOT_BACKREF exists for this child/parent pair but no corresponding ROOT_REF was found.
  • RootBackrefMissing { child, parent } – a ROOT_REF exists for this child/parent pair but no corresponding ROOT_BACKREF was found.
  • RootRefMismatch { child, parent, detail } – both ROOT_REF and ROOT_BACKREF exist but one of their fields (dirid, sequence, or name) differs. The detail string describes which field mismatched and shows both values.
  • ReadError { logical, detail } – I/O error reading the root tree.

Complete error type reference

All error variants are defined in cli/src/check/errors.rs as the CheckError enum. Each variant implements Display for human-readable error messages.

Phase 1 errors

VariantFieldsDescription
SuperblockInvalidmirror: u32, detail: StringSuperblock mirror failed validation (bad magic, bad checksum, or read error)

Phase 2 errors

VariantFieldsDescription
TreeBlockChecksumMismatchtree: &'static str, logical: u64CRC32C checksum does not match
TreeBlockBadFsidtree: &'static str, logical: u64Header fsid does not match filesystem
TreeBlockBadBytenrtree: &'static str, logical: u64, header_bytenr: u64Header bytenr disagrees with read address
TreeBlockBadGenerationtree: &'static str, logical: u64, block_gen: u64, super_gen: u64Block generation exceeds superblock generation
TreeBlockBadLeveltree: &'static str, logical: u64, detail: StringLevel/type mismatch (leaf with level>0 or node with level==0)
KeyOrderViolationtree: &'static str, logical: u64, index: usizeKey at index is not strictly greater than previous key

Phase 3 errors

VariantFieldsDescription
ExtentRefMismatchbytenr: u64, expected: u64, found: u64Declared refs != counted refs (inline + standalone)
MissingExtentItembytenr: u64Tree block has no extent/metadata item in extent tree
BackrefOwnerMismatchbytenr: u64, actual_owner: u64, claimed_owners: Vec<u64>Actual tree block owner not in extent tree’s backref list
BackrefOrphanbytenr: u64, claimed_owner: u64Extent tree claims a backref but no tree block found
OverlappingExtentbytenr: u64, length: u64, prev_end: u64Data extent overlaps with previous extent

Phase 4 errors

VariantFieldsDescription
ChunkMissingBlockGrouplogical: u64Chunk has no matching block group item
BlockGroupMissingChunklogical: u64Block group has no matching chunk
DeviceExtentOverlapdevid: u64, offset: u64Two device extents overlap on the same device

Phase 5 errors

VariantFieldsDescription
InodeMissingtree: u64, ino: u64Inode referenced but has no INODE_ITEM
NlinkMismatchtree: u64, ino: u64, expected: u32, found: u32Stored nlink differs from counted references
FileExtentOverlaptree: u64, ino: u64, offset: u64File extent items overlap in file offset space
DirItemOrphantree: u64, parent_ino: u64, name: StringDir entry references non-existent inode
DirSizeWrongtree: u64, ino: u64, expected: u64, found: u64Directory inode size does not match DIR_INDEX name sum
NbytesWrongtree: u64, ino: u64, expected: u64, found: u64File inode nbytes does not match extent sum

Phase 6 errors

VariantFieldsDescription
CsumMismatchlogical: u64Data checksum does not match stored value

Phase 7 errors

VariantFieldsDescription
RootRefMissingchild: u64, parent: u64ROOT_BACKREF exists but no matching ROOT_REF
RootBackrefMissingchild: u64, parent: u64ROOT_REF exists but no matching ROOT_BACKREF
RootRefMismatchchild: u64, parent: u64, detail: StringROOT_REF and ROOT_BACKREF fields disagree

Cross-phase error

VariantFieldsDescription
ReadErrorlogical: u64, detail: StringI/O error reading any tree block (used in phases 2-7)

Summary output

After all phases complete, CheckResults::print_summary writes to stdout:

found <bytes_used> bytes used, <error_count> error(s) found
total csum bytes: <total_csum_bytes>
total tree bytes: <total_tree_bytes>
total fs tree bytes: <total_fs_tree_bytes>
total extent tree bytes: <total_extent_tree_bytes>
btree space waste bytes: <btree_space_waste>
file data blocks allocated: <data_bytes_allocated>
 referenced <data_bytes_referenced>

If error_count > 0, the process exits with code 1.

Limitations and future work

The following checks from the C reference implementation are not yet implemented:

  • --mode lowmem differentiation (the current implementation uses the “original” mode approach of collecting all items then cross-checking).
  • Log tree checking (the log tree is not walked in phase 2).
  • --repair (all checking is read-only).
  • --backup / --tree-root / --chunk-root** (alternate root selection).
  • --init-csum-tree / --init-extent-tree** (destructive reconstruction).
  • --qgroup-report (quota group consistency checking).
  • --subvol-extents (per-subvolume extent sharing analysis).
  • Superblock generation cross-checking between mirror copies.
  • Block group used-bytes verification (comparing declared used in block group items against actual allocated extents).