Introduction

btrfsutils is a Rust implementation of the btrfs filesystem utilities. It provides three command-line tools: btrfs, for managing and inspecting btrfs filesystems; btrfs-mkfs, for creating new ones; and btrfs-tune, for offline superblock tuning. All three aim to be drop-in replacements for the tools provided by btrfs-progs.

Most commands are fully implemented and produce output matching the C reference. The explicit goal is to be drop-in compatible with the reference implementation, with additional features. This is currently in a beta (pre-1.0) version, so it should not be used in production, but the commands that are implemented are thoroughly tested and can be assumed to be correctly implemented.

It also provides library crates that can be used to access kernel APIs to manage btrfs filesystems, decode and write on-disk structures and decode and handle the btrfs send format.

Source Code

The source is available on github and gitlab.

Installation

While these tools are still in their beta (pre-1.0 release) phase, you can already install them and try them out. Currently, the recommended way to install them is using Cargo, there are no binary builds to download.

Cargo

If you have cargo installed, you can install the utilities with it.

cargo install btrfs-cli
cargo install btrfs-tune
cargo install btrfs-mkfs

Nix

If you use Nix with flakes enabled, you can run the tool directly without installing it:

nix run github:rustutils/btrfsutils -- filesystem show /mnt

Or install it into your profile:

nix profile install github:rustutils/btrfsutils

From source

See Building from Source for instructions on compiling btrfsutils yourself from the repository.

Requirements

btrfsutils runs on Linux. Most commands that interact with a mounted filesystem require CAP_SYS_ADMIN (i.e. root, or a process with that capability granted). The exceptions are btrfs inspect-internal dump-super and dump-tree, which only require read access to the block device or image file.

Building from Source

Prerequisites

You need a Rust toolchain matching the version in rust-toolchain.toml — running rustup toolchain install in the project directory will pick it up automatically. You also need clang and libclang for bindgen, which generates Rust bindings from the kernel UAPI headers at build time.

On Fedora/RHEL:

sudo dnf install clang

On Debian/Ubuntu:

sudo apt install clang libclang-dev

Building with Cargo

cargo build --release

The resulting binaries are target/release/btrfs, target/release/btrfs-mkfs, and target/release/btrfs-tune.

Building with Nix

The project includes a Nix flake that provides a fully reproducible build with all dependencies pinned:

nix build

Outputs land in result/bin/btrfs, result/bin/btrfs-mkfs, result/bin/btrfs-tune, and result/share/man/man1/.

To enter a development shell with all tools available (including nightly rustfmt, cargo-insta, and cargo-llvm-cov):

nix develop

Contributors who want to run the full lint sweep (just check) on a non-Nix machine may also need a host-arch musl cross-compiler — see the “Static checks” section of the testing guide for setup instructions.

Concepts

This page defines the terms used throughout the btrfs documentation and command output.

Filesystem

A btrfs filesystem is a single logical storage pool. It has a UUID and an optional human-readable label, and it can span one or more physical block devices. All data and metadata stored in the filesystem is distributed across its devices according to the configured RAID profiles.

A filesystem is accessed by mounting it at a path. Most btrfs commands take that mount point (or any path within it) as their argument.

Device

A device is a block device — a disk partition or a whole disk — that belongs to a filesystem. Every filesystem has at least one device. Additional devices can be added or removed while the filesystem is mounted, allowing online capacity changes.

Subvolume

A subvolume is an independently managed subtree within a filesystem. It looks like a directory, but it has its own inode namespace and can be snapshotted, sent, or deleted independently from the rest of the filesystem.

When you mount a btrfs filesystem, you are mounting one of its subvolumes (the default subvolume, unless you specify otherwise). Other subvolumes appear as directories within it but can also be mounted directly with the subvol= or subvolid= mount options.

Snapshot

A snapshot is a copy-on-write copy of a subvolume taken at a point in time. It initially shares all of its data with the source subvolume; pages diverge as either copy is written. Snapshots can be read-write or read-only. Read-only snapshots are required for btrfs send.

Chunk

btrfs divides storage into chunks — large, contiguous regions of logical address space (typically 256 MiB for metadata, 1 GiB for data). Each chunk is backed by one or more physical stripes on the underlying devices, according to the RAID profile in use. The mapping from logical addresses to physical device locations is stored in the chunk tree.

Extent

An extent is a contiguous run of bytes within a chunk. File data is stored in data extents; the B-trees that make up btrfs metadata are stored in metadata extents. btrfs uses copy-on-write: modifying data creates a new extent rather than overwriting the old one, which is what makes snapshots cheap.

Generation

Every committed transaction increments the filesystem’s generation number. Subvolumes track the generation at which they were last modified (their generation) and the generation at which they were originally created (their ogeneration, or original generation). These are used by tools like btrfs subvolume find-new to identify recently changed files, and by btrfs send to select an appropriate incremental parent.

qgroup

A quota group (qgroup) tracks and optionally limits the amount of space used by a set of subvolumes. qgroups can be nested into a hierarchy, which allows shared space (space that would not be freed even if one subvolume were deleted) to be accounted at the group level. Quotas must be enabled on the filesystem before qgroups can be used.

Commands

btrfsutils implements the same command structure as the upstream btrfs tool. Commands are organized into groups:

btrfs filesystem

Manage and inspect mounted filesystems.

Command	Description
`btrfs filesystem show [path]`	Show filesystem info and devices
`btrfs filesystem df <path>`	Show space usage by chunk type
`btrfs filesystem usage <path>`	Detailed space usage with per-device breakdown
`btrfs filesystem du <path>`	Show disk usage including shared extents
`btrfs filesystem sync <path>`	Sync the filesystem
`btrfs filesystem defrag <path>`	Defragment a file or directory
`btrfs filesystem resize <size> <path>`	Resize a mounted filesystem
`btrfs filesystem label <path> [label]`	Get or set the filesystem label
`btrfs filesystem mkswapfile <path>`	Create a swapfile
`btrfs filesystem commit-stats <path>`	Show commit statistics

btrfs subvolume

Create and manage subvolumes and snapshots.

Command	Description
`btrfs subvolume create <path>`	Create a subvolume
`btrfs subvolume delete <path>`	Delete a subvolume
`btrfs subvolume snapshot <src> <dst>`	Create a snapshot
`btrfs subvolume list <path>`	List subvolumes
`btrfs subvolume show <path>`	Show subvolume details
`btrfs subvolume get-default <path>`	Show the default subvolume
`btrfs subvolume set-default <id> <path>`	Set the default subvolume
`btrfs subvolume get-flags <path>`	Show subvolume flags
`btrfs subvolume set-flags <path>`	Set subvolume flags
`btrfs subvolume find-new <path> <gen>`	Find files modified since a generation
`btrfs subvolume sync <path>`	Wait for deleted subvolumes to be cleaned up

btrfs device

Manage devices in a multi-device filesystem.

Command	Description
`btrfs device add <dev> <path>`	Add a device
`btrfs device remove <dev> <path>`	Remove a device
`btrfs device stats <path>`	Show per-device error statistics
`btrfs device scan [dev]`	Scan for btrfs devices
`btrfs device ready <dev>`	Check if a multi-device filesystem is ready
`btrfs device usage <path>`	Show per-device allocation details

btrfs balance

Rebalance data and metadata across devices or profiles.

Command	Description
`btrfs balance start <path>`	Start a balance
`btrfs balance pause <path>`	Pause a running balance
`btrfs balance resume <path>`	Resume a paused balance
`btrfs balance cancel <path>`	Cancel a running or paused balance
`btrfs balance status <path>`	Show balance status

Balance filters (-d, -m, -s) accept filter strings such as usage=50,profiles=raid1|single.

btrfs scrub

Verify data and metadata checksums.

Command	Description
`btrfs scrub start <path>`	Start a scrub
`btrfs scrub cancel <path>`	Cancel a running scrub
`btrfs scrub resume <path>`	Resume a cancelled scrub
`btrfs scrub status <path>`	Show scrub status
`btrfs scrub limit <path>`	Get or set scrub throughput limit

btrfs replace

Replace a device in a filesystem.

Command	Description
`btrfs replace start <srcdev> <tgtdev> <path>`	Start a device replacement
`btrfs replace status <path>`	Show replacement status
`btrfs replace cancel <path>`	Cancel a running replacement

btrfs send / receive

Stream filesystem data between systems.

Command	Description
`btrfs send <subvol>`	Send a subvolume as a stream
`btrfs receive <path>`	Receive a stream into a directory

btrfs send supports full sends and incremental sends (-p parent, -c clone sources). btrfs receive supports v1, v2 (compressed data), and v3 (fs-verity) stream formats.

btrfs inspect-internal

Low-level inspection tools.

Command	Description
`btrfs inspect-internal rootid <path>`	Show the subvolume ID for a path
`btrfs inspect-internal inode-resolve <ino> <path>`	Resolve an inode to paths
`btrfs inspect-internal logical-resolve <addr> <path>`	Resolve a logical address to paths
`btrfs inspect-internal subvolid-resolve <id> <path>`	Resolve a subvolume ID to a path
`btrfs inspect-internal min-dev-size <path>`	Show the minimum safe device size
`btrfs inspect-internal list-chunks <path>`	List all chunk allocations
`btrfs inspect-internal dump-super <dev>`	Dump the superblock
`btrfs inspect-internal dump-tree <dev>`	Dump raw B-tree contents
`btrfs inspect-internal tree-stats <dev>`	Walk a B-tree and report node/leaf statistics
`btrfs inspect-internal map-swapfile <path>`	Show physical extent map of a swapfile

dump-super and dump-tree read directly from a block device or image file and do not require a mounted filesystem or elevated privileges.

btrfs quota / qgroup

Manage filesystem quotas.

Command	Description
`btrfs quota enable <path>`	Enable quotas
`btrfs quota disable <path>`	Disable quotas
`btrfs quota rescan <path>`	Rescan quota usage
`btrfs quota status <path>`	Show quota status
`btrfs qgroup show <path>`	Show qgroup usage
`btrfs qgroup create <id> <path>`	Create a qgroup
`btrfs qgroup destroy <id> <path>`	Destroy a qgroup
`btrfs qgroup assign <src> <dst> <path>`	Assign a qgroup to a parent
`btrfs qgroup remove <src> <dst> <path>`	Remove a qgroup assignment
`btrfs qgroup limit <size> <id> <path>`	Set a qgroup size limit
`btrfs qgroup clear-stale <path>`	Remove stale qgroups

btrfs property

Get and set filesystem object properties.

Command	Description
`btrfs property get <path> [name]`	Get a property
`btrfs property set <path> <name> <value>`	Set a property
`btrfs property list <path>`	List available properties

Supported properties: ro (subvolumes), label (filesystem/device), compression (inodes).

btrfs restore

Recover files from a damaged or unmounted filesystem by reading on-disk structures directly.

Command	Description
`btrfs restore <dev> <path>`	Restore files to a destination directory
`btrfs restore -l <dev>`	List available tree roots

Supports regular files, directories, symlinks (-S), extended attributes (-x), metadata (owner/mode/times with -m), and compressed extents (zlib/zstd/lzo). Use --path-regex to filter restored files and -s to include snapshots.

btrfs rescue

Emergency recovery tools for damaged filesystems.

Command	Description
`btrfs rescue super-recover <dev>`	Restore superblock from mirrors
`btrfs rescue zero-log <dev>`	Clear the log tree pointer
`btrfs rescue create-control-device`	Create `/dev/btrfs-control` if missing
`btrfs rescue fix-device-size <dev>`	Re-align device and superblock sizes
`btrfs rescue fix-data-checksum [--readonly\|--mirror 1] <dev>`	Scan and (with `--mirror 1`) repair data csums
`btrfs rescue clear-uuid-tree <dev>`	Drop the UUID tree so the kernel rebuilds it
`btrfs rescue clear-space-cache <v1\|v2> <dev>`	Clear the v1 or v2 free space cache
`btrfs rescue clear-ino-cache <dev>`	Remove leftover items from the deprecated inode cache

btrfs rescue chunk-recover has argument parsing scaffolded but is not yet implemented.

btrfs-mkfs

Create a new btrfs filesystem on a block device or image file.

btrfs-mkfs [options] <device> [device...]

Supports single-device and multi-device filesystems with all RAID profiles (SINGLE, DUP, RAID0, RAID1, RAID1C3, RAID1C4, RAID10, RAID5, RAID6), all four checksum algorithms (crc32c, xxhash, sha256, blake2b), quota and simple quota setup, custom nodesize/sectorsize, labels, UUIDs, feature flags, and directory population via --rootdir.

btrfs-tune

Modify btrfs filesystem parameters on an unmounted device.

btrfs-tune [options] <device>

Flag	Description
`-r`	Enable extended inode refs (extref)
`-x`	Enable skinny metadata extent refs
`-n`	Enable no-holes feature
`-S 0` / `-S 1`	Clear or set the seeding flag
`-m`	Change fsid to a random UUID (metadata_uuid mechanism)
`-M <uuid>`	Change fsid to a specific UUID (metadata_uuid mechanism)
`-u`	Rewrite fsid to a random UUID (patches all tree blocks)
`-U <uuid>`	Rewrite fsid to a specific UUID (patches all tree blocks)

Global flags

These flags are accepted by all btrfs commands:

Flag	Description
`-v` / `--verbose`	Increase verbosity (repeatable)
`-q` / `--quiet`	Suppress non-error output
`-f`, `--format`	Set the format, one of: text, json, modern

Output Format

Many commands accept a --format json which will cause them to output JSON-formatte data.

Differences from btrfs-progs

btrfsutils aims to be a drop-in replacement for btrfs-progs. Most commands produce identical output and accept the same flags. This page lists the known gaps and the features that go beyond what btrfs-progs offers.

What’s not yet supported

These features from btrfs-progs are not yet implemented:

btrfs check --repair and related write-mode flags (--init-csum-tree, --init-extent-tree, etc.). Read-only checking works.
btrfs check --mode lowmem (currently only the default mode is supported).
btrfs rescue chunk-recover. Other write-mode rescue subcommands (fix-device-size, clear-space-cache, clear-uuid-tree, clear-ino-cache, fix-data-checksum) are implemented.
btrfs filesystem resize --offline.
btrfs-mkfs zoned device support.
btrfs-tune --convert-to-free-space-tree and --convert-to-block-group-tree.

What’s added beyond btrfs-progs

These features are original additions not present in the C tools:

--format modern (or BTRFS_OUTPUT_FORMAT=modern): opt-in improved output with adaptive column widths and tree views. Supported by most tabular commands including device stats, device usage, subvolume list, inspect list-chunks, filesystem du/df/show/usage, qgroup show, quota status, scrub start/status.
btrfs filesystem du --depth N: limit display depth while computing full totals.
btrfs filesystem du --sort: sort entries by path, total, exclusive, or shared.
btrfs inspect list-chunks --offline: read chunks directly from an unmounted device or image file without CAP_SYS_ADMIN.
btrfs inspect min-dev-size --offline: compute minimum device size from an unmounted device or image file.
btrfs device stats --offline: read device error statistics from the on-disk device tree without requiring a mounted filesystem.

Architecture

Crate structure

The project follows a strict layering: lower crates have no knowledge of the layers above them.

Architecture diagram

btrfs-uapi wraps kernel ioctls, sysfs reads, and procfs reads into safe Rust APIs. It is Linux-only and the only crate that talks directly to the kernel.

btrfs-disk parses on-disk structures — superblocks, B-tree nodes, item payloads — from raw byte buffers. It is platform-independent and does not depend on btrfs-uapi, so it can be used to inspect filesystem images on any OS.

btrfs-stream parses the btrfs send stream wire format. The core parser is platform-independent. The optional receive feature is Linux-only and applies a parsed stream to a mounted filesystem via btrfs-uapi.

btrfs-mkfs implements the mkfs.btrfs tool. It constructs B-tree nodes as raw byte buffers and writes them directly to a block device or image file using pwrite. It does not use ioctls.

btrfs-tune implements the btrfstune tool. It modifies on-disk superblock parameters (feature flags, seeding, filesystem UUIDs) on unmounted devices. For lightweight UUID changes it only rewrites the superblock; for full fsid rewrites it traverses every tree block on disk via btrfs-disk.

btrfs-cli implements the btrfs tool. It handles argument parsing via clap, calls into btrfs-uapi and btrfs-disk as needed, and formats all output. Optionally, this tool can also embed the btrfs-tune and btrfs-mkfs tools as subcommands, for easier single-file deployment.

The two-layer model

Every feature that involves kernel communication is split across two layers. The uapi/ layer provides a safe Rust function: it takes typed arguments, calls the ioctl, and returns a typed result, with no unsafe in the public API and no knowledge of CLI concerns. The cli/ layer provides a clap subcommand that calls into uapi/ and formats the result for the user, with no ioctl calls or raw kernel types.

This rule applies to all kernel interfaces — btrfs ioctls, standard VFS ioctls like FS_IOC_FIEMAP, and block device ioctls like BLKGETSIZE64 all live in uapi/, never in cli/.

The same principle applies to disk/: it parses raw bytes into typed structs, and cli/ handles all display formatting. The disk/ crate never calls println!.

How Commands Work

Every command in btrfsutils is implemented across two layers: a safe kernel interface wrapper in btrfs-uapi, and a CLI command in btrfs-cli. This page walks through a concrete example — btrfs filesystem label — to show how the two layers fit together and why the split exists.

The uapi layer

The uapi layer lives in uapi/src/. Its job is to translate between Rust types and the raw kernel interfaces — allocating ioctl argument buffers, calling the ioctl, and converting the result into something the rest of the code can use without touching any unsafe code or bindgen types.

For btrfs filesystem label, that looks like this (from uapi/src/filesystem.rs):

#![allow(unused)]
fn main() {
pub fn label_get(fd: BorrowedFd) -> nix::Result<CString> {
    let mut buf = [0i8; BTRFS_LABEL_SIZE as usize];
    unsafe { btrfs_ioc_get_fslabel(fd.as_raw_fd(), &mut buf) }?;
    let cstr = unsafe { CStr::from_ptr(buf.as_ptr()) };
    Ok(cstr.to_owned())
}

pub fn label_set(fd: BorrowedFd, label: &CStr) -> nix::Result<()> {
    let bytes = label.to_bytes();
    if bytes.len() >= BTRFS_LABEL_SIZE as usize {
        return Err(nix::errno::Errno::EINVAL);
    }
    let mut buf = [0i8; BTRFS_LABEL_SIZE as usize];
    for (i, &b) in bytes.iter().enumerate() {
        buf[i] = b as c_char;
    }
    unsafe { btrfs_ioc_set_fslabel(fd.as_raw_fd(), &buf) }?;
    Ok(())
}
}

The function signatures use BorrowedFd rather than a raw integer, CString rather than a byte array, and nix::Result rather than checking errno manually. The caller never sees btrfs_ioctl_* types. The unsafe is contained to the ioctl call itself, with surrounding logic that is safe and testable.

The cli layer

The CLI layer lives in cli/src/. Its job is to parse arguments, call the uapi function, and format the output. It never calls ioctls directly.

The same command in cli/src/filesystem/label.rs:

#![allow(unused)]
fn main() {
#[derive(Parser, Debug)]
pub struct FilesystemLabelCommand {
    /// The device or mount point to operate on
    pub path: PathBuf,
    /// The new label to set (if omitted, the current label is printed)
    pub new_label: Option<OsString>,
}

impl Runnable for FilesystemLabelCommand {
    fn run(&self, _format: Format, _dry_run: bool) -> Result<()> {
        let file = open_path(&self.path)?;
        match &self.new_label {
            None => {
                let label = label_get(file.as_fd())
                    .with_context(|| format!("failed to get label for '{}'", self.path.display()))?;
                println!("{}", label.to_bytes().escape_ascii());
            }
            Some(new_label) => {
                let cstring = CString::new(new_label.as_bytes())
                    .context("label must not contain null bytes")?;
                label_set(file.as_fd(), &cstring)
                    .with_context(|| format!("failed to set label for '{}'", self.path.display()))?;
            }
        }
        Ok(())
    }
}
}

The struct derives Parser from clap — the field doc comments become the help text. Runnable::run handles the two cases (get and set) by opening the path, calling the appropriate uapi function, and either printing the result or reporting an error. Error messages include the path so the user knows which filesystem failed.

Why the split

The separation keeps each layer focused and independently testable. The uapi layer can be tested with unit tests that mock the ioctl, or with integration tests that operate on a real filesystem, without any CLI machinery involved. The CLI layer can be tested with argument parsing snapshot tests (no filesystem needed at all) and help text snapshot tests.

It also keeps the library crates clean. Because btrfs-uapi, btrfs-disk, and btrfs-stream contain no CLI logic and no GPL-derived code, they can be licensed MIT/Apache-2.0 and used by other projects independently of the CLI tools.

Routing

Each top-level command group has a router in cli/src/ (e.g. cli/src/filesystem.rs) that defines a FilesystemCommand enum with a variant per subcommand. The Runnable implementation for the router matches on the variant and delegates to the subcommand’s own run method. Adding a new subcommand means adding a variant to the enum, a mod declaration, and a run dispatch arm.

Kernel Interfaces

All kernel communication lives in btrfs-uapi. This page describes the patterns used to wrap the three main kernel interface types: ioctls, sysfs, and tree search.

Binding ioctls

Raw bindgen output is in uapi::raw, generated from uapi/src/raw/btrfs.h and btrfs_tree.h. Ioctl wrappers are declared in uapi/src/raw.rs using nix macros:

#![allow(unused)]
fn main() {
ioctl_write_ptr!(btrfs_ioc_resize, BTRFS_IOCTL_MAGIC, 3, btrfs_ioctl_vol_args);
ioctl_read!(btrfs_ioc_fs_info, BTRFS_IOCTL_MAGIC, 31, btrfs_ioctl_fs_info_args);
ioctl_readwrite!(btrfs_ioc_balance_v2, BTRFS_IOCTL_MAGIC, 32, btrfs_ioctl_balance_args);
ioctl_none!(btrfs_ioc_scrub_cancel, BTRFS_IOCTL_MAGIC, 28);
ioctl_write_int!(btrfs_ioc_balance_ctl, BTRFS_IOCTL_MAGIC, 33);
}

The macro to use is determined by the ioctl direction in the C header:

C macro	nix macro	Direction
`_IOW`	`ioctl_write_ptr!`	userspace → kernel (pointer to struct)
`_IOR`	`ioctl_read!`	kernel → userspace
`_IOWR`	`ioctl_readwrite!`	both directions
`_IO`	`ioctl_none!`	no data
`_IOW` (integer)	`ioctl_write_int!`	value passed directly in arg slot

Flexible array member ioctls

Some ioctls return variable-length arrays (e.g. btrfs_ioctl_space_args with a trailing spaces[] field). The pattern is a two-phase call:

Call with zero slots to get the count from the kernel.
Allocate a Vec<u64> (for 8-byte alignment) sized to base_size + count * item_size.
Cast the vec’s pointer to the struct type, set the slot count, call again.
Read results via __IncompleteArrayField::as_slice(count).

See uapi/src/space.rs for a worked example.

The `btrfs_ioctl_vol_args_v2` union

Several subvolume and device ioctls share btrfs_ioctl_vol_args_v2. Bindgen generates two anonymous union fields:

__bindgen_anon_1 — the {size, qgroup_inherit} / unused[4] union
__bindgen_anon_2 — the name[4040] / devid / subvolid union

#![allow(unused)]
fn main() {
// Set a name:
let name_buf: &mut [c_char] = unsafe { &mut args.__bindgen_anon_2.name };

// Set devid (no unsafe needed for plain integer writes):
args.flags = BTRFS_DEVICE_SPEC_BY_ID as u64;
args.__bindgen_anon_2.devid = devid;
}

Tree search (`BTRFS_IOC_TREE_SEARCH`)

The tree search ioctl is the primary way to read data from btrfs B-trees from userspace. It is wrapped in uapi/src/tree_search.rs as a callback-based cursor:

#![allow(unused)]
fn main() {
tree_search(fd, SearchFilter::for_type(tree_id, item_type), |hdr, data| {
    // hdr: SearchHeader — objectid, offset, item_type, len (host byte order)
    // data: &[u8] — raw on-disk item payload (little-endian)
    Ok(())
})?;
}

Common SearchFilter constructors:

#![allow(unused)]
fn main() {
// All items of a specific type across all objectids:
SearchFilter::for_type(raw::BTRFS_CHUNK_TREE_OBJECTID as u64,
                       raw::BTRFS_CHUNK_ITEM_KEY as u32)

// Items of a specific type within an objectid range:
SearchFilter::for_objectid_range(tree_id, item_type, min_oid, max_oid)
}

For searches spanning multiple item types (e.g. the quota tree walk that reads STATUS, INFO, LIMIT, and RELATION keys in one pass), construct SearchFilter directly with start and end Key values spanning the desired type range.

Important: The start and end keys form compound bounds on the B-tree key order (objectid, item_type, offset). They are not independent per-field filters. Items with unexpected types can appear if their compound key falls between start and end. Callbacks should filter on hdr.item_type when they need a single type.

Bindgen type note

Tree objectid constants from btrfs_tree.h bind as u32 in Rust despite being ULL in C (e.g. BTRFS_ROOT_TREE_OBJECTID: u32 = 1). Always cast at the use site. BTRFS_LAST_FREE_OBJECTID binds as i32 = -256; cast to u64 gives 0xFFFFFFFF_FFFFFF00 as expected.

Cursor advancement

This is the most common source of bugs with tree search. The kernel interprets (min_objectid, min_type, min_offset) as a compound tuple key, not three independent range filters. After each batch, all three fields must be advanced together past the last returned item:

Normal case (offset does not overflow u64): set min_objectid = last.objectid, min_type = last.item_type, min_offset = last.offset + 1.
Offset overflow: set min_offset = 0, keep min_objectid = last.objectid, set min_type = last.item_type + 1.
Type also overflows u32: set min_offset = 0, min_type = 0, min_objectid = last.objectid + 1.

Advancing only min_offset while leaving min_objectid unchanged causes items from lower objectids to match the new minimum on every subsequent batch, producing an infinite loop.

Sysfs

Some data is read from sysfs rather than ioctls — for example, scrub throughput limits and quota state. The SysfsBtrfs type in uapi/src/sysfs.rs provides typed access to /sys/fs/btrfs/<uuid>/. The filesystem UUID is obtained from fs_info() (BTRFS_IOC_FS_INFO).

Send and Receive

btrfs send and btrfs receive transfer filesystem state between two btrfs filesystems as a byte stream. This page explains how the mechanism works and how to use the btrfs-stream and btrfs-uapi crates to implement receive in your own application.

How send works

btrfs send asks the kernel to generate a stream representing the contents of a read-only subvolume. The kernel traverses the subvolume’s B-trees and emits a sequence of commands describing every file, directory, symlink, and extent. For an incremental send (with -p <parent>), only the differences from the parent subvolume are emitted.

The kernel is invoked via BTRFS_IOC_SEND, which writes the stream to a file descriptor (typically the write end of a pipe). A reader thread on the other end consumes the stream and writes it to a file or stdout.

The stream format

The stream is a binary format consisting of a header followed by a sequence of commands.

The stream header identifies the format version (v1, v2, or v3) and contains a magic number (btrfs-stream\0). After the header, commands follow back-to-back until an END command signals completion.

Each command has the following structure:

u32  total_length    (length of the entire command, including this header)
u16  command_type    (BTRFS_SEND_C_* constant)
u32  crc32c          (checksum of the command, with the crc field zeroed)
     attributes...   (variable-length TLV list)

Attributes are TLV-encoded:

u16  attribute_type  (BTRFS_SEND_A_* constant)
u16  length
     data...

The CRC32C used by btrfs is the raw variant (initial seed 0, no final XOR), not the standard ISO 3309 variant (initial seed 0xFFFFFFFF). When computing or verifying a checksum, use:

#![allow(unused)]
fn main() {
let crc = !crc32c::crc32c_append(!0u32, data);
}

Parsing a stream with `btrfs-stream`

The btrfs-stream crate provides StreamReader, which parses commands one at a time from any Read source:

#![allow(unused)]
fn main() {
use btrfs_stream::{StreamReader, StreamCommand};

let mut reader = StreamReader::new(input)?; // reads and validates the header
while let Some(command) = reader.read_command()? {
    match command {
        StreamCommand::Subvol { path, uuid, ctransid } => { /* create subvolume */ }
        StreamCommand::MkFile { path } => { /* create file */ }
        StreamCommand::Write { path, offset, data } => { /* write data */ }
        StreamCommand::Rename { path, path_to } => { /* rename */ }
        StreamCommand::End => break,
        // ... all 22+ command types
    }
}
}

StreamReader::new reads the stream header and returns an error if the magic is wrong or the version is unsupported. read_command returns None at EOF.

Applying a stream with `btrfs-uapi`

To implement receive, you need to apply each command to a mounted btrfs filesystem. The relevant operations are:

Subvolume and snapshot creation (BTRFS_IOC_SUBVOL_CREATE, BTRFS_IOC_SNAP_CREATE_V2): for Subvol commands, create a new empty subvolume. For Snapshot commands, look up the source subvolume by UUID using subvolume_search_by_received_uuid or subvolume_search_by_uuid, then create a writable snapshot.

File operations: standard POSIX calls — open/create, unlink, mkdir, rmdir, symlink, link, rename. btrfs does not require any special ioctls for these.

Write (BTRFS_IOC_ENCODED_WRITE or pwrite): v2 streams may send pre-compressed data via ENCODED_WRITE. If the kernel supports it, this can be passed directly; otherwise decompress and fall back to pwrite.

Clone (BTRFS_IOC_CLONE_RANGE): shares an extent between two files without copying data. The source file is found by resolving its UUID via the UUID tree.

Subvolume finalization: once all commands for a subvolume have been processed, call BTRFS_IOC_SET_RECEIVED_SUBVOL to record the UUID and ctransid, then set the subvolume read-only with BTRFS_IOC_SUBVOL_SETFLAGS.

Using `ReceiveContext`

If you want a complete, ready-to-use receive implementation rather than building your own, the receive feature of btrfs-stream provides ReceiveContext:

btrfs-stream = { version = "0.5", features = ["receive"] }

#![allow(unused)]
fn main() {
use btrfs_stream::ReceiveContext;

let mut ctx = ReceiveContext::new(destination_dir)?;
ctx.receive(input_stream)?;
}

ReceiveContext handles all command types including v2 encoded writes (with decompression fallback for zlib, zstd, and lzo) and v3 fs-verity. It uses an fd cache to avoid reopening the same file for sequential writes, which is important for performance when receiving large files.

Parsing

The btrfs-disk crate parses btrfs on-disk structures from raw byte buffers. It is platform-independent — it works on any OS and can be used to inspect filesystem images without a running kernel.

Reading a filesystem

The typical entry point is filesystem_open, which bootstraps from the superblock:

superblock → sys_chunk_array → chunk tree → root tree

The returned OpenFilesystem contains a BlockReader (for reading tree blocks by logical address) and a map of tree root locations. From there, tree_walk traverses any tree in BFS or DFS order, calling a visitor callback for each block:

#![allow(unused)]
fn main() {
let open = filesystem_open(file)?;
let mut reader = open.reader;
tree_walk(&mut reader, root_bytenr, Traversal::Bfs, &mut |block| {
    // block: &TreeBlock — either a Node (internal) or Leaf
    Ok(())
})?;
}

Item payloads

Leaf blocks contain items, each with a DiskKey (objectid, type, offset) and a raw payload. parse_item_payload dispatches to a typed parser based on the key type:

#![allow(unused)]
fn main() {
let payload = parse_item_payload(&key, data);
match payload {
    ItemPayload::InodeItem(inode) => { /* ... */ }
    ItemPayload::RootItem(root) => { /* ... */ }
    ItemPayload::FileExtentItem(extent) => { /* ... */ }
    // ...
}
}

Reading on-disk fields safely

On-disk structs are packed and little-endian. Casting a *const u8 pointer directly to a packed struct is undefined behaviour due to potential misalignment.

`btrfs-disk`: `bytes::Buf` / `bytes::BufMut`

The disk crate uses the bytes crate for all parsing and serialization. A &[u8] implements Buf, so you can read fields sequentially with methods like get_u64_le(), which advances the cursor automatically:

#![allow(unused)]
fn main() {
let mut buf = data;
let generation = buf.get_u64_le();
let size = buf.get_u64_le();
let mode = buf.get_u32_le();
}

For serialization, BufMut provides the inverse (put_u64_le, put_slice, etc.). This approach avoids manual offset arithmetic and makes it impossible to read past the end of the buffer (it panics instead of silently producing garbage).

`btrfs-uapi`: offset-based LE readers

The uapi crate parses tree search results returned by the kernel, which are raw &[u8] buffers at known offsets. It uses explicit offset-based helpers from uapi/src/util.rs:

#![allow(unused)]
fn main() {
use btrfs_uapi::util::read_le_u64;
use std::mem::offset_of;

let size = read_le_u64(data, offset_of!(raw::btrfs_inode_item, size));
}

Always use std::mem::offset_of! and std::mem::size_of to derive offsets and sizes from the bindgen struct definitions — never hard-code numeric byte offsets. The field_size!(T, field) macro (from crate::util) gives the size of an individual field.

Superblock mirrors

btrfs writes up to three superblock copies at fixed offsets. super_mirror_offset(n) returns the byte offset for mirror n (0, 1, or 2). read_superblock reads and validates a superblock — checking the magic number and CRC — from any seekable reader.

Display logic belongs in `cli/`

The disk/ crate only produces typed structs. All formatting and human-readable output lives in cli/src/inspect/. The disk/ crate never calls println! or constructs output strings.

Testing

The goal for this project is to maintain a high test coverage, to make sure that these tools function correctly.

Running tests

Running the tests for this project is complicated by the fact that many btrfs operations talk directly to the kernel and require elevated privileges.

You can run all non-privileged tests with regular cargo test commands. This will still build the privileged tests, but they are skipped.

cargo test

In order to run privileged tests, there is a just target that will build them, and run (only the test binaries, not cargo itself) using sudo. This is the recommended way to run the full test suite on this project.

just test

You can build a coverage report (requires cargo-llvm-cov) of the full test suite similarly, using the coverage target.

just coverage
# open target/coverage/llvm-cov/html/index.html

Static checks

Before committing, run just check. This wraps the formatter check (nightly rustfmt), cargo deny, taplo for Cargo.toml formatting, cargo doc (with -Dwarnings), cargo clippy --all-features, per-libc cargo check for the host arch, the optional CLI features, and cargo msrv verify against every publishable crate’s declared rust-version.

The host-arch detection means just check works on x86_64 and aarch64 alike. The musl half (<host>-unknown-linux-musl) needs a matching C cross-compiler on PATH, since the zstd-sys and lzo-sys build scripts compile C code:

Nix devshell (nix develop) provides everything; you don’t need any of the steps below.
Fedora aarch64: dnf install musl-gcc ships musl-gcc as a thin wrapper around the host gcc plus musl specs. cc-rs looks for the target-prefixed aarch64-linux-musl-gcc name, so symlink it once:
```
sudo ln -s /usr/bin/musl-gcc /usr/local/bin/aarch64-linux-musl-gcc
```
(or set CC_aarch64_unknown_linux_musl=musl-gcc and AR_aarch64_unknown_linux_musl=ar if you prefer to avoid touching /usr/local/bin.)
Debian / Ubuntu: apt install musl-tools (host arch) or one of the gcc-<arch>-linux-musl-cross packages for cross builds; same target-prefix handling applies if cc-rs doesn’t pick it up automatically.

If the cross C compiler isn’t on PATH, just check prints skipping <triple> check: <prefix>-linux-musl-gcc not on PATH and keeps going — only CI is expected to fail on a missing musl toolchain.

Unit tests

Unit tests live as #[cfg(test)] mod tests blocks within the module they test. They require no privileges and run with cargo test.

Coverage spans all pure logic across the crates: LE readers, struct size assertions, tree search cursor arithmetic, stream parsing (all 22 v1 command types, CRC validation), superblock parsing, B-tree node parsing, size/time formatting, argument parsing helpers, balance filter parsing, and property classification.

When adding a new feature, add unit tests for any logic that doesn’t require a real kernel or filesystem.

Integration tests

Integration tests live in uapi/tests/ and cli/tests/commands/ and are marked:

#![allow(unused)]
fn main() {
#[ignore = "requires elevated privileges"]
}

They are skipped by cargo test and run only via just test.

Fixture tests (`commands/fixture.rs`)

Read-only snapshot tests against a pre-built filesystem image (cli/tests/commands/fixture.img.gz). The image has a fixed UUID, label, and subvolume layout, so output is fully deterministic. These tests cover all read-only commands: filesystem df/show/usage/label/du, subvolume list/show, device stats/usage, all inspect-internal commands, and property get/list.

dump-tree and dump-super tests read the image file directly and do not require mounting, so they run without elevated privileges even within the privileged test suite.

Live tests (`commands/live.rs`)

Tests that create and mutate real btrfs filesystems on loopback devices. These cover all mutating commands: subvolume create/delete/snapshot, send/receive, scrub, balance, device add/remove, quota, qgroup, label set, resize, defrag, replace, and more.

Test helpers

cli/tests/common.rs provides RAII helpers that clean up automatically on drop:

BackingFile → LoopbackDevice → Mount

Convenience functions:

Function	Description
`single_mount()`	512 MiB single-device filesystem in a tempdir
`deterministic_mount()`	Same, with a fixed UUID and label
`fixture_mount()`	Mounts the pre-built fixture image read-only
`write_test_data(path, n)`	Write deterministic byte-pattern files
`verify_test_data(path, n)`	Verify previously written test data

Snapshot testing with insta

CLI output tests use insta for snapshot testing. Snapshots live in cli/tests/snapshots/ and are checked in to the repository.

Four snapshot categories:

Pattern	Privileges	Description
`arguments__*.snap`	none	Argument parsing output
`help__*.snap`	none	Help text for every subcommand
`commands__fixture__*.snap`	root	Read-only CLI output (fixture image)
`commands__live__*.snap`	root	CLI output from live filesystem tests

Snapshot workflow

# Run tests; fails if any snapshot has changed:
cargo test

# Run tests and collect pending snapshot changes:
cargo insta test

# Interactively review each changed snapshot:
cargo insta review

# Accept all pending changes at once:
cargo insta accept --all

After running privileged tests via just test, the Justfile fixes ownership of any root-owned snapshot files and sets INSTA_WORKSPACE_ROOT so snapshots land in the right directory.

Adding tests for a new subcommand

Argument parsing: add cases to cli/tests/arguments.rs following the existing pattern.
Help text: cli/tests/help.rs auto-discovers all subcommands by walking the clap tree — no changes needed.
Read-only output: if the fixture image has suitable content, add snapshot tests to commands/fixture.rs.
Mutating commands: add tests to commands/live.rs using the RAII helpers.

Use the snap!("description", output) macro for snapshot tests — the description appears in the snapshot file header.

Conventions

The goal is to write idiomatic Rust code that is consistent across the whole codebase. btrfsutils spans several crates with different roles (kernel interface wrappers, on-disk parsers, CLI tools) and each has its own patterns. Following these conventions makes it easier to navigate unfamiliar code and to understand what a function or type is responsible for at a glance.

Where possible, lean on the Rust ecosystem rather than reinventing things: uuid for UUIDs, bitflags for flag sets, nix for syscalls and ioctls, anyhow for error context in the CLI. This keeps the code readable to anyone already familiar with those crates.

Naming

Module names are usually generic nouns. For example, in the uapi crate, the ioctl call wrappers are organized by the thing they operate on, and live in modules like filesystem, device, sync.

For the btrfs-cli crate, the module naming structure matches the subcommand hierarchy. Meaning: the btrfs subvolume create command is implemented in cli/src/subvolume/create.rs.

Types are named with the general concept first: SysfsBtrfs, BlockGroupFlags, BalanceArgs — never BtrfsSysfs.

Functions follow a noun_verb pattern: label_get, label_set — never get_label. Ioctl wrapper functions match the lowercased C macro name: btrfs_ioc_balance_v2.

Avoid abbreviations. For example, use ChecksumType instead of CsumType.

Types

Always prefer proper typed values. For example, use Uuid from the uuid crate, never [u8; 16]. In the CLI, if there is an argument that can take one of multiple options, don’t represent it as a string, but instead create an enum and derive clap::ValueEnum.

Null-terminated kernel strings (labels, device paths) use CString/CStr. Make sure that allocation and deallocation is handled properly.

File descriptors passed to uapi functions use BorrowedFd.

Kernel flag fields use bitflags!, usually with a Display implementation so they can be formatted with {}.

Complex argument structs (BalanceArgs, DefragRangeArgs) use the builder pattern with new(), chained setters, and Default.

Never expose bindgen types (btrfs_ioctl_*) in public uapi APIs, instead create idiomatic Rust structs.

Error handling

In uapi/, almost every function just performs a single syscall, so we return the raw nix::Result<T>. Where possible, list potential error codes and their meanings in the documentation comments.

Map specific errnos to Option or a typed error at the call site where appropriate (ENODEV → None, etc.).

In cli/, mkfs/, and tune/, use anyhow::Result<T> and convert at the uapi boundary with .with_context(). Always include the relevant path or resource in the error message.

Constants

All BTRFS_* constants are available via crate::raw::* in the uapi and disk crates. Unless you have a good reason to, import from crate::raw and don’t define local copies. Size constants like SZ_1M that are not part of the btrfs UAPI headers are the exception; define those locally with a comment.

There should not be any stray constants in the code. For example, use std::mem::offset_of!() or std::mem::size_of!() macros to compute offsets and sizes, and if there are any magic constants, give them a name.

Don’t redefine things that are already defined in crate::raw::*.

Parsing on-disk structures

In disk/ and mkfs/, use bytes::Buf for reading and bytes::BufMut for writing on-disk fields. Sequential get_u64_le() / put_u64_le() calls advance the cursor automatically, eliminating manual offset arithmetic. See the Parsing page for details.

In uapi/, tree search results are parsed with explicit offset-based LE readers (read_le_u64, read_le_u32) from uapi/src/util.rs, since those buffers are accessed at known offsets rather than sequentially.

Style

Keep unsafe blocks as small as possible; non-trivial ones get a // SAFETY: comment. For packed structs, copy fields to locals before taking references to avoid misaligned reference UB. Use escape_ascii() when printing byte strings that may be non-UTF-8. Import symbols used more than once rather than qualifying them at every call site (single-use qualified paths are fine).

Shared CLI helpers live in cli/src/util.rs, these include utilities to format sizes, bytes, time, and parse various types.

Doc comments

In uapi/, module-level docs start with a # heading describing the module’s purpose. Function docs explain what the function does and why; the ioctl name is a parenthetical in the implementation, not the primary description.

In cli/, don’t put doc comments on subcommand enum variants — clap uses the variant doc in preference to the struct doc, forcing duplication. Don’t use Markdown in clap struct doc comments: wrap_help reflows all text and destroys formatting. Use plain prose paragraphs instead.

Btrfs On-Disk Format Specification

This document describes the binary layout of btrfs on-disk structures as understood from the parser in disk/src/ and the serializer in mkfs/src/. All multi-byte integer fields are little-endian. All byte offsets in this document are zero-based unless noted otherwise.

Kernel header names are referenced in parentheses where helpful (e.g. btrfs_super_block, btrfs_header). The authoritative source is the Linux kernel UAPI headers btrfs.h and btrfs_tree.h.

Conventions used in this document:

“LE u64” means a 64-bit unsigned integer stored in little-endian byte order.
Byte offsets are from the start of the enclosing structure.
Field sizes are in bytes unless noted otherwise.
“Logical address” refers to an address in btrfs’s virtual address space, which must be resolved to a physical device offset via the chunk tree.
“Physical address” refers to a byte offset on a specific block device.

Overview

Btrfs is a copy-on-write (COW) B-tree filesystem. All persistent data is organized into B-trees, and all B-trees share a single logical address space that is mapped to physical device locations through a chunk/stripe layer.

Architecture: trees within trees

The fundamental architecture is “trees within trees”:

The superblock (at fixed offsets on disk) bootstraps access to the chunk tree and root tree.
The chunk tree maps logical addresses to physical device locations. A small subset of the chunk tree is embedded in the superblock to bootstrap access to the full tree.
The root tree is the directory of all other trees: it contains a ROOT_ITEM for each tree, pointing to that tree’s root block.
Content trees (FS tree, extent tree, checksum tree, etc.) store the actual filesystem data and metadata.

Copy-on-write semantics

Every modification creates new copies of affected blocks (COW), from the modified leaf up through the root of the tree. The final step atomically updates the superblock to point to the new root tree root. This ensures crash consistency without a journal: at any point, the last successfully written superblock points to a fully consistent tree hierarchy.

The COW property means that tree blocks are never modified in place. Instead:

The leaf containing the modified item is written to a new location.
The parent node’s key-pointer is updated to reference the new leaf, and the parent is written to a new location.
This propagates up to the tree root.
The root tree’s ROOT_ITEM is updated with the new root block address.
The root tree itself is COWed up to its root.
The superblock is written with the new root tree root address.

The generation counter is incremented with each transaction. All blocks written in a transaction share the same generation number.

Shared format

All trees share the same block format (header + items or key-pointers) and the same key structure (objectid, type, offset). The block size (nodesize) is uniform across the filesystem, typically 16384 bytes. The sectorsize (typically 4096 bytes) is the minimum I/O unit for data.

Multi-device support

Btrfs supports multiple devices in a single filesystem. The chunk tree maps logical addresses to physical offsets on specific devices. RAID profiles (SINGLE, DUP, RAID0, RAID1, RAID5, RAID6, RAID10, RAID1C3, RAID1C4) determine how chunks are distributed across devices.

Bootstrap sequence

Reading a btrfs filesystem from a raw device follows this sequence:

Read the superblock at offset 64 KiB (try mirrors if primary fails).
Parse sys_chunk_array from the superblock to seed the chunk cache with system chunk mappings.
Resolve chunk_root through the chunk cache to a physical address.
Read the chunk tree root block and all chunk items to populate the full chunk cache.
Resolve root (root tree root) through the chunk cache.
Read the root tree to discover all other trees via ROOT_ITEM entries.
Access any tree by resolving its root block address through the chunk cache.

Superblock

The superblock (btrfs_super_block) is a 4096-byte structure stored at fixed offsets on each device. It is the entry point for reading the filesystem.

Mirror locations

Three copies (mirrors) of the superblock are maintained:

Mirror	Offset	Decimal
0	0x10000	65536 (64 KiB)
1	0x4000000	67108864 (64 MiB)
2	0x4000000000	274877906944 (256 GiB)

Mirror 0 is always present. Mirrors 1 and 2 are written only if the device is large enough. The offsets are computed as:

mirror 0:  64 KiB
mirror i:  16 KiB << (12 * i)    for i > 0

On read, all mirrors present on the device are checked and the one with the highest valid generation is used.

Binary layout

Field	Offset	Size	Notes
`csum`	0	32	Checksum of bytes 32..4095
`fsid`	32	16	Filesystem UUID (shared across devices)
`bytenr`	48	8	Physical offset of this superblock copy
`flags`	56	8	`BTRFS_SUPER_FLAG_*` flags
`magic`	64	8	`0x4D5F53665248425F` (`_BHRfS_M` LE)
`generation`	72	8	Transaction generation counter
`root`	80	8	Logical bytenr of root tree root
`chunk_root`	88	8	Logical bytenr of chunk tree root
`log_root`	96	8	Logical bytenr of log tree root (0 if none)
`__unused_log_root_transid`	104	8	Reserved, formerly log_root_transid
`total_bytes`	112	8	Total usable bytes across all devices
`bytes_used`	120	8	Total bytes used by data and metadata
`root_dir_objectid`	128	8	Objectid of root directory (always 6)
`num_devices`	136	8	Number of devices in this filesystem
`sectorsize`	144	4	Minimum I/O alignment (typically 4096)
`nodesize`	148	4	Tree block size in bytes (typically 16384)
`__unused_leafsize`	152	4	Legacy field, equal to nodesize
`stripesize`	156	4	Stripe size for RAID (typically 65536)
`sys_chunk_array_size`	160	4	Valid bytes in `sys_chunk_array`
`chunk_root_generation`	164	8	Generation of the chunk tree root
`compat_flags`	172	8	Compatible feature flags
`compat_ro_flags`	180	8	Compatible read-only feature flags
`incompat_flags`	188	8	Incompatible feature flags
`csum_type`	196	2	Checksum algorithm (0=CRC32C, 1=xxHash, 2=SHA256, 3=BLAKE2)
`root_level`	198	1	B-tree level of root tree root
`chunk_root_level`	199	1	B-tree level of chunk tree root
`log_root_level`	200	1	B-tree level of log tree root
`dev_item`	201	98	Embedded `btrfs_dev_item` for this device
`label`	299	256	Filesystem label (NUL-terminated, max 255 chars)
`cache_generation`	555	8	Generation of free space cache (v1)
`uuid_tree_generation`	563	8	Generation of UUID tree
`metadata_uuid`	571	16	Metadata UUID (when `METADATA_UUID` incompat set)
`nr_global_roots`	587	8	Number of global roots (extent-tree-v2)
(reserved fields)	595	…	Zero-filled up to `sys_chunk_array`
`sys_chunk_array`	800	2048	Bootstrap chunk items
`super_roots[4]`	2848	672	Four rotating backup root entries (168 bytes each)
(padding)	3520	576	Zero-filled to 4096 bytes

Total: 4096 bytes (BTRFS_SUPER_INFO_SIZE).

System chunk array bootstrap

The sys_chunk_array field embeds a subset of the chunk tree sufficient to locate the full chunk tree on disk. It contains a sequence of (disk_key, chunk_item) pairs:

For each entry:
  17 bytes   btrfs_disk_key     (objectid, type, offset) -- offset = logical addr
  variable   btrfs_chunk        Chunk item (see Section 8.9)

The array is parsed sequentially until sys_chunk_array_size bytes are consumed. These entries typically contain the SYSTEM chunk(s) that map the chunk tree and root tree blocks.

Backup roots

The super_roots array contains four rotating backup copies of critical tree root pointers. The kernel updates one entry per transaction, cycling through indices 0-3. Each backup root entry (btrfs_root_backup) is 168 bytes:

Field	Offset	Size	Notes
`tree_root`	0	8	Logical bytenr of root tree root
`tree_root_gen`	8	8	Generation of root tree root
`chunk_root`	16	8	Logical bytenr of chunk tree root
`chunk_root_gen`	24	8	Generation of chunk tree root
`extent_root`	32	8	Logical bytenr of extent tree root
`extent_root_gen`	40	8	Generation of extent tree root
`fs_root`	48	8	Logical bytenr of FS tree root
`fs_root_gen`	56	8	Generation of FS tree root
`dev_root`	64	8	Logical bytenr of device tree root
`dev_root_gen`	72	8	Generation of device tree root
`csum_root`	80	8	Logical bytenr of checksum tree root
`csum_root_gen`	88	8	Generation of checksum tree root
`total_bytes`	96	8	Total filesystem bytes at backup time
`bytes_used`	104	8	Bytes used at backup time
`num_devices`	112	8	Number of devices at backup time
(reserved)	120	32	Unused u64[4]
`tree_root_level`	152	1	B-tree level of root tree root
`chunk_root_level`	153	1	B-tree level of chunk tree root
`extent_root_level`	154	1	B-tree level of extent tree root
`fs_root_level`	155	1	B-tree level of FS tree root
`dev_root_level`	156	1	B-tree level of device tree root
`csum_root_level`	157	1	B-tree level of checksum tree root
(padding)	158	10	Unused bytes to 168 total

Superblock checksum

The checksum field (csum, bytes 0..31) covers everything from byte 32 through byte 4095 (inclusive). For CRC32C, the 4-byte result is stored little-endian at bytes 0..3 and bytes 4..31 are zeroed.

The magic number _BHRfS_M (hex 0x4D5F53665248425F) must be present at offset 64 for a valid superblock.

Superblock validity is determined by checking both magic and checksum match. When multiple valid mirrors exist, the one with the highest generation is used.

Tree Block Format

Every B-tree block (node or leaf) is exactly nodesize bytes. The block begins with a 101-byte header (btrfs_header), followed by either item descriptors (leaves) or key-pointer entries (nodes).

Field	Offset	Size	Notes
`csum`	0	32	Checksum of bytes 32..nodesize-1
`fsid`	32	16	Filesystem UUID (must match superblock)
`bytenr`	48	8	Logical byte offset of this block
`flags`	56	8	Header flags (lower 56 bits) + backref rev (upper 8 bits)
`chunk_tree_uuid`	64	16	UUID of the chunk tree mapping this block
`generation`	80	8	Transaction generation when last written
`owner`	88	8	Objectid of the tree owning this block
`nritems`	96	4	Number of items (leaf) or key-pointers (node)
`level`	100	1	0 = leaf, >0 = internal node

Total header size: 101 bytes.

The flags field combines two values:

Bits 0-55: block flags (BTRFS_HEADER_FLAG_WRITTEN = 1, BTRFS_HEADER_FLAG_RELOC = 2)
Bits 56-63: backref revision (BTRFS_MIXED_BACKREF_REV = 1 for modern filesystems)

The header checksum covers bytes 32 through nodesize - 1. For CRC32C, the result is stored as a 4-byte LE value at bytes 0..3 with bytes 4..31 zeroed.

Leaf vs node distinction

The level field determines the block type:

level == 0: leaf block, containing items
level > 0: internal node, containing key-pointers to child blocks

The maximum tree depth is bounded by the number of key-pointers that fit in a node. For a 16 KiB nodesize, a node holds up to:

max_ptrs = (nodesize - HEADER_SIZE) / KEY_PTR_SIZE
         = (16384 - 101) / 33
         = 493 key-pointers

With 493 children per node, a tree of depth 2 (root node + leaf) can hold 493 * 651 = ~320,000 items. A tree of depth 3 can hold 493^2 * 651 = ~158 million items. In practice, trees rarely exceed depth 3 or 4.

Leaf Format

A leaf block (level 0) contains sorted item descriptors followed by a data area. Item descriptors grow forward from the header; item data grows backward from the end of the block.

+-------------------------------------------+
| Header (101 bytes)                        |
+-------------------------------------------+
| Item descriptor 0  (25 bytes)             |
| Item descriptor 1  (25 bytes)             |
| ...                                       |
| Item descriptor N-1 (25 bytes)            |
+-------------------------------------------+
| (free space)                              |
+-------------------------------------------+
| Item data N-1                             |
| ...                                       |
| Item data 1                               |
| Item data 0                               |
+-------------------------------------------+

Item descriptor

Each item descriptor (btrfs_item) is 25 bytes:

Field	Offset	Size	Notes
`objectid`	0	8	Key objectid (LE u64)
`type`	8	1	Key type byte (u8)
`offset`	9	8	Key offset (LE u64)
`data_offset`	17	4	Byte offset of item data from end of header (LE u32)
`data_size`	21	4	Size of item data in bytes (LE u32)

The first 17 bytes form a btrfs_disk_key. The data_offset field is relative to the start of the leaf data area, which begins immediately after the header. To locate item data in the raw block buffer:

absolute_offset = HEADER_SIZE + data_offset

where HEADER_SIZE = 101 bytes.

Data area layout

Item data is packed from the end of the block backward. The first item pushed has its data at the highest offset; subsequent items have data at progressively lower offsets. This means:

Item descriptors grow forward: HEADER_SIZE + i * 25
Item data grows backward: starting from nodesize and moving toward the descriptor area

The free space in a leaf is the gap between the end of the last descriptor and the start of the earliest (lowest-offset) item data.

Offset bookkeeping

When building a leaf (as the mkfs LeafBuilder does), the bookkeeping works as follows:

Initial state:
  item_offset = HEADER_SIZE (101)    // next descriptor position
  data_end    = nodesize (16384)     // next data write position

For each item pushed (key, data[N bytes]):
  1. data_end -= N                   // reserve space for item data
  2. Write data at buf[data_end .. data_end + N]
  3. data_offset = data_end - HEADER_SIZE   // relative to header end
  4. Write descriptor at buf[item_offset]:
       key (17 bytes) + data_offset (LE u32) + data_size (LE u32)
  5. item_offset += 25               // advance to next descriptor slot

The available space for additional items is:

space_left = data_end - (item_offset + ITEM_SIZE)

This must accommodate both the 25-byte descriptor and the item data.

Key ordering invariant

Items within a leaf are sorted by their keys in lexicographic order: first by objectid, then by type, then by offset. This invariant is maintained by the B-tree insertion logic and verified by btrfs check.

Capacity

For a 16384-byte leaf, the maximum number of items depends on their data sizes. With zero-length data items (such as TREE_BLOCK_REF or FREE_SPACE_EXTENT), the theoretical maximum is:

max_items = (nodesize - HEADER_SIZE) / ITEM_SIZE
          = (16384 - 101) / 25
          = 651 items

In practice, most items have data payloads that reduce this number significantly.

Node Format

An internal node (level > 0) contains sorted key-pointer entries (btrfs_key_ptr). Each entry points to a child block and records the lowest key in that child’s subtree.

Key-pointer entry

Each key-pointer (btrfs_key_ptr) is 33 bytes:

Field	Offset	Size	Notes
`objectid`	0	8	Key objectid (LE u64)
`type`	8	1	Key type byte (u8)
`offset`	9	8	Key offset (LE u64)
`blockptr`	17	8	Logical byte address of child block (LE u64)
`generation`	25	8	Generation of the child block (LE u64)

The first 17 bytes form the btrfs_disk_key representing the lowest key in the child subtree. The generation field is used for consistency checks: when reading the child block, its header generation must match this value.

Layout

+-------------------------------------------+
| Header (101 bytes)                        |
+-------------------------------------------+
| Key-pointer 0  (33 bytes)                 |
| Key-pointer 1  (33 bytes)                 |
| ...                                       |
| Key-pointer N-1 (33 bytes)                |
+-------------------------------------------+
| (unused space to nodesize)                |
+-------------------------------------------+

Key-pointers are sorted by their key in the same lexicographic order as leaf items. The child block referenced by key-pointer i contains all items with keys >= key-pointer[i].key and < key-pointer[i+1].key (or unbounded above for the last pointer).

Key Structure

Every item and key-pointer is addressed by a three-part key (btrfs_disk_key):

Field	Offset	Size	Notes
`objectid`	0	8	LE u64
`type`	8	1	u8
`offset`	9	8	LE u64

Total: 17 bytes.

Lexicographic ordering

Keys are compared as a tuple (objectid, type, offset) in that order. The objectid is compared first; on a tie, type is compared; on a further tie, offset breaks the tie. All comparisons are unsigned integer comparisons.

Field semantics by tree

The meaning of the three key fields varies depending on the tree and item type:

FS tree:

objectid = inode number (starting at 256 = BTRFS_FIRST_FREE_OBJECTID)
type = item type (INODE_ITEM, DIR_ITEM, EXTENT_DATA, etc.)
offset = type-dependent (0 for INODE_ITEM, name hash for DIR_ITEM, file byte offset for EXTENT_DATA, parent inode for INODE_REF, etc.)

Root tree:

objectid = tree objectid (e.g. 5 for FS_TREE, 256+ for subvolumes)
type = ROOT_ITEM, ROOT_REF, or ROOT_BACKREF
offset = 0 for ROOT_ITEM, child/parent tree ID for refs

Extent tree:

objectid = logical byte address of the extent
type = EXTENT_ITEM, METADATA_ITEM, or backref type
offset = extent length (EXTENT_ITEM), level (METADATA_ITEM), or backref-specific (root objectid, parent bytenr, hash)

Chunk tree:

objectid = BTRFS_FIRST_CHUNK_TREE_OBJECTID (256) for CHUNK_ITEM
type = CHUNK_ITEM
offset = logical byte address of the chunk

Device tree:

objectid = device ID for DEV_EXTENT; BTRFS_DEV_ITEMS_OBJECTID (1) for DEV_ITEM
type = DEV_EXTENT or DEV_ITEM
offset = physical offset for DEV_EXTENT; device ID for DEV_ITEM

Checksum tree:

objectid = BTRFS_EXTENT_CSUM_OBJECTID
type = EXTENT_CSUM
offset = logical byte address of the first checksummed sector

Free space tree:

objectid = block group logical offset (for FREE_SPACE_INFO) or extent start (for FREE_SPACE_EXTENT/BITMAP)
type = FREE_SPACE_INFO, FREE_SPACE_EXTENT, or FREE_SPACE_BITMAP
offset = block group length (for INFO) or extent length (for EXTENT/BITMAP)

UUID tree:

objectid = upper 8 bytes of UUID interpreted as LE u64
type = UUID_KEY_SUBVOL or UUID_KEY_RECEIVED_SUBVOL
offset = lower 8 bytes of UUID interpreted as LE u64

Quota tree:

objectid = packed qgroupid (level << 48) | subvolid
type = QGROUP_STATUS, QGROUP_INFO, QGROUP_LIMIT, QGROUP_RELATION
offset = packed qgroupid for relations, 0 otherwise

Key type values

Value	Name	Description
1	`INODE_ITEM_KEY`	Inode metadata (mode, size, timestamps, nlink)
12	`INODE_REF_KEY`	Link from inode to parent directory (name + index)
13	`INODE_EXTREF_KEY`	Extended inode ref for names exceeding `INODE_REF` capacity
24	`XATTR_ITEM_KEY`	Extended attribute (name + value, keyed by name hash)
36	`VERITY_DESC_ITEM_KEY`	fs-verity descriptor
37	`VERITY_MERKLE_ITEM_KEY`	fs-verity Merkle tree data
48	`ORPHAN_ITEM_KEY`	Orphan inode pending cleanup
60	`DIR_LOG_ITEM_KEY`	Directory log for fsync optimization
72	`DIR_LOG_INDEX_KEY`	Directory log index
84	`DIR_ITEM_KEY`	Directory entry keyed by `crc32c(name)` hash
96	`DIR_INDEX_KEY`	Directory entry keyed by sequential index
108	`EXTENT_DATA_KEY`	File extent (inline data or reference to disk extent)
128	`EXTENT_CSUM_KEY`	Data checksum covering one or more sectors
132	`ROOT_ITEM_KEY`	Tree root descriptor (bytenr, generation, UUID, timestamps)
144	`ROOT_BACKREF_KEY`	Backref from child subvolume to parent
156	`ROOT_REF_KEY`	Forward ref from parent subvolume to child
168	`EXTENT_ITEM_KEY`	Extent allocation with backrefs (non-skinny: offset = size)
169	`METADATA_ITEM_KEY`	Skinny metadata extent (offset = level, not size)
172	`EXTENT_OWNER_REF_KEY`	Simple quota owner backref
176	`TREE_BLOCK_REF_KEY`	Standalone backref: metadata extent → owning tree
178	`EXTENT_DATA_REF_KEY`	Standalone backref: data extent → (root, ino, offset)
182	`SHARED_BLOCK_REF_KEY`	Shared metadata backref (parent block address)
184	`SHARED_DATA_REF_KEY`	Shared data backref (parent block address + count)
192	`BLOCK_GROUP_ITEM_KEY`	Block group allocation info (used bytes, type, profile)
198	`FREE_SPACE_INFO_KEY`	Free space tree: per-block-group metadata
199	`FREE_SPACE_EXTENT_KEY`	Free space tree: free extent range
200	`FREE_SPACE_BITMAP_KEY`	Free space tree: bitmap of free sectors
204	`DEV_EXTENT_KEY`	Physical extent allocated to a chunk on a device
216	`DEV_ITEM_KEY`	Device descriptor (size, UUID, I/O parameters)
228	`CHUNK_ITEM_KEY`	Chunk mapping logical → physical with stripe info
230	`RAID_STRIPE_KEY`	RAID stripe tree entry (zoned devices)
240	`QGROUP_STATUS_KEY`	Quota group global status and generation
242	`QGROUP_INFO_KEY`	Per-qgroup usage counters (referenced, exclusive)
244	`QGROUP_LIMIT_KEY`	Per-qgroup size limits
246	`QGROUP_RELATION_KEY`	Parent-child relationship between qgroups
248	`TEMPORARY_ITEM_KEY`	Transient item; also used as `BALANCE_ITEM_KEY`
249	`PERSISTENT_ITEM_KEY`	Persistent metadata; also used as `DEV_STATS_KEY`
250	`DEV_REPLACE_KEY`	Device replace operation state
251	`UUID_KEY_SUBVOL`	UUID tree: maps subvolume UUID → subvolume ID
252	`UUID_KEY_RECEIVED_SUBVOL`	UUID tree: maps received UUID → subvolume ID
253	`STRING_ITEM_KEY`	Label or other string metadata

Well-known objectid values

Value	Name	Notes
1	`ROOT_TREE_OBJECTID`	Root tree
2	`EXTENT_TREE_OBJECTID`	Extent tree
3	`CHUNK_TREE_OBJECTID`	Chunk tree
4	`DEV_TREE_OBJECTID`	Device tree
5	`FS_TREE_OBJECTID`	Default FS tree
6	`ROOT_TREE_DIR_OBJECTID`	Root tree directory
7	`CSUM_TREE_OBJECTID`	Checksum tree
8	`QUOTA_TREE_OBJECTID`	Quota tree
9	`UUID_TREE_OBJECTID`	UUID tree
10	`FREE_SPACE_TREE_OBJECTID`	Free space tree
11	`BLOCK_GROUP_TREE_OBJECTID`	Block group tree
12	`RAID_STRIPE_TREE_OBJECTID`	RAID stripe tree
256	`FIRST_FREE_OBJECTID`	First user inode / first subvolume ID
(u64)-4	`BALANCE_OBJECTID`	Balance status
(u64)-5	`ORPHAN_OBJECTID`	Orphan items
(u64)-6	`TREE_LOG_OBJECTID`	Tree log
(u64)-7	`TREE_LOG_FIXUP_OBJECTID`	Tree log fixup
(u64)-8	`TREE_RELOC_OBJECTID`	Tree relocation
(u64)-9	`DATA_RELOC_TREE_OBJECTID`	Data relocation tree
(u64)-10	`EXTENT_CSUM_OBJECTID`	Extent checksums
(u64)-11	`FREE_SPACE_OBJECTID`	Free space cache (v1)
(u64)-12	`FREE_INO_OBJECTID`	Free inode number tracking
(u64)-255	`MULTIPLE_OBJECTIDS`	Multiple-owner sentinel

Negative objectids are stored as their unsigned 64-bit two’s complement representation. For example, BALANCE_OBJECTID = -4 is stored as 0xFFFFFFFF_FFFFFFFC.

Trees

Btrfs uses multiple B-trees, each identified by a well-known objectid. The root tree stores a ROOT_ITEM for each tree, pointing to its root block.

Root tree (objectid 1)

The directory of all other trees. Contains:

ROOT_ITEM for each tree (objectid = tree ID, type = ROOT_ITEM, offset = 0)
ROOT_REF for parent-to-child subvolume links
ROOT_BACKREF for child-to-parent subvolume links
ROOT_TREE_DIR directory entry linking to the default subvolume
TEMPORARY_ITEM for balance status persistence
PERSISTENT_ITEM for device statistics and replace status

Extent tree (objectid 2)

Tracks all allocated space (data extents and metadata tree blocks) with reference counting and backreferences. Contains:

EXTENT_ITEM for data and non-skinny metadata extents
METADATA_ITEM for skinny metadata extents
TREE_BLOCK_REF for direct metadata backrefs
SHARED_BLOCK_REF for shared metadata backrefs (snapshots)
EXTENT_DATA_REF for direct data backrefs
SHARED_DATA_REF for shared data backrefs (snapshots)
BLOCK_GROUP_ITEM for each block group (unless block_group_tree feature)

Chunk tree (objectid 3)

Maps logical address ranges to physical device stripes. Contains:

CHUNK_ITEM for each chunk (logical-to-physical mapping)
DEV_ITEM for each device

The chunk tree is bootstrapped from the superblock’s sys_chunk_array.

Device tree (objectid 4)

Tracks per-device physical extent allocations. Contains:

DEV_EXTENT for each allocated physical range on each device

FS tree (objectid 5, 256+)

Holds the filesystem content for a subvolume. The default subvolume uses objectid 5; additional subvolumes and snapshots use objectids starting at 256. Contains:

INODE_ITEM for each inode
INODE_REF / INODE_EXTREF for hard links
DIR_ITEM for directory entries (keyed by name hash)
DIR_INDEX for directory entries (keyed by sequence number)
EXTENT_DATA for file extent descriptors
XATTR_ITEM for extended attributes
ORPHAN_ITEM for unlinked but still open inodes

Checksum tree (objectid 7)

Stores per-sector data checksums. Contains:

EXTENT_CSUM items: each item covers a contiguous range of data sectors, storing an array of per-sector checksums

Quota tree (objectid 8)

Tracks quota group accounting. Contains:

QGROUP_STATUS (one per filesystem)
QGROUP_INFO for each qgroup
QGROUP_LIMIT for each qgroup with limits
QGROUP_RELATION for parent-child qgroup relationships

UUID tree (objectid 9)

Provides fast UUID-to-subvolume lookups for send/receive. Contains:

UUID_KEY_SUBVOL mapping subvolume UUID to objectid
UUID_KEY_RECEIVED_SUBVOL mapping received UUID to objectid

Free space tree (objectid 10)

Tracks free space per block group, replacing the older free space cache (v1). Contains:

FREE_SPACE_INFO for each block group
FREE_SPACE_EXTENT for free ranges
FREE_SPACE_BITMAP for bitmap-tracked regions

Requires the free_space_tree compat_ro feature flag.

Block group tree (objectid 11)

Separates block group items from the extent tree for faster mount times. Contains:

BLOCK_GROUP_ITEM for each block group

Requires the block_group_tree compat_ro feature flag. When this tree is absent, block group items live in the extent tree.

Data relocation tree (objectid (u64)-9)

A temporary FS tree used during balance to hold relocated data extents. Uses the same item types as a regular FS tree.

RAID stripe tree (objectid 12)

Maps logical extents to per-device physical stripe offsets. Contains:

RAID_STRIPE items

Requires the raid_stripe_tree incompat feature flag.

Item Types

This section documents the key format and payload layout for each major item type.

INODE_ITEM (type 1)

Key: (inode_number, INODE_ITEM, 0)

Exactly one per inode. Stores POSIX attributes, timestamps, and btrfs-specific flags.

Payload (btrfs_inode_item, 160 bytes):

Field	Offset	Size	Notes
`generation`	0	8	Generation when created
`transid`	8	8	Transaction ID of last modification
`size`	16	8	Logical file size in bytes
`nbytes`	24	8	On-disk bytes used (all copies)
`block_group`	32	8	Block group hint for new allocations
`nlink`	40	4	Hard link count
`uid`	44	4	Owner user ID
`gid`	48	4	Owner group ID
`mode`	52	4	POSIX file mode (type + permissions)
`rdev`	56	8	Device number (char/block device inodes)
`flags`	64	8	Inode flags (see below)
`sequence`	72	8	NFS-compatible change sequence number
reserved	80	32	Reserved u64[4], must be zero
`atime`	112	12	Access time (`btrfs_timespec`)
`ctime`	124	12	Change time (`btrfs_timespec`)
`mtime`	136	12	Modification time (`btrfs_timespec`)
`otime`	148	12	Creation time (`btrfs_timespec`)

Each btrfs_timespec is 12 bytes:

Field	Offset	Size	Notes
`sec`	0	8	Seconds since Unix epoch (LE u64)
`nsec`	8	4	Nanosecond component, 0..999999999 (LE u32)

Inode flags (bitmask):

Bit	Value	Name
0	`0x1`	`NODATASUM`
1	`0x2`	`NODATACOW`
2	`0x4`	`READONLY`
3	`0x8`	`NOCOMPRESS`
4	`0x10`	`PREALLOC`
5	`0x20`	`SYNC`
6	`0x40`	`IMMUTABLE`
7	`0x80`	`APPEND`
8	`0x100`	`NODUMP`
9	`0x200`	`NOATIME`
10	`0x400`	`DIRSYNC`
11	`0x800`	`COMPRESS`
20	`0x100000`	`ROOT_ITEM_INIT`

INODE_REF (type 12)

Key: (inode_number, INODE_REF, parent_dir_inode)

Hard-link reference from an inode to a directory entry. Multiple refs can be packed into a single item when an inode has several hard links in the same parent directory.

Payload (variable, packed sequence of entries):

For each ref:

Field	Offset	Size	Notes
`index`	0	8	`DIR_INDEX` sequence number (LE u64)
`name_len`	8	2	Length of name in bytes (LE u16)
`name`	10	name_len	Filename bytes (no NUL terminator)

Multiple refs are concatenated without padding.

INODE_EXTREF (type 13)

Key: (inode_number, INODE_EXTREF, crc32c(parent_ino, name))

Extended inode reference. Unlike INODE_REF, the parent inode is stored in the struct, allowing references from different parent directories. Requires the extended_iref incompat feature.

Payload (variable, packed sequence):

For each ref:

Field	Offset	Size	Notes
`parent`	0	8	Parent directory inode number (LE u64)
`index`	8	8	`DIR_INDEX` sequence number (LE u64)
`name_len`	16	2	Length of name (LE u16)
`name`	18	name_len	Filename bytes

DIR_ITEM (type 84) / DIR_INDEX (type 96)

Key for DIR_ITEM: (dir_inode, DIR_ITEM, crc32c(name)) Key for DIR_INDEX: (dir_inode, DIR_INDEX, sequence)

Both use the same on-disk format. DIR_ITEM entries are keyed by the CRC32C hash of the filename (raw CRC32C, not standard). DIR_INDEX entries are keyed by a monotonically increasing sequence number for ordered directory iteration.

Multiple entries can be packed into a single DIR_ITEM when names hash to the same value (hash collision).

Payload (btrfs_dir_item, variable, packed sequence):

For each entry:

Field	Offset	Size	Notes
`location`	0	17	Target inode key (`btrfs_disk_key`)
`transid`	17	8	Transaction ID (LE u64)
`data_len`	25	2	Xattr value length, 0 for dirs (LE u16)
`name_len`	27	2	Filename length (LE u16)
`type`	29	1	File type (see below)
`name`	30	name_len	Filename bytes
`data`	30+NL	data_len	Xattr value (for `XATTR_ITEM` only)

The location field is a btrfs_disk_key pointing to the target. For regular directory entries, this typically has objectid = target inode, type = INODE_ITEM, offset = 0. For subvolume entries, type = ROOT_ITEM and objectid = the subvolume’s tree objectid.

File type values:

Value	Name
0	`FT_UNKNOWN`
1	`FT_REG_FILE`
2	`FT_DIR`
3	`FT_CHRDEV`
4	`FT_BLKDEV`
5	`FT_FIFO`
6	`FT_SOCK`
7	`FT_SYMLINK`
8	`FT_XATTR`

FILE_EXTENT_ITEM (type 108)

Key: (inode_number, EXTENT_DATA, file_byte_offset)

Describes how a range of file bytes maps to on-disk storage. Three extent types exist: inline, regular, and preallocated.

Common header (21 bytes):

Field	Offset	Size	Notes
`generation`	0	8	Allocation generation (LE u64)
`ram_bytes`	8	8	Uncompressed size (LE u64)
`compression`	16	1	Compression type (0=none, 1=zlib, 2=lzo, 3=zstd)
`encryption`	17	1	Reserved (always 0)
`other_encoding`	18	2	Reserved (always 0)
`type`	20	1	Extent type (0=inline, 1=regular, 2=prealloc)

Inline extent (type 0):

After the 21-byte header, the remaining bytes in the item are the file data itself. The data length is item_size - 21. For compressed inline extents, the data is compressed and ram_bytes gives the uncompressed size.

Field	Offset	Size	Notes
header	0	21	Common header (type = 0)
data	21	item_size-21	Inline file data

Total item size: 21 + data_length.

Regular extent (type 1) and prealloc extent (type 2):

Field	Offset	Size	Notes
header	0	21	Common header (type = 1 or 2)
`disk_bytenr`	21	8	Logical address of extent on disk (LE u64)
`disk_num_bytes`	29	8	Size of extent on disk (LE u64)
`offset`	37	8	Byte offset into extent (LE u64)
`num_bytes`	45	8	Number of logical file bytes covered (LE u64)

Total item size: 53 bytes.

A disk_bytenr of 0 indicates a hole (sparse region). For compressed extents, disk_num_bytes is the compressed size on disk and ram_bytes is the uncompressed size. The offset field allows referencing into the middle of a shared extent (e.g., after COW of part of a cloned extent).

Prealloc extents (type 2) are reserved but unwritten; reads return zeroes.

EXTENT_ITEM (type 168) / METADATA_ITEM (type 169)

Key for EXTENT_ITEM: (logical_bytenr, EXTENT_ITEM, extent_length) Key for METADATA_ITEM: (logical_bytenr, METADATA_ITEM, level)

Tracks reference counts and backreferences for allocated space. METADATA_ITEM is the “skinny” variant (when skinny_metadata incompat flag is set): the extent length is implicit (= nodesize) and the key offset stores the tree block level instead.

Base payload (btrfs_extent_item, 24 bytes):

Field	Offset	Size	Notes
`refs`	0	8	Number of references (LE u64)
`generation`	8	8	Allocation generation (LE u64)
`flags`	16	8	Extent flags (LE u64)

Extent flags:

Bit	Value	Name
0	`0x1`	`EXTENT_FLAG_DATA`
1	`0x2`	`EXTENT_FLAG_TREE_BLOCK`
8	`0x100`	`BLOCK_FLAG_FULL_BACKREF`

Tree block info (for non-skinny EXTENT_ITEM with TREE_BLOCK flag):

After the base extent item, non-skinny tree block extents include a btrfs_tree_block_info (18 bytes):

Field	Offset	Size	Notes
`key`	24	17	First key in the tree block (`btrfs_disk_key`)
`level`	41	1	Tree block level (u8)

This is absent for skinny metadata items (METADATA_ITEM), where the level is encoded in the key offset.

Inline backreferences:

After the extent item header (and tree_block_info if present), zero or more inline backreferences may be packed. Each starts with a 1-byte type tag followed by type-specific data:

Type byte	Name	Data after type byte
176 (`0xB0`)	`TREE_BLOCK_REF`	8 bytes: `root_objectid` (LE u64)
182 (`0xB6`)	`SHARED_BLOCK_REF`	8 bytes: `parent_bytenr` (LE u64)
178 (`0xB2`)	`EXTENT_DATA_REF`	28 bytes: `root`(8) + `objectid`(8) + `offset`(8) + `count`(4)
184 (`0xB8`)	`SHARED_DATA_REF`	12 bytes: `parent_bytenr`(8) + `count`(4)
172 (`0xAC`)	`EXTENT_OWNER_REF`	8 bytes: `root_objectid` (LE u64)

Note that for EXTENT_DATA_REF, the 8-byte offset field that normally follows the type byte is absent; the struct fields begin immediately after the type byte:

Field	Offset	Size	Notes
`type`	0	1	178 (`EXTENT_DATA_REF_KEY`)
`root`	1	8	Owning tree objectid (LE u64)
`objectid`	9	8	Referencing inode number (LE u64)
`offset`	17	8	File byte offset of reference (LE u64)
`count`	25	4	Number of references (LE u32)

For other inline ref types, the format is:

Field	Offset	Size	Notes
`type`	0	1	Type byte (176/182/184/172)
`offset`	1	8	Type-specific offset (LE u64)

For SHARED_DATA_REF, an additional 4 bytes follow:

9       4     count       Number of references (LE u32)

Standalone backreference items

When backreferences do not fit inline in the extent item, they are stored as separate items in the extent tree:

TREE_BLOCK_REF (type 176): Key: (extent_bytenr, TREE_BLOCK_REF, root_objectid). No data payload; the key offset encodes the owning root.

SHARED_BLOCK_REF (type 182): Key: (extent_bytenr, SHARED_BLOCK_REF, parent_bytenr). No data payload; the key offset encodes the parent block.

EXTENT_DATA_REF (type 178): Key: (extent_bytenr, EXTENT_DATA_REF, hash). The hash is computed from (root, objectid, offset) using two CRC32C passes:

high_crc = raw_crc32c(0xFFFFFFFF, root_le_bytes)
low_crc  = raw_crc32c(0xFFFFFFFF, objectid_le_bytes)
low_crc  = raw_crc32c(low_crc,    offset_le_bytes)
hash     = (high_crc << 31) ^ low_crc

Payload (btrfs_extent_data_ref, 28 bytes):

Field	Offset	Size	Notes
`root`	0	8	Owning tree objectid (LE u64)
`objectid`	8	8	Referencing inode (LE u64)
`offset`	16	8	File byte offset (LE u64)
`count`	24	4	Reference count (LE u32)

SHARED_DATA_REF (type 184): Key: (extent_bytenr, SHARED_DATA_REF, parent_bytenr). Payload (4 bytes):

Field	Offset	Size	Notes
`count`	0	4	Reference count (LE u32)

EXTENT_OWNER_REF (type 172): Key: (extent_bytenr, EXTENT_OWNER_REF, root_objectid). No data payload. Used with the simple_quota feature.

DEV_ITEM (type 216)

Key: (DEV_ITEMS_OBJECTID [1], DEV_ITEM, devid)

Stored in the chunk tree. Also embedded in the superblock at offset 201.

Payload (btrfs_dev_item, 98 bytes):

Field	Offset	Size	Notes
`devid`	0	8	Device ID (LE u64)
`total_bytes`	8	8	Total device size (LE u64)
`bytes_used`	16	8	Bytes allocated on device (LE u64)
`io_align`	24	4	I/O alignment (LE u32)
`io_width`	28	4	I/O width (LE u32)
`sector_size`	32	4	Device sector size (LE u32)
`type`	36	8	Device type (reserved, 0) (LE u64)
`generation`	44	8	Generation last updated (LE u64)
`start_offset`	52	8	Allocation start offset (LE u64)
`dev_group`	60	4	Device group (reserved, 0) (LE u32)
`seek_speed`	64	1	Seek speed hint (0 = unset)
`bandwidth`	65	1	Bandwidth hint (0 = unset)
`uuid`	66	16	Device UUID
`fsid`	82	16	Filesystem UUID

CHUNK_ITEM (type 228)

Key: (FIRST_CHUNK_TREE_OBJECTID [256], CHUNK_ITEM, logical_offset)

Maps a range of logical addresses to physical device locations. Stored in the chunk tree and (for system chunks) in the superblock’s sys_chunk_array.

Payload (btrfs_chunk + stripes, variable):

Field	Offset	Size	Notes
`length`	0	8	Chunk size in bytes (LE u64)
`owner`	8	8	Owner objectid (LE u64)
`stripe_len`	16	8	Stripe length (typically 65536) (LE u64)
`type`	24	8	Chunk type + RAID profile flags (LE u64)
`io_align`	32	4	I/O alignment (LE u32)
`io_width`	36	4	I/O width (LE u32)
`sector_size`	40	4	Sector size (LE u32)
`num_stripes`	44	2	Number of stripes (LE u16)
`sub_stripes`	46	2	Sub-stripes for RAID10 (LE u16)
`stripes[]`	48	…	Array of `num_stripes` stripe entries

Each stripe entry (btrfs_stripe, 32 bytes):

Field	Offset	Size	Notes
`devid`	0	8	Device ID (LE u64)
`offset`	8	8	Physical byte offset on device (LE u64)
`dev_uuid`	16	16	Device UUID

Total payload size: 48 + num_stripes * 32 bytes.

Chunk type flags (bitmask, same as block group flags):

Bit	Value	Name
0	`0x1`	`DATA`
1	`0x2`	`SYSTEM`
2	`0x4`	`METADATA`
3	`0x8`	`RAID0`
4	`0x10`	`RAID1`
5	`0x20`	`DUP`
6	`0x40`	`RAID10`
7	`0x80`	`RAID5`
8	`0x100`	`RAID6`
9	`0x200`	`RAID1C3`
10	`0x400`	`RAID1C4`

When no RAID profile bits are set, the chunk is SINGLE profile.

DEV_EXTENT (type 204)

Key: (devid, DEV_EXTENT, physical_offset)

The inverse of a chunk stripe: maps a physical range on a device back to the owning chunk.

Payload (btrfs_dev_extent, 48 bytes):

Field	Offset	Size	Notes
`chunk_tree`	0	8	Chunk tree objectid (always 3) (LE u64)
`chunk_objectid`	8	8	Chunk objectid (LE u64)
`chunk_offset`	16	8	Logical offset of owning chunk (LE u64)
`length`	24	8	Length of this device extent (LE u64)
`chunk_tree_uuid`	32	16	Chunk tree UUID

BLOCK_GROUP_ITEM (type 192)

Key: (logical_offset, BLOCK_GROUP_ITEM, length)

Tracks space usage for a chunk. Stored in the extent tree (or block group tree when the block_group_tree feature is enabled).

Payload (btrfs_block_group_item, 24 bytes):

Field	Offset	Size	Notes
`used`	0	8	Bytes used in this block group (LE u64)
`chunk_objectid`	8	8	Chunk objectid backing this group (LE u64)
`flags`	16	8	Type + RAID profile flags (LE u64)

The flags field uses the same bitmask as chunk type flags (Section 8.9).

ROOT_ITEM (type 132)

Key: (tree_objectid, ROOT_ITEM, 0)

Stored in the root tree. Describes a tree root: its block address, generation, subvolume UUIDs, and timestamps.

Payload (btrfs_root_item, 439 bytes):

Field	Offset	Size	Notes
`inode`	0	160	Embedded `btrfs_inode_item` (root dir inode)
`generation`	160	8	Generation when last modified (LE u64)
`root_dirid`	168	8	Root directory inode objectid (LE u64)
`bytenr`	176	8	Logical bytenr of root block (LE u64)
`byte_limit`	184	8	Quota byte limit, 0=unlimited (LE u64)
`bytes_used`	192	8	Bytes used by this tree (LE u64)
`last_snapshot`	200	8	Generation of last snapshot (LE u64)
`flags`	208	8	Root flags (LE u64)
`refs`	216	4	Reference count (LE u32)
`drop_progress`	220	17	Drop operation progress key (`btrfs_disk_key`)
`drop_level`	237	1	Drop operation tree level (u8)
`level`	238	1	B-tree level of root block (u8)
`generation_v2`	239	8	Extended generation (v2) (LE u64)
`uuid`	247	16	Subvolume UUID
`parent_uuid`	263	16	Parent subvolume UUID (for snapshots)
`received_uuid`	279	16	Received UUID (for send/receive)
`ctransid`	295	8	Last change transaction (LE u64)
`otransid`	303	8	Creation transaction (LE u64)
`stransid`	311	8	Send transaction (LE u64)
`rtransid`	319	8	Receive transaction (LE u64)
`ctime`	327	12	Change timestamp (`btrfs_timespec`)
`otime`	339	12	Creation timestamp (`btrfs_timespec`)
`stime`	351	12	Send timestamp (`btrfs_timespec`)
`rtime`	363	12	Receive timestamp (`btrfs_timespec`)
reserved	375	64	Reserved u64[8]

The embedded inode_item at the start describes the root directory inode (objectid 256 = BTRFS_FIRST_FREE_OBJECTID for FS trees).

Older filesystems may store a shorter v1 root item without the UUID, transaction, and timestamp fields. The parser handles both formats.

Root item flags:

Bit	Value	Name
0	`0x1`	`SUBVOL_RDONLY` (read-only snapshot)

SUBVOL_DEAD (bit 48, value 0x1000000000000) marks a deleted subvolume pending cleanup.

ROOT_REF (type 156) / ROOT_BACKREF (type 144)

Key for ROOT_REF: (parent_tree_id, ROOT_REF, child_tree_id) Key for ROOT_BACKREF: (child_tree_id, ROOT_BACKREF, parent_tree_id)

Forward and backward references linking subvolumes to their parent directories. Both use the same on-disk format.

Payload (btrfs_root_ref, 18 bytes + name):

Field	Offset	Size	Notes
`dirid`	0	8	Directory inode containing the subvol entry (LE u64)
`sequence`	8	8	`DIR_INDEX` sequence number (LE u64)
`name_len`	16	2	Length of name (LE u16)
`name`	18	name_len	Subvolume name bytes

FREE_SPACE_INFO (type 198)

Key: (block_group_offset, FREE_SPACE_INFO, block_group_length)

Metadata about free space tracking for a block group.

Payload (btrfs_free_space_info, 8 bytes):

Field	Offset	Size	Notes
`extent_count`	0	4	Number of free extents/bitmap entries (LE u32)
`flags`	4	4	Free space info flags (LE u32)

Flags:

Bit	Value	Name
0	`0x1`	`USING_BITMAPS`

FREE_SPACE_EXTENT (type 199)

Key: (start, FREE_SPACE_EXTENT, length)

Represents a contiguous free range within a block group. The item has no data payload; the key itself encodes the start address and length.

FREE_SPACE_BITMAP (type 200)

Key: (start, FREE_SPACE_BITMAP, length)

A bitmap covering a portion of a block group’s address range. The item data is the raw bitmap, where each bit represents one sector of space. Bit set = free, bit clear = allocated.

XATTR_ITEM (type 24)

Key: (inode_number, XATTR_ITEM, crc32c(name))

Extended attribute storage. Uses the same on-disk format as DIR_ITEM (Section 8.4), but with:

location = zeroed key
data_len = length of the xattr value
type = FT_XATTR (8)
name = xattr name (e.g. user.myattr)
data = xattr value

EXTENT_CSUM (type 128)

Key: (EXTENT_CSUM_OBJECTID, EXTENT_CSUM, logical_bytenr)

Stores an array of per-sector checksums for a contiguous range of data blocks. The item data is a packed array of checksums, one per sector.

For CRC32C, each checksum is 4 bytes (LE u32), so the item covers item_size / 4 sectors. The logical byte range covered is:

start = key.offset
end   = key.offset + (item_size / csum_size) * sectorsize

QGROUP_STATUS (type 240)

Key: (0, QGROUP_STATUS, 0)

One per filesystem. Tracks the overall state of quota accounting.

Payload (btrfs_qgroup_status_item, 32-40 bytes):

Field	Offset	Size	Notes
`version`	0	8	On-disk format version (LE u64)
`generation`	8	8	Last consistent generation (LE u64)
`flags`	16	8	Status flags (LE u64)
`scan`	24	8	Rescan progress objectid (LE u64)
`enable_gen`	32	8	Enable generation (kernel 6.8+, optional) (LE u64)

QGROUP_INFO (type 242)

Key: (packed_qgroupid, QGROUP_INFO, 0)

where packed_qgroupid = (level << 48) | subvolid.

Payload (btrfs_qgroup_info_item, 40 bytes):

Field	Offset	Size	Notes
`generation`	0	8	Last update generation (LE u64)
`referenced`	8	8	Total referenced bytes (LE u64)
`referenced_compressed`	16	8	Referenced bytes (compressed) (LE u64)
`exclusive`	24	8	Exclusive bytes (LE u64)
`exclusive_compressed`	32	8	Exclusive bytes (compressed) (LE u64)

QGROUP_LIMIT (type 244)

Key: (packed_qgroupid, QGROUP_LIMIT, 0)

Payload (btrfs_qgroup_limit_item, 40 bytes):

Field	Offset	Size	Notes
`flags`	0	8	Active limit bitmask (LE u64)
`max_referenced`	8	8	Max referenced bytes, 0=unlimited (LE u64)
`max_exclusive`	16	8	Max exclusive bytes, 0=unlimited (LE u64)
`rsv_referenced`	24	8	Reserved referenced bytes (LE u64)
`rsv_exclusive`	32	8	Reserved exclusive bytes (LE u64)

QGROUP_RELATION (type 246)

Key: (child_qgroupid, QGROUP_RELATION, parent_qgroupid)

Defines a parent-child relationship between qgroups. No data payload; the relationship is fully encoded in the key.

UUID_KEY_SUBVOL (type 251) / UUID_KEY_RECEIVED_SUBVOL (type 252)

Key: (upper_half_uuid, UUID_KEY_SUBVOL, lower_half_uuid)

Maps a UUID to one or more subvolume objectids. The UUID is split: the upper 8 bytes are stored as a LE u64 in the objectid field, the lower 8 bytes as a LE u64 in the offset field.

Payload (variable, array of u64):

For each associated subvolume:
  8 bytes   subvolid   Subvolume tree objectid (LE u64)

STRING_ITEM (type 253)

Key: (BTRFS_FREE_SPACE_OBJECTID, STRING_ITEM, 0)

Raw byte string. Typically stores the filesystem label in the root tree.

Payload: Raw bytes (length = item data size).

TEMPORARY_ITEM (type 248) / BALANCE_ITEM

Key: (BALANCE_OBJECTID, TEMPORARY_ITEM, 0)

Persists in-progress balance state across reboots.

Payload: The first 8 bytes are balance flags (LE u64). The remainder contains btrfs_balance_args structures for data, metadata, and system filters.

PERSISTENT_ITEM (type 249) / DEV_STATS

Key for device stats: (DEV_STATS_OBJECTID [0], PERSISTENT_ITEM, devid) Key for device replace: (DEV_REPLACE_OBJECTID, DEV_REPLACE, 0)

Device stats payload (40 bytes):

Field	Offset	Size	Notes
`write_errs`	0	8	Write error count (LE u64)
`read_errs`	8	8	Read error count (LE u64)
`flush_errs`	16	8	Flush error count (LE u64)
`corruption_errs`	24	8	Corruption error count (LE u64)
`generation_errs`	32	8	Generation mismatch count (LE u64)

Device replace payload (btrfs_dev_replace_item, 72+ bytes):

Field	Offset	Size	Notes
`src_devid`	0	8	Source device ID (LE u64)
`cursor_left`	8	8	Left cursor position (LE u64)
`cursor_right`	16	8	Right cursor position (LE u64)
`replace_mode`	24	8	Replace mode (LE u64)
`replace_state`	32	8	Current state (LE u64)
`time_started`	40	8	Start timestamp (LE u64)
`time_stopped`	48	8	Stop timestamp (LE u64)
`num_write_errors`	56	8	Write errors (LE u64)
`num_uncorrectable_read_errors`	64	8	Uncorrectable reads (LE u64)

ORPHAN_ITEM (type 48)

Key: (ORPHAN_OBJECTID, ORPHAN_ITEM, inode_number)

Marks an inode that has been unlinked but is still open. The item has no data payload. Orphan items are cleaned up on mount or by the kernel’s orphan cleanup thread.

RAID_STRIPE (type 230)

Key: (logical_offset, RAID_STRIPE, length)

Maps logical extents to per-device physical stripe offsets. Requires the raid_stripe_tree incompat feature.

Payload (variable):

Field	Offset	Size	Notes
`encoding`	0	8	RAID encoding type (LE u64)
`stripes[]`	8	…	Array of stripe entries

Each stripe entry (16 bytes):

Field	Offset	Size	Notes
`devid`	0	8	Device ID (LE u64)
`physical`	8	8	Physical byte offset (LE u64)

Checksums

Btrfs uses two distinct CRC32C computation modes:

Standard CRC32C (on-disk structures)

Used for all on-disk checksums: superblocks, tree block headers, and data checksums (EXTENT_CSUM items).

This is ISO 3309 / Castagnoli CRC32C: seed = 0xFFFFFFFF, result is XORed with 0xFFFFFFFF. Equivalent to the standard crc32c() function in most libraries.

checksum = crc32c(data)    // standard ISO 3309 CRC32C

The 4-byte LE result is stored in the checksum field. For superblocks and tree blocks, the checksum covers everything after the 32-byte csum field to the end of the structure.

Raw CRC32C (hash computations)

Used for internal hash computations where the kernel calls crc32c_le() directly:

Name hashes for DIR_ITEM keys (crc32c(name))
Name hashes for XATTR_ITEM keys
Name hashes for INODE_EXTREF keys
extent_data_ref key hash computation
Send stream CRC32C

The raw CRC32C passes the seed through without inversion:

raw_crc32c(seed, data) = !crc32c_append(!seed, data)

This is NOT the standard ISO 3309 algorithm. The seed is typically 0xFFFFFFFF (which is ~0u32), but unlike the standard algorithm, the output is not inverted.

Supported checksum algorithms

The csum_type field in the superblock selects the algorithm:

Value	Name	Output size	Notes
0	CRC32C	4 bytes	Default, by far the most common
1	xxHash64	8 bytes	Fast non-cryptographic hash
2	SHA-256	32 bytes	Cryptographic hash
3	BLAKE2b	32 bytes	Cryptographic hash (BLAKE2b-256)

The maximum checksum size is 32 bytes (BTRFS_CSUM_SIZE), which is also the size of the checksum field in headers.

Feature Flags

Feature flags are stored in three fields in the superblock. A filesystem implementation must understand all set flags to correctly operate:

compat_flags: features that are backward-compatible (no known flags currently defined)
compat_ro_flags: features compatible for read-only mounting
incompat_flags: features that are fully incompatible

Incompatible feature flags (`incompat_flags`)

Bit	Value	Name	Notes
0	`0x1`	`MIXED_BACKREF`	Mixed backref revision (always set on modern fs)
1	`0x2`	`DEFAULT_SUBVOL`	A non-default subvolume is the mount target
2	`0x4`	`MIXED_GROUPS`	Data and metadata may share block groups
3	`0x8`	`COMPRESS_LZO`	LZO compression used
4	`0x10`	`COMPRESS_ZSTD`	Zstandard compression used
5	`0x20`	`BIG_METADATA`	Metadata blocks > sectorsize (always set when nodesize > sectorsize)
6	`0x40`	`EXTENDED_IREF`	Extended inode references (`INODE_EXTREF` items)
7	`0x80`	`RAID56`	RAID5/6 profiles used
8	`0x100`	`SKINNY_METADATA`	Skinny metadata extent refs (`METADATA_ITEM` instead of `EXTENT_ITEM` for tree blocks)
9	`0x200`	`NO_HOLES`	File extents do not need explicit hole entries
10	`0x400`	`METADATA_UUID`	`metadata_uuid` differs from `fsid`
11	`0x800`	`RAID1C34`	RAID1C3/RAID1C4 profiles used
12	`0x1000`	`ZONED`	Zoned device support
13	`0x2000`	`EXTENT_TREE_V2`	Extent tree v2 (experimental)
14	`0x4000`	`RAID_STRIPE_TREE`	RAID stripe tree for stripe mappings
16	`0x10000`	`SIMPLE_QUOTA`	Simple quota (per-extent ownership tracking)
17	`0x20000`	`REMAP_TREE`	Remap tree (reserved for future use)

MIXED_BACKREF (bit 0): Indicates the filesystem uses mixed backref format (revision 1). All modern filesystems set this. Old filesystems without it use revision 0 backrefs.

DEFAULT_SUBVOL (bit 1): Set when a non-default subvolume has been configured as the default mount target via btrfs subvolume set-default.

MIXED_GROUPS (bit 2): Allows data and metadata to share the same block group. Unusual; typically used only on very small filesystems.

COMPRESS_LZO (bit 3): Set when any file on the filesystem uses LZO compression. Once set, it is never cleared.

COMPRESS_ZSTD (bit 4): Set when any file uses Zstandard compression.

BIG_METADATA (bit 5): Set when nodesize > sectorsize, allowing metadata blocks to span multiple sectors. Always set on modern filesystems with the typical 16384-byte nodesize and 4096-byte sectorsize.

EXTENDED_IREF (bit 6): Enables INODE_EXTREF items for inodes with hard links from multiple parent directories. Without this, only INODE_REF is used (keyed by single parent inode, limiting hard links per parent directory).

SKINNY_METADATA (bit 8): Uses METADATA_ITEM (type 169) instead of EXTENT_ITEM (type 168) for tree block extent records. The tree block level is encoded in the key offset, eliminating the separate btrfs_tree_block_info structure and saving 18 bytes per metadata extent item.

NO_HOLES (bit 9): File extents do not require explicit hole entries. Without this flag, holes in sparse files are represented by FILE_EXTENT_ITEM with disk_bytenr = 0; with it, holes are implicit (no item needed for the gap).

METADATA_UUID (bit 10): The metadata_uuid field in the superblock differs from fsid. This allows changing the user-visible filesystem UUID without rewriting every tree block header.

Compatible read-only feature flags (`compat_ro_flags`)

Bit	Value	Name	Notes
0	`0x1`	`FREE_SPACE_TREE`	Free space tree exists
1	`0x2`	`FREE_SPACE_TREE_VALID`	Free space tree is valid and should be used
2	`0x4`	`VERITY`	fs-verity support enabled
3	`0x8`	`BLOCK_GROUP_TREE`	Block group items in separate tree

FREE_SPACE_TREE (bit 0) + FREE_SPACE_TREE_VALID (bit 1): When both are set, the free space tree (objectid 10) is used instead of the legacy free space cache (v1). Both bits must be set for the tree to be considered valid.

VERITY (bit 2): Indicates that fs-verity has been enabled on at least one file, and the filesystem contains VERITY_DESC_ITEM and VERITY_MERKLE_ITEM entries.

BLOCK_GROUP_TREE (bit 3): Block group items are stored in a dedicated block group tree (objectid 11) instead of the extent tree. This improves mount time by avoiding a full extent tree scan to find block groups.

Appendix A: Transaction Model

Btrfs uses a generation-based transaction model. Each transaction is identified by a monotonically increasing generation counter stored in the superblock.

Transaction commit

A transaction commit involves:

All modified tree blocks are written to new locations (COW). Each block’s header records the current generation.
The superblock is updated with:
- Incremented generation
- New root (root tree root address)
- New chunk_root (if chunk tree changed)
- Updated bytes_used and total_bytes
- Rotated super_roots backup entry
The superblock is written to all mirrors that fit on the device.

The superblock write is the atomic commit point. If the system crashes before the superblock is fully written, the previous superblock (with the previous generation) remains valid and the filesystem rolls back to that state.

Generation consistency

The generation field appears in multiple places, all of which must be consistent:

Superblock generation: the current transaction counter
Tree block header generation: must equal the generation when the block was last COWed
Node key-pointer generation: must match the child block’s header generation (used for read-time validation)
ROOT_ITEM.generation: the generation when the tree was last modified
Backup root *_gen fields: generation of each tree root at backup time

When reading a tree, the kernel validates that each block’s generation matches the expected generation from its parent’s key-pointer. A mismatch indicates corruption or a torn write.

Superblock flag: CHANGING_FSID

The BTRFS_SUPER_FLAG_CHANGING_FSID flag (bit 2 of flags) is set during an offline fsid rewrite operation. If the system crashes while this flag is set, the rewrite must be completed or rolled back on the next access. This provides crash safety for the multi-block fsid change operation.

Appendix B: Size Constants

Constant	Size	Notes
`BTRFS_SUPER_INFO_SIZE`	4096 bytes
`BTRFS_HEADER_SIZE`	101 bytes	`sizeof(btrfs_header)`
`BTRFS_ITEM_SIZE`	25 bytes	`sizeof(btrfs_item)`
`BTRFS_KEY_PTR_SIZE`	33 bytes	`sizeof(btrfs_key_ptr)`
`BTRFS_DISK_KEY_SIZE`	17 bytes	`sizeof(btrfs_disk_key)`
`BTRFS_CSUM_SIZE`	32 bytes	Maximum checksum field width
`BTRFS_STRIPE_SIZE`	32 bytes	`sizeof(btrfs_stripe)`
`BTRFS_INODE_ITEM_SIZE`	160 bytes	`sizeof(btrfs_inode_item)`
`BTRFS_ROOT_ITEM_SIZE`	439 bytes	`sizeof(btrfs_root_item)`
`BTRFS_DEV_ITEM_SIZE`	98 bytes	`sizeof(btrfs_dev_item)`
`BTRFS_TIMESPEC_SIZE`	12 bytes	`sizeof(btrfs_timespec)`
`BTRFS_BLOCK_GROUP_SIZE`	24 bytes	`sizeof(btrfs_block_group_item)`
`BTRFS_EXTENT_ITEM_SIZE`	24 bytes	`sizeof(btrfs_extent_item)`
`BTRFS_TREE_BLOCK_INFO_SIZE`	18 bytes	`sizeof(btrfs_tree_block_info)`
`BTRFS_EXTENT_DATA_REF_SIZE`	28 bytes	`sizeof(btrfs_extent_data_ref)`
`BTRFS_DEV_EXTENT_SIZE`	48 bytes	`sizeof(btrfs_dev_extent)`
`BTRFS_FREE_SPACE_INFO_SIZE`	8 bytes	`sizeof(btrfs_free_space_info)`
`BTRFS_ROOT_REF_SIZE`	18 bytes	`sizeof(btrfs_root_ref)`, without name
`BTRFS_DIR_ITEM_SIZE`	30 bytes	`sizeof(btrfs_dir_item)`, without name/data
`BTRFS_BACKUP_ROOT_SIZE`	168 bytes	`sizeof(btrfs_root_backup)`
`SYS_CHUNK_ARRAY_SIZE`	2048 bytes

Appendix C: Logical-to-Physical Address Resolution

All tree block addresses and extent addresses in btrfs are logical addresses. To read a logical address from disk, it must be resolved to a physical device offset through the chunk tree.

The resolution process:

Bootstrap: Parse the superblock’s sys_chunk_array to seed an initial chunk cache with system chunk mappings.
Read the chunk tree: Using the system chunk mappings, resolve superblock.chunk_root to a physical address and read the chunk tree. Add all CHUNK_ITEM entries to the cache.
Resolve: For any logical address, find the chunk whose range contains that address. The physical address is:
```
physical = stripe.offset + (logical - chunk.logical)
```
For SINGLE and DUP profiles, any stripe yields a valid copy. For RAID1, all stripes hold identical copies. For RAID0/5/6/10, stripe index calculation is needed.
Read the root tree: Using the full chunk cache, resolve superblock.root to a physical address and read the root tree. From here, all other trees can be located via their ROOT_ITEM entries.

Appendix D: File Data Layout

A regular file’s on-disk data is described by a sequence of FILE_EXTENT_ITEM entries in the FS tree, keyed by (inode, EXTENT_DATA, file_offset).

Inline extents: Small files (typically < sectorsize) store their data directly in the tree leaf. No separate disk allocation is needed.

Regular extents: Larger files reference data stored in data chunks. The extent is described by disk_bytenr (logical address) and disk_num_bytes (on-disk size). The offset field allows partial references into shared extents (e.g., after COW or clone operations).

Compressed extents: When compression is enabled, the compression field is nonzero, disk_num_bytes is the compressed size, and ram_bytes is the uncompressed size. Inline compressed extents store the compressed data directly in the item.

Sparse files: With the NO_HOLES feature, gaps between extent items are implicit holes. Without it, explicit hole entries with disk_bytenr = 0 fill the gaps.

The file size is stored in INODE_ITEM.size and is authoritative even if the extent items would suggest a different range.

When a file extent is cloned (via cp --reflink or BTRFS_IOC_CLONE), both the source and destination inodes reference the same on-disk extent via their FILE_EXTENT_ITEM entries. The reference count in the extent tree’s EXTENT_ITEM is incremented.

The offset field in FILE_EXTENT_ITEM allows each reference to start at a different position within the shared extent:

File A:  [--- extent X (offset=0, num_bytes=4096) ---]
File B:  [--- extent X (offset=2048, num_bytes=2048) ---]

Both reference the same disk_bytenr, but File B starts reading 2048 bytes into the extent.

Compression type encoding

The compression field in FILE_EXTENT_ITEM uses these values:

Value	Name	Notes
0	none	No compression
1	zlib	Deflate compression
2	lzo	LZO compression (btrfs per-sector format)
3	zstd	Zstandard compression

When compression is used with inline extents, the stored data is compressed and the inline data size may differ from ram_bytes.

Appendix E: Subvolume and Snapshot Model

Subvolumes

Each subvolume is an independent FS tree with its own tree objectid (5 for the default, 256+ for user-created subvolumes). The root tree stores:

A ROOT_ITEM for each subvolume, recording the root block address, generation, UUIDs, and timestamps.
ROOT_REF / ROOT_BACKREF pairs linking parent and child subvolumes.

Snapshots

A snapshot is a subvolume created by COWing the root block of another subvolume. At creation time, the snapshot shares all tree blocks with the source. As either the source or snapshot is modified, shared blocks are COWed on demand, gradually diverging.

The parent_uuid field in ROOT_ITEM links a snapshot back to its source subvolume. The received_uuid field tracks the source across send/receive operations.

Subvolume deletion

Deleted subvolumes are marked with the SUBVOL_DEAD flag in their ROOT_ITEM.flags. The kernel cleans up the tree blocks asynchronously, tracking progress via the drop_progress key and drop_level fields.

Read-only snapshots

A subvolume can be made read-only by setting the SUBVOL_RDONLY flag in ROOT_ITEM.flags. This is required for send operations (the source subvolume must be read-only).

Appendix F: Name Hashing

Directory entries (DIR_ITEM) and extended attributes (XATTR_ITEM) are keyed by a CRC32C hash of the name. The hash uses raw CRC32C (see Section 9.2) with seed ~0:

hash = raw_crc32c(0xFFFFFFFF, name_bytes)

This hash determines the key offset for the DIR_ITEM. If two names hash to the same value (collision), their DIR_ITEM entries are packed into a single item, concatenated one after another.

DIR_INDEX entries use a monotonically increasing sequence number instead of a hash, providing deterministic iteration order independent of name hashing.

For INODE_EXTREF, the hash combines the parent inode number and name:

hash = raw_crc32c(raw_crc32c(0xFFFFFFFF, parent_ino_le_bytes), name_bytes)

Appendix G: Block Group and Chunk Relationship

The relationship between chunks, block groups, and device extents forms the space allocation layer:

Chunk (chunk tree)
  |
  +-- maps logical range [L, L+length) to physical stripes
  |   on one or more devices
  |
  +-- Block Group (extent tree or block group tree)
  |     tracks used/free space within the logical range
  |     type flags must match the chunk type
  |
  +-- Device Extent(s) (device tree)
        one per stripe, maps physical range back to the chunk

Allocation order: mkfs creates chunks by:

Choosing a physical region on each device (creating device extents)
Assigning a logical address range (creating the chunk item)
Creating a block group covering the logical range
For the free space tree, creating a FREE_SPACE_INFO and initial FREE_SPACE_EXTENT entries

Consistency invariant: For every chunk, there must be:

Exactly one BLOCK_GROUP_ITEM with matching logical offset and length
One DEV_EXTENT per stripe, with chunk_offset pointing back to the chunk
The block group flags must match the chunk type field

These cross-references are verified by btrfs check.

Appendix H: Default Feature Set

A modern btrfs filesystem created by mkfs.btrfs (or this project’s btrfs-mkfs) typically has the following features enabled:

Incompatible features:

MIXED_BACKREF (bit 0) – always set
BIG_METADATA (bit 5) – set because nodesize (16384) > sectorsize (4096)
EXTENDED_IREF (bit 6) – enables extended inode references
SKINNY_METADATA (bit 8) – compact metadata extent records
NO_HOLES (bit 9) – implicit holes in sparse files

Compatible read-only features:

FREE_SPACE_TREE (bit 0) – free space tracking tree
FREE_SPACE_TREE_VALID (bit 1) – free space tree is valid

These are the extref, skinny-metadata, no-holes, and free-space-tree features referenced in mkfs output.

Default parameters:

nodesize = 16384 (16 KiB)
sectorsize = 4096 (4 KiB), matching the device sector size
stripesize = 65536 (64 KiB)
csum_type = 0 (CRC32C)
Metadata profile: DUP (two copies on the same device)
Data profile: SINGLE (no redundancy)
System profile: DUP (for single-device) or RAID1 (for multi-device)

Appendix I: Extent Reference Counting

Btrfs tracks references to every allocated extent (both data and metadata) in the extent tree. The reference count in EXTENT_ITEM.refs (or METADATA_ITEM.refs) records how many times the extent is referenced.

Metadata extents

A metadata extent (tree block) is referenced by key-pointers in parent nodes. When a snapshot is created, the snapshot initially shares all tree blocks with the source. Each shared block has refs >= 2. When either tree COWs a shared block, the old block’s refcount is decremented and the new copy gets refs = 1.

Backreferences track which tree(s) own each block:

TREE_BLOCK_REF (inline or standalone): direct ownership by a tree root
SHARED_BLOCK_REF (inline or standalone): ownership via a parent block that is itself shared between trees

Data extents

A data extent is referenced by FILE_EXTENT_ITEM entries in FS trees. Multiple files (or multiple positions in the same file) can reference the same data extent through reflink cloning.

Backreferences track which file inodes reference each extent:

EXTENT_DATA_REF (inline or standalone): records (root, inode, offset, count)
SHARED_DATA_REF (inline or standalone): records (parent_block, count)

Reference count invariant

The refs field must equal the sum of all backreference counts for the extent. btrfs check verifies this invariant by walking the extent tree and cross-referencing with the FS trees.

When refs reaches 0, the extent is freed and its space returned to the block group’s free space pool.

Chunks and Block Groups

This document describes the btrfs chunk and block group system: how the filesystem maps logical addresses to physical device locations, how space is organized into typed block groups, and how these structures relate to each other on disk.

All multi-byte integers in btrfs on-disk structures are little-endian.

Address Spaces

Btrfs uses two distinct address spaces:

Logical address space. Every byte of allocated space in the filesystem has a logical address. Tree node pointers, extent references, block group descriptors, and file extent records all use logical addresses. The logical address space is a flat 64-bit namespace shared across all devices in the filesystem. There is no inherent relationship between a logical address and any particular physical device.

Physical address space. Each device has its own independent physical address space, starting at byte 0. Physical addresses identify actual byte offsets on a block device.

The separation exists for several reasons:

Multi-device support. A single logical address can map to stripes on multiple physical devices (RAID1, DUP, RAID0, etc.) without the upper layers of the filesystem needing to know which devices are involved.
Relocation. The balance and resize operations can move data between physical locations while logical addresses remain stable. Since all internal pointers use logical addresses, no tree rewriting is needed when physical locations change.
Redundancy profiles. The same logical address range can have multiple physical copies (DUP, RAID1) or be striped across devices (RAID0) — this is invisible to everything above the chunk layer.

The mapping between the two address spaces is maintained by three cooperating data structures: chunks (logical to physical), device extents (physical to logical), and block groups (space accounting).

Chunks

A chunk maps a contiguous range of logical addresses to one or more physical locations on devices. Chunks are the fundamental unit of the logical-to-physical translation.

CHUNK_ITEM On-Disk Structure

Chunks are stored in the chunk tree. Each chunk item has a key:

Key: (FIRST_CHUNK_TREE_OBJECTID, CHUNK_ITEM, logical_offset)
      objectid = 256                type = 228    offset = start of logical range

The item payload is a btrfs_chunk structure followed by an array of btrfs_stripe structures:

btrfs_chunk (48 bytes):

Field	Offset	Size	Description
`length`	0	8	Logical extent length in bytes
`owner`	8	8	Owner tree objectid (always `EXTENT_TREE_OBJECTID` = 2)
`stripe_len`	16	8	Stripe length for striped profiles (default 64 KiB)
`type`	24	8	Block group type + RAID profile flags
`io_align`	32	4	I/O alignment (`STRIPE_LEN` for normal chunks, sectorsize for bootstrap)
`io_width`	36	4	I/O width (same as `io_align`)
`sector_size`	40	4	Sector size of the underlying devices
`num_stripes`	44	2	Number of stripe entries following
`sub_stripes`	46	2	Sub-stripe count (nonzero only for RAID10)

btrfs_stripe (32 bytes each, num_stripes entries):

Field	Offset	Size	Description
`devid`	0	8	Device ID
`offset`	8	8	Physical byte offset on that device
`dev_uuid`	16	16	UUID of the device

The total item size is 48 + num_stripes * 32 bytes.

Logical-to-Physical Resolution

To resolve a logical address to a physical location:

Find the chunk whose logical range contains the address. The chunk tree is a B-tree keyed by (256, CHUNK_ITEM, logical_offset), so a lookup finds the entry with the largest logical_offset <= target.
Verify the address falls within the chunk: logical_offset <= target < logical_offset + length.
Compute the offset within the chunk: within = target - logical_offset.
For simple profiles (SINGLE, DUP, RAID1): the physical address on stripe i is stripe[i].offset + within.
For striped profiles (RAID0, RAID10, RAID5, RAID6): the stripe index and offset within the stripe are computed from within, stripe_len, and num_stripes/sub_stripes.

The ChunkTreeCache in disk/src/chunk.rs implements this as a BTreeMap keyed by logical start address, with resolve() returning the physical offset on the first stripe (sufficient for SINGLE, DUP, and RAID1 reads).

Chunk Ownership

The owner field in the chunk item is always BTRFS_EXTENT_TREE_OBJECTID (2). This is a historical artifact — it does not mean the extent tree “owns” the chunk in any meaningful sense. The chunk tree is its own independent tree (tree objectid 3) with its root pointer stored directly in the superblock.

Block Groups

A block group is the unit of space management in btrfs. Each block group corresponds to exactly one chunk and tracks how much of that chunk’s space is used. Block groups carry type information that determines what kind of data can be stored in them.

BLOCK_GROUP_ITEM On-Disk Structure

Block group items are stored either in the extent tree (traditional layout) or in the dedicated block-group tree (when the BLOCK_GROUP_TREE compat_ro feature is enabled).

Key: (logical_offset, BLOCK_GROUP_ITEM, length)
      objectid = chunk start    type = 192    offset = chunk length

The item payload is a btrfs_block_group_item (24 bytes):

Field	Offset	Size	Description
`used`	0	8	Bytes currently allocated within this block group
`chunk_objectid`	8	8	Always `FIRST_CHUNK_TREE_OBJECTID` (256)
`flags`	16	8	Type flags + RAID profile flags

Type Flags

The flags field is a bitfield combining a chunk type (what gets stored) and a RAID profile (how it is stored):

Chunk type bits (mutually exclusive in practice):

Flag	Value	Meaning
DATA	0x001	File data extents
SYSTEM	0x002	Chunk tree blocks (needed to bootstrap reads)
METADATA	0x004	Tree node blocks (all trees except chunk)

The kernel also supports DATA|METADATA (0x005) for the mixed-bg feature, where data and metadata share block groups.

RAID profile bits:

Flag	Value	Meaning
(none)	0	SINGLE — one copy, one device
RAID0	0x008	Striped across N devices, no redundancy
RAID1	0x010	Mirrored on 2 devices
DUP	0x020	Two copies on the same device
RAID10	0x040	Striped mirrors
RAID5	0x080	Single parity
RAID6	0x100	Double parity
RAID1C3	0x200	Mirrored on 3 devices
RAID1C4	0x400	Mirrored on 4 devices

For example, a metadata block group using DUP has flags 0x024 (METADATA | DUP). A system block group with no profile bits set is SYSTEM|single (0x002).

The BlockGroupFlags type in disk/src/items.rs represents these flags as a bitflags struct with methods type_name() (returns “Data”, “Metadata”, “System”, etc.) and profile_name() (returns “RAID1”, “DUP”, “single”, etc.).

Block Group to Chunk Relationship

Every block group has a 1:1 correspondence with a chunk. The block group’s key (logical_offset, BLOCK_GROUP_ITEM, length) must match a chunk item’s (256, CHUNK_ITEM, logical_offset) with matching length. The block group’s flags must agree with the chunk item’s type field.

This invariant is verified by btrfs check (see section 8).

Device Extents

Device extents are the inverse mapping of chunks: they record which ranges of physical space on each device are allocated to which chunks.

DEV_EXTENT On-Disk Structure

Device extents are stored in the device tree (tree objectid 4).

Key: (devid, DEV_EXTENT, physical_offset)
      objectid = device ID    type = 204    offset = start byte on device

The item payload is a btrfs_dev_extent (48 bytes):

Field	Offset	Size	Description
`chunk_tree`	0	8	Chunk tree objectid (always 3)
`chunk_objectid`	8	8	`FIRST_CHUNK_TREE_OBJECTID` (256)
`chunk_offset`	16	8	Logical offset of the owning chunk
`length`	24	8	Physical extent length in bytes
`chunk_tree_uuid`	32	16	UUID of the chunk tree

Relationship to Chunks and Stripes

For each stripe in a chunk item, there is a corresponding device extent. If a chunk at logical address L has num_stripes stripes, then:

Stripe 0: (stripe[0].devid, DEV_EXTENT, stripe[0].offset) with chunk_offset = L and length = chunk.length (for SINGLE/DUP/RAID1).
Stripe 1 (for DUP/RAID1): (stripe[1].devid, DEV_EXTENT, stripe[1].offset) with chunk_offset = L and length = chunk.length.

For a DUP metadata chunk on a single device, both stripes have the same devid but different physical offsets, producing two device extents on the same device.

Device Items

Each device in the filesystem also has a DEV_ITEM in the chunk tree:

Key: (DEV_ITEMS_OBJECTID, DEV_ITEM, devid)
      objectid = 1              type = 216    offset = device ID

The item payload is a btrfs_dev_item (98 bytes):

Field	Offset	Size	Description
`devid`	0	8	Unique device ID
`total_bytes`	8	8	Total device size
`bytes_used`	16	8	Bytes allocated to chunks on this device
`io_align`	24	4	I/O alignment
`io_width`	28	4	I/O width
`sector_size`	32	4	Sector size
`dev_type`	36	8	Reserved (0)
`generation`	44	8	Last-updated generation
`start_offset`	52	8	Start offset for allocations
`dev_group`	60	4	Reserved (0)
`seek_speed`	64	1	Seek speed hint (0)
`bandwidth`	65	1	Bandwidth hint (0)
`uuid`	66	16	Device UUID
`fsid`	82	16	Filesystem UUID

The bytes_used field is the sum of the lengths of all device extents on that device. A copy of the device item for device 1 is also embedded in the superblock.

The Bootstrap Problem

Circular Dependency

To read any tree, you need to resolve logical addresses to physical offsets, which requires the chunk tree. But the chunk tree is itself stored at a logical address that needs resolution. This creates a circular dependency.

sys_chunk_array

Btrfs solves this with the sys_chunk_array — a 2048-byte buffer embedded directly in the superblock. This array contains a subset of the chunk tree: specifically, the chunk items for SYSTEM-type block groups.

The SYSTEM block group contains the chunk tree’s root block. By parsing the sys_chunk_array, the filesystem driver can locate the chunk tree on disk without needing a chunk tree to find it.

The array format is a packed sequence of (btrfs_disk_key, btrfs_chunk) pairs:

sys_chunk_array[0..sys_chunk_array_size]:
  repeat {
    btrfs_disk_key (17 bytes):
      objectid: u64_le      (always FIRST_CHUNK_TREE_OBJECTID = 256)
      type:     u8           (always CHUNK_ITEM = 228)
      offset:   u64_le       (logical offset of the chunk)
    btrfs_chunk + stripes:
      (same format as the chunk item payload described in section 2.1)
  }

The sys_chunk_array_size field in the superblock records how many bytes of the 2048-byte buffer are valid.

Bootstrap Sequence

The full bootstrap sequence for reading a btrfs filesystem is:

Read the superblock at the primary offset (64 KiB). Verify the magic number, checksum, and fsid. The superblock provides:
- sys_chunk_array + sys_chunk_array_size
- chunk_root (logical address of the chunk tree root)
- root (logical address of the root tree root)
- nodesize, sectorsize, csum_type
Parse the sys_chunk_array to build an initial ChunkTreeCache. This cache contains only the SYSTEM chunk(s), which is enough to resolve the chunk tree root address.
Read the chunk tree starting from chunk_root. For each CHUNK_ITEM found, add the mapping to the ChunkTreeCache. After this step, the cache can resolve any logical address in the filesystem.
Read the root tree starting from root. This tree contains ROOT_ITEM entries for every other tree (extent, device, FS, csum, free-space, etc.), providing their root block logical addresses.
Read any other tree by looking up its ROOT_ITEM in the root tree and using the ChunkTreeCache to resolve addresses.

The seed_from_sys_chunk_array() function in disk/src/chunk.rs implements step 2. The BlockReader in disk/src/reader.rs orchestrates the full bootstrap sequence.

RAID Profiles

The RAID profile determines how a chunk’s logical space maps to physical device locations. The profile affects num_stripes, sub_stripes, and the interpretation of stripe entries.

SINGLE

num_stripes = 1
sub_stripes = 0

One stripe, one device. Logical offset maps 1:1 to a physical offset on a single device. No redundancy.

Logical:   [--------chunk------]
Physical:  [dev1: stripe 0     ]

DUP

num_stripes = 2
sub_stripes = 0

Two stripes on the same device at different physical offsets. Both stripes contain identical data. Provides protection against localized media errors but not device failure.

Logical:   [--------chunk------]
Physical:  [dev1: stripe 0     ]
           [dev1: stripe 1     ]  (different offset, same data)

DUP is the default metadata profile for single-device filesystems. The logical size of the chunk equals one stripe size. The physical space consumed is 2 * stripe_size.

In mkfs, DUP metadata stripes are laid out sequentially after the system group:

Physical layout on device 1:
  [0..1M)          reserved (superblock at 64K)
  [1M..5M)         system chunk (4 MiB)
  [5M..5M+meta)    metadata stripe 0
  [5M+meta..5M+2*meta)  metadata stripe 1
  [5M+2*meta..)    data stripe 0

RAID1

num_stripes = 2  (RAID1C3: 3, RAID1C4: 4)
sub_stripes = 0

One stripe per device, each containing identical data. RAID1 uses 2 devices, RAID1C3 uses 3, RAID1C4 uses 4.

Logical:   [--------chunk------]
Physical:  [dev1: stripe 0     ]
           [dev2: stripe 1     ]  (same data, different device)

For RAID1 metadata on a 2-device filesystem, mkfs places one stripe on each device at the same physical offset (CHUNK_START):

Device 1: [system][meta stripe 0][data stripe 0]
Device 2:        [meta stripe 1]

RAID0

num_stripes = N  (number of devices)
sub_stripes = 0

Data is striped across N devices in stripe_len-sized (64 KiB) units. No redundancy. The logical chunk size equals N * physical_stripe_size.

Logical:   [--A--][--B--][--C--][--A--][--B--][--C--]
Physical:  dev1: [--A--]       [--A--]
           dev2:        [--B--]       [--B--]
           dev3:               [--C--]       [--C--]

To resolve a logical address within a RAID0 chunk:

offset = logical - chunk_start
stripe_nr = offset / stripe_len
stripe_index = stripe_nr % num_stripes
stripe_offset = (stripe_nr / num_stripes) * stripe_len + (offset % stripe_len)
Physical address = stripes[stripe_index].offset + stripe_offset

RAID10

num_stripes = N  (must be even, >= 4)
sub_stripes = 2

Striped mirrors: data is striped across N/2 mirror groups, each group having sub_stripes (2) copies. Combines RAID0 throughput with RAID1 redundancy.

RAID5 and RAID6

RAID5: num_stripes = N, sub_stripes = 0, one parity stripe
RAID6: num_stripes = N, sub_stripes = 0, two parity stripes

Data is striped with rotating parity. RAID5 tolerates one device failure; RAID6 tolerates two.

Allocation Sizing

When creating a new filesystem (mkfs), the initial chunk sizes are computed from the total device size. The formulas, implemented in mkfs/src/layout.rs (ChunkLayout::new), are:

System Block Group

Fixed size and position:

Offset: SYSTEM_GROUP_OFFSET = 1 MiB (0x100000)
Size: SYSTEM_GROUP_SIZE = 4 MiB (0x400000)
Profile: always SINGLE
Contains: the chunk tree root block

The first 1 MiB of the device is reserved. The primary superblock sits at offset 64 KiB within this reserved area.

Metadata Block Group

meta_size = clamp(total_bytes / 10, 32 MiB, 256 MiB)
meta_size = round_down(meta_size, STRIPE_LEN)

where STRIPE_LEN = 64 KiB and total_bytes is the sum across all devices.

The metadata chunk starts at logical offset CHUNK_START = 5 MiB (SYSTEM_GROUP_OFFSET + SYSTEM_GROUP_SIZE). For DUP, two physical stripes are placed sequentially on device 1. For RAID1, one stripe is placed on each of the first two devices.

Examples:

256 MiB device: clamp(25.6M, 32M, 256M) = 32 MiB
1 GiB device: clamp(102.4M, 32M, 256M) = 102 MiB (rounded to 64K)
10 GiB device: clamp(1G, 32M, 256M) = 256 MiB

Data Block Group

data_size = clamp(total_bytes / 10, 64 MiB, 1 GiB)
data_size = round_down(data_size, STRIPE_LEN)

The data chunk follows the metadata chunk in both logical and physical address spaces. Logical offset = CHUNK_START + meta_size.

Examples:

256 MiB device: clamp(25.6M, 64M, 1G) = 64 MiB
1 GiB device: clamp(102.4M, 64M, 1G) = 102 MiB (rounded to 64K)
10 GiB device: clamp(1G, 64M, 1G) = 1 GiB

Minimum Device Size

For a single-device filesystem with DUP metadata and SINGLE data, the minimum physical space needed is:

1 MiB (reserved) + 4 MiB (system) + 2 * meta_size + data_size

With the minimum sizes (meta = 32 MiB, data = 64 MiB), this works out to approximately 133 MiB. A 100 MiB device will fail with “device too small”.

Physical Layout Summary

For a single-device DUP-metadata SINGLE-data filesystem:

Physical byte offset:
  [0 .. 1M)                          Reserved (superblock at 64K)
  [1M .. 5M)                         System block group (4 MiB)
  [5M .. 5M + meta_size)             Metadata stripe 0
  [5M + meta_size .. 5M + 2*meta)    Metadata stripe 1 (DUP copy)
  [5M + 2*meta .. 5M + 2*meta + data)  Data

Logical address space:
  [1M .. 5M)                         System chunk
  [5M .. 5M + meta_size)             Metadata chunk
  [5M + meta_size .. 5M + meta + data)  Data chunk

Note the physical space for DUP metadata is 2 * meta_size, but the logical address range is only meta_size. Both physical stripes map to the same logical range.

Cross-Checks

The btrfs check command (implemented in cli/src/check/chunks.rs) verifies the consistency of the chunk/block-group/device-extent triad.

Chunk-to-Block-Group Check

For every chunk in the chunk tree, there must be a matching block group item. The check walks the chunk tree cache and verifies that block_groups.contains_key(chunk.logical) for each chunk.

If a chunk has no corresponding block group, btrfs check reports:

ChunkMissingBlockGroup { logical }

Block-Group-to-Chunk Check

For every block group item (from the extent tree or block-group tree), there must be a matching chunk. The check verifies that chunk_cache.lookup(bg_logical) succeeds for each block group.

If a block group has no corresponding chunk, btrfs check reports:

BlockGroupMissingChunk { logical }

Device Extent Overlap Check

All device extents for each device are collected from the device tree, sorted by physical offset, and checked for overlaps. For consecutive extents on the same device, the check verifies:

extent[i].offset >= extent[i-1].offset + extent[i-1].length

If two device extents overlap, btrfs check reports:

DeviceExtentOverlap { devid, offset }

Block Group Source

When the BLOCK_GROUP_TREE compat_ro feature is enabled, block group items are stored in a separate tree (tree objectid 10) rather than in the extent tree. The check code handles both cases by selecting the appropriate tree root:

#![allow(unused)]
fn main() {
let bg_root = block_group_tree_root.unwrap_or(extent_root);
}

The Chunk Tree

The chunk tree (tree objectid 3) stores two kinds of items:

DEV_ITEM entries for each device in the filesystem: (DEV_ITEMS_OBJECTID=1, DEV_ITEM=216, devid)
CHUNK_ITEM entries for each chunk: (FIRST_CHUNK_TREE_OBJECTID=256, CHUNK_ITEM=228, logical_offset)

Items are sorted by key, so DEV_ITEMs (objectid 1) come before CHUNK_ITEMs (objectid 256).

The chunk tree root pointer is stored directly in the superblock’s chunk_root field — it does not go through the root tree like other trees. This is because the chunk tree is needed to read the root tree itself.

mkfs Chunk Tree Construction

When mkfs builds the chunk tree (build_chunk_tree in mkfs/src/mkfs.rs), it creates:

One DEV_ITEM per device, with bytes_used set to the sum of all chunk stripes on that device.
Three CHUNK_ITEM entries:
- System chunk at SYSTEM_GROUP_OFFSET (1 MiB), size 4 MiB
- Metadata chunk at CHUNK_START (5 MiB), with profile-dependent stripes
- Data chunk after metadata, with profile-dependent stripes

The system chunk item uses sectorsize for io_align and io_width (matching the kernel’s bootstrap behavior), while the metadata and data chunks use STRIPE_LEN (64 KiB).

The Device Tree

The device tree (tree objectid 4) stores:

DEV_STATS (PERSISTENT_ITEM) for each device: per-device I/O error counters, initialized to zero by mkfs.
DEV_EXTENT entries for each physical stripe of each chunk.

Items are sorted by key: (objectid=devid, type=DEV_EXTENT, offset=physical_byte_offset).

mkfs Device Tree Construction

When mkfs builds the device tree (build_dev_tree in mkfs/src/mkfs.rs), it creates:

One DEV_STATS item per device (zeroed counters).
Device extents for each stripe:
- System chunk: one DEV_EXTENT on device 1 at SYSTEM_GROUP_OFFSET
- Metadata chunk: one DEV_EXTENT per stripe (two for DUP on device 1, or one per device for RAID1)
- Data chunk: one DEV_EXTENT per stripe

All device tree items are collected, sorted by key, and written in order. This is necessary because items span multiple device IDs and physical offsets that are not naturally ordered by construction.

Superblock Mirrors

The superblock is written at up to three fixed physical offsets on each device:

Mirror	Offset	Size
0	64 KiB	4 KiB
1	64 MiB	4 KiB
2	256 GiB	4 KiB

The formula is: mirror 0 at 65536 bytes; mirror N (N > 0) at 16384 << (12 * N) bytes. Mirrors are only written if the device is large enough to contain them.

The superblock contains the sys_chunk_array bootstrap data, root pointers for the chunk tree and root tree, the embedded device item for device 1, and all filesystem-level metadata (UUID, label, feature flags, generation counter, bytes_used, etc.).

All three mirrors contain identical data for a given generation. On mount, the kernel reads all available mirrors and uses the one with the highest valid generation, providing resilience against corruption of the primary superblock.

Tree Block Placement in mkfs

During filesystem creation, tree blocks must be placed at specific logical addresses within the chunks. The BlockLayout struct in mkfs/src/layout.rs assigns addresses:

Chunk tree block: placed at SYSTEM_GROUP_OFFSET (1 MiB) in the system chunk. This is the only tree block in the system block group.

All other tree blocks (root, extent, device, FS, csum, free-space, data-reloc, and optionally block-group): placed sequentially in the metadata chunk starting at meta_logical = 5 MiB. With a 16 KiB nodesize:

Logical address	Tree
`meta_logical + 0`	Root tree
`meta_logical + 16K`	Extent tree
`meta_logical + 32K`	Device tree
`meta_logical + 48K`	FS tree
`meta_logical + 64K`	Csum tree
`meta_logical + 80K`	Free-space tree
`meta_logical + 96K`	Data-reloc tree
`meta_logical + 112K`	Block-group tree (if enabled)

For --rootdir mode, where trees may require multiple blocks, the BlockAllocator hands out sequential addresses from the system and metadata chunks, supporting trees of arbitrary size.

System chunk bytes used = nodesize (one chunk tree block). Metadata chunk bytes used = 7 * nodesize (or 8 with block-group tree).

Extent Tree and Backrefs

This document describes the btrfs extent tree: how every allocated byte on disk is tracked, how reference counting works, and how backreferences link extents to the trees and files that use them.

All multi-byte integers in btrfs on-disk structures are little-endian.

Purpose

The extent tree is the central allocator of the btrfs filesystem. It records every contiguous range of allocated disk space (both data extents used by files and metadata blocks used by trees) and tracks who references each extent.

The extent tree serves three purposes:

Allocation tracking. The set of extent items defines which logical byte ranges are in use. The free-space tree (or free-space cache) is derived from the gaps between extent items.
Reference counting. Each extent has a declared reference count. Snapshots and clones share extents by incrementing this count rather than copying data. When the count drops to zero, the extent can be freed.
Backreferences. Each extent stores references back to the trees, inodes, and file offsets that use it. This enables the filesystem to find all users of an extent (for relocation during balance, for example) and to verify consistency (during btrfs check).

The extent tree is stored in tree objectid 2 (BTRFS_EXTENT_TREE_OBJECTID). Its root pointer is stored in the root tree via a ROOT_ITEM entry.

EXTENT_ITEM vs METADATA_ITEM

There are two key types used to record allocated extents:

EXTENT_ITEM (type 168)

The original extent item format, used for both data and metadata extents.

Key: (bytenr, EXTENT_ITEM, length)
      objectid = logical start    type = 168    offset = size in bytes

For data extents, length is the extent’s size on disk. For metadata extents (tree blocks), length equals the filesystem’s nodesize.

METADATA_ITEM (type 169) — Skinny Metadata

When the SKINNY_METADATA incompat feature is enabled (the default since Linux 3.10), metadata extents use a more compact key:

Key: (bytenr, METADATA_ITEM, level)
      objectid = logical start    type = 169    offset = tree level (0..7)

The extent’s length is implicitly nodesize (not stored in the key). The level field in the key offset records the B-tree level of the tree block, which is useful for verification without reading the block itself.

Skinny metadata items are called “skinny refs” because they eliminate the need for the btrfs_tree_block_info structure that non-skinny EXTENT_ITEM entries for tree blocks carry.

Key Differences

Aspect	EXTENT_ITEM (non-skinny)	METADATA_ITEM (skinny)
Key type	168	169
Key offset	nodesize	tree level (0..7)
Item body	extent_item + tree_block_info + inline refs	extent_item + inline refs
When used	Always for data; metadata only without skinny_metadata	Metadata only, with skinny_metadata

In mkfs, the choice is controlled by the skinny_metadata() config flag:

#![allow(unused)]
fn main() {
let (item_type, offset) = if skinny {
    (BTRFS_METADATA_ITEM_KEY, 0u64)   // level 0 for leaf blocks
} else {
    (BTRFS_EXTENT_ITEM_KEY, nodesize as u64)
};
}

The Extent Item Header

Both EXTENT_ITEM and METADATA_ITEM share the same header structure, btrfs_extent_item (24 bytes):

Field	Offset	Size	Description
`refs`	0	8	Total reference count
`generation`	8	8	Generation when allocated
`flags`	16	8	Extent type flags

Extent Flags

The flags field uses these bits:

Flag	Value	Meaning
DATA	0x01	Extent holds file data
TREE_BLOCK	0x02	Extent holds a metadata tree block
FULL_BACKREF	0x80	Uses shared (parent-based) backrefs only

A data extent has flags = DATA (0x01). A metadata extent has flags = TREE_BLOCK (0x02). The FULL_BACKREF flag is set when the extent uses shared backreferences (after a snapshot) rather than normal tree backreferences.

The ExtentFlags type in disk/src/items.rs represents these flags as a bitflags struct.

Tree Block Info (Non-Skinny Only)

For non-skinny EXTENT_ITEM entries with TREE_BLOCK flag, the header is followed by btrfs_tree_block_info (25 bytes):

Field	Offset	Size	Description
`key`	0	17	First key in the tree block (`btrfs_disk_key`)
`level`	17	1	B-tree level of the block

This structure is omitted when using METADATA_ITEM (skinny metadata), since the level is stored in the key offset and the first key is not needed.

Full Item Layout

For a skinny metadata extent item with one inline TREE_BLOCK_REF:

Byte offset	Size	Content
0	8	`refs` (u64_le)
8	8	`generation` (u64_le)
16	8	`flags` = `TREE_BLOCK` (u64_le)
24	1	inline ref type = `TREE_BLOCK_REF_KEY` (176)
25	8	root objectid (u64_le)
		Total: 33 bytes

For a data extent item with one inline EXTENT_DATA_REF:

Byte offset	Size	Content
0	8	`refs` (u64_le)
8	8	`generation` (u64_le)
16	8	`flags` = `DATA` (u64_le)
24	1	inline ref type = `EXTENT_DATA_REF_KEY` (178)
25	8	`root` (u64_le)
33	8	`objectid` (u64_le) – inode number
41	8	`offset` (u64_le) – file offset
49	4	`count` (u32_le)
		Total: 53 bytes

Inline Backrefs

After the extent item header (and tree_block_info for non-skinny metadata), zero or more inline backreferences are packed contiguously. Each inline ref starts with a 1-byte type code, followed by type-specific data.

Inline refs are the common case: they are stored directly inside the extent item, avoiding the overhead of separate B-tree items. When an extent item grows too large to fit in a leaf (due to many references), backrefs are stored as standalone items instead.

TREE_BLOCK_REF (type 176)

Direct backref from a metadata extent to the tree that owns it.

Field	Offset	Size	Description
`type`	0	1	176 (`BTRFS_TREE_BLOCK_REF_KEY`)
root objectid	1	8	u64_le

The root field identifies the tree that owns this metadata block. For example, root = 5 means the FS tree, root = 2 means the extent tree itself.

Total size: 9 bytes.

SHARED_BLOCK_REF (type 182)

Shared backref from a metadata extent to a parent tree block. Used when a tree block is shared between snapshots — the backref points to a parent node rather than a root.

Field	Offset	Size	Description
`type`	0	1	182 (`BTRFS_SHARED_BLOCK_REF_KEY`)
parent bytenr	1	8	u64_le

The parent field is the logical byte address of the tree node that contains a pointer to this extent.

Total size: 9 bytes.

EXTENT_DATA_REF (type 178)

Backref from a data extent to a specific file inode. This is the most common inline ref type for data extents.

Field	Offset	Size	Description
`type`	0	1	178 (`BTRFS_EXTENT_DATA_REF_KEY`)
`root`	1	8	Tree objectid owning the inode (u64_le)
`objectid`	9	8	Inode number (u64_le)
`offset`	17	8	File byte offset (u64_le)
`count`	25	4	Number of references (u32_le)

Note that unlike other inline ref types, EXTENT_DATA_REF does not have an 8-byte offset field between the type byte and the struct body. The struct starts immediately after the type byte. The parser in disk/src/items.rs handles this by reinterpreting the speculatively consumed offset bytes as the root field:

#![allow(unused)]
fn main() {
raw::BTRFS_EXTENT_DATA_REF_KEY => {
    let root = ref_offset; // already read as u64_le
    let oid = buf.get_u64_le();
    let off = buf.get_u64_le();
    let count = buf.get_u32_le();
    // ...
}
}

The count field represents how many times this particular (root, objectid, offset) triple references the extent. For a normal file with one reference, count = 1. For a file cloned via reflink, each clone adds a new EXTENT_DATA_REF with its own triple and count.

Total size: 29 bytes.

SHARED_DATA_REF (type 184)

Shared data backref, used when data extents are shared between snapshots.

Field	Offset	Size	Description
`type`	0	1	184 (`BTRFS_SHARED_DATA_REF_KEY`)
parent bytenr	1	8	u64_le
`count`	9	4	u32_le

Total size: 13 bytes.

EXTENT_OWNER_REF (type 172)

Simple ownership reference, used with the simple_quota feature. Records which tree root owns the extent without full backref details.

Field	Offset	Size	Description
`type`	0	1	172 (`BTRFS_EXTENT_OWNER_REF_KEY`)
root objectid	1	8	u64_le

Total size: 9 bytes.

Standalone Backrefs

When inline backrefs do not fit inside the extent item (because the item would exceed the available leaf space), they are stored as separate items in the extent tree. Standalone backrefs use the same type codes as inline refs but are encoded as independent key/value pairs.

Standalone TREE_BLOCK_REF

Key: (bytenr, TREE_BLOCK_REF, root_objectid)
      objectid = extent start    type = 176    offset = owning tree

Item payload: empty (zero bytes). The backref information is entirely in the key.

Standalone SHARED_BLOCK_REF

Key: (bytenr, SHARED_BLOCK_REF, parent_bytenr)
      objectid = extent start    type = 182    offset = parent block

Item payload: empty.

Standalone EXTENT_DATA_REF

Key: (bytenr, EXTENT_DATA_REF, hash)
      objectid = extent start    type = 178    offset = CRC32C hash

The key offset is a hash of (root, objectid, offset) computed by:

#![allow(unused)]
fn main() {
fn extent_data_ref_hash(root: u64, objectid: u64, offset: u64) -> u64 {
    let high_crc = raw_crc32c(!0u32, &root.to_le_bytes());
    let low_crc = raw_crc32c(!0u32, &objectid.to_le_bytes());
    let low_crc = raw_crc32c(low_crc, &offset.to_le_bytes());
    (u64::from(high_crc) << 31) ^ u64::from(low_crc)
}
}

This hash function uses raw CRC32C (seed = !0, i.e. 0xFFFFFFFF, without final complement) applied independently to the root (high part) and objectid+offset (low part), then combined with a shift and XOR.

Item payload (28 bytes):

Field	Offset	Size	Description
`root`	0	8	u64_le
`objectid`	8	8	u64_le
`offset`	16	8	u64_le
`count`	24	4	u32_le

Standalone SHARED_DATA_REF

Key: (bytenr, SHARED_DATA_REF, parent_bytenr)
      objectid = extent start    type = 184    offset = parent block

Item payload (4 bytes):

Field	Offset	Size	Description
`count`	0	4	u32_le

Reference Counting

The refs Field

The refs field in btrfs_extent_item is the declared total reference count for the extent. It equals the sum of all references from both inline and standalone backrefs.

For TREE_BLOCK_REF, SHARED_BLOCK_REF, and EXTENT_OWNER_REF, each backref contributes 1 to the total. For EXTENT_DATA_REF and SHARED_DATA_REF, each backref contributes its count field to the total.

Counting Rules

The total reference count is computed as:

total = 0
for each inline ref:
    if EXTENT_DATA_REF:  total += count
    if SHARED_DATA_REF:  total += count
    otherwise:           total += 1
for each standalone ref:
    if EXTENT_DATA_REF:  total += count  (from item payload)
    if SHARED_DATA_REF:  total += count  (from item payload)
    otherwise:           total += 1

The declared refs in the extent item header must equal this computed total. A mismatch indicates corruption.

Example: Simple File

A newly created file with one 4 KiB extent in the FS tree (root 5):

Key: (bytenr, EXTENT_ITEM, 4096)
  refs = 1
  generation = 100
  flags = DATA
  inline EXTENT_DATA_REF:
    root = 5, objectid = 257, offset = 0, count = 1

Total refs: count(1) = 1. Matches declared refs.

Example: Snapshot

After taking a snapshot of the FS tree, the same extent is now referenced by both the original and the snapshot. The extent item is updated:

Key: (bytenr, EXTENT_ITEM, 4096)
  refs = 2
  generation = 100
  flags = DATA
  inline EXTENT_DATA_REF:
    root = 5, objectid = 257, offset = 0, count = 1
  inline EXTENT_DATA_REF:
    root = 260, objectid = 257, offset = 0, count = 1

Total refs: count(1) + count(1) = 2. Matches declared refs.

Example: Reflink Clone

A reflink clone within the same tree adds another backref with a different file offset:

Key: (bytenr, EXTENT_ITEM, 4096)
  refs = 2
  generation = 100
  flags = DATA
  inline EXTENT_DATA_REF:
    root = 5, objectid = 257, offset = 0, count = 1
  inline EXTENT_DATA_REF:
    root = 5, objectid = 258, offset = 0, count = 1

Example: Metadata Block

A metadata block owned by the FS tree:

Key: (bytenr, METADATA_ITEM, 0)    // level 0 = leaf
  refs = 1
  generation = 100
  flags = TREE_BLOCK
  inline TREE_BLOCK_REF:
    root = 5

Data Extent Backrefs in Detail

The EXTENT_DATA_REF Triple

Each data extent backref identifies its user by a (root, objectid, offset) triple:

root: the tree objectid containing the referencing inode. For user files this is the FS tree (5) or a subvolume/snapshot tree ID.
objectid: the inode number of the file that references the extent. Regular file inodes start at 257 (BTRFS_FIRST_FREE_OBJECTID + 1).
offset: the byte offset within the file where this extent is referenced. This is the key offset of the EXTENT_DATA item in the FS tree.

The count Field

The count field records how many times the exact same (root, objectid, offset) triple references this extent. In normal operation, count = 1. It can be greater than 1 in specific scenarios involving log replay or certain reflink patterns.

Hash Computation for Standalone Keys

When an EXTENT_DATA_REF is stored as a standalone item, the key offset is not the file offset but rather a hash of the full triple. This allows multiple data refs with different triples to be stored as separate items under the same extent bytenr.

The hash function (from disk/src/items.rs) computes:

high = CRC32C(seed=0xFFFFFFFF, root_le_bytes)
low  = CRC32C(seed=0xFFFFFFFF, objectid_le_bytes)
low  = CRC32C(seed=low,        offset_le_bytes)
hash = (high << 31) ^ low

This produces a 63-bit hash (the top bit is always the MSB of the high CRC, shifted to bit 62). The hash is deterministic and the same function is used in both the kernel and userspace tools.

Metadata Extent Backrefs in Detail

TREE_BLOCK_REF

A TREE_BLOCK_REF links a metadata block to the tree that owns it. The root field is the tree’s objectid:

1 = root tree
2 = extent tree
3 = chunk tree
4 = device tree
5 = FS tree (default subvolume)
6 = csum tree
7 = quota tree
10 = free-space tree
= 256 = subvolume/snapshot trees

SHARED_BLOCK_REF

When a tree block is shared between a subvolume and its snapshot, the normal TREE_BLOCK_REF is replaced with a SHARED_BLOCK_REF that points to the parent node. This happens because the same physical block cannot be “owned” by two different trees simultaneously.

The parent field is the logical bytenr of the tree node whose key pointer array includes this block. When the filesystem needs to modify a shared block, it performs copy-on-write: allocating a new block, copying the data, and updating the parent’s pointer. This is how snapshots achieve their constant-time creation — they share all blocks with the source subvolume.

FULL_BACKREF Flag

The FULL_BACKREF flag in the extent item’s flags field indicates that this metadata extent uses only shared backrefs (no direct tree backrefs). This typically happens for tree blocks at levels > 0 after a snapshot, where the ownership is ambiguous until the block is CoW’d.

Cross-Referencing with Tree Ownership

btrfs check collects a map of (block_address -> owning_tree) during its tree walks. The owning tree for each block is determined by the owner field in the block’s header (btrfs_header). This map is then cross-referenced against the extent tree’s TREE_BLOCK_REF entries in both directions.

Block Group Items in the Extent Tree

Historically, BLOCK_GROUP_ITEM entries are stored directly in the extent tree alongside extent items. With the BLOCK_GROUP_TREE compat_ro feature (default since btrfs-progs 6.x), they are moved to a separate tree (objectid 10).

BLOCK_GROUP_ITEM Structure

Key: (logical_offset, BLOCK_GROUP_ITEM, length)
      objectid = group start    type = 192    offset = group size

Item payload (24 bytes):

Field	Offset	Size	Description
`used`	0	8	Bytes allocated within this block group
`chunk_objectid`	8	8	`FIRST_CHUNK_TREE_OBJECTID` (256)
`flags`	16	8	Type + profile flags (`DATA

The used field tracks how many bytes of the block group are currently allocated to extents. For a new filesystem:

System block group: used = one nodesize (the chunk tree block)
Metadata block group: used = N * nodesize (all non-chunk tree blocks)
Data block group: used = 0 (no file data yet)

Ordering in the Extent Tree

When block group items are in the extent tree, they sort among the extent items by key. Since BLOCK_GROUP_ITEM has type 192 and EXTENT_ITEM has type 168 / METADATA_ITEM has type 169, block group items for a given logical offset sort after any extent item at the same address (because key comparison is (objectid, type, offset) and 192 > 169).

mkfs Construction

mkfs creates three block group items, one for each chunk:

#![allow(unused)]
fn main() {
add_block_group_items(extent_items, cfg, layout, chunks, data_used);
}

This adds entries for the system (SYSTEM flag), metadata (METADATA | profile flag), and data (DATA | profile flag) block groups.

When the BLOCK_GROUP_TREE feature is enabled, these items are placed in a separate tree instead (build_block_group_tree_with_used).

What btrfs check Verifies

The extent tree checker (implemented in cli/src/check/extents.rs) performs several categories of verification.

Reference Count Matching

For each extent item (EXTENT_ITEM or METADATA_ITEM) and its associated standalone backrefs, the checker computes the total reference count from inline + standalone refs and compares it to the declared refs field:

#![allow(unused)]
fn main() {
if state.pending_refs != state.pending_counted {
    results.report(CheckError::ExtentRefMismatch {
        bytenr, expected: state.pending_refs, found: state.pending_counted,
    });
}
}

The checker processes items in key order. When it encounters a new EXTENT_ITEM or METADATA_ITEM, it “flushes” the previous extent (checking its ref count) and begins accumulating refs for the new one. Standalone backref items (TREE_BLOCK_REF, SHARED_BLOCK_REF, EXTENT_DATA_REF, SHARED_DATA_REF, EXTENT_OWNER_REF) that follow an extent item with a matching objectid add to the running count.

Extent Overlap Detection

Extents in the extent tree are sorted by logical address. The checker tracks the end address of the previous extent and reports an error if the next extent starts before the previous one ends:

#![allow(unused)]
fn main() {
if length > 0 && bytenr < state.prev_end && state.prev_end > 0 {
    results.report(CheckError::OverlappingExtent {
        bytenr, length, prev_end: state.prev_end,
    });
}
}

Note that METADATA_ITEM entries store the tree level (not the length) in the key offset. Since the checker does not have access to the nodesize at this point, it uses length = 0 for metadata items and skips overlap detection for them.

Backref Owner Cross-Checks (Direction 1: Walk to Extent)

During tree walks in earlier check phases, the checker builds a map of tree_block_owners: HashMap<u64, u64> mapping each tree block’s logical address to the tree objectid that owns it (from the block header’s owner field).

After processing the extent tree, the checker verifies that every block encountered during walks has an extent item:

#![allow(unused)]
fn main() {
if !state.extent_item_addrs.contains(&addr) {
    results.report(CheckError::MissingExtentItem { bytenr: addr });
}
}

And that the extent tree’s backrefs agree with the actual owner:

#![allow(unused)]
fn main() {
if !claimed_owners.contains(&actual_owner) {
    results.report(CheckError::BackrefOwnerMismatch {
        bytenr: addr, actual_owner, claimed_owners,
    });
}
}

Backref Owner Cross-Checks (Direction 2: Extent to Walk)

The checker also verifies the reverse: every TREE_BLOCK_REF in the extent tree (both inline and standalone) must correspond to a tree block that was actually encountered during walks and is owned by the claimed tree:

#![allow(unused)]
fn main() {
let actual = tree_block_owners.get(&addr).copied();
if actual != Some(claimed) {
    results.report(CheckError::BackrefOrphan {
        bytenr: addr, claimed_owner: claimed,
    });
}
}

This catches “orphan” backrefs that point to blocks that either do not exist or are owned by a different tree than claimed.

Data Byte Accounting

The checker accumulates two statistics from data extents:

data_bytes_allocated: the sum of length for all data extent items. This is the total physical space reserved for data.
data_bytes_referenced: the sum of length * count for all data extent references. When data is shared (via snapshots or reflinks), referenced bytes exceed allocated bytes.

For inline-only data refs (no standalone ExtentDataRef items), referenced bytes are computed from the inline ref count. For standalone refs, each EXTENT_DATA_REF and SHARED_DATA_REF item contributes length * count.

Extent Item Construction in mkfs

Metadata Extent Items

For each tree block allocated during mkfs, the extent tree receives a metadata extent item with one inline TREE_BLOCK_REF:

#![allow(unused)]
fn main() {
fn metadata_extent_item(addr, skinny, generation, owner, nodesize) -> (Key, Vec<u8>) {
    let (item_type, offset) = if skinny {
        (BTRFS_METADATA_ITEM_KEY, 0u64)     // offset = level 0
    } else {
        (BTRFS_EXTENT_ITEM_KEY, nodesize)    // offset = nodesize
    };
    (
        Key::new(addr, item_type, offset),
        extent_item(1, generation, skinny, owner),
    )
}
}

The extent_item() function serializes:

btrfs_extent_item header: refs=1, generation, flags=TREE_BLOCK
For non-skinny: zero-filled btrfs_tree_block_info (25 bytes)
Inline TREE_BLOCK_REF: type byte (176) + root objectid (8 bytes)

Total item size: 33 bytes (skinny) or 58 bytes (non-skinny).

Data Extent Items

For each data extent written during --rootdir mode, the extent tree receives a data extent item with one inline EXTENT_DATA_REF:

#![allow(unused)]
fn main() {
fn data_extent_item(refs, generation, root, objectid, offset, count) -> Vec<u8> {
    // btrfs_extent_item header
    buf.put_u64_le(refs);
    buf.put_u64_le(generation);
    buf.put_u64_le(BTRFS_EXTENT_FLAG_DATA);
    // inline EXTENT_DATA_REF
    buf.put_u8(BTRFS_EXTENT_DATA_REF_KEY);
    buf.put_u64_le(root);
    buf.put_u64_le(objectid);
    buf.put_u64_le(offset);
    buf.put_u32_le(count);
}
}

Total item size: 53 bytes. The key is (extent_bytenr, EXTENT_ITEM, extent_length).

Self-Referential Convergence

The extent tree must contain entries for its own tree blocks. But the number of tree blocks needed depends on how many items the tree contains, which depends on how many extent items there are, which depends on the number of tree blocks… This creates a circular dependency.

The --rootdir code path solves this with a convergence loop (converge_extent_tree_block_count in mkfs/src/mkfs.rs):

Start with extent_tree_block_count = 1.
Build a trial extent tree with all items (including placeholder entries for the extent tree’s own blocks).
If the trial tree’s actual block count differs from the assumed count, update the count and repeat.
The loop converges quickly (usually in 1-2 iterations) because adding extent items for additional blocks only marginally increases the tree size.

After convergence, the real extent tree is built with actual logical addresses assigned by the BlockAllocator.

Extent Tree Key Ordering

Items in the extent tree are sorted by the standard btrfs key comparison (objectid, type, offset). Since objectid is the extent’s logical byte address, items are effectively sorted by logical address.

Within a single extent’s address, the ordering is:

EXTENT_ITEM or METADATA_ITEM (type 168 or 169) — the extent header
EXTENT_OWNER_REF (type 172) — if simple quotas are enabled
TREE_BLOCK_REF (type 176) — standalone metadata backrefs
EXTENT_DATA_REF (type 178) — standalone data backrefs
SHARED_BLOCK_REF (type 182) — standalone shared metadata backrefs
SHARED_DATA_REF (type 184) — standalone shared data backrefs
BLOCK_GROUP_ITEM (type 192) — if not using block-group tree

This ordering is a natural consequence of the type field values and ensures that btrfs check can process all backrefs for an extent by reading items sequentially until the objectid (bytenr) changes.

Relationship to File Extents

The connection between the extent tree and actual file data flows through EXTENT_DATA items in FS trees:

FS tree: (inode, EXTENT_DATA, file_offset)
  -> disk_bytenr, disk_num_bytes, offset, num_bytes

Extent tree: (disk_bytenr, EXTENT_ITEM, disk_num_bytes)
  -> refs, generation, flags=DATA
  -> inline EXTENT_DATA_REF(root, inode, file_offset, count)

The disk_bytenr in the file extent item is the logical address of the data extent. The extent tree entry at that address records who references the extent and how many times.

For inline file extents (small files where data is embedded directly in the tree leaf), there is no corresponding extent tree entry — the data does not occupy a separate extent.

For hole/sparse extents (disk_bytenr = 0), there is similarly no extent tree entry. The no-holes feature eliminates explicit hole extent items entirely.

Summary of Key Formats

Item type	Key	Payload
`EXTENT_ITEM`	`(bytenr, 168, length)`	`extent_item` + inline refs
`METADATA_ITEM`	`(bytenr, 169, level)`	`extent_item` + inline refs
`EXTENT_OWNER_REF`	`(bytenr, 172, root)`	(empty)
`TREE_BLOCK_REF`	`(bytenr, 176, root)`	(empty)
`EXTENT_DATA_REF`	`(bytenr, 178, hash)`	`extent_data_ref` (28 bytes)
`SHARED_BLOCK_REF`	`(bytenr, 182, parent)`	(empty)
`SHARED_DATA_REF`	`(bytenr, 184, parent)`	`shared_data_ref` (4 bytes)
`BLOCK_GROUP_ITEM`	`(logical, 192, length)`	`block_group_item` (24 bytes)

All bytenr values are logical byte addresses. The extent tree provides the complete picture of space allocation and ownership across the entire filesystem.

Btrfs Transaction Infrastructure: On-Disk Format Specification

This document is the sole reference for implementing the btrfs-transaction crate. It describes the on-disk format, invariants, and protocols needed to safely modify a btrfs filesystem from userspace.

Tree block layout

A btrfs filesystem stores its metadata in a B-tree. Each tree block (also called a node or extent buffer) is nodesize bytes (typically 16,384, but can be 4,096 to 65,536). Tree blocks are identified by their logical byte address (bytenr), which is translated to a physical device offset via the chunk tree.

Every tree block begins with a 101-byte header, followed by either leaf items (level 0) or internal node key pointers (level > 0).

Header (101 bytes)

All multi-byte integers are little-endian on disk.

Offset	Size	Field	Description
0	32	`csum`	Checksum of bytes 32..`nodesize` (header fields after csum + all payload). Algorithm determined by superblock `csum_type`. Zero-padded: for `CRC32C` only bytes 0..3 are meaningful.
32	16	`fsid`	Filesystem UUID. Must match superblock `fsid` (or `metadata_uuid` if `METADATA_UUID` incompat flag is set).
48	8	`bytenr`	Logical byte address of this block. Must match the address used to read/write it.
56	8	`flags`	Bits 0..55: header flags (currently unused by userspace). Bits 56..63: backref revision (1 = mixed backrefs, the modern format).
64	16	`chunk_tree_uuid`	UUID of the chunk tree that maps this block’s logical address to physical. Typically the same for all blocks on a single-device fs.
80	8	`generation`	Transaction generation when this block was last written. Critical for COW: a block with `generation` == current transaction has already been COWed and can be modified in place.
88	8	`owner`	Tree ID that owns this block (e.g. 1 for root tree, 2 for extent tree, 5 for default fs tree). Used for backref accounting.
96	4	`nritems`	Number of items (leaf) or key pointers (node).
100	1	`level`	B-tree level. 0 = leaf, 1..7 = internal node. Maximum level is 7 (`BTRFS_MAX_LEVEL` = 8 levels total, 0-indexed).

Key (17 bytes)

Every item and pointer in the B-tree is identified by a three-part key. On disk this is the btrfs_disk_key (little-endian):

Offset	Size	Field	Description
0	8	`objectid`	Primary identifier (inode number, tree ID, extent bytenr, etc. depending on key type).
8	1	`type`	Key type discriminator (see section 7).
9	8	`offset`	Type-specific secondary value (file offset, extent size, parent ID, etc.).

Keys are compared as a tuple (objectid, type, offset) in that order, all as unsigned integers. This defines the sort order within every B-tree.

Leaf layout (level 0)

A leaf contains item descriptors that grow forward from the header, and item data payloads that grow backward from the end of the block. Free space is the gap between them.

Byte 0..100:                    Header
Byte 101..101+nritems*25-1:     Item descriptors [item0, item1, ..., itemN-1]
                                (25 bytes each, sorted by key ascending)
  ...free space...
Byte X..nodesize-1:             Item data [dataN-1, ..., data1, data0]
                                (packed from the end of the block backward)

Each item descriptor is 25 bytes:

Offset	Size	Field	Description
0	17	`key`	The item’s key (`btrfs_disk_key`).
17	4	`offset`	Byte offset of this item’s data payload, relative to the start of the data area (byte 101). To get the absolute position in the block: absolute = 101 + `offset`.
21	4	`size`	Size of the item’s data payload in bytes.

Invariants:

Items are sorted by key in ascending order.
Item data regions must not overlap.
The last item’s data starts at 101 + item[N-1].offset and extends for item[N-1].size bytes. Items with lower indices have data at higher offsets (data grows backward).
The first item’s data ends at 101 + item[0].offset + item[0].size, which must be <= nodesize.
Free space = (101 + item[N-1].offset) - (101 + nritems * 25). When this is < 25 + data_size for a new item, the leaf is full.

Data offset convention:

The offset field in btrfs_item counts from byte 101 (immediately after the header), not from the start of the block. When constructing a new leaf:

Start data_end at nodesize.
For each item (in key order): data_end -= data.len(), write data at data_end, store offset = data_end - 101 in the item descriptor.
Item descriptors are written at 101 + i * 25.

Internal node layout (level > 0)

An internal node contains key pointers that identify child subtrees.

Byte 0..100:                    Header
Byte 101..101+nritems*33-1:     Key pointers [ptr0, ptr1, ..., ptrN-1]
                                (33 bytes each, sorted by key ascending)

Each key pointer is 33 bytes:

Offset	Size	Field	Description
0	17	`key`	Lowest key in the child subtree.
17	8	`blockptr`	Logical byte address of the child block.
25	8	`generation`	Generation of the child block (used for consistency checking during reads).

Invariants:

Key pointers are sorted by key in ascending order.
blockptr must be a valid, allocated logical address.
generation must match the generation in the child block’s header.

Maximum capacities

For a given nodesize:

Leaf items per block: depends on item data size. The theoretical maximum number of zero-size items is (nodesize - 101) / 25 = 651 for 16 KiB.
Key pointers per node: (nodesize - 101) / 33 = 493 for 16 KiB.
Maximum tree depth: 8 levels (BTRFS_MAX_LEVEL). In practice, trees rarely exceed 3-4 levels.

Superblock

The superblock is the entry point for reading a btrfs filesystem. It is a 4,096-byte structure stored at fixed offsets on every device:

Mirror 0: byte 65,536 (64 KiB)
Mirror 1: byte 67,108,864 (64 MiB)
Mirror 2: byte 274,877,906,944 (256 GiB), only if device is large enough

Superblock layout (4,096 bytes)

Offset	Size	Field	Description
0	32	`csum`	Checksum of bytes 32..4095.
32	16	`fsid`	Filesystem UUID.
48	8	`bytenr`	Physical byte offset of this copy.
56	8	`flags`	`BTRFS_SUPER_FLAG_*` bits.
64	8	`magic`	0x4D5F53665248425F (“_BHRfS_M” reversed).
72	8	`generation`	Current transaction generation.
80	8	`root`	Logical bytenr of root tree root block.
88	8	`chunk_root`	Logical bytenr of chunk tree root block.
96	8	`log_root`	Logical bytenr of log tree root (0 if none).
104	8	`__unused_log_root_transid`	Deprecated, always 0.
112	8	`total_bytes`	Total usable bytes across all devices.
120	8	`bytes_used`	Total bytes allocated to extents.
128	8	`root_dir_objectid`	Always 6 (`BTRFS_ROOT_TREE_DIR_OBJECTID`).
136	8	`num_devices`	Number of devices.
144	4	`sectorsize`	Minimum I/O unit (typically 4096).
148	4	`nodesize`	Tree block size (typically 16384).
152	4	`__unused_leafsize`	Legacy, always equal to `nodesize`.
156	4	`stripesize`	RAID stripe unit (typically 65536).
160	4	`sys_chunk_array_size`	Valid bytes in the `sys_chunk_array` field.
164	8	`chunk_root_generation`	Generation of the chunk tree root.
172	8	`compat_flags`	Compatible feature flags.
180	8	`compat_ro_flags`	Read-only compatible feature flags.
188	8	`incompat_flags`	Incompatible feature flags.
196	2	`csum_type`	Checksum algorithm (0=`CRC32C`, 1=xxhash, 2=SHA256, 3=BLAKE2).
198	1	`root_level`	B-tree level of root tree root.
199	1	`chunk_root_level`	B-tree level of chunk tree root.
200	1	`log_root_level`	B-tree level of log tree root.
201	98	`dev_item`	Embedded device item for this device (see section 6.4).
299	256	`label`	NUL-terminated filesystem label.
555	8	`cache_generation`	Free space cache v1 generation.
563	8	`uuid_tree_generation`	UUID tree last-updated generation.
571	16	`metadata_uuid`	Metadata UUID (if `METADATA_UUID` flag set).
587	8	`nr_global_roots`	Global root count (extent-tree-v2, rare).
595	8	`remap_root`	Remap tree bytenr.
603	8	`remap_root_generation`	Remap tree generation.
611	1	`remap_root_level`	Remap tree level.
612	199	`reserved`	Zero-filled.
811	2048	`sys_chunk_array`	Bootstrap chunk tree entries (key + chunk item pairs, packed sequentially).
2859	668	`super_roots`	4 rotating backup root entries (167 bytes each). See section 2.3.
3527	569	`padding`	Zero-filled to 4096.

Fields updated on every transaction commit

When committing a transaction, the following superblock fields are updated:

generation — incremented by 1.
root — logical bytenr of the (possibly new) root tree root block.
root_level — level of the root tree root.
chunk_root — logical bytenr of the chunk tree root (if chunk tree was modified).
chunk_root_generation — generation of the chunk tree root.
chunk_root_level — level of the chunk tree root.
bytes_used — updated to reflect allocations/frees.
log_root — set to 0 after log replay, or updated if log is active.
super_roots — one of the 4 backup root slots is written (rotating).
csum — recomputed last, covering bytes 32..4095.

The commit writes the superblock to all mirrors. The superblock write is the atomic commit point: if power is lost before the superblock is written, the previous generation’s state is intact because COW ensures old blocks are never overwritten (see section 3).

Backup roots (167 bytes each, 4 entries)

The superblock contains 4 rotating backup root entries. On each commit, one slot is overwritten (cycling 0 → 1 → 2 → 3 → 0 → …). These are used for recovery when the primary root pointers are corrupt.

Offset	Size	Field	Description
0	8	`tree_root`	Root tree root bytenr.
8	8	`tree_root_gen`	Root tree generation.
16	8	`chunk_root`	Chunk tree root bytenr.
24	8	`chunk_root_gen`	Chunk tree generation.
32	8	`extent_root`	Extent tree root bytenr.
40	8	`extent_root_gen`	Extent tree generation.
48	8	`fs_root`	Default FS tree root bytenr.
56	8	`fs_root_gen`	FS tree generation.
64	8	`dev_root`	Device tree root bytenr.
72	8	`dev_root_gen`	Device tree generation.
80	8	`csum_root`	Checksum tree root bytenr.
88	8	`csum_root_gen`	Checksum tree generation.
96	8	`total_bytes`	Total filesystem bytes at this point.
104	8	`bytes_used`	Bytes used at this point.
112	8	`num_devices`	Device count at this point.
120	32	`unused`	Reserved (zero).
152	1	`tree_root_level`	Root tree level.
153	1	`chunk_root_level`	Chunk tree level.
154	1	`extent_root_level`	Extent tree level.
155	1	`fs_root_level`	FS tree level.
156	1	`dev_root_level`	Device tree level.
157	1	`csum_root_level`	Checksum tree level.
158	9	`padding`	Padding to 167 bytes.

Superblock flags

Bit	Name	Description
2	`BTRFS_SUPER_FLAG_ERROR`	Filesystem has errors.
32	`BTRFS_SUPER_FLAG_SEEDING`	Seed device (read-only base for cloning).
33	`BTRFS_SUPER_FLAG_METADUMP`	Metadump image.
34	`BTRFS_SUPER_FLAG_METADUMP_V2`	Metadump v2 image.
35	`BTRFS_SUPER_FLAG_CHANGING_FSID`	FSID rewrite in progress.
36	`BTRFS_SUPER_FLAG_CHANGING_FSID_V2`	FSID rewrite v2 in progress.
38	`BTRFS_SUPER_FLAG_CHANGING_BG_TREE`	Block group tree migration.
39	`BTRFS_SUPER_FLAG_CHANGING_DATA_CSUM`	Data csum algorithm change.
40	`BTRFS_SUPER_FLAG_CHANGING_META_CSUM`	Metadata csum algorithm change.

Feature flags

Incompatible (incompat_flags):

Bit	Name	Hex	Description
0	`MIXED_BACKREF`	0x1	Modern backreference format.
1	`DEFAULT_SUBVOL`	0x2	Non-default default subvolume set.
2	`MIXED_GROUPS`	0x4	Mixed data+metadata block groups.
3	`COMPRESS_LZO`	0x8	LZO compression used.
4	`COMPRESS_ZSTD`	0x10	ZSTD compression used.
5	`BIG_METADATA`	0x20	Metadata blocks > 4 KiB (always set with modern mkfs for `nodesize` > 4096).
6	`EXTENDED_IREF`	0x40	Extended inode references (`INODE_EXTREF`).
7	`RAID56`	0x80	RAID5/RAID6 profiles in use.
8	`SKINNY_METADATA`	0x100	Skinny metadata extent refs (see 5.1).
9	`NO_HOLES`	0x200	No explicit hole extent items.
10	`METADATA_UUID`	0x400	`metadata_uuid` field is in use.
11	`RAID1C34`	0x800	RAID1C3 or RAID1C4 profiles in use.
12	`ZONED`	0x1000	Zoned block device support.
13	`EXTENT_TREE_V2`	0x2000	Extent tree v2 (experimental).
14	`RAID_STRIPE_TREE`	0x4000	RAID stripe tree.
16	`SIMPLE_QUOTA`	0x10000	Simple quota accounting.

Read-only compatible (compat_ro_flags):

Bit	Name	Hex	Description
0	`FREE_SPACE_TREE`	0x1	Free space tree present.
1	`FREE_SPACE_TREE_VALID`	0x2	Free space tree is valid/consistent.
2	`VERITY`	0x4	fs-verity enabled files present.
3	`BLOCK_GROUP_TREE`	0x8	Separate block group tree.

Default features for modern mkfs:

incompat_flags: MIXED_BACKREF | BIG_METADATA | EXTENDED_IREF | SKINNY_METADATA | NO_HOLES = 0x361
compat_ro_flags: FREE_SPACE_TREE | FREE_SPACE_TREE_VALID | BLOCK_GROUP_TREE = 0xB

System chunk array

The sys_chunk_array (2,048 bytes at offset 811) contains bootstrap chunk entries needed to read the chunk tree itself. Format: packed sequence of (btrfs_disk_key, btrfs_chunk) pairs. The sys_chunk_array_size field says how many bytes are valid. Parsing: read key (17 bytes), then chunk header (48 bytes) + stripes (num_stripes * 32 bytes), repeat until consumed.

Copy-on-write (COW) protocol

Btrfs never modifies tree blocks in place (except when a block was already allocated in the current transaction). This is the fundamental mechanism that provides crash consistency.

COW a tree block

When a transaction needs to modify a tree block:

Check generation. If block.generation == current_transaction_generation, the block was already COWed in this transaction. Modify it in place.
Allocate a new block. Find free space in an appropriate metadata block group and allocate nodesize bytes at a new logical address.
Copy. Copy the entire block contents to the new address.
Update parent pointer. In the parent node, change the blockptr for the relevant slot to the new address, and set generation to the current transaction generation.
Update the new block’s header. Set bytenr to the new logical address, generation to the current transaction generation.
Queue old block for freeing. The old block’s extent reference is decremented. If its refcount reaches 0, the space is freed (but only after the transaction commits, to maintain crash consistency).
COW cascades upward. If the parent was not yet COWed, it must be COWed first (step 1 check), then updated. This cascades up to the root.

COW and the root pointer

The root of each tree is stored in a root_item in the root tree (tree ID 1). The root tree’s own root pointer is stored in the superblock (root field).

When COW reaches the root of a non-root tree:

Update the root_item’s bytenr and level fields in the root tree.
This modification to the root tree triggers COW of the root tree itself.

When COW reaches the root tree’s root:

The new root block address is written to the superblock’s root field at commit time.

COW and the chunk tree

The chunk tree root is special: its pointer lives directly in the superblock (chunk_root field), not in the root tree. If the chunk tree is modified, its new root address updates chunk_root at commit time.

Crash consistency

The commit point is the superblock write. Before the superblock is updated:

All new tree blocks have been written to new locations.
All old tree blocks are still intact at their original locations.
The old superblock still points to the old root tree root, which points to the old state of all trees.

If power is lost before the superblock write completes, the filesystem reverts to the previous generation. No fsck needed.

Transaction lifecycle

A transaction groups multiple tree modifications into a single atomic commit.

Start

Read the current superblock generation G.
Set the new transaction generation to G + 1.
Track all blocks modified during this transaction (the “dirty set”).

Modify

All tree modifications (insert, delete, update items) go through COW:

search_slot descends the tree, COWing each block along the path.
Item operations modify the COWed leaf.
Reference counts are updated for allocated and freed extents.

Commit

Flush pending reference updates. Process all queued extent reference changes (delayed refs, see section 5.3). This may modify the extent tree, which may COW more blocks and generate more ref updates. Repeat until stable (no more pending updates).
Update root items. For every tree whose root block changed, update its root_item in the root tree (fields: bytenr, generation, level). This may COW the root tree.
Write dirty blocks. Write all blocks in the dirty set to disk with correct checksums. Each block’s checksum covers bytes 32..nodesize.
Prepare superblock. Update the superblock fields listed in section 2.2. Write one backup root entry (rotating through slots 0-3). Recompute the superblock checksum.
Write superblock. Write the superblock to all mirrors. Issue fsync to ensure durability.

Abort

Discard all dirty blocks. Do not write the superblock. The filesystem remains at the previous generation.

Extent tree and reference counting

The extent tree (tree ID 2) tracks which logical address ranges are allocated and who references them. Every allocated extent (both data and metadata) has an entry in the extent tree.

Extent items

There are two key types for extent records:

EXTENT_ITEM (type 168): Used for data extents and (on older filesystems without SKINNY_METADATA) for tree blocks.

Key: (logical_bytenr, EXTENT_ITEM=168, size_in_bytes)
Data: extent_item header (24 bytes), optionally tree_block_info (18 bytes), then inline backreferences.

METADATA_ITEM (type 169): Used for tree blocks when SKINNY_METADATA incompat flag is set. This is the modern default.

Key: (logical_bytenr, METADATA_ITEM=169, tree_level)
Data: extent_item header (24 bytes), then inline backreferences. No tree_block_info (the level is in the key offset, and the first key is not stored).

Extent item header (24 bytes):

Offset	Size	Field	Description
0	8	`refs`	Total reference count for this extent.
8	8	`generation`	Transaction generation when allocated.
16	8	`flags`	`EXTENT_FLAG_DATA` (bit 0) for data extents, `EXTENT_FLAG_TREE_BLOCK` (bit 1) for metadata. `BLOCK_FLAG_FULL_BACKREF` (bit 8) indicates full backrefs (shared block refs use parent bytenr instead of root ID).

Tree block info (18 bytes, only for non-skinny EXTENT_ITEM with TREE_BLOCK flag):

Offset	Size	Field	Description
0	17	`key`	First key in the tree block (`btrfs_disk_key`).
17	1	`level`	Level of the tree block.

Backreferences

Backreferences record who uses an extent. They come in two forms: inline (packed inside the extent item’s data) and standalone (separate items in the extent tree).

Inline backreferences follow the extent item header (and tree_block_info if present). Each inline ref has a 1-byte type followed by an 8-byte offset, then type-specific data:

Offset	Size	Field	Description
0	1	`type`	One of the backref type codes below.
1	8	`offset`	Type-dependent (see below).

The backref types:

Type code	Name	Offset meaning	Extra data	Total inline size
176	`TREE_BLOCK_REF`	Root tree ID	(none)	9 bytes
182	`SHARED_BLOCK_REF`	Parent block bytenr	(none)	9 bytes
178	`EXTENT_DATA_REF`	(see below)	28 bytes	37 bytes
184	`SHARED_DATA_REF`	Parent block bytenr	4-byte count	13 bytes
172	`EXTENT_OWNER_REF`	Root tree ID	(none)	9 bytes

TREE_BLOCK_REF (type 176): A tree block is referenced by a specific tree (identified by root ID). The offset field IS the root objectid. No additional data. Each such ref contributes 1 to the extent’s refcount.

SHARED_BLOCK_REF (type 182): A tree block is referenced by another tree block (identified by its bytenr) rather than by root ID. This happens during snapshots. The offset field IS the parent block’s bytenr. Each such ref contributes 1 to the extent’s refcount.

EXTENT_DATA_REF (type 178): A data extent is referenced by a file. The inline form packs the following 28 bytes immediately after the type byte (the 8-byte offset from the generic header is actually the first field root of this struct — parse carefully):

Offset	Size	Field	Description
0	8	`root`	Root tree ID containing the referencing inode.
8	8	`objectid`	Inode number.
16	8	`offset`	File offset where this extent is referenced.
24	4	`count`	Number of references (typically 1, >1 for reflinked files).

Each EXTENT_DATA_REF contributes count to the extent’s refcount.

SHARED_DATA_REF (type 184): A data extent is referenced through a shared tree block (snapshot). The offset field is the parent block bytenr. Additional 4 bytes:

Offset	Size	Field	Description
0	4	`count`	Reference count from this parent.

Each SHARED_DATA_REF contributes count to the extent’s refcount.

Standalone backreferences: When inline refs don’t fit in the extent item (rare, happens with many references), they overflow to standalone items:

TREE_BLOCK_REF_KEY (176): key (extent_bytenr, 176, root_id), no data.
SHARED_BLOCK_REF_KEY (182): key (extent_bytenr, 182, parent_bytenr), no data.

EXTENT_DATA_REF_KEY (178): key (extent_bytenr, 178, hash), 28-byte btrfs_extent_data_ref data. The hash is computed as:

high_crc = crc32c(seed=0xFFFFFFFF, root.to_le_bytes())
low_crc  = crc32c(seed=0xFFFFFFFF, objectid.to_le_bytes())
low_crc  = crc32c(seed=low_crc,    offset.to_le_bytes())
hash     = (high_crc as u64) << 31 ^ (low_crc as u64)

Note: these are raw CRC32C (no final inversion), not the standard ISO 3309 form.

SHARED_DATA_REF_KEY (184): key (extent_bytenr, 184, parent_bytenr), 4-byte count.

Delayed references

Modifying a tree generates many reference count updates (every COWed block creates a new ref and removes an old ref). Processing each one immediately would cause excessive extent tree modifications. Instead, reference updates are queued and batched:

When a block is COWed, queue: +1 ref at new_bytenr, -1 ref at old_bytenr.
When a block is allocated for splitting, queue +1 ref.
When blocks are freed (e.g., after merging), queue -1 ref.

At commit time, process all queued refs:

Merge updates to the same extent (e.g., +1 and -1 cancel out).
For each remaining update, modify the extent item in the extent tree.
If a refcount drops to 0, delete the extent item and free the space.
Processing delayed refs modifies the extent tree, which may generate more delayed refs (from COWing extent tree blocks). Repeat until the queue is empty. This converges because each iteration processes more refs than it creates.

Refcount invariant

The refs field in an extent item must always equal the sum of all its backreferences:

Each TREE_BLOCK_REF or SHARED_BLOCK_REF contributes 1.
Each EXTENT_DATA_REF contributes its count field.
Each SHARED_DATA_REF contributes its count field.

If refs reaches 0, the extent is freed.

Block groups, chunks, and device extents

Btrfs organizes disk space into three layers: block groups (logical allocation regions), chunks (logical-to-physical mapping), and device extents (physical device reservations).

Block group item (24 bytes)

Stored in the extent tree (or block group tree if BLOCK_GROUP_TREE compat_ro flag is set).

Key: (logical_offset, BLOCK_GROUP_ITEM=192, length)

Offset	Size	Field	Description
0	8	`used`	Bytes currently allocated within this group.
8	8	`chunk_objectid`	Always 256 (`BTRFS_FIRST_CHUNK_TREE_OBJECTID`).
16	8	`flags`	Type + RAID profile (see 6.5).

Block groups are the allocation units: when allocating an extent, the allocator finds a block group of the right type (DATA, METADATA, or SYSTEM) with enough free space.

Chunk item (48 + num_stripes * 32 bytes)

Stored in the chunk tree (tree ID 3).

Key: (256, CHUNK_ITEM=228, logical_offset)

Offset	Size	Field	Description
0	8	`length`	Logical size of this chunk.
8	8	`owner`	Owner tree (always 2, extent tree).
16	8	`stripe_len`	Stripe unit for RAID (typically 65536).
24	8	`type`	Flags: same as block group flags.
32	4	`io_align`	I/O alignment (typically 65536 for non-system, `sectorsize` for system chunks).
36	4	`io_width`	I/O width (same as `io_align`).
40	4	`sector_size`	Device sector size (typically 4096).
44	2	`num_stripes`	Number of stripes.
46	2	`sub_stripes`	Sub-stripes for RAID10 (0 otherwise).
48+	32*N	`stripes`	Array of stripe descriptors.

Each stripe (32 bytes):

Offset	Size	Field	Description
0	8	`devid`	Device ID.
8	8	`offset`	Physical byte offset on the device.
16	16	`dev_uuid`	Device UUID.

Chunk-to-physical resolution: For a logical address L within a chunk starting at chunk_start with a single stripe at device offset phys: physical = phys + (L - chunk_start). RAID profiles use more complex mapping.

Device extent (48 bytes)

Stored in the device tree (tree ID 4).

Key: (devid, DEV_EXTENT=204, physical_offset)

Offset	Size	Field	Description
0	8	`chunk_tree`	Always 3 (`BTRFS_CHUNK_TREE_OBJECTID`).
8	8	`chunk_objectid`	Always 256.
16	8	`chunk_offset`	Logical offset of the owning chunk.
24	8	`length`	Length of this device extent.
32	16	`chunk_tree_uuid`	Chunk tree UUID.

For each stripe in a chunk, there is one device extent on the corresponding device.

Device item (98 bytes)

Stored in the chunk tree (and embedded in the superblock for the local device).

Key: (1, DEV_ITEM=216, devid) (objectid 1 = BTRFS_DEV_ITEMS_OBJECTID)

Offset	Size	Field	Description
0	8	`devid`	Device ID (1, 2, 3, …).
8	8	`total_bytes`	Total device size.
16	8	`bytes_used`	Bytes allocated to chunks on this device.
24	4	`io_align`	I/O alignment.
28	4	`io_width`	I/O width.
32	4	`sector_size`	Sector size.
36	8	`type`	Reserved (0).
44	8	`generation`	Last transaction touching this device.
52	8	`start_offset`	Start offset for new allocations.
60	4	`dev_group`	Reserved (0).
64	1	`seek_speed`	Hint (0 = unset).
65	1	`bandwidth`	Hint (0 = unset).
66	16	`uuid`	Device UUID.
82	16	`fsid`	Filesystem UUID.

Block group type flags

Bit	Name	Hex	Description
0	`DATA`	0x1	Data extents.
1	`SYSTEM`	0x2	System (chunk tree) metadata.
2	`METADATA`	0x4	Metadata extents.
3	`RAID0`	0x8	Striped.
4	`RAID1`	0x10	Mirrored (2 copies).
5	`DUP`	0x20	Duplicated on same device.
6	`RAID10`	0x40	Striped + mirrored.
7	`RAID5`	0x80	RAID5.
8	`RAID6`	0x100	RAID6.
9	`RAID1C3`	0x200	Mirrored (3 copies).
10	`RAID1C4`	0x400	Mirrored (4 copies).

A block group’s flags combine exactly one type (DATA, SYSTEM, METADATA) with zero or one RAID profile. If no RAID profile bit is set, the block group is SINGLE (no replication, but the virtual SINGLE bit 48 = 0x1000000000000 is used in some display contexts only).

Relationships between structures

For each allocated region of logical space:

A block group item in the extent tree defines the logical range and tracks usage.
A chunk item in the chunk tree maps the same logical range to one or more physical stripes.
For each stripe, a device extent in the device tree reserves the physical space on that device.
The device item in the chunk tree tracks total and used bytes per device.

All four must be consistent. When allocating a new block group (rare in rescue operations), all four structures must be created atomically within one transaction.

Tree types and key reference

Tree IDs

ID	Name	Stored in
1	Root tree	Superblock (`root` field)
2	Extent tree	Root tree (`ROOT_ITEM` objectid=2)
3	Chunk tree	Superblock (`chunk_root` field)
4	Device tree	Root tree (`ROOT_ITEM` objectid=4)
5	Default FS tree	Root tree (`ROOT_ITEM` objectid=5)
6	Root tree directory	(virtual, in root tree)
7	Checksum tree	Root tree (`ROOT_ITEM` objectid=7)
8	Quota tree	Root tree (`ROOT_ITEM` objectid=8)
9	UUID tree	Root tree (`ROOT_ITEM` objectid=9)
10	Free space tree	Root tree (`ROOT_ITEM` objectid=10)
11	Block group tree	Root tree (`ROOT_ITEM` objectid=11)
12	RAID stripe tree	Root tree (`ROOT_ITEM` objectid=12)
256+	User subvolume/snapshot trees	Root tree (`ROOT_ITEM` objectid=N)

The root tree is the master index. It contains a ROOT_ITEM for every other tree (except itself and the chunk tree, whose roots are in the superblock).

Root item (439 bytes used, padded to 496 bytes)

Stored in root tree with key (tree_id, ROOT_ITEM=132, 0).

Offset	Size	Field	Description
0	176	`inode`	Embedded `btrfs_inode_item` (see 7.3).
176	8	`generation`	Transaction generation of this root.
184	8	`root_dirid`	Root directory objectid (typically 256).
192	8	`bytenr`	Logical bytenr of this tree’s root block.
200	8	`byte_limit`	Deprecated (0).
208	8	`bytes_used`	Total bytes used by this tree’s extents.
216	8	`last_snapshot`	Generation of last snapshot of this tree.
224	8	`flags`	Root flags (bit 0 = read-only subvolume).
232	4	`refs`	Reference count.
236	17	`drop_progress`	Key tracking in-progress drop operation.
253	1	`drop_level`	Level of drop progress.
254	1	`level`	Current B-tree height of this root.
255	8	`generation_v2`	Same as `generation` (marks v2 format).
263	16	`uuid`	Subvolume UUID.
279	16	`parent_uuid`	Parent subvolume UUID (for snapshots).
295	16	`received_uuid`	Source UUID (for received subvolumes).
311	8	`ctransid`	Transaction of last inode change.
319	8	`otransid`	Transaction when this root was created.
327	8	`stransid`	Transaction when sent.
335	8	`rtransid`	Transaction when received.
343	12	`ctime`	Change time (8-byte sec + 4-byte nsec).
355	12	`otime`	Creation time.
367	12	`stime`	Send time.
379	12	`rtime`	Receive time.
391	64	`reserved`	Zero-filled.

Fields updated when a tree’s root block changes (during commit):

bytenr — new root block address.
generation and generation_v2 — current transaction generation.
level — root block level.

Inode item (176 bytes)

Embedded in root items and stored standalone in FS trees.

Offset	Size	Field	Description
0	8	`generation`	NFS generation.
8	8	`transid`	Last modifying transaction.
16	8	`size`	File size.
24	8	`nbytes`	Sum of `EXTENT_DATA.num_bytes` for regular/prealloc extents plus inline payload length. See File data extents.
32	8	`block_group`	Block group hint for allocation.
40	4	`nlink`	Hard link count.
44	4	`uid`	User ID.
48	4	`gid`	Group ID.
52	4	`mode`	File mode (permissions + type).
56	8	`rdev`	Device number (block/char devices).
64	8	`flags`	Inode flags.
72	8	`sequence`	NFS sequence number.
80	32	`reserved`	Zero-filled.
112	12	`atime`	Access time (8-byte sec + 4-byte nsec).
124	12	`ctime`	Change time.
136	12	`mtime`	Modification time.
148	12	`otime`	Creation time.

Key type reference

All key types with their numeric values:

Value	Name	Primary tree	Key semantics
1	`INODE_ITEM`	FS tree	(inode#, 1, 0)
12	`INODE_REF`	FS tree	(inode#, 12, parent_dir_inode#)
13	`INODE_EXTREF`	FS tree	(inode#, 13, hash)
24	`XATTR_ITEM`	FS tree	(inode#, 24, name_hash)
36	`VERITY_DESC_ITEM`	FS tree	(inode#, 36, 0)
37	`VERITY_MERKLE_ITEM`	FS tree	(inode#, 37, offset)
48	`ORPHAN_ITEM`	Root/FS tree	(objectid, 48, offset)
60	`DIR_LOG_ITEM`	Log tree	(dir_inode#, 60, hash)
72	`DIR_LOG_INDEX`	Log tree	(dir_inode#, 72, index)
84	`DIR_ITEM`	FS tree	(dir_inode#, 84, name_hash)
96	`DIR_INDEX`	FS tree	(dir_inode#, 96, index)
108	`EXTENT_DATA`	FS tree	(inode#, 108, file_offset)
128	`EXTENT_CSUM`	Csum tree	(-10, 128, logical_bytenr)
132	`ROOT_ITEM`	Root tree	(tree_id, 132, 0)
144	`ROOT_BACKREF`	Root tree	(child_id, 144, parent_id)
156	`ROOT_REF`	Root tree	(parent_id, 156, child_id)
168	`EXTENT_ITEM`	Extent tree	(bytenr, 168, size)
169	`METADATA_ITEM`	Extent tree	(bytenr, 169, level)
172	`EXTENT_OWNER_REF`	(inline only)	–
176	`TREE_BLOCK_REF`	Extent tree	(bytenr, 176, root_id)
178	`EXTENT_DATA_REF`	Extent tree	(bytenr, 178, hash)
182	`SHARED_BLOCK_REF`	Extent tree	(bytenr, 182, parent_bytenr)
184	`SHARED_DATA_REF`	Extent tree	(bytenr, 184, parent_bytenr)
192	`BLOCK_GROUP_ITEM`	Extent tree*	(logical, 192, length)
198	`FREE_SPACE_INFO`	Free space tree	(bg_start, 198, bg_length)
199	`FREE_SPACE_EXTENT`	Free space tree	(start, 199, length)
200	`FREE_SPACE_BITMAP`	Free space tree	(start, 200, length)
204	`DEV_EXTENT`	Device tree	(devid, 204, phys_offset)
216	`DEV_ITEM`	Chunk tree	(1, 216, devid)
228	`CHUNK_ITEM`	Chunk tree	(256, 228, logical)
230	`RAID_STRIPE`	Stripe tree	(logical, 230, length)
240	`QGROUP_STATUS`	Quota tree	(0, 240, 0)
242	`QGROUP_INFO`	Quota tree	(qgroupid, 242, 0)
244	`QGROUP_LIMIT`	Quota tree	(qgroupid, 244, 0)
246	`QGROUP_RELATION`	Quota tree	(qgroupid, 246, other_qgroupid)
248	`TEMPORARY_ITEM`	Root tree	(objectid, 248, offset)
249	`PERSISTENT_ITEM`	Root tree	(objectid, 249, offset)
250	`DEV_REPLACE`	Root tree	(objectid, 250, 0)

*BLOCK_GROUP_ITEM lives in the extent tree by default. With the BLOCK_GROUP_TREE compat_ro flag, it moves to tree ID 11.

Root ref and root backref (18+ bytes)

Forward and backward links between parent and child subvolumes.

ROOT_REF key: (parent_tree_id, ROOT_REF=156, child_tree_id) ROOT_BACKREF key: (child_tree_id, ROOT_BACKREF=144, parent_tree_id)

Both use the same data format:

Offset	Size	Field	Description
0	8	`dirid`	Directory objectid in the parent tree that contains this subvolume.
8	8	`sequence`	Index in the directory.
16	2	`name_len`	Length of the subvolume name.
18	N	`name`	Subvolume name (not NUL-terminated).

File data extents

Regular file content lives in EXTENT_DATA items in the FS tree, keyed (inode#, EXTENT_DATA, file_offset). Each item describes a contiguous range of the file’s logical bytes; consecutive items must cover non-overlapping ranges. Three extent types exist:

BTRFS_FILE_EXTENT_INLINE (0): data embedded directly in the leaf.
BTRFS_FILE_EXTENT_REG (1): pointer to a separate data extent on disk.
BTRFS_FILE_EXTENT_PREALLOC (2): reserved on disk but not yet written.

Common header (21 bytes)

Offset	Size	Field	Description
0	8	`generation`	Transid at extent creation.
8	8	`ram_bytes`	Uncompressed size of the extent’s data.
16	1	`compression`	0=none, 1=zlib, 2=LZO, 3=zstd.
17	1	`encryption`	Always 0.
18	2	`other_encoding`	Always 0.
20	1	`extent_type`	0=inline, 1=regular, 2=prealloc.

Regular and prealloc body (32 bytes follow header)

Offset	Size	Field	Description
21	8	`disk_bytenr`	Logical address of data extent (0 = hole).
29	8	`disk_num_bytes`	On-disk size, sectorsize-aligned.
37	8	`offset`	Byte offset into the on-disk extent (bookend; 0 for non-shared).
45	8	`num_bytes`	Logical file bytes covered by this item.

Inline body

For inline extents the bytes after the 21-byte header are the (possibly compressed) file data. There is no disk_bytenr, no extent-tree entry, and no csum entry: the inline payload is covered by the FS tree leaf’s own checksum.

For LZO inline extents the embedded bytes carry an additional framing header: [4B total_len LE] [4B seg_len LE] [lzo1x compressed bytes], where total_len includes the 8-byte framing header itself.

Validation rules

These invariants are enforced by btrfs check and must hold for any EXTENT_DATA written by userspace:

Regular and prealloc extents: num_bytes must be sectorsize-aligned and non-zero. disk_num_bytes must also be sectorsize-aligned. num_bytes + offset <= ram_bytes.
Inline extents: total embedded payload (compressed or not) must fit in a leaf, capped at min(nodesize - 147, sectorsize - 1) bytes on a default filesystem. The 147 = HEADER_SIZE (101) + ITEM_SIZE (25) + 21-byte file-extent header. The sectorsize - 1 cap is btrfs’s rule that sector-or-larger files must use a regular extent.
INODE.nbytes: sum of num_bytes for every regular/prealloc extent (where disk_bytenr > 0) plus the inline payload length for any inline extent. For non-compressed extents num_bytes is the sector-aligned logical size, NOT the on-disk byte count. For compressed extents num_bytes is still the sector-aligned logical size — the smaller disk_num_bytes is not what gets summed.
INODE.size: the file’s logical size in bytes. May be smaller than the sum of num_bytes (the unwritten tail in the last extent reads as zero up to size).

LZO regular framing

For non-inline LZO extents, the on-disk bytes use a per-sector framed format:

[4B total_len LE] { [4B seg_len LE] [lzo1x compressed bytes] [zero pad] }*

Each input sector is compressed independently.
seg_len is the size of that sector’s compressed segment.
total_len is the total framed buffer size, including the 4-byte header.
After each segment, if fewer than 4 bytes remain in the current sector (i.e. the next 4-byte length header would cross a sector boundary), zero-pad to the next sector boundary so the next segment’s length header is sector-aligned.

This per-sector independence lets the kernel decompress individual sectors without reading neighbours.

Holes

With the NO_HOLES incompat flag (default on modern filesystems), gaps in the file_offset sequence indicate holes — no EXTENT_DATA item is written for the unmapped range. Without NO_HOLES, hole regions are recorded as regular extents with disk_bytenr == 0 and disk_num_bytes == 0.

Checksum computation

Tree block checksums

The checksum field (bytes 0..31 of the header) covers bytes 32..nodesize. For CRC32C (type 0), the checksum is 4 bytes stored at offset 0, with bytes 4..31 zero-padded.

Computation: standard ISO 3309 CRC32C (initial seed 0xFFFFFFFF, final XOR with 0xFFFFFFFF) over the data region bytes 32..nodesize.

Superblock checksums

Same as tree block checksums: bytes 0..31 are the checksum field, covering bytes 32..4095.

Data checksums (csum tree)

Data checksums are stored in the csum tree (tree ID 7) with key (EXTENT_CSUM_OBJECTID=-10, EXTENT_CSUM=128, logical_bytenr).

The item data is a packed array of checksums, one per sector. For CRC32C, each checksum is 4 bytes. The number of sectors covered is item_size / csum_size_for_type. Sectors are consecutive starting at the key’s offset (logical_bytenr).

Computation: standard ISO 3309 CRC32C (the same algorithm as for tree blocks; initial seed 0xFFFFFFFF, final XOR with 0xFFFFFFFF). The csum input is the on-disk bytes of the data extent — for compressed extents, that is the compressed+sector-padded payload, NOT the uncompressed original.

Note this is distinct from the raw_crc32c (no final invert) used by EXTENT_DATA_REF_KEY hashes and by the send-stream protocol. On-disk csum-tree entries always use the standard variant.

A single csum item may cover multiple consecutive sectors. The practical upper bound for a single item’s payload is roughly leaf_data_size - 2 * item_header_size - csum_size bytes, leaving room for a future split. Adjacent items at sector-contiguous logical addresses may be merged into one larger item, but btrfs check accepts either layout.

Inline extents have no csum entries — the data lives in the leaf and is covered by the leaf’s own header checksum.

NODATASUM extents (inode flag BTRFS_INODE_NODATASUM) skip csum computation entirely. btrfs check rejects csum entries for NODATASUM extents, and rejects missing csum entries for non-NODATASUM regular extents.

Extent data ref hash

The hash used in EXTENT_DATA_REF_KEY’s offset field:

high_crc = raw_crc32c(seed=0xFFFFFFFF, root.to_le_bytes())
low_crc  = raw_crc32c(seed=0xFFFFFFFF, objectid.to_le_bytes())
low_crc  = raw_crc32c(seed=low_crc,    offset.to_le_bytes())
hash     = (high_crc as u64) << 31 ^ (low_crc as u64)

Here raw_crc32c means NO final XOR — the raw CRC register value. This can be recovered from the standard API: raw = !standard_crc32c(data) when seed is !0, or equivalently raw = crc32c_with_seed(!0, data) if the API exposes the seed.

B-tree operations

This section describes the algorithms for searching, inserting, and deleting items in a btrfs B-tree. These are standard B-tree algorithms adapted for the btrfs leaf/node layout and COW model.

Binary search within a block

Given a block and a target key, find the slot:

In a leaf: Binary search over items[0..nritems-1] comparing keys. If found, return (true, slot). If not found, return (false, slot) where slot is the insertion point (the index of the first item with key > target).

In a node: Binary search over ptrs[0..nritems-1] comparing keys. The result is the slot of the child subtree that could contain the target key. Specifically, find the largest slot where ptrs[slot].key <= target. If the target is less than all keys, use slot 0.

Search (search_slot)

search_slot(trans, root, key, path, ins_len, cow) descends from the root to a leaf:

Start at the root block (level = root_level).
If cow != 0 and the block hasn’t been COWed in this transaction, COW it.
Binary search for the key within the block.
Store (block, slot) in path.nodes[level] and path.slots[level].
If level > 0: read the child at ptrs[slot].blockptr, go to step 2 with the child.
If level == 0: done. If the key was found, path.slots[0] points to it. If not found, path.slots[0] is the insertion point.

When ins_len > 0 (insert operation), the search checks whether the target leaf has enough free space. If not, it triggers a leaf split before returning.

Item insertion

Given a search path pointing to the insertion slot in a leaf:

If the leaf has enough free space (>= 25 + data_size): a. Shift items at slots [insert_slot..nritems-1] right by 25 bytes (one item descriptor). b. Shift all data belonging to items at [insert_slot..nritems-1] left by data_size bytes (making room at the end of the data area). c. Update the offset field of shifted items (subtract data_size from each). d. Write the new item descriptor at the insert slot. e. Write the new item data. f. Increment nritems.
If the leaf is full: split the leaf (section 9.5), then insert.

Item deletion

Given a search path pointing to items to delete (slot, count):

If deleting items in the middle: shift items at [slot+count..nritems-1] left by count * 25 bytes.
Shift data: move data belonging to remaining items to fill the gap left by deleted items’ data. Update offset fields accordingly.
Decrement nritems by count.
If the leaf becomes empty: remove the key pointer from the parent node and free the leaf block. If the parent also becomes empty (or has only one child), rebalance upward.

Leaf split

When a leaf is too full for an insertion:

Allocate a new leaf block.
Find the split point: aim for roughly half the data in each leaf. The split point should be at an item boundary (never split an item).
Copy items [split..nritems-1] and their data to the new leaf.
Update the original leaf’s nritems.
Insert a new key pointer in the parent node pointing to the new leaf. The key is the first key of the new leaf.
If the parent node is full, split the parent (section 9.6).

Node split

When an internal node is too full for a new key pointer:

Allocate a new node at the same level.
Move roughly half the key pointers to the new node.
Insert a new key pointer in the parent (one level up) for the new node. The key is the first key of the new node.
If the parent is also full, split it recursively.
If the root node splits, create a new root one level higher containing two key pointers (to the old and new nodes). Update the tree’s root pointer. The tree grows taller by one level.

Rebalancing (optional optimization)

Before splitting, try to redistribute items to a neighboring sibling:

Push left: If the left sibling has free space, move items from the start of the full leaf to the end of the left sibling. Update the parent’s key for the full leaf.
Push right: If the right sibling has free space, move items from the end of the full leaf to the start of the right sibling. Update the parent’s key for the right sibling.

This reduces tree height growth. It’s an optimization, not required for correctness. The same applies to nodes (push key pointers to siblings).

After deletion, if a leaf or node is less than ~25% full, consider merging with a sibling. This is also optional for correctness but prevents excessive tree bloat.

Path advancement

next_leaf(path): advance from the current leaf to the next one.

Walk up the path until finding a level where slot < nritems - 1.
Increment that slot.
Walk back down, always taking slot 0, until reaching a leaf.
Update the path at each level.

prev_leaf(path): similar but in reverse (walk up until slot > 0, decrement, walk down taking the last slot at each level).

Free space management

To allocate extents, the transaction crate needs to know which logical addresses are free within each block group.

Extent tree scanning

The simplest approach: walk the extent tree within a block group’s logical range. Allocated extents are contiguous EXTENT_ITEM/METADATA_ITEM entries. Gaps between them are free space. This is O(n) in the number of extents but works without additional infrastructure.

Free space tree (optional optimization)

If the FREE_SPACE_TREE compat_ro flag is set, the free space tree (tree ID 10) provides pre-computed free space information per block group.

For each block group, there is a FREE_SPACE_INFO item: Key: (block_group_start, FREE_SPACE_INFO=198, block_group_length)

Offset	Size	Field	Description
0	4	`extent_count`	Number of free extents.
4	4	`flags`	Bit 0: `USING_BITMAPS` (bitmap mode).

If not using bitmaps, free extents are stored as: Key: (start, FREE_SPACE_EXTENT=199, length) — no item data.

If using bitmaps: Key: (start, FREE_SPACE_BITMAP=200, length) — item data is a bitmap where each bit represents one sector (1 = free).

The free space tree must be kept in sync with the extent tree during transactions. When allocating or freeing extents, update both.

Allocation strategy

For metadata blocks:

Find a block group with type METADATA (or SYSTEM for chunk tree blocks).
Find a free region >= nodesize.
Prefer the block group hinted by the tree’s root item or the most recently used block group.

For data extents:

Find a block group with type DATA.
Find a free region >= requested size.

Rescue command requirements

This section maps each rescue command to the specific tree operations needed.

clear-uuid-tree

Delete all items from the UUID tree and remove its root item.

Start transaction.
Search for the first key in the UUID tree: search_slot(uuid_root, min_key).
Delete items in batches (walk forward, delete, repeat until tree empty).
Delete the ROOT_ITEM for tree ID 9 from the root tree.
Free all tree blocks that belonged to the UUID tree (decrement refs).
Set uuid_tree_generation = 0 in the superblock (tells the kernel to rebuild the UUID tree on next mount).
Commit transaction.

clear-ino-cache

Remove leftover inode cache items (from the deprecated v1 inode cache).

Start transaction.
For each FS tree (tree IDs 5, 256+): search for INODE_ITEM with objectid = BTRFS_FREE_INO_OBJECTID (-12). Delete the inode item and all associated EXTENT_DATA items.
Free any data extents referenced by the deleted extent data items.
Commit transaction.

clear-space-cache

Two modes: v1 (free space inode cache) and v2 (free space tree).

v1: Similar to clear-ino-cache — delete free space cache inodes (objectid = BTRFS_FREE_SPACE_OBJECTID = -11) from each block group.

v2: Delete the entire free space tree (tree ID 10) like clear-uuid-tree. Clear the FREE_SPACE_TREE_VALID compat_ro flag so the kernel rebuilds it on next mount.

fix-device-size

Correct device and superblock size fields when they’re inconsistent.

Start transaction.
Walk the device tree to find all DEV_EXTENT items for each device.
Sum the extent lengths to get the true bytes_used per device.
Update each DEV_ITEM’s total_bytes and bytes_used.
Update the superblock’s embedded dev_item and total_bytes.
Commit transaction.

fix-data-checksum

Verify and repair data checksums using mirror redundancy.

Start transaction.
Walk the csum tree (EXTENT_CSUM items).
For each checksummed range, read data from each available mirror.
Verify each mirror’s data against the stored checksum.
If a checksum mismatch is found and a good mirror exists: optionally update the csum item to match the good mirror’s data (or rewrite the data from the good mirror).
Commit transaction.

Requires: extent tree walking for backref resolution (to report which files are affected), multi-device I/O for reading mirrors.

chunk-recover

Rebuild the chunk tree by scanning device surfaces for tree blocks.

Scan all devices for valid tree block headers (check magic, csum).
From found tree blocks, reconstruct chunk items by cross-referencing block group items and device extents.
Rebuild the chunk tree with the recovered mappings.
Commit.

This is the most complex rescue operation and requires extensive device scanning infrastructure beyond basic tree operations.

mkfs.btrfs: filesystem creation process

This document describes how mkfs.btrfs creates a new btrfs filesystem, covering both the empty filesystem case (make_btrfs) and the directory population case (make_btrfs_with_rootdir).

Overview

mkfs.btrfs creates a filesystem by constructing B-tree nodes as raw byte buffers and writing them directly to a block device or image file with pwrite. No kernel ioctls or mounting are involved. The process produces a valid, mountable btrfs filesystem.

The implementation spans several modules:

mkfs/src/mkfs.rs – orchestration: make_btrfs and make_btrfs_with_rootdir
mkfs/src/layout.rs – chunk layout computation and block address assignment
mkfs/src/tree.rs – LeafBuilder and NodeBuilder for individual blocks
mkfs/src/treebuilder.rs – TreeBuilder for multi-leaf trees
mkfs/src/items.rs – serializers for all on-disk item types
mkfs/src/rootdir.rs – directory walking, data writing, compression
mkfs/src/write.rs – checksum computation and pwrite I/O

Part 1: empty filesystem creation (make_btrfs)

Step 1: validation

Before any I/O, the configuration is validated:

sectorsize must be a power of 2 and >= 4096.
nodesize must be a power of 2, >= sectorsize, and <= 65536.
If the mixed-bg incompat feature is set, nodesize must equal sectorsize.

Step 2: chunk layout computation

ChunkLayout::new computes the physical placement of three block groups on disk:

System block group

Logical offset: 1 MiB (SYSTEM_GROUP_OFFSET).
Size: 4 MiB (SYSTEM_GROUP_SIZE).
Physical offset: same as logical (system chunk has identity mapping on device 1).
Profile: always SINGLE (one stripe on device 1).
Contains: the chunk tree block.

Metadata block group

Logical offset: 5 MiB (CHUNK_START = system offset + system size).
Size: clamp(total_bytes / 10, 32 MiB, 256 MiB), rounded down to 64 KiB (STRIPE_LEN).
Profile: DUP on single device (two physical stripes on device 1, sequential after the system group) or RAID1 on multi-device (one stripe per device at CHUNK_START).
Contains: all non-chunk tree blocks (root, extent, dev, FS, csum, free-space, data-reloc, and optionally block-group tree).

Data block group

Logical offset: metadata logical + metadata size.
Size: clamp(total_bytes / 10, 64 MiB, 1 GiB), rounded down to STRIPE_LEN.
Profile: SINGLE (one stripe on device 1, after the last metadata stripe).
Contains: file data (empty for a freshly created filesystem).

The layout validates that all stripes fit on their respective devices. If they do not, ChunkLayout::new returns None and mkfs reports “device too small”.

The minimum device size is approximately 133 MiB: 5 MiB (system) + 64 MiB (2 x 32 MiB metadata DUP) + 64 MiB (data).

Step 3: block address assignment

BlockLayout assigns a logical address to each tree block:

Chunk tree: at SYSTEM_GROUP_OFFSET (1 MiB), in the system chunk.
Root, Extent, Dev, FS, Csum, FreeSpace, DataReloc trees: sequential in the metadata chunk starting at meta_logical, spaced by nodesize.
Block-group tree (if enabled): the 8th block in the metadata chunk.

For example, with nodesize = 16384 and meta_logical = 5 MiB:

Tree	Logical address
Chunk	0x100000 (1 MiB)
Root	0x500000 (5 MiB)
Extent	0x504000
Dev	0x508000
FS	0x50C000
Csum	0x510000
FreeSpace	0x514000
DataReloc	0x518000
BlockGroup	0x51C000 (optional)

Step 4: tree block construction

Each tree is built as a single leaf node using LeafBuilder. Items must be pushed in strictly ascending key order. The builder handles offset bookkeeping: item descriptors grow forward from byte 101 (after the header), item data grows backward from the end of the block.

Tree block format

Bytes 0-31:    checksum (32 bytes, computed last)
Bytes 32-47:   fsid (16 bytes)
Bytes 48-55:   bytenr (logical address, 8 bytes LE)
Bytes 56-63:   flags (8 bytes LE)
Bytes 64-79:   chunk_tree_uuid (16 bytes)
Bytes 80-87:   generation (8 bytes LE)
Bytes 88-95:   owner tree objectid (8 bytes LE)
Bytes 96-99:   nritems (4 bytes LE)
Byte 100:      level (0 for leaf, >0 for internal node)

After the 101-byte header, item descriptors occupy 25 bytes each:

Bytes 0-16:    key (objectid:8 + type:1 + offset:8)
Bytes 17-20:   data_offset (relative to end of header, 4 bytes LE)
Bytes 21-24:   data_size (4 bytes LE)

Item data payloads fill from the end of the block backward. The space between the last descriptor and the first data payload is unused.

Root tree contents

The root tree contains a ROOT_ITEM (key type 132) for each tree that needs one. The root tree itself and the chunk tree are excluded (the root tree cannot reference itself; the chunk tree is referenced by the superblock’s chunk_root pointer, though a ROOT_ITEM is still written for the chunk tree in practice through the ROOT_ITEM_TREES list).

Trees receiving a ROOT_ITEM: Extent, Dev, FS, Csum, FreeSpace, DataReloc, and optionally BlockGroup. Each ROOT_ITEM is 439 bytes and contains:

An embedded btrfs_inode_item (160 bytes) for the root directory.
Tree-specific fields: generation, root_dirid, bytenr (pointing to the tree’s block), byte_limit, bytes_used, refs, level.

The FS tree’s ROOT_ITEM gets additional initialization:

A deterministic UUID (derived by XOR-flipping the filesystem UUID).
BTRFS_INODE_ROOT_ITEM_INIT flag set in the embedded inode.
inode.size = 3, inode.nbytes = nodesize.
ctime and otime timestamps set to the creation time.

Extent tree contents

The extent tree contains one METADATA_ITEM (or EXTENT_ITEM if skinny metadata is disabled) for each tree block, plus BLOCK_GROUP_ITEM entries for each block group (unless the block-group tree is enabled, in which case block group items go there instead).

Each metadata extent item consists of 24 bytes (btrfs_extent_item: refs, generation, flags) plus a 9-byte inline TREE_BLOCK_REF (type byte + root objectid). With skinny metadata, the key is (bytenr, METADATA_ITEM, level). Without skinny metadata, the key is (bytenr, EXTENT_ITEM, nodesize) and an additional 18-byte btrfs_tree_block_info is included.

Block group items (24 bytes each) are keyed as (logical_addr, BLOCK_GROUP_ITEM, chunk_size) and contain the bytes used, chunk objectid, and profile flags.

All items are collected, sorted by key, then pushed to the leaf.

Chunk tree contents

The chunk tree contains:

DEV_ITEM entries for each device, keyed as (DEV_ITEMS_OBJECTID, DEV_ITEM, devid). Each contains the device’s total bytes, bytes used, sector size, and UUIDs.
CHUNK_ITEM entries for each block group:
- System chunk: uses sectorsize for io_align/io_width (bootstrap convention). One stripe on device 1.
- Metadata chunk: uses STRIPE_LEN (64 KiB) for io_align/io_width. Two stripes for DUP, one per device for RAID1.
- Data chunk: uses STRIPE_LEN for io_align/io_width. One stripe for SINGLE.

Dev tree contents

The dev tree contains:

PERSISTENT_ITEM (DEV_STATS) for each device – all five counters zeroed (40 bytes).
DEV_EXTENT items for each physical allocation:
- System chunk: device 1 at SYSTEM_GROUP_OFFSET.
- Metadata stripes: one or two entries per device.
- Data stripes: one entry per device.

Items are sorted by key (devid, DEV_EXTENT, physical_offset).

FS tree contents

Contains two items for the root directory inode (objectid 256):

INODE_ITEM: directory mode 040755, nlink=1, nbytes=nodesize, generation=1, timestamps set to creation time.
INODE_REF: index=0, name=.., parent_ino=256 (self-referencing for the root directory).

Csum tree

Empty leaf (no items). Populated later if files are written.

Free-space tree

If the free-space-tree feature is enabled, contains FREE_SPACE_INFO and FREE_SPACE_EXTENT items for each block group. Each block group gets:

One FREE_SPACE_INFO item with extent_count=1.
One FREE_SPACE_EXTENT item covering the unused portion of the block group (from used_bytes to group_size).

If the free-space-tree feature is disabled, this is an empty leaf.

Data-reloc tree

Same structure as the FS tree: root directory inode (objectid 256) with INODE_ITEM and INODE_REF.

Block-group tree (optional)

If the block-group-tree compat_ro feature is enabled, block group items are placed here instead of in the extent tree. Contains three BLOCK_GROUP_ITEM entries (system, metadata, data).

Step 5: checksum computation

After each tree block is fully constructed, btrfs_disk::util::csum_tree_block computes the checksum of bytes CSUM_SIZE..nodesize and writes the result into the first bytes of the block:

CRC32C: 4 bytes (standard CRC32C via crc32c::crc32c).
xxHash64: 8 bytes.
SHA-256: 32 bytes.
BLAKE2b-256: 32 bytes.

Remaining bytes in the 32-byte checksum field stay zero.

Step 6: writing to disk

Tree blocks

Each tree block is written to its physical location(s) using pwrite_all. The logical-to-physical mapping is provided by ChunkLayout::logical_to_physical:

System chunk blocks: one write at the logical address (identity mapping) on device 1.
Metadata chunk blocks: one write per stripe. For DUP: two writes on device 1 at different offsets. For RAID1: one write per device.
Data chunk blocks: one write per stripe (typically one for SINGLE).

Superblocks

The superblock is constructed with all necessary fields:

magic: _BHRfS_M
root: logical address of the root tree block
chunk_root: logical address of the chunk tree block
total_bytes: sum across all devices
bytes_used: system used + metadata used (no data used for empty filesystem)
sectorsize, nodesize, leafsize (= nodesize), stripesize (= sectorsize)
num_devices: device count
incompat_flags, compat_ro_flags: from configuration
csum_type: checksum algorithm
cache_generation: 0 if free-space-tree enabled, u64::MAX otherwise
sys_chunk_array: embedded copy of the system chunk (disk_key + chunk_item bytes), enabling the kernel to bootstrap chunk mapping from the superblock alone

The sys_chunk_array is the bootstrap mechanism: it contains a serialized disk key followed by the system chunk item data (including stripe info), stored in a fixed 2048-byte buffer within the superblock. The kernel reads this array first to locate the chunk tree block, then reads the chunk tree to find all other chunks.

Each device gets its own superblock with device-specific fields (devid, dev_uuid, bytes_used for that device). The superblock is written to all valid mirror locations (up to 3):

Mirror 0: byte offset 65536 (64 KiB) – always written.
Mirror 1: byte offset 67108864 (64 MiB) – written if device is large enough.
Mirror 2: byte offset 274877906944 (256 GiB) – written if device is large enough.

After all writes, fsync is called on all device files.

Part 2: rootdir population (make_btrfs_with_rootdir)

The --rootdir flag populates the new filesystem from a source directory on the host. This is significantly more complex than the empty filesystem case because:

The FS tree may need multiple leaf blocks (and internal nodes).
File data must be written to the data chunk.
The extent tree must reference both metadata blocks and data extents.
The csum tree must contain checksums for all data.
The extent tree must contain entries for its own blocks, creating a circular dependency.

Step 1: directory walk (walk_directory)

The rootdir::walk_directory function performs a depth-first traversal of the source directory, building all FS tree items and identifying files that need data extents.

Inode assignment

Inode numbers are assigned sequentially starting at 257 (inode 256 is the root directory, handled separately). The root directory (objectid 256) gets its INODE_ITEM and INODE_REF added during the merge phase.

Hardlink detection

For files with nlink > 1, the function tracks (dev, ino) pairs from the host filesystem in a HashMap. When a subsequent directory entry refers to the same host inode:

No new btrfs inode number is assigned; the existing one is reused.
An INODE_REF is added (additional reference from the new parent).
No new INODE_ITEM is created.
The nlink counter for that btrfs inode is incremented.

After all entries are processed, fixup_inode_nlink patches the nlink field in the INODE_ITEM for all hardlinked inodes.

Per-entry processing

For each directory entry (file, directory, symlink, special file):

DIR_ITEM in the parent directory, keyed by name hash (crc32c(0xFFFFFFFE, name)).
DIR_INDEX in the parent directory, keyed by sequential index (starting at 2 for each directory).
INODE_REF for the new inode, pointing to the parent.
INODE_ITEM with metadata copied from the host filesystem (uid, gid, mode, timestamps, rdev for special files).
XATTR_ITEM entries for each extended attribute on the host file (read via llistxattr/lgetxattr).

Type-specific items:

Directories: Push children onto the DFS stack (reversed for correct order). Initialize the dir_index counter for the new directory.
Symlinks: Create an inline FILE_EXTENT_ITEM containing the link target (never compressed).
Regular files with size > 0:
- If size <= max_inline_data_size: read the file, optionally compress, create an inline FILE_EXTENT_ITEM.
- If size > max_inline_data_size: defer to the data writing phase. Record a FileAllocation with the host path, btrfs inode, size, and NODATASUM flag.
Special files (FIFO, socket, char/block device): INODE_ITEM only, no extent.

Inline extent threshold

The maximum inline data size is min(sectorsize - 1, nodesize - 147). With the defaults (sectorsize=4096, nodesize=16384), this is 4095 bytes. Files at or below this threshold are stored directly in the tree leaf.

Inode flags

The --inode-flags argument allows setting NODATACOW and NODATASUM flags per path. NODATACOW implies NODATASUM for regular files. These flags are set in the INODE_ITEM and affect whether checksums are generated during the data writing phase.

Directory size fixup

After the walk, fixup_inode_size patches each non-root directory’s INODE_ITEM size field to match the sum of name_len * 2 from its DIR_INDEX entries (the btrfs convention for directory sizes).

Inline nbytes fixup

fixup_inline_nbytes patches the nbytes field of INODE_ITEM entries for files with inline extents. For inline extents, nbytes equals the inline data size (the actual stored bytes, which may be compressed).

Output

walk_directory returns a RootdirPlan containing:

fs_items: sorted list of all FS tree items (excluding root dir inode).
file_extents: list of FileAllocation entries for files needing data extents.
data_bytes_needed: total aligned data bytes needed in the data chunk.
root_dir_nlink, root_dir_size: root directory metadata.

Step 2: data writing (write_file_data)

For each file in plan.file_extents, the function reads the host file in 1 MiB chunks (MAX_EXTENT_SIZE) and writes each chunk to the data block group:

Per-extent processing

Read up to 1 MiB of raw data from the host file.
Optionally try compression (zlib or zstd). If the compressed output is smaller than the input, use it; otherwise store uncompressed.
Pad the (possibly compressed) data to sectorsize alignment.
Compute the logical disk address: data_logical + current_offset.
Write the padded data to all physical locations for this logical address.
Compute per-sector checksums (skipped for NODATASUM files):
- For each sector in the padded data, compute the checksum using the configured algorithm.
- Pack all checksums into a single EXTENT_CSUM item.
Create a FILE_EXTENT_ITEM (regular type) in the FS tree items: disk_bytenr, disk_num_bytes (aligned compressed size), offset=0, num_bytes (logical file extent size), ram_bytes (uncompressed size), compression type.
Create an EXTENT_ITEM with inline EXTENT_DATA_REF in the extent tree items: refs=1, generation=1, flags=DATA.

After processing all files, nbytes_updates records the total disk-allocated bytes per inode, which are patched into the corresponding INODE_ITEM entries via apply_nbytes_updates.

Step 3: multi-leaf tree building (TreeBuilder)

When a tree has more items than fit in a single leaf, TreeBuilder splits them across multiple leaves and creates internal nodes to form a valid B-tree.

Leaf packing

Items are packed into leaves sequentially:

Start a new leaf.
For each item, check if the leaf has space for the item descriptor (25 bytes) plus the item data. If not, finalize the current leaf and start a new one.
Record the first key of each leaf for parent node entries.

Internal node construction

If more than one leaf is produced:

Create internal nodes at level 1, each pointing to up to (nodesize - 101) / 33 child blocks (33 bytes per key-pointer entry: 17 key + 8 blockptr + 8 generation).
If more than one level-1 node is needed, create level-2 nodes, and so on.
Repeat until a single root node remains.

Node balancing: if the last node at a level would have fewer than 1/4 of the maximum entries, the previous node is split more evenly to avoid a tiny remainder.

Placeholder addresses

All blocks are initially built with bytenr = 0 in the header. After address assignment, TreeBuilder::assign_addresses patches:

The bytenr field in each block’s header (offset 48).
The blockptr fields in internal nodes (for each key-pointer entry at offset 17 relative to the entry start).

Step 4: the convergence loop

This is the solution to the bootstrapping problem.

The bootstrapping problem

The extent tree must contain a METADATA_ITEM (or EXTENT_ITEM) for every tree block in the filesystem, including the extent tree’s own blocks. But the number of extent tree blocks depends on how many items it contains, which includes its own self-referential entries. Adding more extent tree blocks requires more extent items, which might require even more blocks.

Solution: iterate until stable

The converge_extent_tree_block_count function iteratively computes the extent tree block count:

Start with extent_tree_block_count = 1.
Construct a trial set of all extent items:
- One METADATA_ITEM per tree block (chunk tree, root tree, extent_tree_block_count extent tree blocks, dev tree, FS tree blocks, csum tree blocks, free-space tree block, data-reloc tree blocks, block-group tree block if applicable).
- All data extent items from the data writing phase.
- Block group items (if not using block-group tree).
Sort all trial items by key.
Build the trial extent tree using TreeBuilder::build to determine how many blocks it needs.
If trial.blocks.len() == extent_tree_block_count, the count has stabilized; break.
Otherwise, set extent_tree_block_count = trial.blocks.len() and repeat.

In practice, this converges in 1-3 iterations. The count is monotonically non-decreasing (adding self-referential items can only increase the block count), so convergence is guaranteed.

Step 5: address assignment

Once the extent tree block count is known, BlockAllocator assigns real logical addresses in a fixed order:

Chunk tree: allocate from the system chunk (alloc_system).
Root tree: allocate from the metadata chunk (alloc_metadata).
Extent tree blocks (count from convergence loop): sequential metadata allocations.
Dev tree: one metadata allocation.
FS tree blocks: sequential metadata allocations.
Csum tree blocks: sequential metadata allocations.
Free-space tree: one metadata allocation (if enabled).
Data-reloc tree blocks: sequential metadata allocations.
Block-group tree: one metadata allocation (if enabled).

BlockAllocator maintains separate bumping pointers for the system chunk (SYSTEM_GROUP_OFFSET to SYSTEM_GROUP_OFFSET + SYSTEM_GROUP_SIZE) and the metadata chunk (meta_logical to meta_logical + meta_size), returning an error if either runs out of space.

Step 6: building the real extent tree

With real addresses known, the actual extent tree is built:

Create METADATA_ITEM entries for every tree block using their real addresses.
Include all data extent items from the data writing phase.
Include block group items (in-extent-tree or separate block-group tree).
Sort all items by key.
Build with TreeBuilder::build.
Assert that the block count matches the converged count (if it does not, the convergence loop has a bug).
Assign addresses to extent tree blocks from the pre-allocated address list.

Step 7: building remaining trees

With all addresses finalized:

FS tree: TreeBuilder::assign_addresses patches bytenr fields using pre-allocated addresses.
Csum tree: same.
Data-reloc tree: same.
Chunk tree: rebuilt as a single leaf with final device bytes_used values.
Dev tree: rebuilt as a single leaf with final device extent information.
Free-space tree: rebuilt with final used-byte counts for each block group.
Block-group tree: rebuilt with final used-byte counts.
Root tree: rebuilt with final tree root addresses and levels for all trees.

The root tree is always a single leaf because the number of ROOT_ITEM entries is small (6-8 trees). It is built last because it needs the root address and level of every other tree.

Step 8: writing to disk

All tree blocks are written in order:

Single-leaf trees (chunk, root, dev): compute checksum, write to all physical locations.
Multi-block trees (extent, FS, csum, data-reloc): for each block, compute checksum, write to all physical locations.
Optional single-leaf trees (free-space, block-group): compute checksum, write.

The write_rootdir_trees helper manages this process.

Step 9: superblock

The superblock is built with:

root: root tree address (from step 5).
chunk_root: chunk tree address (from step 5).
bytes_used: system_used + metadata_used + data_used.

Written to all mirror locations on all devices.

Step 10: shrink (optional)

If --shrink is specified and there is a single device:

Compute the physical end of the last chunk (considering all metadata and data stripes).
Round up to sectorsize alignment.
Create a new config with total_bytes set to this shrunk size.
Rebuild the chunk tree and superblock with the reduced total_bytes (so DEV_ITEM.total_bytes and superblock.total_bytes reflect the actual image size).
After all writes, truncate the image file to the shrunk size with set_len.

This produces a minimal image file suitable for distribution or flashing.

Item serialization (items.rs)

All item serializers produce Vec<u8> suitable for LeafBuilder::push. They use the bytes::BufMut trait for little-endian encoding and derive field positions from std::mem::offset_of! and std::mem::size_of on the bindgen structs.

Key serializers and their sizes:

Function	Item type	Approximate size
`root_item`	ROOT_ITEM	439 bytes
`extent_item`	EXTENT_ITEM/METADATA_ITEM	33 bytes (skinny) or 51 bytes
`block_group_item`	BLOCK_GROUP_ITEM	24 bytes
`dev_item`	DEV_ITEM	98 bytes
`chunk_item`	CHUNK_ITEM	48 + 32*num_stripes bytes
`dev_extent`	DEV_EXTENT	48 bytes
`dev_stats_zeroed`	PERSISTENT_ITEM	40 bytes
`free_space_info`	FREE_SPACE_INFO	8 bytes
`inode_item_dir`	INODE_ITEM	160 bytes
`inode_item`	INODE_ITEM	160 bytes
`inode_ref`	INODE_REF	10 + name_len bytes
`dir_item`	DIR_ITEM/DIR_INDEX	30 + name_len bytes
`xattr_item`	XATTR_ITEM	30 + name_len + value_len bytes
`file_extent_inline`	FILE_EXTENT_ITEM	21 + data_len bytes
`file_extent_reg`	FILE_EXTENT_ITEM	53 bytes
`data_extent_item`	EXTENT_ITEM	53 bytes

Checksum computation (write.rs)

ChecksumType supports four algorithms, each computing checksums of the data portion (bytes 32..end) of tree blocks and superblocks:

Algorithm	On-disk type value	Output size	Implementation
CRC32C	0	4 bytes	`crc32c` crate
xxHash64	1	8 bytes	`xxhash-rust` crate
SHA-256	2	32 bytes	`sha2` crate
BLAKE2b-256	3	32 bytes	`blake2` crate

csum_tree_block writes the computed hash into the first N bytes of the block’s checksum field (32 bytes total), zero-filling the remaining bytes.

Data block checksums (in the csum tree) use the same algorithm but are computed per-sector.

The bootstrapping problem in detail

The bootstrapping problem is fundamental to mkfs and worth understanding in depth.

The circular dependency

Consider a minimal filesystem with 8 tree blocks. The extent tree must contain 8 METADATA_ITEM entries (one for each block, including itself). But what if those 8 entries do not fit in a single leaf?

With skinny metadata (METADATA_ITEM, 33-byte payload), each item uses 25 (descriptor) + 33 (data) = 58 bytes. A 16 KiB leaf has 16384 - 101 = 16283 usable bytes, fitting 280 items. So for an empty filesystem, the extent tree easily fits in one block.

But with --rootdir populating thousands of files, the FS tree, csum tree, and extent tree can each grow to many blocks. If the FS tree has 100 blocks and there are 500 data extents, the extent tree might need several blocks itself, and each additional extent tree block requires another METADATA_ITEM entry in the extent tree.

Why pre-computing works

The solution works because:

Addresses are independent of content. Tree block addresses are assigned by sequential bump allocation, so the address of each block depends only on how many blocks precede it, not on the content of any block.
Block count is monotonically non-decreasing. Adding self-referential entries can only increase (or maintain) the block count, never decrease it.
The system is finite. There is a maximum number of blocks that can fit in the metadata chunk, bounding the iteration.
Content depends only on addresses and counts. Once addresses are assigned, every tree block’s content is fully determined. There are no further dependencies.

The convergence loop exploits properties (1) and (2): it guesses a block count, computes trial content, checks if the trial needs the same number of blocks, and if not, tries again with the new count. Property (2) guarantees this converges (the count can only go up until it stabilizes).

Implementation detail

The trial in each iteration uses placeholder addresses (sequential from meta_logical), not the final addresses. This is acceptable because the TreeBuilder only needs the item count and sizes to determine how many blocks are needed – the actual address values do not affect block count. After convergence, the real extent tree is built with the actual addresses from BlockAllocator.

Default features

The default incompat feature flags are:

MIXED_BACKREF – mixed backreference format
BIG_METADATA – larger metadata blocks
EXTENDED_IREF – extended inode references (INODE_EXTREF)
SKINNY_METADATA – skinny metadata extent refs (METADATA_ITEM key type)
NO_HOLES – no explicit hole extent items

The default compat_ro feature flags are:

FREE_SPACE_TREE – free-space tree (v2 free space tracking)
FREE_SPACE_TREE_VALID – marks the free-space tree as valid
BLOCK_GROUP_TREE – separate tree for block group items

Features can be enabled or disabled with -O feature or -O ^feature.

Multi-device support

For multi-device filesystems, chunk layout computation distributes stripes across devices:

RAID1 metadata: one stripe per device at CHUNK_START.
SINGLE data: one stripe on device 1.

Each device gets its own superblock with device-specific devid, dev_uuid, and bytes_used. The chunk tree contains a DEV_ITEM per device, and the dev tree contains DEV_EXTENT entries mapping physical allocations to chunks.

The logical_to_physical function determines write destinations: system chunk blocks go to device 1 only, metadata blocks go to all metadata stripe devices, data blocks go to all data stripe devices.

Limitations

Not yet implemented:

--rootdir with LZO compression (rejected at argument validation).
RAID0/5/6/10 profiles.
Zoned device support.
Mixed block group mode with --rootdir.

btrfs check: verification phases

This document describes the seven phases of btrfs check, as implemented in the cli/src/check/ module. The checker operates in read-only mode on an unmounted filesystem, reading the raw on-disk image through btrfs-disk’s BlockReader without requiring any kernel ioctls.

Overview

The check command opens the filesystem image and bootstraps the chunk tree (superblock -> sys_chunk_array -> chunk tree -> root tree), then runs seven sequential verification phases:

Superblock mirror validation
Tree structure checks (all trees)
Extent tree cross-checks (reference counting and ownership)
Chunk / block group / device extent cross-checks
FS tree inode consistency
Checksum tree verification
ROOT_REF / ROOT_BACKREF consistency

Each phase accumulates errors into a CheckResults struct. Errors are printed to stderr as they are found, and a summary is printed at the end. The process exits with code 1 if any errors were detected.

Orchestration (check.rs)

The main CheckCommand::run method:

Rejects unsupported flags (--repair, --init-csum-tree, --init-extent-tree, --backup, --tree-root, --chunk-root, --qgroup-report, --subvol-extents).
Checks mount status (skippable with --force).
Validates the superblock mirror index (0-2).
Opens the filesystem via reader::filesystem_open_mirror, which bootstraps chunk mapping and discovers all tree roots.
Runs phases 1-7 in order.
Prints summary and exits.

Statistics tracking

Throughout all phases, CheckResults accumulates byte counts that are printed in the final summary:

total_tree_bytes: sum of nodesize for every tree block visited in phase 2.
total_fs_tree_bytes: subset of the above for FS trees (objectid 5 and >= 256).
total_extent_tree_bytes: subset of the above for the extent tree (objectid 2).
btree_space_waste: for each leaf, nodesize minus actual bytes used (header + item descriptors + item data payloads).
data_bytes_allocated: total length of data extents from extent items.
data_bytes_referenced: total referenced bytes, accounting for shared extents via ExtentDataRef and SharedDataRef count fields.
total_csum_bytes: total bytes of checksum data in the csum tree.

Phase 1: Superblocks

Source: cli/src/check/superblock.rs

Purpose: Validate all three superblock mirror copies.

What it checks

Btrfs stores up to three copies of the superblock at fixed byte offsets on the device:

Mirror 0: byte offset 65536 (64 KiB)
Mirror 1: byte offset 67108864 (64 MiB)
Mirror 2: byte offset 274877906944 (256 GiB)

For each mirror (0 through SUPER_MIRROR_MAX - 1):

Read 4096 bytes from the mirror offset using read_superblock_bytes_at.
Validate the superblock using superblock_is_valid, which checks:
- The magic number matches _BHRfS_M (0x4D5F53665248425F).
- The CRC32C checksum of bytes 32..4096 matches the stored checksum in bytes 0..4.

If a mirror cannot be read (I/O error), this is only reported as an error for mirror 0. Mirrors 1 and 2 may legitimately be absent on small devices where the device is shorter than the mirror offset.

Generation consistency

The current implementation validates each mirror independently (magic + checksum). The C reference additionally checks that the generation fields across valid mirrors are consistent (the primary mirror should have the highest generation). This is not yet implemented.

Error variants produced

SuperblockInvalid { mirror, detail } – reported when:
- A mirror has an invalid checksum or magic number.
- Mirror 0 cannot be read at all (I/O error).

Return value

Returns the count of valid mirrors found (0-3). This value is currently not used by the caller but could be used for repair decisions in the future.

Phase 2: Tree structure

Source: cli/src/check/tree_structure.rs

Purpose: Walk every tree in the filesystem and verify per-block structural integrity. Collect a map of all tree block addresses and their owners for use in phase 3.

Trees checked

The phase checks:

Root tree – directly from superblock.root.
Chunk tree – directly from superblock.chunk_root.
All trees discovered in the root tree – every (tree_id, (bytenr, gen)) pair from open.tree_roots. This includes the extent tree, dev tree, FS tree, csum tree, free-space tree, data-reloc tree, block-group tree (if present), and all subvolume/snapshot trees.

Each tree is walked using reader::tree_walk_tolerant, which performs a depth-first traversal through all internal nodes and leaves, calling the visitor callback for each block. The _tolerant variant collects read errors instead of aborting, allowing the checker to report all problems rather than stopping at the first.

Per-block checks

For every tree block (leaf or internal node), the following checks are performed:

CRC32C checksum verification

The first 32 bytes of each block contain the checksum. The checker computes btrfs_csum_data(&raw[32..]) (standard CRC32C with ISO 3309 seed) and compares it to the stored value in raw[0..4]. This check is only performed when the superblock’s csum_type is CRC32C; other checksum types emit a warning and skip verification.

Fsid match

The block header’s fsid field (16 bytes at offset 32) must match the filesystem’s effective fsid. The effective fsid is metadata_uuid if the METADATA_UUID incompat flag is set, or fsid otherwise. This distinction matters for filesystems that have had their metadata UUID changed via btrfs-tune -m.

Generation bound

The block header’s generation field must not exceed the superblock’s generation. A block with a higher generation than the superblock indicates corruption (the block was written in a transaction that was never committed, or the block has been corrupted).

Level consistency

Leaf blocks (items present) must have header.level == 0.
Internal nodes (key-pointer entries) must have header.level > 0.

A mismatch indicates structural corruption where a block’s type disagrees with its declared level.

Key ordering

Within each block, keys must be in strictly ascending order using the compound key comparison (objectid, type, offset):

For leaves: consecutive items items[i-1] and items[i] must satisfy key_less(prev, cur).
For internal nodes: consecutive key-pointers ptrs[i-1] and ptrs[i] must satisfy key_less(prev, cur).

Strictly ascending means no duplicates are allowed. The comparison function uses the raw type byte for the type field (via key_type.to_raw()), comparing the tuple (objectid, type_raw, offset) lexicographically.

Byte attribution

Each visited block contributes nodesize bytes to the appropriate category:

Extent tree blocks (objectid 2) -> total_extent_tree_bytes
FS tree blocks (objectid 5 or >= 256) -> total_fs_tree_bytes
All blocks -> total_tree_bytes

For leaf blocks, space waste is computed as:

waste = nodesize - (101 + nritems * 25 + sum(item.size for each item))

where 101 is the header size and 25 is the item descriptor size.

Output

Returns a HashMap<u64, u64> mapping each tree block’s logical address to the objectid of the tree that owns it. This map is used by phase 3 for bidirectional ownership verification.

Tree name resolution

Tree names for error messages are derived from the objectid using ObjectId formatting (e.g., objectid 1 = “ROOT_TREE”, objectid 5 = “FS_TREE”, objectid 256+ = the numeric subvolume ID). Names are leaked as &'static str since the set of tree names is small and bounded.

Error variants produced

TreeBlockChecksumMismatch { tree, logical } – CRC32C does not match.
TreeBlockBadFsid { tree, logical } – header fsid does not match the filesystem’s effective fsid.
TreeBlockBadBytenr { tree, logical, header_bytenr } – the header’s bytenr field does not match the logical address where the block was read. (Note: this check is performed by the block reader during parsing, not directly in this phase, but the error is reported here if it occurs.)
TreeBlockBadGeneration { tree, logical, block_gen, super_gen } – block generation exceeds superblock generation.
TreeBlockBadLevel { tree, logical, detail } – level/type mismatch (leaf with non-zero level, or node with zero level).
KeyOrderViolation { tree, logical, index } – key at index is not strictly greater than the key at index - 1.
ReadError { logical, detail } – I/O error reading a tree block.

Phase 3: Extents

Source: cli/src/check/extents.rs

Purpose: Walk the extent tree to verify reference counts, detect overlapping extents, and cross-check tree block ownership against extent tree backrefs in both directions.

How it works

The phase walks the extent tree leaf by leaf, processing items in key order. It maintains an ExtentCheckState that tracks the “current” extent being verified and accumulates statistics.

Item processing

Items are processed based on their key type:

EXTENT_ITEM / METADATA_ITEM: Start a new extent. The previous extent (if any) is flushed first. For the new extent:

Record the bytenr in extent_item_addrs (for later ownership checks).
Determine the extent length:
- EXTENT_ITEM: length = key.offset.
- METADATA_ITEM: length = 0 (skinny refs use key.offset as level, not length, so overlap detection is skipped for metadata items).
Check for overlap: if bytenr < prev_end and prev_end > 0, report an overlapping extent error.
Parse the ExtentItem payload to extract:
- The declared reference count (refs).
- Inline backrefs and their count.
- Whether this is a data extent (via BTRFS_EXTENT_FLAG_DATA).
- For tree block extents: collect TreeBlockBackref inline refs into extent_backref_owners[bytenr].
Initialize pending state: pending_refs = declared refs, pending_counted = inline ref count.
For data extents, add length to data_bytes_allocated.

TREE_BLOCK_REF: Standalone tree block backref. Increments pending_counted by 1. Records key.offset (the root objectid) in extent_backref_owners.

SHARED_BLOCK_REF / EXTENT_OWNER_REF: Standalone backrefs. Each increments pending_counted by 1.

EXTENT_DATA_REF: Standalone data backref. Parses the item to extract the count field (number of references from this particular root/objectid/offset combination). Increments pending_counted by count. Adds length * count to data_bytes_referenced.

SHARED_DATA_REF: Same as EXTENT_DATA_REF but for shared (relocated) data references.

All other key types (e.g., BLOCK_GROUP_ITEM): ignored.

Inline reference counting

The count_inline_refs function iterates over the InlineRef variants in an ExtentItem:

TreeBlockBackref, SharedBlockBackref, ExtentOwnerRef: count as 1 each.
ExtentDataBackref, SharedDataBackref: count as their embedded count field (which may be > 1 for multiply-referenced data extents).

Flushing

When a new EXTENT_ITEM/METADATA_ITEM is encountered, or at the end of the tree walk, flush_pending is called:

Skip if no extent is pending (pending_bytenr == 0).
For data extents where data_bytes_referenced is still 0 (only inline refs, no standalone ExtentDataRef), compute data_bytes_referenced += pending_length * pending_counted.
Compare pending_refs (declared) to pending_counted (actual). If they differ, report an ExtentRefMismatch error.
Reset pending_bytenr to 0.

Bidirectional ownership cross-check

After the extent tree walk completes, two cross-checks are performed using the tree_block_owners map from phase 2:

Direction 1: tree block -> extent tree. For every tree block address found during phase 2 tree walks:

If the address has no EXTENT_ITEM or METADATA_ITEM in the extent tree, report MissingExtentItem.
If the address has extent tree entries but none of the claimed owner roots match the actual owner (the tree that contained this block during phase 2 walks), report BackrefOwnerMismatch.

Direction 2: extent tree -> tree block. For every tree block address with backrefs in the extent tree:

For each claimed owner root, check if the actual owner (from the phase 2 map) matches. If the block was not found during phase 2 walks, or belongs to a different tree, report BackrefOrphan.

Both cross-checks sort addresses before iteration for deterministic error ordering.

Error variants produced

ExtentRefMismatch { bytenr, expected, found } – the declared reference count in the ExtentItem does not match the sum of inline and standalone backrefs.
MissingExtentItem { bytenr } – a tree block observed during phase 2 has no corresponding EXTENT_ITEM or METADATA_ITEM in the extent tree.
BackrefOwnerMismatch { bytenr, actual_owner, claimed_owners } – the tree block’s actual owner (from phase 2) does not appear in the extent tree’s list of backref owners for that address.
BackrefOrphan { bytenr, claimed_owner } – the extent tree claims a backref for a tree that does not actually contain a block at that address.
OverlappingExtent { bytenr, length, prev_end } – two data extents overlap in logical address space (the start of one extent is before the end of the previous).
ReadError { logical, detail } – I/O error reading the extent tree.

Phase 4: Chunks / block groups / device extents

Source: cli/src/check/chunks.rs

Purpose: Cross-check the chunk tree, block group items, and device extents for mutual consistency.

What it checks

This phase performs three categories of cross-checks:

Chunk <-> block group cross-check

Every chunk in the chunk tree’s ChunkTreeCache (built during filesystem open) should have a corresponding BLOCK_GROUP_ITEM in the extent tree (or block-group tree, if the BLOCK_GROUP_TREE compat_ro feature is enabled). And vice versa: every block group item should correspond to a chunk.

Block groups are collected by walking either:

The block-group tree if BTRFS_FEATURE_COMPAT_RO_BLOCK_GROUP_TREE is set in the superblock’s compat_ro_flags.
The extent tree otherwise (block group items historically lived in the extent tree).

The walk collects all items with key type BLOCK_GROUP_ITEM into a BTreeMap keyed by logical address.

Then:

For each chunk in the chunk cache: if no block group exists at that logical address, report ChunkMissingBlockGroup.
For each block group: if the chunk cache has no chunk at that logical address, report BlockGroupMissingChunk.

Device extent overlap detection

Device extents are collected from the device tree by walking all items with key type DEV_EXTENT. Each extent is recorded as (offset, length) grouped by device ID (key.objectid).

For each device, extents are sorted by physical offset. Then consecutive pairs are checked: if extents[i].offset < extents[i-1].offset + extents[i-1].length, the extents overlap and DeviceExtentOverlap is reported.

Error variants produced

ChunkMissingBlockGroup { logical } – a chunk exists in the chunk tree but no block group item was found at the same logical address.
BlockGroupMissingChunk { logical } – a block group item exists but no chunk was found at the same logical address.
DeviceExtentOverlap { devid, offset } – two device extents on the same device overlap in physical address space.
ReadError { logical, detail } – I/O error reading the block-group tree, extent tree, or device tree.

Phase 5: FS roots

Source: cli/src/check/fs_roots.rs

Purpose: Walk every filesystem tree (the default FS tree and all subvolume trees) and verify inode-level consistency.

Which trees are checked

From the tree_roots map (populated during filesystem open), the phase selects trees whose objectid is either:

BTRFS_FS_TREE_OBJECTID (5) – the default filesystem tree.
>= BTRFS_FIRST_FREE_OBJECTID (256) – subvolume and snapshot trees.

Item collection

For each FS tree, collect_fs_items walks all leaves and groups items by objectid (inode number). Each item is stored as a (KeyType, key_offset, raw_data_bytes) tuple. Items arrive in sorted key order due to the B-tree traversal, which means within an objectid group, items are sorted by (key_type, offset).

Per-inode checks

For each objectid group (inode), the following checks are performed:

INODE_ITEM presence

The checker notes whether the objectid has an INODE_ITEM. If directory entries reference an objectid that has no INODE_ITEM, the entry is an orphan.

Parsed from INODE_ITEM: nlink, size (isize), nbytes, and mode.

Nlink consistency

The actual reference count is computed by counting entries across all INODE_REF items (via InodeRef::parse_all) and INODE_EXTREF items (via InodeExtref::parse_all) for this objectid. If the computed count differs from inode_item.nlink and the inode has at least one reference, NlinkMismatch is reported.

The root directory inode (objectid 256, BTRFS_FIRST_FREE_OBJECTID) is excluded from this check because it has special nlink handling in btrfs.

File extent overlap detection

For regular files, all EXTENT_DATA items are processed to extract (file_offset, file_offset + length) ranges:

Regular extents: length = num_bytes from the FileExtentBody::Regular variant.
Inline extents: length = inline_size from the FileExtentBody::Inline variant.

Since items are in key order and EXTENT_DATA keys use the file offset, ranges are already sorted by start offset. Consecutive ranges are checked: if ranges[i].start < ranges[i-1].end, a FileExtentOverlap is reported.

Directory inode size (isize) check

For directory inodes (mode & S_IFMT == S_IFDIR), the expected inode size is computed by summing name_len * 2 for every DIR_INDEX entry belonging to this inode. The factor of 2 matches the btrfs convention where directory inode size counts each entry’s name length twice (once for DIR_ITEM, once for DIR_INDEX).

If the inode’s stored size field differs from this computed sum, DirSizeWrong is reported.

File nbytes check

For regular files and symlinks (mode & S_IFMT == S_IFREG or S_IFLNK), the expected nbytes is computed from extent items:

Inline extents: nbytes += data_len (the inline payload size).
Regular extents: nbytes += disk_num_bytes, but only for non-prealloc extents. Prealloc extents (preallocated but unwritten) and hole extents (disk_bytenr == 0) do not contribute.

If the inode’s stored nbytes differs from the computed total, NbytesWrong is reported.

Orphan directory entries

When processing DIR_ITEM and DIR_INDEX items, for each entry whose location key type is INODE_ITEM and whose target objectid is >= BTRFS_FIRST_FREE_OBJECTID (256): if the target inode has no INODE_ITEM anywhere in this tree, DirItemOrphan is reported. Both DIR_ITEM and DIR_INDEX entries are checked, so an orphan reference in either will be caught.

Error variants produced

InodeMissing { tree, ino } – an objectid is referenced but has no INODE_ITEM. (Note: this is detected indirectly through DirItemOrphan in the current implementation.)
NlinkMismatch { tree, ino, expected, found } – the inode’s stored nlink differs from the number of INODE_REF + INODE_EXTREF entries.
FileExtentOverlap { tree, ino, offset } – two file extent items for the same inode overlap in file offset space.
DirItemOrphan { tree, parent_ino, name } – a directory entry references an inode that has no INODE_ITEM.
DirSizeWrong { tree, ino, expected, found } – a directory inode’s stored size does not match the computed sum of DIR_INDEX name lengths times 2.
NbytesWrong { tree, ino, expected, found } – a file inode’s stored nbytes does not match the computed sum from extent items.
ReadError { logical, detail } – I/O error reading the FS tree.

Phase 6: Checksums

Source: cli/src/check/csums.rs

Purpose: Walk the checksum tree and optionally verify data block checksums against the actual on-disk data.

Structure of the csum tree

The csum tree contains EXTENT_CSUM items (key type 128). Each item covers a contiguous range of data sectors:

Key objectid: BTRFS_EXTENT_CSUM_OBJECTID (fixed constant).
Key offset: the logical byte address of the first sector covered.
Item data: packed array of checksums, one per sector. With CRC32C (4 bytes per checksum) and 4K sectors, a single item can cover many sectors.

What it checks

Phase 6a: tree walk and byte counting

Always performed. The phase walks the csum tree and for each EXTENT_CSUM item, computes num_csums = item_data_len / csum_size and adds item_data_len to total_csum_bytes. This total is reported in the final summary.

Phase 6b: data verification (optional)

Only performed when --check-data-csum is passed. Only supported for CRC32C checksums; other checksum types emit a warning and skip verification.

For each csum item, the phase iterates over every sector:

Compute the logical address: item.key.offset + i * sectorsize.
Read sectorsize bytes from that logical address via reader.read_data.
Compute btrfs_csum_data(&data) (standard CRC32C).
Compare to the stored checksum (extracted from the item data at offset i * csum_size).
If they differ, or if the read fails, report CsumMismatch.

The btrfs_csum_data function uses the standard ISO 3309 CRC32C computation (seed = 0xFFFFFFFF, final XOR), matching the kernel’s checksum for tree blocks and data. This is distinct from the raw CRC32C used in send streams.

Error variants produced

CsumMismatch { logical } – the computed CRC32C of the data at the given logical address does not match the stored checksum, or the data could not be read.
ReadError { logical, detail } – I/O error reading the csum tree itself.

Phase 7: Root refs

Source: cli/src/check/root_refs.rs

Purpose: Verify that ROOT_REF and ROOT_BACKREF items in the root tree are consistent with each other.

Background

In btrfs, subvolume parent-child relationships are recorded in the root tree using two item types:

ROOT_REF (key type 156): stored with objectid = parent_root_id, offset = child_root_id. Contains the directory ID, sequence number, and name of the directory entry that references the child subvolume.
ROOT_BACKREF (key type 157): stored with objectid = child_root_id, offset = parent_root_id. Contains the same fields as the corresponding ROOT_REF.

These items form a bidirectional link. For every ROOT_REF there should be a matching ROOT_BACKREF, and vice versa. The fields (dirid, sequence, name) should be identical between the pair.

What it checks

The phase walks the root tree and collects all ROOT_REF and ROOT_BACKREF items into two maps, keyed by (child_root_id, parent_root_id). Both item types are parsed using RootRef::parse (the on-disk format is identical).

Then two passes are made:

Forward check: every ROOT_REF has a matching ROOT_BACKREF

For each (child, parent) pair in the forward refs map:

If no entry exists in the back refs map, report RootBackrefMissing.
If an entry exists, compare the three fields:
- dirid: if they differ, report RootRefMismatch with “dirid mismatch”.
- sequence: if they differ, report RootRefMismatch with “sequence mismatch”.
- name: if they differ, report RootRefMismatch with “name mismatch”.

Each field is checked independently, so a single pair can produce up to 3 mismatch errors.

Reverse check: every ROOT_BACKREF has a matching ROOT_REF

For each (child, parent) pair in the back refs map:

If no entry exists in the forward refs map, report RootRefMissing.

Field comparison is not repeated in this direction because the forward check already caught any field mismatches for pairs that exist in both maps.

Error variants produced

RootRefMissing { child, parent } – a ROOT_BACKREF exists for this child/parent pair but no corresponding ROOT_REF was found.
RootBackrefMissing { child, parent } – a ROOT_REF exists for this child/parent pair but no corresponding ROOT_BACKREF was found.
RootRefMismatch { child, parent, detail } – both ROOT_REF and ROOT_BACKREF exist but one of their fields (dirid, sequence, or name) differs. The detail string describes which field mismatched and shows both values.
ReadError { logical, detail } – I/O error reading the root tree.

Complete error type reference

All error variants are defined in cli/src/check/errors.rs as the CheckError enum. Each variant implements Display for human-readable error messages.

Phase 1 errors

Variant	Fields	Description
`SuperblockInvalid`	`mirror: u32`, `detail: String`	Superblock mirror failed validation (bad magic, bad checksum, or read error)

Phase 2 errors

Variant	Fields	Description
`TreeBlockChecksumMismatch`	`tree: &'static str`, `logical: u64`	CRC32C checksum does not match
`TreeBlockBadFsid`	`tree: &'static str`, `logical: u64`	Header fsid does not match filesystem
`TreeBlockBadBytenr`	`tree: &'static str`, `logical: u64`, `header_bytenr: u64`	Header bytenr disagrees with read address
`TreeBlockBadGeneration`	`tree: &'static str`, `logical: u64`, `block_gen: u64`, `super_gen: u64`	Block generation exceeds superblock generation
`TreeBlockBadLevel`	`tree: &'static str`, `logical: u64`, `detail: String`	Level/type mismatch (leaf with level>0 or node with level==0)
`KeyOrderViolation`	`tree: &'static str`, `logical: u64`, `index: usize`	Key at index is not strictly greater than previous key

Phase 3 errors

Variant	Fields	Description
`ExtentRefMismatch`	`bytenr: u64`, `expected: u64`, `found: u64`	Declared refs != counted refs (inline + standalone)
`MissingExtentItem`	`bytenr: u64`	Tree block has no extent/metadata item in extent tree
`BackrefOwnerMismatch`	`bytenr: u64`, `actual_owner: u64`, `claimed_owners: Vec<u64>`	Actual tree block owner not in extent tree’s backref list
`BackrefOrphan`	`bytenr: u64`, `claimed_owner: u64`	Extent tree claims a backref but no tree block found
`OverlappingExtent`	`bytenr: u64`, `length: u64`, `prev_end: u64`	Data extent overlaps with previous extent

Phase 4 errors

Variant	Fields	Description
`ChunkMissingBlockGroup`	`logical: u64`	Chunk has no matching block group item
`BlockGroupMissingChunk`	`logical: u64`	Block group has no matching chunk
`DeviceExtentOverlap`	`devid: u64`, `offset: u64`	Two device extents overlap on the same device

Phase 5 errors

Variant	Fields	Description
`InodeMissing`	`tree: u64`, `ino: u64`	Inode referenced but has no INODE_ITEM
`NlinkMismatch`	`tree: u64`, `ino: u64`, `expected: u32`, `found: u32`	Stored nlink differs from counted references
`FileExtentOverlap`	`tree: u64`, `ino: u64`, `offset: u64`	File extent items overlap in file offset space
`DirItemOrphan`	`tree: u64`, `parent_ino: u64`, `name: String`	Dir entry references non-existent inode
`DirSizeWrong`	`tree: u64`, `ino: u64`, `expected: u64`, `found: u64`	Directory inode size does not match DIR_INDEX name sum
`NbytesWrong`	`tree: u64`, `ino: u64`, `expected: u64`, `found: u64`	File inode nbytes does not match extent sum

Phase 6 errors

Variant	Fields	Description
`CsumMismatch`	`logical: u64`	Data checksum does not match stored value

Phase 7 errors

Variant	Fields	Description
`RootRefMissing`	`child: u64`, `parent: u64`	ROOT_BACKREF exists but no matching ROOT_REF
`RootBackrefMissing`	`child: u64`, `parent: u64`	ROOT_REF exists but no matching ROOT_BACKREF
`RootRefMismatch`	`child: u64`, `parent: u64`, `detail: String`	ROOT_REF and ROOT_BACKREF fields disagree

Cross-phase error

Variant	Fields	Description
`ReadError`	`logical: u64`, `detail: String`	I/O error reading any tree block (used in phases 2-7)

Summary output

After all phases complete, CheckResults::print_summary writes to stdout:

found <bytes_used> bytes used, <error_count> error(s) found
total csum bytes: <total_csum_bytes>
total tree bytes: <total_tree_bytes>
total fs tree bytes: <total_fs_tree_bytes>
total extent tree bytes: <total_extent_tree_bytes>
btree space waste bytes: <btree_space_waste>
file data blocks allocated: <data_bytes_allocated>
 referenced <data_bytes_referenced>

If error_count > 0, the process exits with code 1.

Limitations and future work

The following checks from the C reference implementation are not yet implemented:

--mode lowmem differentiation (the current implementation uses the “original” mode approach of collecting all items then cross-checking).
Log tree checking (the log tree is not walked in phase 2).
--repair (all checking is read-only).
--backup / --tree-root / --chunk-root** (alternate root selection).
--init-csum-tree / --init-extent-tree** (destructive reconstruction).
--qgroup-report (quota group consistency checking).
--subvol-extents (per-subvolume extent sharing analysis).
Superblock generation cross-checking between mirror copies.
Block group used-bytes verification (comparing declared used in block group items against actual allocated extents).

Keyboard shortcuts

btrfsutils