Introduction
btrfsutils is a Rust implementation of the btrfs filesystem utilities. It
provides three command-line tools: btrfs, for managing and inspecting btrfs
filesystems; btrfs-mkfs, for creating new ones; and btrfs-tune, for offline
superblock tuning. All three aim to be drop-in replacements for the tools
provided by btrfs-progs.
Most commands are fully implemented and produce output matching the C reference. The explicit goal is to be drop-in compatible with the reference implementation, with additional features. This is currently in a beta (pre-1.0) version, so it should not be used in production, but the commands that are implemented are thoroughly tested and can be assumed to be correctly implemented.
It also provides library crates that can be used to access kernel APIs to
manage btrfs filesystems, decode and write on-disk structures and decode and
handle the btrfs send format.
Source Code
The source is available on github and gitlab.
Installation
While these tools are still in their beta (pre-1.0 release) phase, you can already install them and try them out. Currently, the recommended way to install them is using Cargo, there are no binary builds to download.
Cargo
If you have cargo installed, you can install the utilities with it.
cargo install btrfs-cli
cargo install btrfs-tune
cargo install btrfs-mkfs
Nix
If you use Nix with flakes enabled, you can run the tool directly without installing it:
nix run github:rustutils/btrfsutils -- filesystem show /mnt
Or install it into your profile:
nix profile install github:rustutils/btrfsutils
From source
See Building from Source for instructions on compiling btrfsutils yourself from the repository.
Requirements
btrfsutils runs on Linux. Most commands that interact with a mounted filesystem
require CAP_SYS_ADMIN (i.e. root, or a process with that capability granted).
The exceptions are btrfs inspect-internal dump-super and dump-tree, which
only require read access to the block device or image file.
Building from Source
Prerequisites
You need a Rust toolchain matching the version in rust-toolchain.toml — running
rustup toolchain install in the project directory will pick it up automatically.
You also need clang and libclang for bindgen, which generates Rust bindings
from the kernel UAPI headers at build time.
On Fedora/RHEL:
sudo dnf install clang
On Debian/Ubuntu:
sudo apt install clang libclang-dev
Building with Cargo
cargo build --release
The resulting binaries are target/release/btrfs, target/release/btrfs-mkfs,
and target/release/btrfs-tune.
Building with Nix
The project includes a Nix flake that provides a fully reproducible build with all dependencies pinned:
nix build
Outputs land in result/bin/btrfs, result/bin/btrfs-mkfs,
result/bin/btrfs-tune, and result/share/man/man1/.
To enter a development shell with all tools available (including nightly rustfmt, cargo-insta, and cargo-llvm-cov):
nix develop
Contributors who want to run the full lint sweep (just check) on a
non-Nix machine may also need a host-arch musl cross-compiler — see
the “Static checks” section of the
testing guide for setup instructions.
Concepts
This page defines the terms used throughout the btrfs documentation and command output.
Filesystem
A btrfs filesystem is a single logical storage pool. It has a UUID and an optional human-readable label, and it can span one or more physical block devices. All data and metadata stored in the filesystem is distributed across its devices according to the configured RAID profiles.
A filesystem is accessed by mounting it at a path. Most btrfs commands take that
mount point (or any path within it) as their argument.
Device
A device is a block device — a disk partition or a whole disk — that belongs to a filesystem. Every filesystem has at least one device. Additional devices can be added or removed while the filesystem is mounted, allowing online capacity changes.
Subvolume
A subvolume is an independently managed subtree within a filesystem. It looks like a directory, but it has its own inode namespace and can be snapshotted, sent, or deleted independently from the rest of the filesystem.
When you mount a btrfs filesystem, you are mounting one of its subvolumes (the
default subvolume, unless you specify otherwise). Other subvolumes appear as
directories within it but can also be mounted directly with the subvol= or
subvolid= mount options.
Snapshot
A snapshot is a copy-on-write copy of a subvolume taken at a point in time. It
initially shares all of its data with the source subvolume; pages diverge as
either copy is written. Snapshots can be read-write or read-only. Read-only
snapshots are required for btrfs send.
Chunk
btrfs divides storage into chunks — large, contiguous regions of logical address space (typically 256 MiB for metadata, 1 GiB for data). Each chunk is backed by one or more physical stripes on the underlying devices, according to the RAID profile in use. The mapping from logical addresses to physical device locations is stored in the chunk tree.
Extent
An extent is a contiguous run of bytes within a chunk. File data is stored in data extents; the B-trees that make up btrfs metadata are stored in metadata extents. btrfs uses copy-on-write: modifying data creates a new extent rather than overwriting the old one, which is what makes snapshots cheap.
Generation
Every committed transaction increments the filesystem’s generation number.
Subvolumes track the generation at which they were last modified (their
generation) and the generation at which they were originally created (their
ogeneration, or original generation). These are used by tools like
btrfs subvolume find-new to identify recently changed files, and by btrfs send
to select an appropriate incremental parent.
qgroup
A quota group (qgroup) tracks and optionally limits the amount of space used by a set of subvolumes. qgroups can be nested into a hierarchy, which allows shared space (space that would not be freed even if one subvolume were deleted) to be accounted at the group level. Quotas must be enabled on the filesystem before qgroups can be used.
Commands
btrfsutils implements the same command structure as the upstream btrfs tool.
Commands are organized into groups:
btrfs filesystem
Manage and inspect mounted filesystems.
| Command | Description |
|---|---|
btrfs filesystem show [path] | Show filesystem info and devices |
btrfs filesystem df <path> | Show space usage by chunk type |
btrfs filesystem usage <path> | Detailed space usage with per-device breakdown |
btrfs filesystem du <path> | Show disk usage including shared extents |
btrfs filesystem sync <path> | Sync the filesystem |
btrfs filesystem defrag <path> | Defragment a file or directory |
btrfs filesystem resize <size> <path> | Resize a mounted filesystem |
btrfs filesystem label <path> [label] | Get or set the filesystem label |
btrfs filesystem mkswapfile <path> | Create a swapfile |
btrfs filesystem commit-stats <path> | Show commit statistics |
btrfs subvolume
Create and manage subvolumes and snapshots.
| Command | Description |
|---|---|
btrfs subvolume create <path> | Create a subvolume |
btrfs subvolume delete <path> | Delete a subvolume |
btrfs subvolume snapshot <src> <dst> | Create a snapshot |
btrfs subvolume list <path> | List subvolumes |
btrfs subvolume show <path> | Show subvolume details |
btrfs subvolume get-default <path> | Show the default subvolume |
btrfs subvolume set-default <id> <path> | Set the default subvolume |
btrfs subvolume get-flags <path> | Show subvolume flags |
btrfs subvolume set-flags <path> | Set subvolume flags |
btrfs subvolume find-new <path> <gen> | Find files modified since a generation |
btrfs subvolume sync <path> | Wait for deleted subvolumes to be cleaned up |
btrfs device
Manage devices in a multi-device filesystem.
| Command | Description |
|---|---|
btrfs device add <dev> <path> | Add a device |
btrfs device remove <dev> <path> | Remove a device |
btrfs device stats <path> | Show per-device error statistics |
btrfs device scan [dev] | Scan for btrfs devices |
btrfs device ready <dev> | Check if a multi-device filesystem is ready |
btrfs device usage <path> | Show per-device allocation details |
btrfs balance
Rebalance data and metadata across devices or profiles.
| Command | Description |
|---|---|
btrfs balance start <path> | Start a balance |
btrfs balance pause <path> | Pause a running balance |
btrfs balance resume <path> | Resume a paused balance |
btrfs balance cancel <path> | Cancel a running or paused balance |
btrfs balance status <path> | Show balance status |
Balance filters (-d, -m, -s) accept filter strings such as
usage=50,profiles=raid1|single.
btrfs scrub
Verify data and metadata checksums.
| Command | Description |
|---|---|
btrfs scrub start <path> | Start a scrub |
btrfs scrub cancel <path> | Cancel a running scrub |
btrfs scrub resume <path> | Resume a cancelled scrub |
btrfs scrub status <path> | Show scrub status |
btrfs scrub limit <path> | Get or set scrub throughput limit |
btrfs replace
Replace a device in a filesystem.
| Command | Description |
|---|---|
btrfs replace start <srcdev> <tgtdev> <path> | Start a device replacement |
btrfs replace status <path> | Show replacement status |
btrfs replace cancel <path> | Cancel a running replacement |
btrfs send / receive
Stream filesystem data between systems.
| Command | Description |
|---|---|
btrfs send <subvol> | Send a subvolume as a stream |
btrfs receive <path> | Receive a stream into a directory |
btrfs send supports full sends and incremental sends (-p parent, -c clone
sources). btrfs receive supports v1, v2 (compressed data), and v3 (fs-verity)
stream formats.
btrfs inspect-internal
Low-level inspection tools.
| Command | Description |
|---|---|
btrfs inspect-internal rootid <path> | Show the subvolume ID for a path |
btrfs inspect-internal inode-resolve <ino> <path> | Resolve an inode to paths |
btrfs inspect-internal logical-resolve <addr> <path> | Resolve a logical address to paths |
btrfs inspect-internal subvolid-resolve <id> <path> | Resolve a subvolume ID to a path |
btrfs inspect-internal min-dev-size <path> | Show the minimum safe device size |
btrfs inspect-internal list-chunks <path> | List all chunk allocations |
btrfs inspect-internal dump-super <dev> | Dump the superblock |
btrfs inspect-internal dump-tree <dev> | Dump raw B-tree contents |
btrfs inspect-internal tree-stats <dev> | Walk a B-tree and report node/leaf statistics |
btrfs inspect-internal map-swapfile <path> | Show physical extent map of a swapfile |
dump-super and dump-tree read directly from a block device or image file and
do not require a mounted filesystem or elevated privileges.
btrfs quota / qgroup
Manage filesystem quotas.
| Command | Description |
|---|---|
btrfs quota enable <path> | Enable quotas |
btrfs quota disable <path> | Disable quotas |
btrfs quota rescan <path> | Rescan quota usage |
btrfs quota status <path> | Show quota status |
btrfs qgroup show <path> | Show qgroup usage |
btrfs qgroup create <id> <path> | Create a qgroup |
btrfs qgroup destroy <id> <path> | Destroy a qgroup |
btrfs qgroup assign <src> <dst> <path> | Assign a qgroup to a parent |
btrfs qgroup remove <src> <dst> <path> | Remove a qgroup assignment |
btrfs qgroup limit <size> <id> <path> | Set a qgroup size limit |
btrfs qgroup clear-stale <path> | Remove stale qgroups |
btrfs property
Get and set filesystem object properties.
| Command | Description |
|---|---|
btrfs property get <path> [name] | Get a property |
btrfs property set <path> <name> <value> | Set a property |
btrfs property list <path> | List available properties |
Supported properties: ro (subvolumes), label (filesystem/device),
compression (inodes).
btrfs restore
Recover files from a damaged or unmounted filesystem by reading on-disk structures directly.
| Command | Description |
|---|---|
btrfs restore <dev> <path> | Restore files to a destination directory |
btrfs restore -l <dev> | List available tree roots |
Supports regular files, directories, symlinks (-S), extended attributes (-x),
metadata (owner/mode/times with -m), and compressed extents (zlib/zstd/lzo).
Use --path-regex to filter restored files and -s to include snapshots.
btrfs rescue
Emergency recovery tools for damaged filesystems.
| Command | Description |
|---|---|
btrfs rescue super-recover <dev> | Restore superblock from mirrors |
btrfs rescue zero-log <dev> | Clear the log tree pointer |
btrfs rescue create-control-device | Create /dev/btrfs-control if missing |
btrfs rescue fix-device-size <dev> | Re-align device and superblock sizes |
btrfs rescue fix-data-checksum [--readonly|--mirror 1] <dev> | Scan and (with --mirror 1) repair data csums |
btrfs rescue clear-uuid-tree <dev> | Drop the UUID tree so the kernel rebuilds it |
btrfs rescue clear-space-cache <v1|v2> <dev> | Clear the v1 or v2 free space cache |
btrfs rescue clear-ino-cache <dev> | Remove leftover items from the deprecated inode cache |
btrfs rescue chunk-recover has argument parsing scaffolded but is
not yet implemented.
btrfs-mkfs
Create a new btrfs filesystem on a block device or image file.
btrfs-mkfs [options] <device> [device...]
Supports single-device and multi-device filesystems with all RAID profiles
(SINGLE, DUP, RAID0, RAID1, RAID1C3, RAID1C4, RAID10, RAID5, RAID6), all
four checksum algorithms (crc32c, xxhash, sha256, blake2b), quota and simple
quota setup, custom nodesize/sectorsize, labels, UUIDs, feature flags, and
directory population via --rootdir.
btrfs-tune
Modify btrfs filesystem parameters on an unmounted device.
btrfs-tune [options] <device>
| Flag | Description |
|---|---|
-r | Enable extended inode refs (extref) |
-x | Enable skinny metadata extent refs |
-n | Enable no-holes feature |
-S 0 / -S 1 | Clear or set the seeding flag |
-m | Change fsid to a random UUID (metadata_uuid mechanism) |
-M <uuid> | Change fsid to a specific UUID (metadata_uuid mechanism) |
-u | Rewrite fsid to a random UUID (patches all tree blocks) |
-U <uuid> | Rewrite fsid to a specific UUID (patches all tree blocks) |
Global flags
These flags are accepted by all btrfs commands:
| Flag | Description |
|---|---|
-v / --verbose | Increase verbosity (repeatable) |
-q / --quiet | Suppress non-error output |
-f, --format | Set the format, one of: text, json, modern |
Output Format
Many commands accept a --format json which will cause them to
output JSON-formatte data.
Differences from btrfs-progs
btrfsutils aims to be a drop-in replacement for btrfs-progs. Most commands produce identical output and accept the same flags. This page lists the known gaps and the features that go beyond what btrfs-progs offers.
What’s not yet supported
These features from btrfs-progs are not yet implemented:
btrfs check --repairand related write-mode flags (--init-csum-tree,--init-extent-tree, etc.). Read-only checking works.btrfs check --mode lowmem(currently only the default mode is supported).btrfs rescue chunk-recover. Other write-mode rescue subcommands (fix-device-size,clear-space-cache,clear-uuid-tree,clear-ino-cache,fix-data-checksum) are implemented.btrfs filesystem resize --offline.btrfs-mkfszoned device support.btrfs-tune --convert-to-free-space-treeand--convert-to-block-group-tree.
What’s added beyond btrfs-progs
These features are original additions not present in the C tools:
--format modern(orBTRFS_OUTPUT_FORMAT=modern): opt-in improved output with adaptive column widths and tree views. Supported by most tabular commands includingdevice stats,device usage,subvolume list,inspect list-chunks,filesystem du/df/show/usage,qgroup show,quota status,scrub start/status.btrfs filesystem du --depth N: limit display depth while computing full totals.btrfs filesystem du --sort: sort entries by path, total, exclusive, or shared.btrfs inspect list-chunks --offline: read chunks directly from an unmounted device or image file without CAP_SYS_ADMIN.btrfs inspect min-dev-size --offline: compute minimum device size from an unmounted device or image file.btrfs device stats --offline: read device error statistics from the on-disk device tree without requiring a mounted filesystem.
Architecture
Crate structure
The project follows a strict layering: lower crates have no knowledge of the layers above them.
btrfs-uapi wraps kernel ioctls, sysfs reads, and procfs reads into safe
Rust APIs. It is Linux-only and the only crate that talks directly to the
kernel.
btrfs-disk parses on-disk structures — superblocks, B-tree nodes, item
payloads — from raw byte buffers. It is platform-independent and does not
depend on btrfs-uapi, so it can be used to inspect filesystem images on
any OS.
btrfs-stream parses the btrfs send stream wire format. The core parser is
platform-independent. The optional receive feature is Linux-only and
applies a parsed stream to a mounted filesystem via btrfs-uapi.
btrfs-mkfs implements the mkfs.btrfs tool. It constructs B-tree nodes as
raw byte buffers and writes them directly to a block device or image file
using pwrite. It does not use ioctls.
btrfs-tune implements the btrfstune tool. It modifies on-disk superblock
parameters (feature flags, seeding, filesystem UUIDs) on unmounted devices.
For lightweight UUID changes it only rewrites the superblock; for full fsid
rewrites it traverses every tree block on disk via btrfs-disk.
btrfs-cli implements the btrfs tool. It handles argument parsing via
clap, calls into btrfs-uapi and btrfs-disk as needed, and formats all
output. Optionally, this tool can also embed the btrfs-tune and
btrfs-mkfs tools as subcommands, for easier single-file deployment.
The two-layer model
Every feature that involves kernel communication is split across two layers.
The uapi/ layer provides a safe Rust function: it takes typed arguments,
calls the ioctl, and returns a typed result, with no unsafe in the public
API and no knowledge of CLI concerns. The cli/ layer provides a clap
subcommand that calls into uapi/ and formats the result for the user, with
no ioctl calls or raw kernel types.
This rule applies to all kernel interfaces — btrfs ioctls, standard VFS
ioctls like FS_IOC_FIEMAP, and block device ioctls like BLKGETSIZE64 all
live in uapi/, never in cli/.
The same principle applies to disk/: it parses raw bytes into typed
structs, and cli/ handles all display formatting. The disk/ crate never
calls println!.
How Commands Work
Every command in btrfsutils is implemented across two layers: a safe kernel
interface wrapper in btrfs-uapi, and a CLI command in btrfs-cli. This page
walks through a concrete example — btrfs filesystem label — to show how the
two layers fit together and why the split exists.
The uapi layer
The uapi layer lives in uapi/src/. Its job is to translate between Rust types
and the raw kernel interfaces — allocating ioctl argument buffers, calling the
ioctl, and converting the result into something the rest of the code can use
without touching any unsafe code or bindgen types.
For btrfs filesystem label, that looks like this (from uapi/src/filesystem.rs):
#![allow(unused)]
fn main() {
pub fn label_get(fd: BorrowedFd) -> nix::Result<CString> {
let mut buf = [0i8; BTRFS_LABEL_SIZE as usize];
unsafe { btrfs_ioc_get_fslabel(fd.as_raw_fd(), &mut buf) }?;
let cstr = unsafe { CStr::from_ptr(buf.as_ptr()) };
Ok(cstr.to_owned())
}
pub fn label_set(fd: BorrowedFd, label: &CStr) -> nix::Result<()> {
let bytes = label.to_bytes();
if bytes.len() >= BTRFS_LABEL_SIZE as usize {
return Err(nix::errno::Errno::EINVAL);
}
let mut buf = [0i8; BTRFS_LABEL_SIZE as usize];
for (i, &b) in bytes.iter().enumerate() {
buf[i] = b as c_char;
}
unsafe { btrfs_ioc_set_fslabel(fd.as_raw_fd(), &buf) }?;
Ok(())
}
}
The function signatures use BorrowedFd rather than a raw integer, CString
rather than a byte array, and nix::Result rather than checking errno manually.
The caller never sees btrfs_ioctl_* types. The unsafe is contained to the
ioctl call itself, with surrounding logic that is safe and testable.
The cli layer
The CLI layer lives in cli/src/. Its job is to parse arguments, call the uapi
function, and format the output. It never calls ioctls directly.
The same command in cli/src/filesystem/label.rs:
#![allow(unused)]
fn main() {
#[derive(Parser, Debug)]
pub struct FilesystemLabelCommand {
/// The device or mount point to operate on
pub path: PathBuf,
/// The new label to set (if omitted, the current label is printed)
pub new_label: Option<OsString>,
}
impl Runnable for FilesystemLabelCommand {
fn run(&self, _format: Format, _dry_run: bool) -> Result<()> {
let file = open_path(&self.path)?;
match &self.new_label {
None => {
let label = label_get(file.as_fd())
.with_context(|| format!("failed to get label for '{}'", self.path.display()))?;
println!("{}", label.to_bytes().escape_ascii());
}
Some(new_label) => {
let cstring = CString::new(new_label.as_bytes())
.context("label must not contain null bytes")?;
label_set(file.as_fd(), &cstring)
.with_context(|| format!("failed to set label for '{}'", self.path.display()))?;
}
}
Ok(())
}
}
}
The struct derives Parser from clap — the field doc comments become the help
text. Runnable::run handles the two cases (get and set) by opening the path,
calling the appropriate uapi function, and either printing the result or reporting
an error. Error messages include the path so the user knows which filesystem
failed.
Why the split
The separation keeps each layer focused and independently testable. The uapi layer can be tested with unit tests that mock the ioctl, or with integration tests that operate on a real filesystem, without any CLI machinery involved. The CLI layer can be tested with argument parsing snapshot tests (no filesystem needed at all) and help text snapshot tests.
It also keeps the library crates clean. Because btrfs-uapi, btrfs-disk, and
btrfs-stream contain no CLI logic and no GPL-derived code, they can be licensed
MIT/Apache-2.0 and used by other projects independently of the CLI tools.
Routing
Each top-level command group has a router in cli/src/ (e.g.
cli/src/filesystem.rs) that defines a FilesystemCommand enum with a variant
per subcommand. The Runnable implementation for the router matches on the
variant and delegates to the subcommand’s own run method. Adding a new
subcommand means adding a variant to the enum, a mod declaration, and a run
dispatch arm.
Kernel Interfaces
All kernel communication lives in btrfs-uapi. This page describes the patterns
used to wrap the three main kernel interface types: ioctls, sysfs, and tree search.
Binding ioctls
Raw bindgen output is in uapi::raw, generated from uapi/src/raw/btrfs.h and
btrfs_tree.h. Ioctl wrappers are declared in uapi/src/raw.rs using nix macros:
#![allow(unused)]
fn main() {
ioctl_write_ptr!(btrfs_ioc_resize, BTRFS_IOCTL_MAGIC, 3, btrfs_ioctl_vol_args);
ioctl_read!(btrfs_ioc_fs_info, BTRFS_IOCTL_MAGIC, 31, btrfs_ioctl_fs_info_args);
ioctl_readwrite!(btrfs_ioc_balance_v2, BTRFS_IOCTL_MAGIC, 32, btrfs_ioctl_balance_args);
ioctl_none!(btrfs_ioc_scrub_cancel, BTRFS_IOCTL_MAGIC, 28);
ioctl_write_int!(btrfs_ioc_balance_ctl, BTRFS_IOCTL_MAGIC, 33);
}
The macro to use is determined by the ioctl direction in the C header:
| C macro | nix macro | Direction |
|---|---|---|
_IOW | ioctl_write_ptr! | userspace → kernel (pointer to struct) |
_IOR | ioctl_read! | kernel → userspace |
_IOWR | ioctl_readwrite! | both directions |
_IO | ioctl_none! | no data |
_IOW (integer) | ioctl_write_int! | value passed directly in arg slot |
Flexible array member ioctls
Some ioctls return variable-length arrays (e.g. btrfs_ioctl_space_args with a
trailing spaces[] field). The pattern is a two-phase call:
- Call with zero slots to get the count from the kernel.
- Allocate a
Vec<u64>(for 8-byte alignment) sized tobase_size + count * item_size. - Cast the vec’s pointer to the struct type, set the slot count, call again.
- Read results via
__IncompleteArrayField::as_slice(count).
See uapi/src/space.rs for a worked example.
The btrfs_ioctl_vol_args_v2 union
Several subvolume and device ioctls share btrfs_ioctl_vol_args_v2. Bindgen
generates two anonymous union fields:
__bindgen_anon_1— the{size, qgroup_inherit}/unused[4]union__bindgen_anon_2— thename[4040]/devid/subvolidunion
#![allow(unused)]
fn main() {
// Set a name:
let name_buf: &mut [c_char] = unsafe { &mut args.__bindgen_anon_2.name };
// Set devid (no unsafe needed for plain integer writes):
args.flags = BTRFS_DEVICE_SPEC_BY_ID as u64;
args.__bindgen_anon_2.devid = devid;
}
Tree search (BTRFS_IOC_TREE_SEARCH)
The tree search ioctl is the primary way to read data from btrfs B-trees from
userspace. It is wrapped in uapi/src/tree_search.rs as a callback-based cursor:
#![allow(unused)]
fn main() {
tree_search(fd, SearchFilter::for_type(tree_id, item_type), |hdr, data| {
// hdr: SearchHeader — objectid, offset, item_type, len (host byte order)
// data: &[u8] — raw on-disk item payload (little-endian)
Ok(())
})?;
}
Common SearchFilter constructors:
#![allow(unused)]
fn main() {
// All items of a specific type across all objectids:
SearchFilter::for_type(raw::BTRFS_CHUNK_TREE_OBJECTID as u64,
raw::BTRFS_CHUNK_ITEM_KEY as u32)
// Items of a specific type within an objectid range:
SearchFilter::for_objectid_range(tree_id, item_type, min_oid, max_oid)
}
For searches spanning multiple item types (e.g. the quota tree walk that reads
STATUS, INFO, LIMIT, and RELATION keys in one pass), construct SearchFilter
directly with start and end Key values spanning the desired type range.
Important: The start and end keys form compound bounds on the B-tree
key order (objectid, item_type, offset). They are not independent per-field
filters. Items with unexpected types can appear if their compound key falls
between start and end. Callbacks should filter on hdr.item_type when
they need a single type.
Bindgen type note
Tree objectid constants from btrfs_tree.h bind as u32 in Rust despite being
ULL in C (e.g. BTRFS_ROOT_TREE_OBJECTID: u32 = 1). Always cast at the use
site. BTRFS_LAST_FREE_OBJECTID binds as i32 = -256; cast to u64 gives
0xFFFFFFFF_FFFFFF00 as expected.
Cursor advancement
This is the most common source of bugs with tree search. The kernel interprets
(min_objectid, min_type, min_offset) as a compound tuple key, not three
independent range filters. After each batch, all three fields must be advanced
together past the last returned item:
- Normal case (offset does not overflow
u64): setmin_objectid = last.objectid,min_type = last.item_type,min_offset = last.offset + 1. - Offset overflow: set
min_offset = 0, keepmin_objectid = last.objectid, setmin_type = last.item_type + 1. - Type also overflows
u32: setmin_offset = 0,min_type = 0,min_objectid = last.objectid + 1.
Advancing only min_offset while leaving min_objectid unchanged causes items
from lower objectids to match the new minimum on every subsequent batch, producing
an infinite loop.
Sysfs
Some data is read from sysfs rather than ioctls — for example, scrub throughput
limits and quota state. The SysfsBtrfs type in uapi/src/sysfs.rs provides
typed access to /sys/fs/btrfs/<uuid>/. The filesystem UUID is obtained from
fs_info() (BTRFS_IOC_FS_INFO).
Send and Receive
btrfs send and btrfs receive transfer filesystem state between two btrfs
filesystems as a byte stream. This page explains how the mechanism works and how
to use the btrfs-stream and btrfs-uapi crates to implement receive in your
own application.
How send works
btrfs send asks the kernel to generate a stream representing the contents of a
read-only subvolume. The kernel traverses the subvolume’s B-trees and emits a
sequence of commands describing every file, directory, symlink, and extent. For
an incremental send (with -p <parent>), only the differences from the parent
subvolume are emitted.
The kernel is invoked via BTRFS_IOC_SEND, which writes the stream to a file
descriptor (typically the write end of a pipe). A reader thread on the other end
consumes the stream and writes it to a file or stdout.
The stream format
The stream is a binary format consisting of a header followed by a sequence of commands.
The stream header identifies the format version (v1, v2, or v3) and contains a
magic number (btrfs-stream\0). After the header, commands follow
back-to-back until an END command signals completion.
Each command has the following structure:
u32 total_length (length of the entire command, including this header)
u16 command_type (BTRFS_SEND_C_* constant)
u32 crc32c (checksum of the command, with the crc field zeroed)
attributes... (variable-length TLV list)
Attributes are TLV-encoded:
u16 attribute_type (BTRFS_SEND_A_* constant)
u16 length
data...
The CRC32C used by btrfs is the raw variant (initial seed 0, no final XOR),
not the standard ISO 3309 variant (initial seed 0xFFFFFFFF). When computing
or verifying a checksum, use:
#![allow(unused)]
fn main() {
let crc = !crc32c::crc32c_append(!0u32, data);
}
Parsing a stream with btrfs-stream
The btrfs-stream crate provides StreamReader, which parses commands one at a
time from any Read source:
#![allow(unused)]
fn main() {
use btrfs_stream::{StreamReader, StreamCommand};
let mut reader = StreamReader::new(input)?; // reads and validates the header
while let Some(command) = reader.read_command()? {
match command {
StreamCommand::Subvol { path, uuid, ctransid } => { /* create subvolume */ }
StreamCommand::MkFile { path } => { /* create file */ }
StreamCommand::Write { path, offset, data } => { /* write data */ }
StreamCommand::Rename { path, path_to } => { /* rename */ }
StreamCommand::End => break,
// ... all 22+ command types
}
}
}
StreamReader::new reads the stream header and returns an error if the magic is
wrong or the version is unsupported. read_command returns None at EOF.
Applying a stream with btrfs-uapi
To implement receive, you need to apply each command to a mounted btrfs filesystem. The relevant operations are:
Subvolume and snapshot creation (BTRFS_IOC_SUBVOL_CREATE,
BTRFS_IOC_SNAP_CREATE_V2): for Subvol commands, create a new empty
subvolume. For Snapshot commands, look up the source subvolume by UUID using
subvolume_search_by_received_uuid or subvolume_search_by_uuid, then create a
writable snapshot.
File operations: standard POSIX calls — open/create, unlink, mkdir,
rmdir, symlink, link, rename. btrfs does not require any special ioctls
for these.
Write (BTRFS_IOC_ENCODED_WRITE or pwrite): v2 streams may send
pre-compressed data via ENCODED_WRITE. If the kernel supports it, this can be
passed directly; otherwise decompress and fall back to pwrite.
Clone (BTRFS_IOC_CLONE_RANGE): shares an extent between two files without
copying data. The source file is found by resolving its UUID via the UUID tree.
Subvolume finalization: once all commands for a subvolume have been processed,
call BTRFS_IOC_SET_RECEIVED_SUBVOL to record the UUID and ctransid, then set
the subvolume read-only with BTRFS_IOC_SUBVOL_SETFLAGS.
Using ReceiveContext
If you want a complete, ready-to-use receive implementation rather than building
your own, the receive feature of btrfs-stream provides ReceiveContext:
btrfs-stream = { version = "0.5", features = ["receive"] }
#![allow(unused)]
fn main() {
use btrfs_stream::ReceiveContext;
let mut ctx = ReceiveContext::new(destination_dir)?;
ctx.receive(input_stream)?;
}
ReceiveContext handles all command types including v2 encoded writes (with
decompression fallback for zlib, zstd, and lzo) and v3 fs-verity. It uses an fd
cache to avoid reopening the same file for sequential writes, which is important
for performance when receiving large files.
Parsing
The btrfs-disk crate parses btrfs on-disk structures from raw byte buffers.
It is platform-independent — it works on any OS and can be used to inspect
filesystem images without a running kernel.
Reading a filesystem
The typical entry point is filesystem_open, which bootstraps from the
superblock:
superblock → sys_chunk_array → chunk tree → root tree
The returned OpenFilesystem contains a BlockReader (for reading tree blocks
by logical address) and a map of tree root locations. From there, tree_walk
traverses any tree in BFS or DFS order, calling a visitor callback for each
block:
#![allow(unused)]
fn main() {
let open = filesystem_open(file)?;
let mut reader = open.reader;
tree_walk(&mut reader, root_bytenr, Traversal::Bfs, &mut |block| {
// block: &TreeBlock — either a Node (internal) or Leaf
Ok(())
})?;
}
Item payloads
Leaf blocks contain items, each with a DiskKey (objectid, type, offset) and a
raw payload. parse_item_payload dispatches to a typed parser based on the key
type:
#![allow(unused)]
fn main() {
let payload = parse_item_payload(&key, data);
match payload {
ItemPayload::InodeItem(inode) => { /* ... */ }
ItemPayload::RootItem(root) => { /* ... */ }
ItemPayload::FileExtentItem(extent) => { /* ... */ }
// ...
}
}
Reading on-disk fields safely
On-disk structs are packed and little-endian. Casting a *const u8 pointer
directly to a packed struct is undefined behaviour due to potential misalignment.
btrfs-disk: bytes::Buf / bytes::BufMut
The disk crate uses the bytes crate for all parsing and serialization. A
&[u8] implements Buf, so you can read fields sequentially with methods like
get_u64_le(), which advances the cursor automatically:
#![allow(unused)]
fn main() {
let mut buf = data;
let generation = buf.get_u64_le();
let size = buf.get_u64_le();
let mode = buf.get_u32_le();
}
For serialization, BufMut provides the inverse (put_u64_le, put_slice,
etc.). This approach avoids manual offset arithmetic and makes it impossible to
read past the end of the buffer (it panics instead of silently producing
garbage).
btrfs-uapi: offset-based LE readers
The uapi crate parses tree search results returned by the kernel, which are
raw &[u8] buffers at known offsets. It uses explicit offset-based helpers from
uapi/src/util.rs:
#![allow(unused)]
fn main() {
use btrfs_uapi::util::read_le_u64;
use std::mem::offset_of;
let size = read_le_u64(data, offset_of!(raw::btrfs_inode_item, size));
}
Always use std::mem::offset_of! and std::mem::size_of to derive offsets and
sizes from the bindgen struct definitions — never hard-code numeric byte offsets.
The field_size!(T, field) macro (from crate::util) gives the size of an
individual field.
Superblock mirrors
btrfs writes up to three superblock copies at fixed offsets.
super_mirror_offset(n) returns the byte offset for mirror n (0, 1, or 2).
read_superblock reads and validates a superblock — checking the magic number
and CRC — from any seekable reader.
Display logic belongs in cli/
The disk/ crate only produces typed structs. All formatting and human-readable
output lives in cli/src/inspect/. The disk/ crate never calls println! or
constructs output strings.
Testing
The goal for this project is to maintain a high test coverage, to make sure that these tools function correctly.
Running tests
Running the tests for this project is complicated by the fact that many btrfs operations talk directly to the kernel and require elevated privileges.
You can run all non-privileged tests with regular cargo test commands. This
will still build the privileged tests, but they are skipped.
cargo test
In order to run privileged tests, there is a just target that will build
them, and run (only the test binaries, not cargo itself) using sudo.
This is the recommended way to run the full test suite on this project.
just test
You can build a coverage report (requires cargo-llvm-cov) of the full test
suite similarly, using the coverage target.
just coverage
# open target/coverage/llvm-cov/html/index.html
Static checks
Before committing, run just check. This wraps the formatter check
(nightly rustfmt), cargo deny, taplo for Cargo.toml formatting,
cargo doc (with -Dwarnings), cargo clippy --all-features,
per-libc cargo check for the host arch, the optional CLI features,
and cargo msrv verify against every publishable crate’s declared
rust-version.
The host-arch detection means just check works on x86_64 and
aarch64 alike. The musl half (<host>-unknown-linux-musl) needs a
matching C cross-compiler on PATH, since the zstd-sys and
lzo-sys build scripts compile C code:
-
Nix devshell (
nix develop) provides everything; you don’t need any of the steps below. -
Fedora aarch64:
dnf install musl-gccshipsmusl-gccas a thin wrapper around the host gcc plus musl specs.cc-rslooks for the target-prefixedaarch64-linux-musl-gccname, so symlink it once:sudo ln -s /usr/bin/musl-gcc /usr/local/bin/aarch64-linux-musl-gcc(or set
CC_aarch64_unknown_linux_musl=musl-gccandAR_aarch64_unknown_linux_musl=arif you prefer to avoid touching/usr/local/bin.) -
Debian / Ubuntu:
apt install musl-tools(host arch) or one of thegcc-<arch>-linux-musl-crosspackages for cross builds; same target-prefix handling applies ifcc-rsdoesn’t pick it up automatically.
If the cross C compiler isn’t on PATH, just check prints
skipping <triple> check: <prefix>-linux-musl-gcc not on PATH and
keeps going — only CI is expected to fail on a missing musl
toolchain.
Unit tests
Unit tests live as #[cfg(test)] mod tests blocks within the module they test.
They require no privileges and run with cargo test.
Coverage spans all pure logic across the crates: LE readers, struct size assertions, tree search cursor arithmetic, stream parsing (all 22 v1 command types, CRC validation), superblock parsing, B-tree node parsing, size/time formatting, argument parsing helpers, balance filter parsing, and property classification.
When adding a new feature, add unit tests for any logic that doesn’t require a real kernel or filesystem.
Integration tests
Integration tests live in uapi/tests/ and cli/tests/commands/ and are marked:
#![allow(unused)]
fn main() {
#[ignore = "requires elevated privileges"]
}
They are skipped by cargo test and run only via just test.
Fixture tests (commands/fixture.rs)
Read-only snapshot tests against a pre-built filesystem image
(cli/tests/commands/fixture.img.gz). The image has a fixed UUID, label, and
subvolume layout, so output is fully deterministic. These tests cover all
read-only commands: filesystem df/show/usage/label/du, subvolume list/show,
device stats/usage, all inspect-internal commands, and property get/list.
dump-tree and dump-super tests read the image file directly and do not
require mounting, so they run without elevated privileges even within the
privileged test suite.
Live tests (commands/live.rs)
Tests that create and mutate real btrfs filesystems on loopback devices. These cover all mutating commands: subvolume create/delete/snapshot, send/receive, scrub, balance, device add/remove, quota, qgroup, label set, resize, defrag, replace, and more.
Test helpers
cli/tests/common.rs provides RAII helpers that clean up automatically on drop:
BackingFile → LoopbackDevice → Mount
Convenience functions:
| Function | Description |
|---|---|
single_mount() | 512 MiB single-device filesystem in a tempdir |
deterministic_mount() | Same, with a fixed UUID and label |
fixture_mount() | Mounts the pre-built fixture image read-only |
write_test_data(path, n) | Write deterministic byte-pattern files |
verify_test_data(path, n) | Verify previously written test data |
Snapshot testing with insta
CLI output tests use insta for snapshot testing. Snapshots
live in cli/tests/snapshots/ and are checked in to the repository.
Four snapshot categories:
| Pattern | Privileges | Description |
|---|---|---|
arguments__*.snap | none | Argument parsing output |
help__*.snap | none | Help text for every subcommand |
commands__fixture__*.snap | root | Read-only CLI output (fixture image) |
commands__live__*.snap | root | CLI output from live filesystem tests |
Snapshot workflow
# Run tests; fails if any snapshot has changed:
cargo test
# Run tests and collect pending snapshot changes:
cargo insta test
# Interactively review each changed snapshot:
cargo insta review
# Accept all pending changes at once:
cargo insta accept --all
After running privileged tests via just test, the Justfile fixes ownership of
any root-owned snapshot files and sets INSTA_WORKSPACE_ROOT so snapshots land
in the right directory.
Adding tests for a new subcommand
- Argument parsing: add cases to
cli/tests/arguments.rsfollowing the existing pattern. - Help text:
cli/tests/help.rsauto-discovers all subcommands by walking the clap tree — no changes needed. - Read-only output: if the fixture image has suitable content, add snapshot
tests to
commands/fixture.rs. - Mutating commands: add tests to
commands/live.rsusing the RAII helpers.
Use the snap!("description", output) macro for snapshot tests — the description
appears in the snapshot file header.
Conventions
The goal is to write idiomatic Rust code that is consistent across the whole codebase. btrfsutils spans several crates with different roles (kernel interface wrappers, on-disk parsers, CLI tools) and each has its own patterns. Following these conventions makes it easier to navigate unfamiliar code and to understand what a function or type is responsible for at a glance.
Where possible, lean on the Rust ecosystem rather than reinventing things:
uuid for UUIDs, bitflags for flag sets, nix for syscalls and ioctls,
anyhow for error context in the CLI. This keeps the code readable to anyone
already familiar with those crates.
Naming
Module names are usually generic nouns. For example, in the uapi crate,
the ioctl call wrappers are organized by the thing they operate on, and
live in modules like filesystem, device, sync.
For the btrfs-cli crate, the module naming structure matches the subcommand
hierarchy. Meaning: the btrfs subvolume create command is implemented in
cli/src/subvolume/create.rs.
Types are named with the general concept first: SysfsBtrfs,
BlockGroupFlags, BalanceArgs — never BtrfsSysfs.
Functions follow a noun_verb pattern: label_get, label_set — never
get_label. Ioctl wrapper functions match the lowercased C macro name:
btrfs_ioc_balance_v2.
Avoid abbreviations. For example, use ChecksumType instead of CsumType.
Types
Always prefer proper typed values. For example, use Uuid from the uuid
crate, never [u8; 16]. In the CLI, if there is an argument that can take
one of multiple options, don’t represent it as a string, but instead create
an enum and derive clap::ValueEnum.
Null-terminated kernel strings (labels, device paths) use CString/CStr.
Make sure that allocation and deallocation is handled properly.
File descriptors passed to uapi functions use BorrowedFd.
Kernel flag fields use bitflags!, usually with a Display implementation so
they can be formatted with {}.
Complex argument structs (BalanceArgs, DefragRangeArgs) use the builder
pattern with new(), chained setters, and Default.
Never expose bindgen types (btrfs_ioctl_*) in public uapi APIs, instead
create idiomatic Rust structs.
Error handling
In uapi/, almost every function just performs a single syscall, so we return
the raw nix::Result<T>. Where possible, list potential error codes and their
meanings in the documentation comments.
Map specific errnos to Option or a typed
error at the call site where appropriate (ENODEV → None, etc.).
In cli/, mkfs/, and tune/, use anyhow::Result<T> and convert at the
uapi boundary with .with_context(). Always include the relevant path or
resource in the error message.
Constants
All BTRFS_* constants are available via crate::raw::* in the uapi and
disk crates. Unless you have a good reason to, import from crate::raw and
don’t define local copies. Size constants like SZ_1M that are not part of the
btrfs UAPI headers are the exception; define those locally with a comment.
There should not be any stray constants in the code. For example, use
std::mem::offset_of!() or std::mem::size_of!() macros to compute offsets
and sizes, and if there are any magic constants, give them a name.
Don’t redefine things that are already defined in crate::raw::*.
Parsing on-disk structures
In disk/ and mkfs/, use bytes::Buf for reading and bytes::BufMut for
writing on-disk fields. Sequential get_u64_le() / put_u64_le() calls
advance the cursor automatically, eliminating manual offset arithmetic. See the
Parsing page for details.
In uapi/, tree search results are parsed with explicit offset-based LE
readers (read_le_u64, read_le_u32) from uapi/src/util.rs, since those
buffers are accessed at known offsets rather than sequentially.
Style
Keep unsafe blocks as small as possible; non-trivial ones get a // SAFETY:
comment. For packed structs, copy fields to locals before taking references to
avoid misaligned reference UB. Use escape_ascii() when printing byte strings
that may be non-UTF-8. Import symbols used more than once rather than
qualifying them at every call site (single-use qualified paths are fine).
Shared CLI helpers live in cli/src/util.rs, these include utilities to format
sizes, bytes, time, and parse various types.
Doc comments
In uapi/, module-level docs start with a # heading describing the module’s
purpose. Function docs explain what the function does and why; the ioctl name is
a parenthetical in the implementation, not the primary description.
In cli/, don’t put doc comments on subcommand enum variants — clap uses the
variant doc in preference to the struct doc, forcing duplication. Don’t use
Markdown in clap struct doc comments: wrap_help reflows all text and destroys
formatting. Use plain prose paragraphs instead.
Btrfs On-Disk Format Specification
This document describes the binary layout of btrfs on-disk structures as
understood from the parser in disk/src/ and the serializer in mkfs/src/.
All multi-byte integer fields are little-endian. All byte offsets in this
document are zero-based unless noted otherwise.
Kernel header names are referenced in parentheses where helpful (e.g.
btrfs_super_block, btrfs_header). The authoritative source is the Linux
kernel UAPI headers btrfs.h and btrfs_tree.h.
Conventions used in this document:
- “LE u64” means a 64-bit unsigned integer stored in little-endian byte order.
- Byte offsets are from the start of the enclosing structure.
- Field sizes are in bytes unless noted otherwise.
- “Logical address” refers to an address in btrfs’s virtual address space, which must be resolved to a physical device offset via the chunk tree.
- “Physical address” refers to a byte offset on a specific block device.
Overview
Btrfs is a copy-on-write (COW) B-tree filesystem. All persistent data is organized into B-trees, and all B-trees share a single logical address space that is mapped to physical device locations through a chunk/stripe layer.
Architecture: trees within trees
The fundamental architecture is “trees within trees”:
- The superblock (at fixed offsets on disk) bootstraps access to the chunk tree and root tree.
- The chunk tree maps logical addresses to physical device locations. A small subset of the chunk tree is embedded in the superblock to bootstrap access to the full tree.
- The root tree is the directory of all other trees: it contains a
ROOT_ITEMfor each tree, pointing to that tree’s root block. - Content trees (FS tree, extent tree, checksum tree, etc.) store the actual filesystem data and metadata.
Copy-on-write semantics
Every modification creates new copies of affected blocks (COW), from the modified leaf up through the root of the tree. The final step atomically updates the superblock to point to the new root tree root. This ensures crash consistency without a journal: at any point, the last successfully written superblock points to a fully consistent tree hierarchy.
The COW property means that tree blocks are never modified in place. Instead:
- The leaf containing the modified item is written to a new location.
- The parent node’s key-pointer is updated to reference the new leaf, and the parent is written to a new location.
- This propagates up to the tree root.
- The root tree’s ROOT_ITEM is updated with the new root block address.
- The root tree itself is COWed up to its root.
- The superblock is written with the new root tree root address.
The generation counter is incremented with each transaction. All
blocks written in a transaction share the same generation number.
Shared format
All trees share the same block format (header + items or key-pointers)
and the same key structure (objectid, type, offset). The block size
(nodesize) is uniform across the filesystem, typically 16384 bytes.
The sectorsize (typically 4096 bytes) is the minimum I/O unit for data.
Multi-device support
Btrfs supports multiple devices in a single filesystem. The chunk tree maps logical addresses to physical offsets on specific devices. RAID profiles (SINGLE, DUP, RAID0, RAID1, RAID5, RAID6, RAID10, RAID1C3, RAID1C4) determine how chunks are distributed across devices.
Bootstrap sequence
Reading a btrfs filesystem from a raw device follows this sequence:
- Read the superblock at offset 64 KiB (try mirrors if primary fails).
- Parse
sys_chunk_arrayfrom the superblock to seed the chunk cache with system chunk mappings. - Resolve
chunk_rootthrough the chunk cache to a physical address. - Read the chunk tree root block and all chunk items to populate the full chunk cache.
- Resolve
root(root tree root) through the chunk cache. - Read the root tree to discover all other trees via ROOT_ITEM entries.
- Access any tree by resolving its root block address through the chunk cache.
Superblock
The superblock (btrfs_super_block) is a 4096-byte structure stored at
fixed offsets on each device. It is the entry point for reading the
filesystem.
Mirror locations
Three copies (mirrors) of the superblock are maintained:
| Mirror | Offset | Decimal |
|---|---|---|
| 0 | 0x10000 | 65536 (64 KiB) |
| 1 | 0x4000000 | 67108864 (64 MiB) |
| 2 | 0x4000000000 | 274877906944 (256 GiB) |
Mirror 0 is always present. Mirrors 1 and 2 are written only if the device is large enough. The offsets are computed as:
mirror 0: 64 KiB
mirror i: 16 KiB << (12 * i) for i > 0
On read, all mirrors present on the device are checked and the one with the highest valid generation is used.
Binary layout
| Field | Offset | Size | Notes |
|---|---|---|---|
csum | 0 | 32 | Checksum of bytes 32..4095 |
fsid | 32 | 16 | Filesystem UUID (shared across devices) |
bytenr | 48 | 8 | Physical offset of this superblock copy |
flags | 56 | 8 | BTRFS_SUPER_FLAG_* flags |
magic | 64 | 8 | 0x4D5F53665248425F (_BHRfS_M LE) |
generation | 72 | 8 | Transaction generation counter |
root | 80 | 8 | Logical bytenr of root tree root |
chunk_root | 88 | 8 | Logical bytenr of chunk tree root |
log_root | 96 | 8 | Logical bytenr of log tree root (0 if none) |
__unused_log_root_transid | 104 | 8 | Reserved, formerly log_root_transid |
total_bytes | 112 | 8 | Total usable bytes across all devices |
bytes_used | 120 | 8 | Total bytes used by data and metadata |
root_dir_objectid | 128 | 8 | Objectid of root directory (always 6) |
num_devices | 136 | 8 | Number of devices in this filesystem |
sectorsize | 144 | 4 | Minimum I/O alignment (typically 4096) |
nodesize | 148 | 4 | Tree block size in bytes (typically 16384) |
__unused_leafsize | 152 | 4 | Legacy field, equal to nodesize |
stripesize | 156 | 4 | Stripe size for RAID (typically 65536) |
sys_chunk_array_size | 160 | 4 | Valid bytes in sys_chunk_array |
chunk_root_generation | 164 | 8 | Generation of the chunk tree root |
compat_flags | 172 | 8 | Compatible feature flags |
compat_ro_flags | 180 | 8 | Compatible read-only feature flags |
incompat_flags | 188 | 8 | Incompatible feature flags |
csum_type | 196 | 2 | Checksum algorithm (0=CRC32C, 1=xxHash, 2=SHA256, 3=BLAKE2) |
root_level | 198 | 1 | B-tree level of root tree root |
chunk_root_level | 199 | 1 | B-tree level of chunk tree root |
log_root_level | 200 | 1 | B-tree level of log tree root |
dev_item | 201 | 98 | Embedded btrfs_dev_item for this device |
label | 299 | 256 | Filesystem label (NUL-terminated, max 255 chars) |
cache_generation | 555 | 8 | Generation of free space cache (v1) |
uuid_tree_generation | 563 | 8 | Generation of UUID tree |
metadata_uuid | 571 | 16 | Metadata UUID (when METADATA_UUID incompat set) |
nr_global_roots | 587 | 8 | Number of global roots (extent-tree-v2) |
| (reserved fields) | 595 | … | Zero-filled up to sys_chunk_array |
sys_chunk_array | 800 | 2048 | Bootstrap chunk items |
super_roots[4] | 2848 | 672 | Four rotating backup root entries (168 bytes each) |
| (padding) | 3520 | 576 | Zero-filled to 4096 bytes |
Total: 4096 bytes (BTRFS_SUPER_INFO_SIZE).
System chunk array bootstrap
The sys_chunk_array field embeds a subset of the chunk tree sufficient
to locate the full chunk tree on disk. It contains a sequence of
(disk_key, chunk_item) pairs:
For each entry:
17 bytes btrfs_disk_key (objectid, type, offset) -- offset = logical addr
variable btrfs_chunk Chunk item (see Section 8.9)
The array is parsed sequentially until sys_chunk_array_size bytes are
consumed. These entries typically contain the SYSTEM chunk(s) that map
the chunk tree and root tree blocks.
Backup roots
The super_roots array contains four rotating backup copies of critical
tree root pointers. The kernel updates one entry per transaction, cycling
through indices 0-3. Each backup root entry (btrfs_root_backup) is
168 bytes:
| Field | Offset | Size | Notes |
|---|---|---|---|
tree_root | 0 | 8 | Logical bytenr of root tree root |
tree_root_gen | 8 | 8 | Generation of root tree root |
chunk_root | 16 | 8 | Logical bytenr of chunk tree root |
chunk_root_gen | 24 | 8 | Generation of chunk tree root |
extent_root | 32 | 8 | Logical bytenr of extent tree root |
extent_root_gen | 40 | 8 | Generation of extent tree root |
fs_root | 48 | 8 | Logical bytenr of FS tree root |
fs_root_gen | 56 | 8 | Generation of FS tree root |
dev_root | 64 | 8 | Logical bytenr of device tree root |
dev_root_gen | 72 | 8 | Generation of device tree root |
csum_root | 80 | 8 | Logical bytenr of checksum tree root |
csum_root_gen | 88 | 8 | Generation of checksum tree root |
total_bytes | 96 | 8 | Total filesystem bytes at backup time |
bytes_used | 104 | 8 | Bytes used at backup time |
num_devices | 112 | 8 | Number of devices at backup time |
| (reserved) | 120 | 32 | Unused u64[4] |
tree_root_level | 152 | 1 | B-tree level of root tree root |
chunk_root_level | 153 | 1 | B-tree level of chunk tree root |
extent_root_level | 154 | 1 | B-tree level of extent tree root |
fs_root_level | 155 | 1 | B-tree level of FS tree root |
dev_root_level | 156 | 1 | B-tree level of device tree root |
csum_root_level | 157 | 1 | B-tree level of checksum tree root |
| (padding) | 158 | 10 | Unused bytes to 168 total |
Superblock checksum
The checksum field (csum, bytes 0..31) covers everything from byte 32
through byte 4095 (inclusive). For CRC32C, the 4-byte result is stored
little-endian at bytes 0..3 and bytes 4..31 are zeroed.
The magic number _BHRfS_M (hex 0x4D5F53665248425F) must be present
at offset 64 for a valid superblock.
Superblock validity is determined by checking both magic and checksum
match. When multiple valid mirrors exist, the one with the highest
generation is used.
Tree Block Format
Every B-tree block (node or leaf) is exactly nodesize bytes. The block
begins with a 101-byte header (btrfs_header), followed by either item
descriptors (leaves) or key-pointer entries (nodes).
Header
| Field | Offset | Size | Notes |
|---|---|---|---|
csum | 0 | 32 | Checksum of bytes 32..nodesize-1 |
fsid | 32 | 16 | Filesystem UUID (must match superblock) |
bytenr | 48 | 8 | Logical byte offset of this block |
flags | 56 | 8 | Header flags (lower 56 bits) + backref rev (upper 8 bits) |
chunk_tree_uuid | 64 | 16 | UUID of the chunk tree mapping this block |
generation | 80 | 8 | Transaction generation when last written |
owner | 88 | 8 | Objectid of the tree owning this block |
nritems | 96 | 4 | Number of items (leaf) or key-pointers (node) |
level | 100 | 1 | 0 = leaf, >0 = internal node |
Total header size: 101 bytes.
The flags field combines two values:
- Bits 0-55: block flags (
BTRFS_HEADER_FLAG_WRITTEN= 1,BTRFS_HEADER_FLAG_RELOC= 2) - Bits 56-63: backref revision (
BTRFS_MIXED_BACKREF_REV= 1 for modern filesystems)
The header checksum covers bytes 32 through nodesize - 1. For CRC32C,
the result is stored as a 4-byte LE value at bytes 0..3 with bytes 4..31
zeroed.
Leaf vs node distinction
The level field determines the block type:
level == 0: leaf block, containing itemslevel > 0: internal node, containing key-pointers to child blocks
The maximum tree depth is bounded by the number of key-pointers that fit in a node. For a 16 KiB nodesize, a node holds up to:
max_ptrs = (nodesize - HEADER_SIZE) / KEY_PTR_SIZE
= (16384 - 101) / 33
= 493 key-pointers
With 493 children per node, a tree of depth 2 (root node + leaf) can
hold 493 * 651 = ~320,000 items. A tree of depth 3 can hold
493^2 * 651 = ~158 million items. In practice, trees rarely exceed
depth 3 or 4.
Leaf Format
A leaf block (level 0) contains sorted item descriptors followed by a data area. Item descriptors grow forward from the header; item data grows backward from the end of the block.
+-------------------------------------------+
| Header (101 bytes) |
+-------------------------------------------+
| Item descriptor 0 (25 bytes) |
| Item descriptor 1 (25 bytes) |
| ... |
| Item descriptor N-1 (25 bytes) |
+-------------------------------------------+
| (free space) |
+-------------------------------------------+
| Item data N-1 |
| ... |
| Item data 1 |
| Item data 0 |
+-------------------------------------------+
Item descriptor
Each item descriptor (btrfs_item) is 25 bytes:
| Field | Offset | Size | Notes |
|---|---|---|---|
objectid | 0 | 8 | Key objectid (LE u64) |
type | 8 | 1 | Key type byte (u8) |
offset | 9 | 8 | Key offset (LE u64) |
data_offset | 17 | 4 | Byte offset of item data from end of header (LE u32) |
data_size | 21 | 4 | Size of item data in bytes (LE u32) |
The first 17 bytes form a btrfs_disk_key. The data_offset field is
relative to the start of the leaf data area, which begins immediately
after the header. To locate item data in the raw block buffer:
absolute_offset = HEADER_SIZE + data_offset
where HEADER_SIZE = 101 bytes.
Data area layout
Item data is packed from the end of the block backward. The first item pushed has its data at the highest offset; subsequent items have data at progressively lower offsets. This means:
- Item descriptors grow forward:
HEADER_SIZE + i * 25 - Item data grows backward: starting from
nodesizeand moving toward the descriptor area
The free space in a leaf is the gap between the end of the last descriptor and the start of the earliest (lowest-offset) item data.
Offset bookkeeping
When building a leaf (as the mkfs LeafBuilder does), the bookkeeping
works as follows:
Initial state:
item_offset = HEADER_SIZE (101) // next descriptor position
data_end = nodesize (16384) // next data write position
For each item pushed (key, data[N bytes]):
1. data_end -= N // reserve space for item data
2. Write data at buf[data_end .. data_end + N]
3. data_offset = data_end - HEADER_SIZE // relative to header end
4. Write descriptor at buf[item_offset]:
key (17 bytes) + data_offset (LE u32) + data_size (LE u32)
5. item_offset += 25 // advance to next descriptor slot
The available space for additional items is:
space_left = data_end - (item_offset + ITEM_SIZE)
This must accommodate both the 25-byte descriptor and the item data.
Key ordering invariant
Items within a leaf are sorted by their keys in lexicographic order:
first by objectid, then by type, then by offset. This invariant
is maintained by the B-tree insertion logic and verified by btrfs check.
Capacity
For a 16384-byte leaf, the maximum number of items depends on their data
sizes. With zero-length data items (such as TREE_BLOCK_REF or
FREE_SPACE_EXTENT), the theoretical maximum is:
max_items = (nodesize - HEADER_SIZE) / ITEM_SIZE
= (16384 - 101) / 25
= 651 items
In practice, most items have data payloads that reduce this number significantly.
Node Format
An internal node (level > 0) contains sorted key-pointer entries
(btrfs_key_ptr). Each entry points to a child block and records the
lowest key in that child’s subtree.
Key-pointer entry
Each key-pointer (btrfs_key_ptr) is 33 bytes:
| Field | Offset | Size | Notes |
|---|---|---|---|
objectid | 0 | 8 | Key objectid (LE u64) |
type | 8 | 1 | Key type byte (u8) |
offset | 9 | 8 | Key offset (LE u64) |
blockptr | 17 | 8 | Logical byte address of child block (LE u64) |
generation | 25 | 8 | Generation of the child block (LE u64) |
The first 17 bytes form the btrfs_disk_key representing the lowest key
in the child subtree. The generation field is used for consistency
checks: when reading the child block, its header generation must match
this value.
Layout
+-------------------------------------------+
| Header (101 bytes) |
+-------------------------------------------+
| Key-pointer 0 (33 bytes) |
| Key-pointer 1 (33 bytes) |
| ... |
| Key-pointer N-1 (33 bytes) |
+-------------------------------------------+
| (unused space to nodesize) |
+-------------------------------------------+
Key-pointers are sorted by their key in the same lexicographic order as
leaf items. The child block referenced by key-pointer i contains all
items with keys >= key-pointer[i].key and < key-pointer[i+1].key (or
unbounded above for the last pointer).
Key Structure
Every item and key-pointer is addressed by a three-part key
(btrfs_disk_key):
| Field | Offset | Size | Notes |
|---|---|---|---|
objectid | 0 | 8 | LE u64 |
type | 8 | 1 | u8 |
offset | 9 | 8 | LE u64 |
Total: 17 bytes.
Lexicographic ordering
Keys are compared as a tuple (objectid, type, offset) in that order.
The objectid is compared first; on a tie, type is compared; on a
further tie, offset breaks the tie. All comparisons are unsigned
integer comparisons.
Field semantics by tree
The meaning of the three key fields varies depending on the tree and item type:
FS tree:
objectid= inode number (starting at 256 =BTRFS_FIRST_FREE_OBJECTID)type= item type (INODE_ITEM, DIR_ITEM, EXTENT_DATA, etc.)offset= type-dependent (0 for INODE_ITEM, name hash for DIR_ITEM, file byte offset for EXTENT_DATA, parent inode for INODE_REF, etc.)
Root tree:
objectid= tree objectid (e.g. 5 for FS_TREE, 256+ for subvolumes)type= ROOT_ITEM, ROOT_REF, or ROOT_BACKREFoffset= 0 for ROOT_ITEM, child/parent tree ID for refs
Extent tree:
objectid= logical byte address of the extenttype= EXTENT_ITEM, METADATA_ITEM, or backref typeoffset= extent length (EXTENT_ITEM), level (METADATA_ITEM), or backref-specific (root objectid, parent bytenr, hash)
Chunk tree:
objectid=BTRFS_FIRST_CHUNK_TREE_OBJECTID(256) for CHUNK_ITEMtype= CHUNK_ITEMoffset= logical byte address of the chunk
Device tree:
objectid= device ID for DEV_EXTENT;BTRFS_DEV_ITEMS_OBJECTID(1) for DEV_ITEMtype= DEV_EXTENT or DEV_ITEMoffset= physical offset for DEV_EXTENT; device ID for DEV_ITEM
Checksum tree:
objectid=BTRFS_EXTENT_CSUM_OBJECTIDtype= EXTENT_CSUMoffset= logical byte address of the first checksummed sector
Free space tree:
objectid= block group logical offset (for FREE_SPACE_INFO) or extent start (for FREE_SPACE_EXTENT/BITMAP)type= FREE_SPACE_INFO, FREE_SPACE_EXTENT, or FREE_SPACE_BITMAPoffset= block group length (for INFO) or extent length (for EXTENT/BITMAP)
UUID tree:
objectid= upper 8 bytes of UUID interpreted as LE u64type= UUID_KEY_SUBVOL or UUID_KEY_RECEIVED_SUBVOLoffset= lower 8 bytes of UUID interpreted as LE u64
Quota tree:
objectid= packed qgroupid(level << 48) | subvolidtype= QGROUP_STATUS, QGROUP_INFO, QGROUP_LIMIT, QGROUP_RELATIONoffset= packed qgroupid for relations, 0 otherwise
Key type values
| Value | Name | Description |
|---|---|---|
| 1 | INODE_ITEM_KEY | Inode metadata (mode, size, timestamps, nlink) |
| 12 | INODE_REF_KEY | Link from inode to parent directory (name + index) |
| 13 | INODE_EXTREF_KEY | Extended inode ref for names exceeding INODE_REF capacity |
| 24 | XATTR_ITEM_KEY | Extended attribute (name + value, keyed by name hash) |
| 36 | VERITY_DESC_ITEM_KEY | fs-verity descriptor |
| 37 | VERITY_MERKLE_ITEM_KEY | fs-verity Merkle tree data |
| 48 | ORPHAN_ITEM_KEY | Orphan inode pending cleanup |
| 60 | DIR_LOG_ITEM_KEY | Directory log for fsync optimization |
| 72 | DIR_LOG_INDEX_KEY | Directory log index |
| 84 | DIR_ITEM_KEY | Directory entry keyed by crc32c(name) hash |
| 96 | DIR_INDEX_KEY | Directory entry keyed by sequential index |
| 108 | EXTENT_DATA_KEY | File extent (inline data or reference to disk extent) |
| 128 | EXTENT_CSUM_KEY | Data checksum covering one or more sectors |
| 132 | ROOT_ITEM_KEY | Tree root descriptor (bytenr, generation, UUID, timestamps) |
| 144 | ROOT_BACKREF_KEY | Backref from child subvolume to parent |
| 156 | ROOT_REF_KEY | Forward ref from parent subvolume to child |
| 168 | EXTENT_ITEM_KEY | Extent allocation with backrefs (non-skinny: offset = size) |
| 169 | METADATA_ITEM_KEY | Skinny metadata extent (offset = level, not size) |
| 172 | EXTENT_OWNER_REF_KEY | Simple quota owner backref |
| 176 | TREE_BLOCK_REF_KEY | Standalone backref: metadata extent → owning tree |
| 178 | EXTENT_DATA_REF_KEY | Standalone backref: data extent → (root, ino, offset) |
| 182 | SHARED_BLOCK_REF_KEY | Shared metadata backref (parent block address) |
| 184 | SHARED_DATA_REF_KEY | Shared data backref (parent block address + count) |
| 192 | BLOCK_GROUP_ITEM_KEY | Block group allocation info (used bytes, type, profile) |
| 198 | FREE_SPACE_INFO_KEY | Free space tree: per-block-group metadata |
| 199 | FREE_SPACE_EXTENT_KEY | Free space tree: free extent range |
| 200 | FREE_SPACE_BITMAP_KEY | Free space tree: bitmap of free sectors |
| 204 | DEV_EXTENT_KEY | Physical extent allocated to a chunk on a device |
| 216 | DEV_ITEM_KEY | Device descriptor (size, UUID, I/O parameters) |
| 228 | CHUNK_ITEM_KEY | Chunk mapping logical → physical with stripe info |
| 230 | RAID_STRIPE_KEY | RAID stripe tree entry (zoned devices) |
| 240 | QGROUP_STATUS_KEY | Quota group global status and generation |
| 242 | QGROUP_INFO_KEY | Per-qgroup usage counters (referenced, exclusive) |
| 244 | QGROUP_LIMIT_KEY | Per-qgroup size limits |
| 246 | QGROUP_RELATION_KEY | Parent-child relationship between qgroups |
| 248 | TEMPORARY_ITEM_KEY | Transient item; also used as BALANCE_ITEM_KEY |
| 249 | PERSISTENT_ITEM_KEY | Persistent metadata; also used as DEV_STATS_KEY |
| 250 | DEV_REPLACE_KEY | Device replace operation state |
| 251 | UUID_KEY_SUBVOL | UUID tree: maps subvolume UUID → subvolume ID |
| 252 | UUID_KEY_RECEIVED_SUBVOL | UUID tree: maps received UUID → subvolume ID |
| 253 | STRING_ITEM_KEY | Label or other string metadata |
Well-known objectid values
| Value | Name | Notes |
|---|---|---|
| 1 | ROOT_TREE_OBJECTID | Root tree |
| 2 | EXTENT_TREE_OBJECTID | Extent tree |
| 3 | CHUNK_TREE_OBJECTID | Chunk tree |
| 4 | DEV_TREE_OBJECTID | Device tree |
| 5 | FS_TREE_OBJECTID | Default FS tree |
| 6 | ROOT_TREE_DIR_OBJECTID | Root tree directory |
| 7 | CSUM_TREE_OBJECTID | Checksum tree |
| 8 | QUOTA_TREE_OBJECTID | Quota tree |
| 9 | UUID_TREE_OBJECTID | UUID tree |
| 10 | FREE_SPACE_TREE_OBJECTID | Free space tree |
| 11 | BLOCK_GROUP_TREE_OBJECTID | Block group tree |
| 12 | RAID_STRIPE_TREE_OBJECTID | RAID stripe tree |
| 256 | FIRST_FREE_OBJECTID | First user inode / first subvolume ID |
| (u64)-4 | BALANCE_OBJECTID | Balance status |
| (u64)-5 | ORPHAN_OBJECTID | Orphan items |
| (u64)-6 | TREE_LOG_OBJECTID | Tree log |
| (u64)-7 | TREE_LOG_FIXUP_OBJECTID | Tree log fixup |
| (u64)-8 | TREE_RELOC_OBJECTID | Tree relocation |
| (u64)-9 | DATA_RELOC_TREE_OBJECTID | Data relocation tree |
| (u64)-10 | EXTENT_CSUM_OBJECTID | Extent checksums |
| (u64)-11 | FREE_SPACE_OBJECTID | Free space cache (v1) |
| (u64)-12 | FREE_INO_OBJECTID | Free inode number tracking |
| (u64)-255 | MULTIPLE_OBJECTIDS | Multiple-owner sentinel |
Negative objectids are stored as their unsigned 64-bit two’s complement
representation. For example, BALANCE_OBJECTID = -4 is stored as
0xFFFFFFFF_FFFFFFFC.
Trees
Btrfs uses multiple B-trees, each identified by a well-known objectid.
The root tree stores a ROOT_ITEM for each tree, pointing to its root
block.
Root tree (objectid 1)
The directory of all other trees. Contains:
ROOT_ITEMfor each tree (objectid = tree ID, type = ROOT_ITEM, offset = 0)ROOT_REFfor parent-to-child subvolume linksROOT_BACKREFfor child-to-parent subvolume linksROOT_TREE_DIRdirectory entry linking to the default subvolumeTEMPORARY_ITEMfor balance status persistencePERSISTENT_ITEMfor device statistics and replace status
Extent tree (objectid 2)
Tracks all allocated space (data extents and metadata tree blocks) with reference counting and backreferences. Contains:
EXTENT_ITEMfor data and non-skinny metadata extentsMETADATA_ITEMfor skinny metadata extentsTREE_BLOCK_REFfor direct metadata backrefsSHARED_BLOCK_REFfor shared metadata backrefs (snapshots)EXTENT_DATA_REFfor direct data backrefsSHARED_DATA_REFfor shared data backrefs (snapshots)BLOCK_GROUP_ITEMfor each block group (unless block_group_tree feature)
Chunk tree (objectid 3)
Maps logical address ranges to physical device stripes. Contains:
CHUNK_ITEMfor each chunk (logical-to-physical mapping)DEV_ITEMfor each device
The chunk tree is bootstrapped from the superblock’s sys_chunk_array.
Device tree (objectid 4)
Tracks per-device physical extent allocations. Contains:
DEV_EXTENTfor each allocated physical range on each device
FS tree (objectid 5, 256+)
Holds the filesystem content for a subvolume. The default subvolume uses objectid 5; additional subvolumes and snapshots use objectids starting at 256. Contains:
INODE_ITEMfor each inodeINODE_REF/INODE_EXTREFfor hard linksDIR_ITEMfor directory entries (keyed by name hash)DIR_INDEXfor directory entries (keyed by sequence number)EXTENT_DATAfor file extent descriptorsXATTR_ITEMfor extended attributesORPHAN_ITEMfor unlinked but still open inodes
Checksum tree (objectid 7)
Stores per-sector data checksums. Contains:
EXTENT_CSUMitems: each item covers a contiguous range of data sectors, storing an array of per-sector checksums
Quota tree (objectid 8)
Tracks quota group accounting. Contains:
QGROUP_STATUS(one per filesystem)QGROUP_INFOfor each qgroupQGROUP_LIMITfor each qgroup with limitsQGROUP_RELATIONfor parent-child qgroup relationships
UUID tree (objectid 9)
Provides fast UUID-to-subvolume lookups for send/receive. Contains:
UUID_KEY_SUBVOLmapping subvolume UUID to objectidUUID_KEY_RECEIVED_SUBVOLmapping received UUID to objectid
Free space tree (objectid 10)
Tracks free space per block group, replacing the older free space cache (v1). Contains:
FREE_SPACE_INFOfor each block groupFREE_SPACE_EXTENTfor free rangesFREE_SPACE_BITMAPfor bitmap-tracked regions
Requires the free_space_tree compat_ro feature flag.
Block group tree (objectid 11)
Separates block group items from the extent tree for faster mount times. Contains:
BLOCK_GROUP_ITEMfor each block group
Requires the block_group_tree compat_ro feature flag. When this tree
is absent, block group items live in the extent tree.
Data relocation tree (objectid (u64)-9)
A temporary FS tree used during balance to hold relocated data extents. Uses the same item types as a regular FS tree.
RAID stripe tree (objectid 12)
Maps logical extents to per-device physical stripe offsets. Contains:
RAID_STRIPEitems
Requires the raid_stripe_tree incompat feature flag.
Item Types
This section documents the key format and payload layout for each major item type.
INODE_ITEM (type 1)
Key: (inode_number, INODE_ITEM, 0)
Exactly one per inode. Stores POSIX attributes, timestamps, and btrfs-specific flags.
Payload (btrfs_inode_item, 160 bytes):
| Field | Offset | Size | Notes |
|---|---|---|---|
generation | 0 | 8 | Generation when created |
transid | 8 | 8 | Transaction ID of last modification |
size | 16 | 8 | Logical file size in bytes |
nbytes | 24 | 8 | On-disk bytes used (all copies) |
block_group | 32 | 8 | Block group hint for new allocations |
nlink | 40 | 4 | Hard link count |
uid | 44 | 4 | Owner user ID |
gid | 48 | 4 | Owner group ID |
mode | 52 | 4 | POSIX file mode (type + permissions) |
rdev | 56 | 8 | Device number (char/block device inodes) |
flags | 64 | 8 | Inode flags (see below) |
sequence | 72 | 8 | NFS-compatible change sequence number |
| reserved | 80 | 32 | Reserved u64[4], must be zero |
atime | 112 | 12 | Access time (btrfs_timespec) |
ctime | 124 | 12 | Change time (btrfs_timespec) |
mtime | 136 | 12 | Modification time (btrfs_timespec) |
otime | 148 | 12 | Creation time (btrfs_timespec) |
Each btrfs_timespec is 12 bytes:
| Field | Offset | Size | Notes |
|---|---|---|---|
sec | 0 | 8 | Seconds since Unix epoch (LE u64) |
nsec | 8 | 4 | Nanosecond component, 0..999999999 (LE u32) |
Inode flags (bitmask):
| Bit | Value | Name |
|---|---|---|
| 0 | 0x1 | NODATASUM |
| 1 | 0x2 | NODATACOW |
| 2 | 0x4 | READONLY |
| 3 | 0x8 | NOCOMPRESS |
| 4 | 0x10 | PREALLOC |
| 5 | 0x20 | SYNC |
| 6 | 0x40 | IMMUTABLE |
| 7 | 0x80 | APPEND |
| 8 | 0x100 | NODUMP |
| 9 | 0x200 | NOATIME |
| 10 | 0x400 | DIRSYNC |
| 11 | 0x800 | COMPRESS |
| 20 | 0x100000 | ROOT_ITEM_INIT |
INODE_REF (type 12)
Key: (inode_number, INODE_REF, parent_dir_inode)
Hard-link reference from an inode to a directory entry. Multiple refs can be packed into a single item when an inode has several hard links in the same parent directory.
Payload (variable, packed sequence of entries):
For each ref:
| Field | Offset | Size | Notes |
|---|---|---|---|
index | 0 | 8 | DIR_INDEX sequence number (LE u64) |
name_len | 8 | 2 | Length of name in bytes (LE u16) |
name | 10 | name_len | Filename bytes (no NUL terminator) |
Multiple refs are concatenated without padding.
INODE_EXTREF (type 13)
Key: (inode_number, INODE_EXTREF, crc32c(parent_ino, name))
Extended inode reference. Unlike INODE_REF, the parent inode is stored
in the struct, allowing references from different parent directories.
Requires the extended_iref incompat feature.
Payload (variable, packed sequence):
For each ref:
| Field | Offset | Size | Notes |
|---|---|---|---|
parent | 0 | 8 | Parent directory inode number (LE u64) |
index | 8 | 8 | DIR_INDEX sequence number (LE u64) |
name_len | 16 | 2 | Length of name (LE u16) |
name | 18 | name_len | Filename bytes |
DIR_ITEM (type 84) / DIR_INDEX (type 96)
Key for DIR_ITEM: (dir_inode, DIR_ITEM, crc32c(name))
Key for DIR_INDEX: (dir_inode, DIR_INDEX, sequence)
Both use the same on-disk format. DIR_ITEM entries are keyed by the
CRC32C hash of the filename (raw CRC32C, not standard). DIR_INDEX
entries are keyed by a monotonically increasing sequence number for
ordered directory iteration.
Multiple entries can be packed into a single DIR_ITEM when names hash
to the same value (hash collision).
Payload (btrfs_dir_item, variable, packed sequence):
For each entry:
| Field | Offset | Size | Notes |
|---|---|---|---|
location | 0 | 17 | Target inode key (btrfs_disk_key) |
transid | 17 | 8 | Transaction ID (LE u64) |
data_len | 25 | 2 | Xattr value length, 0 for dirs (LE u16) |
name_len | 27 | 2 | Filename length (LE u16) |
type | 29 | 1 | File type (see below) |
name | 30 | name_len | Filename bytes |
data | 30+NL | data_len | Xattr value (for XATTR_ITEM only) |
The location field is a btrfs_disk_key pointing to the target. For
regular directory entries, this typically has objectid = target inode,
type = INODE_ITEM, offset = 0. For subvolume entries, type = ROOT_ITEM
and objectid = the subvolume’s tree objectid.
File type values:
| Value | Name |
|---|---|
| 0 | FT_UNKNOWN |
| 1 | FT_REG_FILE |
| 2 | FT_DIR |
| 3 | FT_CHRDEV |
| 4 | FT_BLKDEV |
| 5 | FT_FIFO |
| 6 | FT_SOCK |
| 7 | FT_SYMLINK |
| 8 | FT_XATTR |
FILE_EXTENT_ITEM (type 108)
Key: (inode_number, EXTENT_DATA, file_byte_offset)
Describes how a range of file bytes maps to on-disk storage. Three extent types exist: inline, regular, and preallocated.
Common header (21 bytes):
| Field | Offset | Size | Notes |
|---|---|---|---|
generation | 0 | 8 | Allocation generation (LE u64) |
ram_bytes | 8 | 8 | Uncompressed size (LE u64) |
compression | 16 | 1 | Compression type (0=none, 1=zlib, 2=lzo, 3=zstd) |
encryption | 17 | 1 | Reserved (always 0) |
other_encoding | 18 | 2 | Reserved (always 0) |
type | 20 | 1 | Extent type (0=inline, 1=regular, 2=prealloc) |
Inline extent (type 0):
After the 21-byte header, the remaining bytes in the item are the file
data itself. The data length is item_size - 21. For compressed inline
extents, the data is compressed and ram_bytes gives the uncompressed
size.
| Field | Offset | Size | Notes |
|---|---|---|---|
| header | 0 | 21 | Common header (type = 0) |
| data | 21 | item_size-21 | Inline file data |
Total item size: 21 + data_length.
Regular extent (type 1) and prealloc extent (type 2):
| Field | Offset | Size | Notes |
|---|---|---|---|
| header | 0 | 21 | Common header (type = 1 or 2) |
disk_bytenr | 21 | 8 | Logical address of extent on disk (LE u64) |
disk_num_bytes | 29 | 8 | Size of extent on disk (LE u64) |
offset | 37 | 8 | Byte offset into extent (LE u64) |
num_bytes | 45 | 8 | Number of logical file bytes covered (LE u64) |
Total item size: 53 bytes.
A disk_bytenr of 0 indicates a hole (sparse region). For compressed
extents, disk_num_bytes is the compressed size on disk and ram_bytes
is the uncompressed size. The offset field allows referencing into the
middle of a shared extent (e.g., after COW of part of a cloned extent).
Prealloc extents (type 2) are reserved but unwritten; reads return zeroes.
EXTENT_ITEM (type 168) / METADATA_ITEM (type 169)
Key for EXTENT_ITEM: (logical_bytenr, EXTENT_ITEM, extent_length)
Key for METADATA_ITEM: (logical_bytenr, METADATA_ITEM, level)
Tracks reference counts and backreferences for allocated space.
METADATA_ITEM is the “skinny” variant (when skinny_metadata incompat
flag is set): the extent length is implicit (= nodesize) and the key
offset stores the tree block level instead.
Base payload (btrfs_extent_item, 24 bytes):
| Field | Offset | Size | Notes |
|---|---|---|---|
refs | 0 | 8 | Number of references (LE u64) |
generation | 8 | 8 | Allocation generation (LE u64) |
flags | 16 | 8 | Extent flags (LE u64) |
Extent flags:
| Bit | Value | Name |
|---|---|---|
| 0 | 0x1 | EXTENT_FLAG_DATA |
| 1 | 0x2 | EXTENT_FLAG_TREE_BLOCK |
| 8 | 0x100 | BLOCK_FLAG_FULL_BACKREF |
Tree block info (for non-skinny EXTENT_ITEM with TREE_BLOCK flag):
After the base extent item, non-skinny tree block extents include a
btrfs_tree_block_info (18 bytes):
| Field | Offset | Size | Notes |
|---|---|---|---|
key | 24 | 17 | First key in the tree block (btrfs_disk_key) |
level | 41 | 1 | Tree block level (u8) |
This is absent for skinny metadata items (METADATA_ITEM), where the
level is encoded in the key offset.
Inline backreferences:
After the extent item header (and tree_block_info if present), zero or more inline backreferences may be packed. Each starts with a 1-byte type tag followed by type-specific data:
| Type byte | Name | Data after type byte |
|---|---|---|
176 (0xB0) | TREE_BLOCK_REF | 8 bytes: root_objectid (LE u64) |
182 (0xB6) | SHARED_BLOCK_REF | 8 bytes: parent_bytenr (LE u64) |
178 (0xB2) | EXTENT_DATA_REF | 28 bytes: root(8) + objectid(8) + offset(8) + count(4) |
184 (0xB8) | SHARED_DATA_REF | 12 bytes: parent_bytenr(8) + count(4) |
172 (0xAC) | EXTENT_OWNER_REF | 8 bytes: root_objectid (LE u64) |
Note that for EXTENT_DATA_REF, the 8-byte offset field that normally
follows the type byte is absent; the struct fields begin immediately
after the type byte:
| Field | Offset | Size | Notes |
|---|---|---|---|
type | 0 | 1 | 178 (EXTENT_DATA_REF_KEY) |
root | 1 | 8 | Owning tree objectid (LE u64) |
objectid | 9 | 8 | Referencing inode number (LE u64) |
offset | 17 | 8 | File byte offset of reference (LE u64) |
count | 25 | 4 | Number of references (LE u32) |
For other inline ref types, the format is:
| Field | Offset | Size | Notes |
|---|---|---|---|
type | 0 | 1 | Type byte (176/182/184/172) |
offset | 1 | 8 | Type-specific offset (LE u64) |
For SHARED_DATA_REF, an additional 4 bytes follow:
9 4 count Number of references (LE u32)
Standalone backreference items
When backreferences do not fit inline in the extent item, they are stored as separate items in the extent tree:
TREE_BLOCK_REF (type 176):
Key: (extent_bytenr, TREE_BLOCK_REF, root_objectid).
No data payload; the key offset encodes the owning root.
SHARED_BLOCK_REF (type 182):
Key: (extent_bytenr, SHARED_BLOCK_REF, parent_bytenr).
No data payload; the key offset encodes the parent block.
EXTENT_DATA_REF (type 178):
Key: (extent_bytenr, EXTENT_DATA_REF, hash).
The hash is computed from (root, objectid, offset) using two CRC32C
passes:
high_crc = raw_crc32c(0xFFFFFFFF, root_le_bytes)
low_crc = raw_crc32c(0xFFFFFFFF, objectid_le_bytes)
low_crc = raw_crc32c(low_crc, offset_le_bytes)
hash = (high_crc << 31) ^ low_crc
Payload (btrfs_extent_data_ref, 28 bytes):
| Field | Offset | Size | Notes |
|---|---|---|---|
root | 0 | 8 | Owning tree objectid (LE u64) |
objectid | 8 | 8 | Referencing inode (LE u64) |
offset | 16 | 8 | File byte offset (LE u64) |
count | 24 | 4 | Reference count (LE u32) |
SHARED_DATA_REF (type 184):
Key: (extent_bytenr, SHARED_DATA_REF, parent_bytenr).
Payload (4 bytes):
| Field | Offset | Size | Notes |
|---|---|---|---|
count | 0 | 4 | Reference count (LE u32) |
EXTENT_OWNER_REF (type 172):
Key: (extent_bytenr, EXTENT_OWNER_REF, root_objectid).
No data payload. Used with the simple_quota feature.
DEV_ITEM (type 216)
Key: (DEV_ITEMS_OBJECTID [1], DEV_ITEM, devid)
Stored in the chunk tree. Also embedded in the superblock at offset 201.
Payload (btrfs_dev_item, 98 bytes):
| Field | Offset | Size | Notes |
|---|---|---|---|
devid | 0 | 8 | Device ID (LE u64) |
total_bytes | 8 | 8 | Total device size (LE u64) |
bytes_used | 16 | 8 | Bytes allocated on device (LE u64) |
io_align | 24 | 4 | I/O alignment (LE u32) |
io_width | 28 | 4 | I/O width (LE u32) |
sector_size | 32 | 4 | Device sector size (LE u32) |
type | 36 | 8 | Device type (reserved, 0) (LE u64) |
generation | 44 | 8 | Generation last updated (LE u64) |
start_offset | 52 | 8 | Allocation start offset (LE u64) |
dev_group | 60 | 4 | Device group (reserved, 0) (LE u32) |
seek_speed | 64 | 1 | Seek speed hint (0 = unset) |
bandwidth | 65 | 1 | Bandwidth hint (0 = unset) |
uuid | 66 | 16 | Device UUID |
fsid | 82 | 16 | Filesystem UUID |
CHUNK_ITEM (type 228)
Key: (FIRST_CHUNK_TREE_OBJECTID [256], CHUNK_ITEM, logical_offset)
Maps a range of logical addresses to physical device locations. Stored
in the chunk tree and (for system chunks) in the superblock’s
sys_chunk_array.
Payload (btrfs_chunk + stripes, variable):
| Field | Offset | Size | Notes |
|---|---|---|---|
length | 0 | 8 | Chunk size in bytes (LE u64) |
owner | 8 | 8 | Owner objectid (LE u64) |
stripe_len | 16 | 8 | Stripe length (typically 65536) (LE u64) |
type | 24 | 8 | Chunk type + RAID profile flags (LE u64) |
io_align | 32 | 4 | I/O alignment (LE u32) |
io_width | 36 | 4 | I/O width (LE u32) |
sector_size | 40 | 4 | Sector size (LE u32) |
num_stripes | 44 | 2 | Number of stripes (LE u16) |
sub_stripes | 46 | 2 | Sub-stripes for RAID10 (LE u16) |
stripes[] | 48 | … | Array of num_stripes stripe entries |
Each stripe entry (btrfs_stripe, 32 bytes):
| Field | Offset | Size | Notes |
|---|---|---|---|
devid | 0 | 8 | Device ID (LE u64) |
offset | 8 | 8 | Physical byte offset on device (LE u64) |
dev_uuid | 16 | 16 | Device UUID |
Total payload size: 48 + num_stripes * 32 bytes.
Chunk type flags (bitmask, same as block group flags):
| Bit | Value | Name |
|---|---|---|
| 0 | 0x1 | DATA |
| 1 | 0x2 | SYSTEM |
| 2 | 0x4 | METADATA |
| 3 | 0x8 | RAID0 |
| 4 | 0x10 | RAID1 |
| 5 | 0x20 | DUP |
| 6 | 0x40 | RAID10 |
| 7 | 0x80 | RAID5 |
| 8 | 0x100 | RAID6 |
| 9 | 0x200 | RAID1C3 |
| 10 | 0x400 | RAID1C4 |
When no RAID profile bits are set, the chunk is SINGLE profile.
DEV_EXTENT (type 204)
Key: (devid, DEV_EXTENT, physical_offset)
The inverse of a chunk stripe: maps a physical range on a device back to the owning chunk.
Payload (btrfs_dev_extent, 48 bytes):
| Field | Offset | Size | Notes |
|---|---|---|---|
chunk_tree | 0 | 8 | Chunk tree objectid (always 3) (LE u64) |
chunk_objectid | 8 | 8 | Chunk objectid (LE u64) |
chunk_offset | 16 | 8 | Logical offset of owning chunk (LE u64) |
length | 24 | 8 | Length of this device extent (LE u64) |
chunk_tree_uuid | 32 | 16 | Chunk tree UUID |
BLOCK_GROUP_ITEM (type 192)
Key: (logical_offset, BLOCK_GROUP_ITEM, length)
Tracks space usage for a chunk. Stored in the extent tree (or block
group tree when the block_group_tree feature is enabled).
Payload (btrfs_block_group_item, 24 bytes):
| Field | Offset | Size | Notes |
|---|---|---|---|
used | 0 | 8 | Bytes used in this block group (LE u64) |
chunk_objectid | 8 | 8 | Chunk objectid backing this group (LE u64) |
flags | 16 | 8 | Type + RAID profile flags (LE u64) |
The flags field uses the same bitmask as chunk type flags (Section 8.9).
ROOT_ITEM (type 132)
Key: (tree_objectid, ROOT_ITEM, 0)
Stored in the root tree. Describes a tree root: its block address, generation, subvolume UUIDs, and timestamps.
Payload (btrfs_root_item, 439 bytes):
| Field | Offset | Size | Notes |
|---|---|---|---|
inode | 0 | 160 | Embedded btrfs_inode_item (root dir inode) |
generation | 160 | 8 | Generation when last modified (LE u64) |
root_dirid | 168 | 8 | Root directory inode objectid (LE u64) |
bytenr | 176 | 8 | Logical bytenr of root block (LE u64) |
byte_limit | 184 | 8 | Quota byte limit, 0=unlimited (LE u64) |
bytes_used | 192 | 8 | Bytes used by this tree (LE u64) |
last_snapshot | 200 | 8 | Generation of last snapshot (LE u64) |
flags | 208 | 8 | Root flags (LE u64) |
refs | 216 | 4 | Reference count (LE u32) |
drop_progress | 220 | 17 | Drop operation progress key (btrfs_disk_key) |
drop_level | 237 | 1 | Drop operation tree level (u8) |
level | 238 | 1 | B-tree level of root block (u8) |
generation_v2 | 239 | 8 | Extended generation (v2) (LE u64) |
uuid | 247 | 16 | Subvolume UUID |
parent_uuid | 263 | 16 | Parent subvolume UUID (for snapshots) |
received_uuid | 279 | 16 | Received UUID (for send/receive) |
ctransid | 295 | 8 | Last change transaction (LE u64) |
otransid | 303 | 8 | Creation transaction (LE u64) |
stransid | 311 | 8 | Send transaction (LE u64) |
rtransid | 319 | 8 | Receive transaction (LE u64) |
ctime | 327 | 12 | Change timestamp (btrfs_timespec) |
otime | 339 | 12 | Creation timestamp (btrfs_timespec) |
stime | 351 | 12 | Send timestamp (btrfs_timespec) |
rtime | 363 | 12 | Receive timestamp (btrfs_timespec) |
| reserved | 375 | 64 | Reserved u64[8] |
The embedded inode_item at the start describes the root directory
inode (objectid 256 = BTRFS_FIRST_FREE_OBJECTID for FS trees).
Older filesystems may store a shorter v1 root item without the UUID, transaction, and timestamp fields. The parser handles both formats.
Root item flags:
| Bit | Value | Name |
|---|---|---|
| 0 | 0x1 | SUBVOL_RDONLY (read-only snapshot) |
SUBVOL_DEAD (bit 48, value 0x1000000000000) marks a deleted
subvolume pending cleanup.
ROOT_REF (type 156) / ROOT_BACKREF (type 144)
Key for ROOT_REF: (parent_tree_id, ROOT_REF, child_tree_id)
Key for ROOT_BACKREF: (child_tree_id, ROOT_BACKREF, parent_tree_id)
Forward and backward references linking subvolumes to their parent directories. Both use the same on-disk format.
Payload (btrfs_root_ref, 18 bytes + name):
| Field | Offset | Size | Notes |
|---|---|---|---|
dirid | 0 | 8 | Directory inode containing the subvol entry (LE u64) |
sequence | 8 | 8 | DIR_INDEX sequence number (LE u64) |
name_len | 16 | 2 | Length of name (LE u16) |
name | 18 | name_len | Subvolume name bytes |
FREE_SPACE_INFO (type 198)
Key: (block_group_offset, FREE_SPACE_INFO, block_group_length)
Metadata about free space tracking for a block group.
Payload (btrfs_free_space_info, 8 bytes):
| Field | Offset | Size | Notes |
|---|---|---|---|
extent_count | 0 | 4 | Number of free extents/bitmap entries (LE u32) |
flags | 4 | 4 | Free space info flags (LE u32) |
Flags:
| Bit | Value | Name |
|---|---|---|
| 0 | 0x1 | USING_BITMAPS |
FREE_SPACE_EXTENT (type 199)
Key: (start, FREE_SPACE_EXTENT, length)
Represents a contiguous free range within a block group. The item has no data payload; the key itself encodes the start address and length.
FREE_SPACE_BITMAP (type 200)
Key: (start, FREE_SPACE_BITMAP, length)
A bitmap covering a portion of a block group’s address range. The item data is the raw bitmap, where each bit represents one sector of space. Bit set = free, bit clear = allocated.
XATTR_ITEM (type 24)
Key: (inode_number, XATTR_ITEM, crc32c(name))
Extended attribute storage. Uses the same on-disk format as DIR_ITEM
(Section 8.4), but with:
location= zeroed keydata_len= length of the xattr valuetype=FT_XATTR(8)name= xattr name (e.g.user.myattr)data= xattr value
EXTENT_CSUM (type 128)
Key: (EXTENT_CSUM_OBJECTID, EXTENT_CSUM, logical_bytenr)
Stores an array of per-sector checksums for a contiguous range of data blocks. The item data is a packed array of checksums, one per sector.
For CRC32C, each checksum is 4 bytes (LE u32), so the item covers
item_size / 4 sectors. The logical byte range covered is:
start = key.offset
end = key.offset + (item_size / csum_size) * sectorsize
QGROUP_STATUS (type 240)
Key: (0, QGROUP_STATUS, 0)
One per filesystem. Tracks the overall state of quota accounting.
Payload (btrfs_qgroup_status_item, 32-40 bytes):
| Field | Offset | Size | Notes |
|---|---|---|---|
version | 0 | 8 | On-disk format version (LE u64) |
generation | 8 | 8 | Last consistent generation (LE u64) |
flags | 16 | 8 | Status flags (LE u64) |
scan | 24 | 8 | Rescan progress objectid (LE u64) |
enable_gen | 32 | 8 | Enable generation (kernel 6.8+, optional) (LE u64) |
QGROUP_INFO (type 242)
Key: (packed_qgroupid, QGROUP_INFO, 0)
where packed_qgroupid = (level << 48) | subvolid.
Payload (btrfs_qgroup_info_item, 40 bytes):
| Field | Offset | Size | Notes |
|---|---|---|---|
generation | 0 | 8 | Last update generation (LE u64) |
referenced | 8 | 8 | Total referenced bytes (LE u64) |
referenced_compressed | 16 | 8 | Referenced bytes (compressed) (LE u64) |
exclusive | 24 | 8 | Exclusive bytes (LE u64) |
exclusive_compressed | 32 | 8 | Exclusive bytes (compressed) (LE u64) |
QGROUP_LIMIT (type 244)
Key: (packed_qgroupid, QGROUP_LIMIT, 0)
Payload (btrfs_qgroup_limit_item, 40 bytes):
| Field | Offset | Size | Notes |
|---|---|---|---|
flags | 0 | 8 | Active limit bitmask (LE u64) |
max_referenced | 8 | 8 | Max referenced bytes, 0=unlimited (LE u64) |
max_exclusive | 16 | 8 | Max exclusive bytes, 0=unlimited (LE u64) |
rsv_referenced | 24 | 8 | Reserved referenced bytes (LE u64) |
rsv_exclusive | 32 | 8 | Reserved exclusive bytes (LE u64) |
QGROUP_RELATION (type 246)
Key: (child_qgroupid, QGROUP_RELATION, parent_qgroupid)
Defines a parent-child relationship between qgroups. No data payload; the relationship is fully encoded in the key.
UUID_KEY_SUBVOL (type 251) / UUID_KEY_RECEIVED_SUBVOL (type 252)
Key: (upper_half_uuid, UUID_KEY_SUBVOL, lower_half_uuid)
Maps a UUID to one or more subvolume objectids. The UUID is split: the upper 8 bytes are stored as a LE u64 in the objectid field, the lower 8 bytes as a LE u64 in the offset field.
Payload (variable, array of u64):
For each associated subvolume:
8 bytes subvolid Subvolume tree objectid (LE u64)
STRING_ITEM (type 253)
Key: (BTRFS_FREE_SPACE_OBJECTID, STRING_ITEM, 0)
Raw byte string. Typically stores the filesystem label in the root tree.
Payload: Raw bytes (length = item data size).
TEMPORARY_ITEM (type 248) / BALANCE_ITEM
Key: (BALANCE_OBJECTID, TEMPORARY_ITEM, 0)
Persists in-progress balance state across reboots.
Payload: The first 8 bytes are balance flags (LE u64). The remainder
contains btrfs_balance_args structures for data, metadata, and system
filters.
PERSISTENT_ITEM (type 249) / DEV_STATS
Key for device stats: (DEV_STATS_OBJECTID [0], PERSISTENT_ITEM, devid)
Key for device replace: (DEV_REPLACE_OBJECTID, DEV_REPLACE, 0)
Device stats payload (40 bytes):
| Field | Offset | Size | Notes |
|---|---|---|---|
write_errs | 0 | 8 | Write error count (LE u64) |
read_errs | 8 | 8 | Read error count (LE u64) |
flush_errs | 16 | 8 | Flush error count (LE u64) |
corruption_errs | 24 | 8 | Corruption error count (LE u64) |
generation_errs | 32 | 8 | Generation mismatch count (LE u64) |
Device replace payload (btrfs_dev_replace_item, 72+ bytes):
| Field | Offset | Size | Notes |
|---|---|---|---|
src_devid | 0 | 8 | Source device ID (LE u64) |
cursor_left | 8 | 8 | Left cursor position (LE u64) |
cursor_right | 16 | 8 | Right cursor position (LE u64) |
replace_mode | 24 | 8 | Replace mode (LE u64) |
replace_state | 32 | 8 | Current state (LE u64) |
time_started | 40 | 8 | Start timestamp (LE u64) |
time_stopped | 48 | 8 | Stop timestamp (LE u64) |
num_write_errors | 56 | 8 | Write errors (LE u64) |
num_uncorrectable_read_errors | 64 | 8 | Uncorrectable reads (LE u64) |
ORPHAN_ITEM (type 48)
Key: (ORPHAN_OBJECTID, ORPHAN_ITEM, inode_number)
Marks an inode that has been unlinked but is still open. The item has no data payload. Orphan items are cleaned up on mount or by the kernel’s orphan cleanup thread.
RAID_STRIPE (type 230)
Key: (logical_offset, RAID_STRIPE, length)
Maps logical extents to per-device physical stripe offsets. Requires the
raid_stripe_tree incompat feature.
Payload (variable):
| Field | Offset | Size | Notes |
|---|---|---|---|
encoding | 0 | 8 | RAID encoding type (LE u64) |
stripes[] | 8 | … | Array of stripe entries |
Each stripe entry (16 bytes):
| Field | Offset | Size | Notes |
|---|---|---|---|
devid | 0 | 8 | Device ID (LE u64) |
physical | 8 | 8 | Physical byte offset (LE u64) |
Checksums
Btrfs uses two distinct CRC32C computation modes:
Standard CRC32C (on-disk structures)
Used for all on-disk checksums: superblocks, tree block headers, and
data checksums (EXTENT_CSUM items).
This is ISO 3309 / Castagnoli CRC32C: seed = 0xFFFFFFFF, result is
XORed with 0xFFFFFFFF. Equivalent to the standard crc32c() function
in most libraries.
checksum = crc32c(data) // standard ISO 3309 CRC32C
The 4-byte LE result is stored in the checksum field. For superblocks and tree blocks, the checksum covers everything after the 32-byte csum field to the end of the structure.
Raw CRC32C (hash computations)
Used for internal hash computations where the kernel calls crc32c_le()
directly:
- Name hashes for
DIR_ITEMkeys (crc32c(name)) - Name hashes for
XATTR_ITEMkeys - Name hashes for
INODE_EXTREFkeys extent_data_refkey hash computation- Send stream CRC32C
The raw CRC32C passes the seed through without inversion:
raw_crc32c(seed, data) = !crc32c_append(!seed, data)
This is NOT the standard ISO 3309 algorithm. The seed is typically
0xFFFFFFFF (which is ~0u32), but unlike the standard algorithm, the
output is not inverted.
Supported checksum algorithms
The csum_type field in the superblock selects the algorithm:
| Value | Name | Output size | Notes |
|---|---|---|---|
| 0 | CRC32C | 4 bytes | Default, by far the most common |
| 1 | xxHash64 | 8 bytes | Fast non-cryptographic hash |
| 2 | SHA-256 | 32 bytes | Cryptographic hash |
| 3 | BLAKE2b | 32 bytes | Cryptographic hash (BLAKE2b-256) |
The maximum checksum size is 32 bytes (BTRFS_CSUM_SIZE), which is also
the size of the checksum field in headers.
Feature Flags
Feature flags are stored in three fields in the superblock. A filesystem implementation must understand all set flags to correctly operate:
compat_flags: features that are backward-compatible (no known flags currently defined)compat_ro_flags: features compatible for read-only mountingincompat_flags: features that are fully incompatible
Incompatible feature flags (incompat_flags)
| Bit | Value | Name | Notes |
|---|---|---|---|
| 0 | 0x1 | MIXED_BACKREF | Mixed backref revision (always set on modern fs) |
| 1 | 0x2 | DEFAULT_SUBVOL | A non-default subvolume is the mount target |
| 2 | 0x4 | MIXED_GROUPS | Data and metadata may share block groups |
| 3 | 0x8 | COMPRESS_LZO | LZO compression used |
| 4 | 0x10 | COMPRESS_ZSTD | Zstandard compression used |
| 5 | 0x20 | BIG_METADATA | Metadata blocks > sectorsize (always set when nodesize > sectorsize) |
| 6 | 0x40 | EXTENDED_IREF | Extended inode references (INODE_EXTREF items) |
| 7 | 0x80 | RAID56 | RAID5/6 profiles used |
| 8 | 0x100 | SKINNY_METADATA | Skinny metadata extent refs (METADATA_ITEM instead of EXTENT_ITEM for tree blocks) |
| 9 | 0x200 | NO_HOLES | File extents do not need explicit hole entries |
| 10 | 0x400 | METADATA_UUID | metadata_uuid differs from fsid |
| 11 | 0x800 | RAID1C34 | RAID1C3/RAID1C4 profiles used |
| 12 | 0x1000 | ZONED | Zoned device support |
| 13 | 0x2000 | EXTENT_TREE_V2 | Extent tree v2 (experimental) |
| 14 | 0x4000 | RAID_STRIPE_TREE | RAID stripe tree for stripe mappings |
| 16 | 0x10000 | SIMPLE_QUOTA | Simple quota (per-extent ownership tracking) |
| 17 | 0x20000 | REMAP_TREE | Remap tree (reserved for future use) |
MIXED_BACKREF (bit 0): Indicates the filesystem uses mixed backref format (revision 1). All modern filesystems set this. Old filesystems without it use revision 0 backrefs.
DEFAULT_SUBVOL (bit 1): Set when a non-default subvolume has been
configured as the default mount target via btrfs subvolume set-default.
MIXED_GROUPS (bit 2): Allows data and metadata to share the same block group. Unusual; typically used only on very small filesystems.
COMPRESS_LZO (bit 3): Set when any file on the filesystem uses LZO compression. Once set, it is never cleared.
COMPRESS_ZSTD (bit 4): Set when any file uses Zstandard compression.
BIG_METADATA (bit 5): Set when nodesize > sectorsize, allowing metadata blocks to span multiple sectors. Always set on modern filesystems with the typical 16384-byte nodesize and 4096-byte sectorsize.
EXTENDED_IREF (bit 6): Enables INODE_EXTREF items for inodes with
hard links from multiple parent directories. Without this, only
INODE_REF is used (keyed by single parent inode, limiting hard links
per parent directory).
SKINNY_METADATA (bit 8): Uses METADATA_ITEM (type 169) instead of
EXTENT_ITEM (type 168) for tree block extent records. The tree block
level is encoded in the key offset, eliminating the separate
btrfs_tree_block_info structure and saving 18 bytes per metadata
extent item.
NO_HOLES (bit 9): File extents do not require explicit hole entries.
Without this flag, holes in sparse files are represented by
FILE_EXTENT_ITEM with disk_bytenr = 0; with it, holes are implicit
(no item needed for the gap).
METADATA_UUID (bit 10): The metadata_uuid field in the superblock
differs from fsid. This allows changing the user-visible filesystem
UUID without rewriting every tree block header.
Compatible read-only feature flags (compat_ro_flags)
| Bit | Value | Name | Notes |
|---|---|---|---|
| 0 | 0x1 | FREE_SPACE_TREE | Free space tree exists |
| 1 | 0x2 | FREE_SPACE_TREE_VALID | Free space tree is valid and should be used |
| 2 | 0x4 | VERITY | fs-verity support enabled |
| 3 | 0x8 | BLOCK_GROUP_TREE | Block group items in separate tree |
FREE_SPACE_TREE (bit 0) + FREE_SPACE_TREE_VALID (bit 1): When both are set, the free space tree (objectid 10) is used instead of the legacy free space cache (v1). Both bits must be set for the tree to be considered valid.
VERITY (bit 2): Indicates that fs-verity has been enabled on at
least one file, and the filesystem contains VERITY_DESC_ITEM and
VERITY_MERKLE_ITEM entries.
BLOCK_GROUP_TREE (bit 3): Block group items are stored in a dedicated block group tree (objectid 11) instead of the extent tree. This improves mount time by avoiding a full extent tree scan to find block groups.
Appendix A: Transaction Model
Btrfs uses a generation-based transaction model. Each transaction is
identified by a monotonically increasing generation counter stored in
the superblock.
Transaction commit
A transaction commit involves:
- All modified tree blocks are written to new locations (COW). Each block’s header records the current generation.
- The superblock is updated with:
- Incremented
generation - New
root(root tree root address) - New
chunk_root(if chunk tree changed) - Updated
bytes_usedandtotal_bytes - Rotated
super_rootsbackup entry
- Incremented
- The superblock is written to all mirrors that fit on the device.
The superblock write is the atomic commit point. If the system crashes before the superblock is fully written, the previous superblock (with the previous generation) remains valid and the filesystem rolls back to that state.
Generation consistency
The generation field appears in multiple places, all of which must be consistent:
- Superblock
generation: the current transaction counter - Tree block header
generation: must equal the generation when the block was last COWed - Node key-pointer
generation: must match the child block’s header generation (used for read-time validation) ROOT_ITEM.generation: the generation when the tree was last modified- Backup root
*_genfields: generation of each tree root at backup time
When reading a tree, the kernel validates that each block’s generation matches the expected generation from its parent’s key-pointer. A mismatch indicates corruption or a torn write.
Superblock flag: CHANGING_FSID
The BTRFS_SUPER_FLAG_CHANGING_FSID flag (bit 2 of flags) is set
during an offline fsid rewrite operation. If the system crashes while
this flag is set, the rewrite must be completed or rolled back on the
next access. This provides crash safety for the multi-block fsid change
operation.
Appendix B: Size Constants
| Constant | Size | Notes |
|---|---|---|
BTRFS_SUPER_INFO_SIZE | 4096 bytes | |
BTRFS_HEADER_SIZE | 101 bytes | sizeof(btrfs_header) |
BTRFS_ITEM_SIZE | 25 bytes | sizeof(btrfs_item) |
BTRFS_KEY_PTR_SIZE | 33 bytes | sizeof(btrfs_key_ptr) |
BTRFS_DISK_KEY_SIZE | 17 bytes | sizeof(btrfs_disk_key) |
BTRFS_CSUM_SIZE | 32 bytes | Maximum checksum field width |
BTRFS_STRIPE_SIZE | 32 bytes | sizeof(btrfs_stripe) |
BTRFS_INODE_ITEM_SIZE | 160 bytes | sizeof(btrfs_inode_item) |
BTRFS_ROOT_ITEM_SIZE | 439 bytes | sizeof(btrfs_root_item) |
BTRFS_DEV_ITEM_SIZE | 98 bytes | sizeof(btrfs_dev_item) |
BTRFS_TIMESPEC_SIZE | 12 bytes | sizeof(btrfs_timespec) |
BTRFS_BLOCK_GROUP_SIZE | 24 bytes | sizeof(btrfs_block_group_item) |
BTRFS_EXTENT_ITEM_SIZE | 24 bytes | sizeof(btrfs_extent_item) |
BTRFS_TREE_BLOCK_INFO_SIZE | 18 bytes | sizeof(btrfs_tree_block_info) |
BTRFS_EXTENT_DATA_REF_SIZE | 28 bytes | sizeof(btrfs_extent_data_ref) |
BTRFS_DEV_EXTENT_SIZE | 48 bytes | sizeof(btrfs_dev_extent) |
BTRFS_FREE_SPACE_INFO_SIZE | 8 bytes | sizeof(btrfs_free_space_info) |
BTRFS_ROOT_REF_SIZE | 18 bytes | sizeof(btrfs_root_ref), without name |
BTRFS_DIR_ITEM_SIZE | 30 bytes | sizeof(btrfs_dir_item), without name/data |
BTRFS_BACKUP_ROOT_SIZE | 168 bytes | sizeof(btrfs_root_backup) |
SYS_CHUNK_ARRAY_SIZE | 2048 bytes |
Appendix C: Logical-to-Physical Address Resolution
All tree block addresses and extent addresses in btrfs are logical addresses. To read a logical address from disk, it must be resolved to a physical device offset through the chunk tree.
The resolution process:
-
Bootstrap: Parse the superblock’s
sys_chunk_arrayto seed an initial chunk cache with system chunk mappings. -
Read the chunk tree: Using the system chunk mappings, resolve
superblock.chunk_rootto a physical address and read the chunk tree. Add allCHUNK_ITEMentries to the cache. -
Resolve: For any logical address, find the chunk whose range contains that address. The physical address is:
physical = stripe.offset + (logical - chunk.logical)For SINGLE and DUP profiles, any stripe yields a valid copy. For RAID1, all stripes hold identical copies. For RAID0/5/6/10, stripe index calculation is needed.
-
Read the root tree: Using the full chunk cache, resolve
superblock.rootto a physical address and read the root tree. From here, all other trees can be located via theirROOT_ITEMentries.
Appendix D: File Data Layout
A regular file’s on-disk data is described by a sequence of
FILE_EXTENT_ITEM entries in the FS tree, keyed by (inode, EXTENT_DATA, file_offset).
Inline extents: Small files (typically < sectorsize) store their data directly in the tree leaf. No separate disk allocation is needed.
Regular extents: Larger files reference data stored in data chunks.
The extent is described by disk_bytenr (logical address) and
disk_num_bytes (on-disk size). The offset field allows partial
references into shared extents (e.g., after COW or clone operations).
Compressed extents: When compression is enabled, the compression
field is nonzero, disk_num_bytes is the compressed size, and
ram_bytes is the uncompressed size. Inline compressed extents store
the compressed data directly in the item.
Sparse files: With the NO_HOLES feature, gaps between extent items
are implicit holes. Without it, explicit hole entries with
disk_bytenr = 0 fill the gaps.
The file size is stored in INODE_ITEM.size and is authoritative even
if the extent items would suggest a different range.
Extent sharing and cloning
When a file extent is cloned (via cp --reflink or BTRFS_IOC_CLONE),
both the source and destination inodes reference the same on-disk extent
via their FILE_EXTENT_ITEM entries. The reference count in the extent
tree’s EXTENT_ITEM is incremented.
The offset field in FILE_EXTENT_ITEM allows each reference to start
at a different position within the shared extent:
File A: [--- extent X (offset=0, num_bytes=4096) ---]
File B: [--- extent X (offset=2048, num_bytes=2048) ---]
Both reference the same disk_bytenr, but File B starts reading 2048
bytes into the extent.
Compression type encoding
The compression field in FILE_EXTENT_ITEM uses these values:
| Value | Name | Notes |
|---|---|---|
| 0 | none | No compression |
| 1 | zlib | Deflate compression |
| 2 | lzo | LZO compression (btrfs per-sector format) |
| 3 | zstd | Zstandard compression |
When compression is used with inline extents, the stored data is
compressed and the inline data size may differ from ram_bytes.
Appendix E: Subvolume and Snapshot Model
Subvolumes
Each subvolume is an independent FS tree with its own tree objectid (5 for the default, 256+ for user-created subvolumes). The root tree stores:
- A
ROOT_ITEMfor each subvolume, recording the root block address, generation, UUIDs, and timestamps. ROOT_REF/ROOT_BACKREFpairs linking parent and child subvolumes.
Snapshots
A snapshot is a subvolume created by COWing the root block of another subvolume. At creation time, the snapshot shares all tree blocks with the source. As either the source or snapshot is modified, shared blocks are COWed on demand, gradually diverging.
The parent_uuid field in ROOT_ITEM links a snapshot back to its
source subvolume. The received_uuid field tracks the source across
send/receive operations.
Subvolume deletion
Deleted subvolumes are marked with the SUBVOL_DEAD flag in their
ROOT_ITEM.flags. The kernel cleans up the tree blocks asynchronously,
tracking progress via the drop_progress key and drop_level fields.
Read-only snapshots
A subvolume can be made read-only by setting the SUBVOL_RDONLY flag
in ROOT_ITEM.flags. This is required for send operations (the source
subvolume must be read-only).
Appendix F: Name Hashing
Directory entries (DIR_ITEM) and extended attributes (XATTR_ITEM)
are keyed by a CRC32C hash of the name. The hash uses raw CRC32C (see
Section 9.2) with seed ~0:
hash = raw_crc32c(0xFFFFFFFF, name_bytes)
This hash determines the key offset for the DIR_ITEM. If two names
hash to the same value (collision), their DIR_ITEM entries are packed
into a single item, concatenated one after another.
DIR_INDEX entries use a monotonically increasing sequence number
instead of a hash, providing deterministic iteration order independent
of name hashing.
For INODE_EXTREF, the hash combines the parent inode number and name:
hash = raw_crc32c(raw_crc32c(0xFFFFFFFF, parent_ino_le_bytes), name_bytes)
Appendix G: Block Group and Chunk Relationship
The relationship between chunks, block groups, and device extents forms the space allocation layer:
Chunk (chunk tree)
|
+-- maps logical range [L, L+length) to physical stripes
| on one or more devices
|
+-- Block Group (extent tree or block group tree)
| tracks used/free space within the logical range
| type flags must match the chunk type
|
+-- Device Extent(s) (device tree)
one per stripe, maps physical range back to the chunk
Allocation order: mkfs creates chunks by:
- Choosing a physical region on each device (creating device extents)
- Assigning a logical address range (creating the chunk item)
- Creating a block group covering the logical range
- For the free space tree, creating a
FREE_SPACE_INFOand initialFREE_SPACE_EXTENTentries
Consistency invariant: For every chunk, there must be:
- Exactly one
BLOCK_GROUP_ITEMwith matching logical offset and length - One
DEV_EXTENTper stripe, withchunk_offsetpointing back to the chunk - The block group
flagsmust match the chunktypefield
These cross-references are verified by btrfs check.
Appendix H: Default Feature Set
A modern btrfs filesystem created by mkfs.btrfs (or this project’s
btrfs-mkfs) typically has the following features enabled:
Incompatible features:
MIXED_BACKREF(bit 0) – always setBIG_METADATA(bit 5) – set because nodesize (16384) > sectorsize (4096)EXTENDED_IREF(bit 6) – enables extended inode referencesSKINNY_METADATA(bit 8) – compact metadata extent recordsNO_HOLES(bit 9) – implicit holes in sparse files
Compatible read-only features:
FREE_SPACE_TREE(bit 0) – free space tracking treeFREE_SPACE_TREE_VALID(bit 1) – free space tree is valid
These are the extref, skinny-metadata, no-holes, and
free-space-tree features referenced in mkfs output.
Default parameters:
nodesize= 16384 (16 KiB)sectorsize= 4096 (4 KiB), matching the device sector sizestripesize= 65536 (64 KiB)csum_type= 0 (CRC32C)- Metadata profile: DUP (two copies on the same device)
- Data profile: SINGLE (no redundancy)
- System profile: DUP (for single-device) or RAID1 (for multi-device)
Appendix I: Extent Reference Counting
Btrfs tracks references to every allocated extent (both data and
metadata) in the extent tree. The reference count in EXTENT_ITEM.refs
(or METADATA_ITEM.refs) records how many times the extent is
referenced.
Metadata extents
A metadata extent (tree block) is referenced by key-pointers in parent
nodes. When a snapshot is created, the snapshot initially shares all
tree blocks with the source. Each shared block has refs >= 2. When
either tree COWs a shared block, the old block’s refcount is
decremented and the new copy gets refs = 1.
Backreferences track which tree(s) own each block:
TREE_BLOCK_REF(inline or standalone): direct ownership by a tree rootSHARED_BLOCK_REF(inline or standalone): ownership via a parent block that is itself shared between trees
Data extents
A data extent is referenced by FILE_EXTENT_ITEM entries in FS trees.
Multiple files (or multiple positions in the same file) can reference
the same data extent through reflink cloning.
Backreferences track which file inodes reference each extent:
EXTENT_DATA_REF(inline or standalone): records(root, inode, offset, count)SHARED_DATA_REF(inline or standalone): records(parent_block, count)
Reference count invariant
The refs field must equal the sum of all backreference counts for the
extent. btrfs check verifies this invariant by walking the extent tree
and cross-referencing with the FS trees.
When refs reaches 0, the extent is freed and its space returned to
the block group’s free space pool.
Chunks and Block Groups
This document describes the btrfs chunk and block group system: how the filesystem maps logical addresses to physical device locations, how space is organized into typed block groups, and how these structures relate to each other on disk.
All multi-byte integers in btrfs on-disk structures are little-endian.
Address Spaces
Btrfs uses two distinct address spaces:
Logical address space. Every byte of allocated space in the filesystem has a logical address. Tree node pointers, extent references, block group descriptors, and file extent records all use logical addresses. The logical address space is a flat 64-bit namespace shared across all devices in the filesystem. There is no inherent relationship between a logical address and any particular physical device.
Physical address space. Each device has its own independent physical address space, starting at byte 0. Physical addresses identify actual byte offsets on a block device.
The separation exists for several reasons:
-
Multi-device support. A single logical address can map to stripes on multiple physical devices (RAID1, DUP, RAID0, etc.) without the upper layers of the filesystem needing to know which devices are involved.
-
Relocation. The balance and resize operations can move data between physical locations while logical addresses remain stable. Since all internal pointers use logical addresses, no tree rewriting is needed when physical locations change.
-
Redundancy profiles. The same logical address range can have multiple physical copies (DUP, RAID1) or be striped across devices (RAID0) — this is invisible to everything above the chunk layer.
The mapping between the two address spaces is maintained by three cooperating data structures: chunks (logical to physical), device extents (physical to logical), and block groups (space accounting).
Chunks
A chunk maps a contiguous range of logical addresses to one or more physical locations on devices. Chunks are the fundamental unit of the logical-to-physical translation.
CHUNK_ITEM On-Disk Structure
Chunks are stored in the chunk tree. Each chunk item has a key:
Key: (FIRST_CHUNK_TREE_OBJECTID, CHUNK_ITEM, logical_offset)
objectid = 256 type = 228 offset = start of logical range
The item payload is a btrfs_chunk structure followed by an array of
btrfs_stripe structures:
btrfs_chunk (48 bytes):
| Field | Offset | Size | Description |
|---|---|---|---|
length | 0 | 8 | Logical extent length in bytes |
owner | 8 | 8 | Owner tree objectid (always EXTENT_TREE_OBJECTID = 2) |
stripe_len | 16 | 8 | Stripe length for striped profiles (default 64 KiB) |
type | 24 | 8 | Block group type + RAID profile flags |
io_align | 32 | 4 | I/O alignment (STRIPE_LEN for normal chunks, sectorsize for bootstrap) |
io_width | 36 | 4 | I/O width (same as io_align) |
sector_size | 40 | 4 | Sector size of the underlying devices |
num_stripes | 44 | 2 | Number of stripe entries following |
sub_stripes | 46 | 2 | Sub-stripe count (nonzero only for RAID10) |
btrfs_stripe (32 bytes each, num_stripes entries):
| Field | Offset | Size | Description |
|---|---|---|---|
devid | 0 | 8 | Device ID |
offset | 8 | 8 | Physical byte offset on that device |
dev_uuid | 16 | 16 | UUID of the device |
The total item size is 48 + num_stripes * 32 bytes.
Logical-to-Physical Resolution
To resolve a logical address to a physical location:
-
Find the chunk whose logical range contains the address. The chunk tree is a B-tree keyed by
(256, CHUNK_ITEM, logical_offset), so a lookup finds the entry with the largestlogical_offset <= target. -
Verify the address falls within the chunk:
logical_offset <= target < logical_offset + length. -
Compute the offset within the chunk:
within = target - logical_offset. -
For simple profiles (SINGLE, DUP, RAID1): the physical address on stripe
iisstripe[i].offset + within. -
For striped profiles (RAID0, RAID10, RAID5, RAID6): the stripe index and offset within the stripe are computed from
within,stripe_len, andnum_stripes/sub_stripes.
The ChunkTreeCache in disk/src/chunk.rs implements this as a BTreeMap
keyed by logical start address, with resolve() returning the physical
offset on the first stripe (sufficient for SINGLE, DUP, and RAID1 reads).
Chunk Ownership
The owner field in the chunk item is always BTRFS_EXTENT_TREE_OBJECTID
(2). This is a historical artifact — it does not mean the extent tree
“owns” the chunk in any meaningful sense. The chunk tree is its own
independent tree (tree objectid 3) with its root pointer stored directly
in the superblock.
Block Groups
A block group is the unit of space management in btrfs. Each block group corresponds to exactly one chunk and tracks how much of that chunk’s space is used. Block groups carry type information that determines what kind of data can be stored in them.
BLOCK_GROUP_ITEM On-Disk Structure
Block group items are stored either in the extent tree (traditional
layout) or in the dedicated block-group tree (when the
BLOCK_GROUP_TREE compat_ro feature is enabled).
Key: (logical_offset, BLOCK_GROUP_ITEM, length)
objectid = chunk start type = 192 offset = chunk length
The item payload is a btrfs_block_group_item (24 bytes):
| Field | Offset | Size | Description |
|---|---|---|---|
used | 0 | 8 | Bytes currently allocated within this block group |
chunk_objectid | 8 | 8 | Always FIRST_CHUNK_TREE_OBJECTID (256) |
flags | 16 | 8 | Type flags + RAID profile flags |
Type Flags
The flags field is a bitfield combining a chunk type (what gets stored)
and a RAID profile (how it is stored):
Chunk type bits (mutually exclusive in practice):
| Flag | Value | Meaning |
|---|---|---|
| DATA | 0x001 | File data extents |
| SYSTEM | 0x002 | Chunk tree blocks (needed to bootstrap reads) |
| METADATA | 0x004 | Tree node blocks (all trees except chunk) |
The kernel also supports DATA|METADATA (0x005) for the mixed-bg
feature, where data and metadata share block groups.
RAID profile bits:
| Flag | Value | Meaning |
|---|---|---|
| (none) | 0 | SINGLE — one copy, one device |
| RAID0 | 0x008 | Striped across N devices, no redundancy |
| RAID1 | 0x010 | Mirrored on 2 devices |
| DUP | 0x020 | Two copies on the same device |
| RAID10 | 0x040 | Striped mirrors |
| RAID5 | 0x080 | Single parity |
| RAID6 | 0x100 | Double parity |
| RAID1C3 | 0x200 | Mirrored on 3 devices |
| RAID1C4 | 0x400 | Mirrored on 4 devices |
For example, a metadata block group using DUP has flags 0x024
(METADATA | DUP). A system block group with no profile bits set is
SYSTEM|single (0x002).
The BlockGroupFlags type in disk/src/items.rs represents these
flags as a bitflags struct with methods type_name() (returns
“Data”, “Metadata”, “System”, etc.) and profile_name() (returns
“RAID1”, “DUP”, “single”, etc.).
Block Group to Chunk Relationship
Every block group has a 1:1 correspondence with a chunk. The block
group’s key (logical_offset, BLOCK_GROUP_ITEM, length) must match
a chunk item’s (256, CHUNK_ITEM, logical_offset) with matching
length. The block group’s flags must agree with the chunk item’s
type field.
This invariant is verified by btrfs check (see section 8).
Device Extents
Device extents are the inverse mapping of chunks: they record which ranges of physical space on each device are allocated to which chunks.
DEV_EXTENT On-Disk Structure
Device extents are stored in the device tree (tree objectid 4).
Key: (devid, DEV_EXTENT, physical_offset)
objectid = device ID type = 204 offset = start byte on device
The item payload is a btrfs_dev_extent (48 bytes):
| Field | Offset | Size | Description |
|---|---|---|---|
chunk_tree | 0 | 8 | Chunk tree objectid (always 3) |
chunk_objectid | 8 | 8 | FIRST_CHUNK_TREE_OBJECTID (256) |
chunk_offset | 16 | 8 | Logical offset of the owning chunk |
length | 24 | 8 | Physical extent length in bytes |
chunk_tree_uuid | 32 | 16 | UUID of the chunk tree |
Relationship to Chunks and Stripes
For each stripe in a chunk item, there is a corresponding device extent.
If a chunk at logical address L has num_stripes stripes, then:
-
Stripe 0:
(stripe[0].devid, DEV_EXTENT, stripe[0].offset)withchunk_offset = Landlength = chunk.length(for SINGLE/DUP/RAID1). -
Stripe 1 (for DUP/RAID1):
(stripe[1].devid, DEV_EXTENT, stripe[1].offset)withchunk_offset = Landlength = chunk.length.
For a DUP metadata chunk on a single device, both stripes have the same
devid but different physical offsets, producing two device extents on
the same device.
Device Items
Each device in the filesystem also has a DEV_ITEM in the chunk tree:
Key: (DEV_ITEMS_OBJECTID, DEV_ITEM, devid)
objectid = 1 type = 216 offset = device ID
The item payload is a btrfs_dev_item (98 bytes):
| Field | Offset | Size | Description |
|---|---|---|---|
devid | 0 | 8 | Unique device ID |
total_bytes | 8 | 8 | Total device size |
bytes_used | 16 | 8 | Bytes allocated to chunks on this device |
io_align | 24 | 4 | I/O alignment |
io_width | 28 | 4 | I/O width |
sector_size | 32 | 4 | Sector size |
dev_type | 36 | 8 | Reserved (0) |
generation | 44 | 8 | Last-updated generation |
start_offset | 52 | 8 | Start offset for allocations |
dev_group | 60 | 4 | Reserved (0) |
seek_speed | 64 | 1 | Seek speed hint (0) |
bandwidth | 65 | 1 | Bandwidth hint (0) |
uuid | 66 | 16 | Device UUID |
fsid | 82 | 16 | Filesystem UUID |
The bytes_used field is the sum of the lengths of all device extents
on that device. A copy of the device item for device 1 is also embedded
in the superblock.
The Bootstrap Problem
Circular Dependency
To read any tree, you need to resolve logical addresses to physical offsets, which requires the chunk tree. But the chunk tree is itself stored at a logical address that needs resolution. This creates a circular dependency.
sys_chunk_array
Btrfs solves this with the sys_chunk_array — a 2048-byte buffer
embedded directly in the superblock. This array contains a subset of the
chunk tree: specifically, the chunk items for SYSTEM-type block groups.
The SYSTEM block group contains the chunk tree’s root block. By parsing the sys_chunk_array, the filesystem driver can locate the chunk tree on disk without needing a chunk tree to find it.
The array format is a packed sequence of (btrfs_disk_key, btrfs_chunk)
pairs:
sys_chunk_array[0..sys_chunk_array_size]:
repeat {
btrfs_disk_key (17 bytes):
objectid: u64_le (always FIRST_CHUNK_TREE_OBJECTID = 256)
type: u8 (always CHUNK_ITEM = 228)
offset: u64_le (logical offset of the chunk)
btrfs_chunk + stripes:
(same format as the chunk item payload described in section 2.1)
}
The sys_chunk_array_size field in the superblock records how many bytes
of the 2048-byte buffer are valid.
Bootstrap Sequence
The full bootstrap sequence for reading a btrfs filesystem is:
-
Read the superblock at the primary offset (64 KiB). Verify the magic number, checksum, and fsid. The superblock provides:
sys_chunk_array+sys_chunk_array_sizechunk_root(logical address of the chunk tree root)root(logical address of the root tree root)nodesize,sectorsize,csum_type
-
Parse the sys_chunk_array to build an initial
ChunkTreeCache. This cache contains only the SYSTEM chunk(s), which is enough to resolve the chunk tree root address. -
Read the chunk tree starting from
chunk_root. For eachCHUNK_ITEMfound, add the mapping to theChunkTreeCache. After this step, the cache can resolve any logical address in the filesystem. -
Read the root tree starting from
root. This tree containsROOT_ITEMentries for every other tree (extent, device, FS, csum, free-space, etc.), providing their root block logical addresses. -
Read any other tree by looking up its
ROOT_ITEMin the root tree and using theChunkTreeCacheto resolve addresses.
The seed_from_sys_chunk_array() function in disk/src/chunk.rs
implements step 2. The BlockReader in disk/src/reader.rs orchestrates
the full bootstrap sequence.
RAID Profiles
The RAID profile determines how a chunk’s logical space maps to physical
device locations. The profile affects num_stripes, sub_stripes, and
the interpretation of stripe entries.
SINGLE
num_stripes = 1
sub_stripes = 0
One stripe, one device. Logical offset maps 1:1 to a physical offset on a single device. No redundancy.
Logical: [--------chunk------]
Physical: [dev1: stripe 0 ]
DUP
num_stripes = 2
sub_stripes = 0
Two stripes on the same device at different physical offsets. Both stripes contain identical data. Provides protection against localized media errors but not device failure.
Logical: [--------chunk------]
Physical: [dev1: stripe 0 ]
[dev1: stripe 1 ] (different offset, same data)
DUP is the default metadata profile for single-device filesystems.
The logical size of the chunk equals one stripe size. The physical space
consumed is 2 * stripe_size.
In mkfs, DUP metadata stripes are laid out sequentially after the system group:
Physical layout on device 1:
[0..1M) reserved (superblock at 64K)
[1M..5M) system chunk (4 MiB)
[5M..5M+meta) metadata stripe 0
[5M+meta..5M+2*meta) metadata stripe 1
[5M+2*meta..) data stripe 0
RAID1
num_stripes = 2 (RAID1C3: 3, RAID1C4: 4)
sub_stripes = 0
One stripe per device, each containing identical data. RAID1 uses 2 devices, RAID1C3 uses 3, RAID1C4 uses 4.
Logical: [--------chunk------]
Physical: [dev1: stripe 0 ]
[dev2: stripe 1 ] (same data, different device)
For RAID1 metadata on a 2-device filesystem, mkfs places one stripe
on each device at the same physical offset (CHUNK_START):
Device 1: [system][meta stripe 0][data stripe 0]
Device 2: [meta stripe 1]
RAID0
num_stripes = N (number of devices)
sub_stripes = 0
Data is striped across N devices in stripe_len-sized (64 KiB) units.
No redundancy. The logical chunk size equals N * physical_stripe_size.
Logical: [--A--][--B--][--C--][--A--][--B--][--C--]
Physical: dev1: [--A--] [--A--]
dev2: [--B--] [--B--]
dev3: [--C--] [--C--]
To resolve a logical address within a RAID0 chunk:
offset = logical - chunk_startstripe_nr = offset / stripe_lenstripe_index = stripe_nr % num_stripesstripe_offset = (stripe_nr / num_stripes) * stripe_len + (offset % stripe_len)- Physical address =
stripes[stripe_index].offset + stripe_offset
RAID10
num_stripes = N (must be even, >= 4)
sub_stripes = 2
Striped mirrors: data is striped across N/2 mirror groups, each group
having sub_stripes (2) copies. Combines RAID0 throughput with RAID1
redundancy.
RAID5 and RAID6
RAID5: num_stripes = N, sub_stripes = 0, one parity stripe
RAID6: num_stripes = N, sub_stripes = 0, two parity stripes
Data is striped with rotating parity. RAID5 tolerates one device failure; RAID6 tolerates two.
Allocation Sizing
When creating a new filesystem (mkfs), the initial chunk sizes are
computed from the total device size. The formulas, implemented in
mkfs/src/layout.rs (ChunkLayout::new), are:
System Block Group
Fixed size and position:
- Offset:
SYSTEM_GROUP_OFFSET= 1 MiB (0x100000) - Size:
SYSTEM_GROUP_SIZE= 4 MiB (0x400000) - Profile: always SINGLE
- Contains: the chunk tree root block
The first 1 MiB of the device is reserved. The primary superblock sits at offset 64 KiB within this reserved area.
Metadata Block Group
meta_size = clamp(total_bytes / 10, 32 MiB, 256 MiB)
meta_size = round_down(meta_size, STRIPE_LEN)
where STRIPE_LEN = 64 KiB and total_bytes is the sum across all
devices.
The metadata chunk starts at logical offset CHUNK_START = 5 MiB
(SYSTEM_GROUP_OFFSET + SYSTEM_GROUP_SIZE). For DUP, two physical
stripes are placed sequentially on device 1. For RAID1, one stripe is
placed on each of the first two devices.
Examples:
- 256 MiB device:
clamp(25.6M, 32M, 256M)= 32 MiB - 1 GiB device:
clamp(102.4M, 32M, 256M)= 102 MiB (rounded to 64K) - 10 GiB device:
clamp(1G, 32M, 256M)= 256 MiB
Data Block Group
data_size = clamp(total_bytes / 10, 64 MiB, 1 GiB)
data_size = round_down(data_size, STRIPE_LEN)
The data chunk follows the metadata chunk in both logical and physical
address spaces. Logical offset = CHUNK_START + meta_size.
Examples:
- 256 MiB device:
clamp(25.6M, 64M, 1G)= 64 MiB - 1 GiB device:
clamp(102.4M, 64M, 1G)= 102 MiB (rounded to 64K) - 10 GiB device:
clamp(1G, 64M, 1G)= 1 GiB
Minimum Device Size
For a single-device filesystem with DUP metadata and SINGLE data, the minimum physical space needed is:
1 MiB (reserved) + 4 MiB (system) + 2 * meta_size + data_size
With the minimum sizes (meta = 32 MiB, data = 64 MiB), this works out to approximately 133 MiB. A 100 MiB device will fail with “device too small”.
Physical Layout Summary
For a single-device DUP-metadata SINGLE-data filesystem:
Physical byte offset:
[0 .. 1M) Reserved (superblock at 64K)
[1M .. 5M) System block group (4 MiB)
[5M .. 5M + meta_size) Metadata stripe 0
[5M + meta_size .. 5M + 2*meta) Metadata stripe 1 (DUP copy)
[5M + 2*meta .. 5M + 2*meta + data) Data
Logical address space:
[1M .. 5M) System chunk
[5M .. 5M + meta_size) Metadata chunk
[5M + meta_size .. 5M + meta + data) Data chunk
Note the physical space for DUP metadata is 2 * meta_size, but the
logical address range is only meta_size. Both physical stripes map to
the same logical range.
Cross-Checks
The btrfs check command (implemented in cli/src/check/chunks.rs)
verifies the consistency of the chunk/block-group/device-extent triad.
Chunk-to-Block-Group Check
For every chunk in the chunk tree, there must be a matching block group
item. The check walks the chunk tree cache and verifies that
block_groups.contains_key(chunk.logical) for each chunk.
If a chunk has no corresponding block group, btrfs check reports:
ChunkMissingBlockGroup { logical }
Block-Group-to-Chunk Check
For every block group item (from the extent tree or block-group tree),
there must be a matching chunk. The check verifies that
chunk_cache.lookup(bg_logical) succeeds for each block group.
If a block group has no corresponding chunk, btrfs check reports:
BlockGroupMissingChunk { logical }
Device Extent Overlap Check
All device extents for each device are collected from the device tree, sorted by physical offset, and checked for overlaps. For consecutive extents on the same device, the check verifies:
extent[i].offset >= extent[i-1].offset + extent[i-1].length
If two device extents overlap, btrfs check reports:
DeviceExtentOverlap { devid, offset }
Block Group Source
When the BLOCK_GROUP_TREE compat_ro feature is enabled, block group
items are stored in a separate tree (tree objectid 10) rather than in the
extent tree. The check code handles both cases by selecting the
appropriate tree root:
#![allow(unused)]
fn main() {
let bg_root = block_group_tree_root.unwrap_or(extent_root);
}
The Chunk Tree
The chunk tree (tree objectid 3) stores two kinds of items:
-
DEV_ITEM entries for each device in the filesystem:
(DEV_ITEMS_OBJECTID=1, DEV_ITEM=216, devid) -
CHUNK_ITEM entries for each chunk:
(FIRST_CHUNK_TREE_OBJECTID=256, CHUNK_ITEM=228, logical_offset)
Items are sorted by key, so DEV_ITEMs (objectid 1) come before CHUNK_ITEMs (objectid 256).
The chunk tree root pointer is stored directly in the superblock’s
chunk_root field — it does not go through the root tree like other
trees. This is because the chunk tree is needed to read the root tree
itself.
mkfs Chunk Tree Construction
When mkfs builds the chunk tree (build_chunk_tree in
mkfs/src/mkfs.rs), it creates:
-
One
DEV_ITEMper device, withbytes_usedset to the sum of all chunk stripes on that device. -
Three
CHUNK_ITEMentries:- System chunk at
SYSTEM_GROUP_OFFSET(1 MiB), size 4 MiB - Metadata chunk at
CHUNK_START(5 MiB), with profile-dependent stripes - Data chunk after metadata, with profile-dependent stripes
- System chunk at
The system chunk item uses sectorsize for io_align and io_width
(matching the kernel’s bootstrap behavior), while the metadata and data
chunks use STRIPE_LEN (64 KiB).
The Device Tree
The device tree (tree objectid 4) stores:
-
DEV_STATS (PERSISTENT_ITEM) for each device: per-device I/O error counters, initialized to zero by mkfs.
-
DEV_EXTENT entries for each physical stripe of each chunk.
Items are sorted by key: (objectid=devid, type=DEV_EXTENT, offset=physical_byte_offset).
mkfs Device Tree Construction
When mkfs builds the device tree (build_dev_tree in
mkfs/src/mkfs.rs), it creates:
-
One
DEV_STATSitem per device (zeroed counters). -
Device extents for each stripe:
- System chunk: one DEV_EXTENT on device 1 at
SYSTEM_GROUP_OFFSET - Metadata chunk: one DEV_EXTENT per stripe (two for DUP on device 1, or one per device for RAID1)
- Data chunk: one DEV_EXTENT per stripe
- System chunk: one DEV_EXTENT on device 1 at
All device tree items are collected, sorted by key, and written in order. This is necessary because items span multiple device IDs and physical offsets that are not naturally ordered by construction.
Superblock Mirrors
The superblock is written at up to three fixed physical offsets on each device:
| Mirror | Offset | Size |
|---|---|---|
| 0 | 64 KiB | 4 KiB |
| 1 | 64 MiB | 4 KiB |
| 2 | 256 GiB | 4 KiB |
The formula is: mirror 0 at 65536 bytes; mirror N (N > 0) at
16384 << (12 * N) bytes. Mirrors are only written if the device is
large enough to contain them.
The superblock contains the sys_chunk_array bootstrap data, root
pointers for the chunk tree and root tree, the embedded device item for
device 1, and all filesystem-level metadata (UUID, label, feature flags,
generation counter, bytes_used, etc.).
All three mirrors contain identical data for a given generation. On mount, the kernel reads all available mirrors and uses the one with the highest valid generation, providing resilience against corruption of the primary superblock.
Tree Block Placement in mkfs
During filesystem creation, tree blocks must be placed at specific logical
addresses within the chunks. The BlockLayout struct in
mkfs/src/layout.rs assigns addresses:
Chunk tree block: placed at SYSTEM_GROUP_OFFSET (1 MiB) in the
system chunk. This is the only tree block in the system block group.
All other tree blocks (root, extent, device, FS, csum, free-space,
data-reloc, and optionally block-group): placed sequentially in the
metadata chunk starting at meta_logical = 5 MiB. With a 16 KiB
nodesize:
| Logical address | Tree |
|---|---|
meta_logical + 0 | Root tree |
meta_logical + 16K | Extent tree |
meta_logical + 32K | Device tree |
meta_logical + 48K | FS tree |
meta_logical + 64K | Csum tree |
meta_logical + 80K | Free-space tree |
meta_logical + 96K | Data-reloc tree |
meta_logical + 112K | Block-group tree (if enabled) |
For --rootdir mode, where trees may require multiple blocks, the
BlockAllocator hands out sequential addresses from the system and
metadata chunks, supporting trees of arbitrary size.
System chunk bytes used = nodesize (one chunk tree block).
Metadata chunk bytes used = 7 * nodesize (or 8 with block-group tree).
Extent Tree and Backrefs
This document describes the btrfs extent tree: how every allocated byte on disk is tracked, how reference counting works, and how backreferences link extents to the trees and files that use them.
All multi-byte integers in btrfs on-disk structures are little-endian.
Purpose
The extent tree is the central allocator of the btrfs filesystem. It records every contiguous range of allocated disk space (both data extents used by files and metadata blocks used by trees) and tracks who references each extent.
The extent tree serves three purposes:
-
Allocation tracking. The set of extent items defines which logical byte ranges are in use. The free-space tree (or free-space cache) is derived from the gaps between extent items.
-
Reference counting. Each extent has a declared reference count. Snapshots and clones share extents by incrementing this count rather than copying data. When the count drops to zero, the extent can be freed.
-
Backreferences. Each extent stores references back to the trees, inodes, and file offsets that use it. This enables the filesystem to find all users of an extent (for relocation during balance, for example) and to verify consistency (during
btrfs check).
The extent tree is stored in tree objectid 2
(BTRFS_EXTENT_TREE_OBJECTID). Its root pointer is stored in the root
tree via a ROOT_ITEM entry.
EXTENT_ITEM vs METADATA_ITEM
There are two key types used to record allocated extents:
EXTENT_ITEM (type 168)
The original extent item format, used for both data and metadata extents.
Key: (bytenr, EXTENT_ITEM, length)
objectid = logical start type = 168 offset = size in bytes
For data extents, length is the extent’s size on disk. For metadata
extents (tree blocks), length equals the filesystem’s nodesize.
METADATA_ITEM (type 169) — Skinny Metadata
When the SKINNY_METADATA incompat feature is enabled (the default since
Linux 3.10), metadata extents use a more compact key:
Key: (bytenr, METADATA_ITEM, level)
objectid = logical start type = 169 offset = tree level (0..7)
The extent’s length is implicitly nodesize (not stored in the key).
The level field in the key offset records the B-tree level of the tree
block, which is useful for verification without reading the block itself.
Skinny metadata items are called “skinny refs” because they eliminate the
need for the btrfs_tree_block_info structure that non-skinny
EXTENT_ITEM entries for tree blocks carry.
Key Differences
| Aspect | EXTENT_ITEM (non-skinny) | METADATA_ITEM (skinny) |
|---|---|---|
| Key type | 168 | 169 |
| Key offset | nodesize | tree level (0..7) |
| Item body | extent_item + tree_block_info + inline refs | extent_item + inline refs |
| When used | Always for data; metadata only without skinny_metadata | Metadata only, with skinny_metadata |
In mkfs, the choice is controlled by the skinny_metadata() config flag:
#![allow(unused)]
fn main() {
let (item_type, offset) = if skinny {
(BTRFS_METADATA_ITEM_KEY, 0u64) // level 0 for leaf blocks
} else {
(BTRFS_EXTENT_ITEM_KEY, nodesize as u64)
};
}
The Extent Item Header
Both EXTENT_ITEM and METADATA_ITEM share the same header structure,
btrfs_extent_item (24 bytes):
| Field | Offset | Size | Description |
|---|---|---|---|
refs | 0 | 8 | Total reference count |
generation | 8 | 8 | Generation when allocated |
flags | 16 | 8 | Extent type flags |
Extent Flags
The flags field uses these bits:
| Flag | Value | Meaning |
|---|---|---|
| DATA | 0x01 | Extent holds file data |
| TREE_BLOCK | 0x02 | Extent holds a metadata tree block |
| FULL_BACKREF | 0x80 | Uses shared (parent-based) backrefs only |
A data extent has flags = DATA (0x01). A metadata extent has
flags = TREE_BLOCK (0x02). The FULL_BACKREF flag is set when the
extent uses shared backreferences (after a snapshot) rather than normal
tree backreferences.
The ExtentFlags type in disk/src/items.rs represents these flags as a
bitflags struct.
Tree Block Info (Non-Skinny Only)
For non-skinny EXTENT_ITEM entries with TREE_BLOCK flag, the header
is followed by btrfs_tree_block_info (25 bytes):
| Field | Offset | Size | Description |
|---|---|---|---|
key | 0 | 17 | First key in the tree block (btrfs_disk_key) |
level | 17 | 1 | B-tree level of the block |
This structure is omitted when using METADATA_ITEM (skinny metadata),
since the level is stored in the key offset and the first key is not
needed.
Full Item Layout
For a skinny metadata extent item with one inline TREE_BLOCK_REF:
| Byte offset | Size | Content |
|---|---|---|
| 0 | 8 | refs (u64_le) |
| 8 | 8 | generation (u64_le) |
| 16 | 8 | flags = TREE_BLOCK (u64_le) |
| 24 | 1 | inline ref type = TREE_BLOCK_REF_KEY (176) |
| 25 | 8 | root objectid (u64_le) |
| Total: 33 bytes |
For a data extent item with one inline EXTENT_DATA_REF:
| Byte offset | Size | Content |
|---|---|---|
| 0 | 8 | refs (u64_le) |
| 8 | 8 | generation (u64_le) |
| 16 | 8 | flags = DATA (u64_le) |
| 24 | 1 | inline ref type = EXTENT_DATA_REF_KEY (178) |
| 25 | 8 | root (u64_le) |
| 33 | 8 | objectid (u64_le) – inode number |
| 41 | 8 | offset (u64_le) – file offset |
| 49 | 4 | count (u32_le) |
| Total: 53 bytes |
Inline Backrefs
After the extent item header (and tree_block_info for non-skinny metadata), zero or more inline backreferences are packed contiguously. Each inline ref starts with a 1-byte type code, followed by type-specific data.
Inline refs are the common case: they are stored directly inside the extent item, avoiding the overhead of separate B-tree items. When an extent item grows too large to fit in a leaf (due to many references), backrefs are stored as standalone items instead.
TREE_BLOCK_REF (type 176)
Direct backref from a metadata extent to the tree that owns it.
| Field | Offset | Size | Description |
|---|---|---|---|
type | 0 | 1 | 176 (BTRFS_TREE_BLOCK_REF_KEY) |
| root objectid | 1 | 8 | u64_le |
The root field identifies the tree that owns this metadata block. For
example, root = 5 means the FS tree, root = 2 means the extent tree
itself.
Total size: 9 bytes.
SHARED_BLOCK_REF (type 182)
Shared backref from a metadata extent to a parent tree block. Used when a tree block is shared between snapshots — the backref points to a parent node rather than a root.
| Field | Offset | Size | Description |
|---|---|---|---|
type | 0 | 1 | 182 (BTRFS_SHARED_BLOCK_REF_KEY) |
| parent bytenr | 1 | 8 | u64_le |
The parent field is the logical byte address of the tree node that
contains a pointer to this extent.
Total size: 9 bytes.
EXTENT_DATA_REF (type 178)
Backref from a data extent to a specific file inode. This is the most common inline ref type for data extents.
| Field | Offset | Size | Description |
|---|---|---|---|
type | 0 | 1 | 178 (BTRFS_EXTENT_DATA_REF_KEY) |
root | 1 | 8 | Tree objectid owning the inode (u64_le) |
objectid | 9 | 8 | Inode number (u64_le) |
offset | 17 | 8 | File byte offset (u64_le) |
count | 25 | 4 | Number of references (u32_le) |
Note that unlike other inline ref types, EXTENT_DATA_REF does not
have an 8-byte offset field between the type byte and the struct body.
The struct starts immediately after the type byte. The parser in
disk/src/items.rs handles this by reinterpreting the speculatively
consumed offset bytes as the root field:
#![allow(unused)]
fn main() {
raw::BTRFS_EXTENT_DATA_REF_KEY => {
let root = ref_offset; // already read as u64_le
let oid = buf.get_u64_le();
let off = buf.get_u64_le();
let count = buf.get_u32_le();
// ...
}
}
The count field represents how many times this particular
(root, objectid, offset) triple references the extent. For a normal
file with one reference, count = 1. For a file cloned via reflink,
each clone adds a new EXTENT_DATA_REF with its own triple and count.
Total size: 29 bytes.
SHARED_DATA_REF (type 184)
Shared data backref, used when data extents are shared between snapshots.
| Field | Offset | Size | Description |
|---|---|---|---|
type | 0 | 1 | 184 (BTRFS_SHARED_DATA_REF_KEY) |
| parent bytenr | 1 | 8 | u64_le |
count | 9 | 4 | u32_le |
Total size: 13 bytes.
EXTENT_OWNER_REF (type 172)
Simple ownership reference, used with the simple_quota feature. Records
which tree root owns the extent without full backref details.
| Field | Offset | Size | Description |
|---|---|---|---|
type | 0 | 1 | 172 (BTRFS_EXTENT_OWNER_REF_KEY) |
| root objectid | 1 | 8 | u64_le |
Total size: 9 bytes.
Standalone Backrefs
When inline backrefs do not fit inside the extent item (because the item would exceed the available leaf space), they are stored as separate items in the extent tree. Standalone backrefs use the same type codes as inline refs but are encoded as independent key/value pairs.
Standalone TREE_BLOCK_REF
Key: (bytenr, TREE_BLOCK_REF, root_objectid)
objectid = extent start type = 176 offset = owning tree
Item payload: empty (zero bytes). The backref information is entirely in the key.
Standalone SHARED_BLOCK_REF
Key: (bytenr, SHARED_BLOCK_REF, parent_bytenr)
objectid = extent start type = 182 offset = parent block
Item payload: empty.
Standalone EXTENT_DATA_REF
Key: (bytenr, EXTENT_DATA_REF, hash)
objectid = extent start type = 178 offset = CRC32C hash
The key offset is a hash of (root, objectid, offset) computed by:
#![allow(unused)]
fn main() {
fn extent_data_ref_hash(root: u64, objectid: u64, offset: u64) -> u64 {
let high_crc = raw_crc32c(!0u32, &root.to_le_bytes());
let low_crc = raw_crc32c(!0u32, &objectid.to_le_bytes());
let low_crc = raw_crc32c(low_crc, &offset.to_le_bytes());
(u64::from(high_crc) << 31) ^ u64::from(low_crc)
}
}
This hash function uses raw CRC32C (seed = !0, i.e. 0xFFFFFFFF,
without final complement) applied independently to the root (high part)
and objectid+offset (low part), then combined with a shift and XOR.
Item payload (28 bytes):
| Field | Offset | Size | Description |
|---|---|---|---|
root | 0 | 8 | u64_le |
objectid | 8 | 8 | u64_le |
offset | 16 | 8 | u64_le |
count | 24 | 4 | u32_le |
Standalone SHARED_DATA_REF
Key: (bytenr, SHARED_DATA_REF, parent_bytenr)
objectid = extent start type = 184 offset = parent block
Item payload (4 bytes):
| Field | Offset | Size | Description |
|---|---|---|---|
count | 0 | 4 | u32_le |
Reference Counting
The refs Field
The refs field in btrfs_extent_item is the declared total reference
count for the extent. It equals the sum of all references from both
inline and standalone backrefs.
For TREE_BLOCK_REF, SHARED_BLOCK_REF, and EXTENT_OWNER_REF, each
backref contributes 1 to the total. For EXTENT_DATA_REF and
SHARED_DATA_REF, each backref contributes its count field to the
total.
Counting Rules
The total reference count is computed as:
total = 0
for each inline ref:
if EXTENT_DATA_REF: total += count
if SHARED_DATA_REF: total += count
otherwise: total += 1
for each standalone ref:
if EXTENT_DATA_REF: total += count (from item payload)
if SHARED_DATA_REF: total += count (from item payload)
otherwise: total += 1
The declared refs in the extent item header must equal this computed
total. A mismatch indicates corruption.
Example: Simple File
A newly created file with one 4 KiB extent in the FS tree (root 5):
Key: (bytenr, EXTENT_ITEM, 4096)
refs = 1
generation = 100
flags = DATA
inline EXTENT_DATA_REF:
root = 5, objectid = 257, offset = 0, count = 1
Total refs: count(1) = 1. Matches declared refs.
Example: Snapshot
After taking a snapshot of the FS tree, the same extent is now referenced by both the original and the snapshot. The extent item is updated:
Key: (bytenr, EXTENT_ITEM, 4096)
refs = 2
generation = 100
flags = DATA
inline EXTENT_DATA_REF:
root = 5, objectid = 257, offset = 0, count = 1
inline EXTENT_DATA_REF:
root = 260, objectid = 257, offset = 0, count = 1
Total refs: count(1) + count(1) = 2. Matches declared refs.
Example: Reflink Clone
A reflink clone within the same tree adds another backref with a different file offset:
Key: (bytenr, EXTENT_ITEM, 4096)
refs = 2
generation = 100
flags = DATA
inline EXTENT_DATA_REF:
root = 5, objectid = 257, offset = 0, count = 1
inline EXTENT_DATA_REF:
root = 5, objectid = 258, offset = 0, count = 1
Example: Metadata Block
A metadata block owned by the FS tree:
Key: (bytenr, METADATA_ITEM, 0) // level 0 = leaf
refs = 1
generation = 100
flags = TREE_BLOCK
inline TREE_BLOCK_REF:
root = 5
Data Extent Backrefs in Detail
The EXTENT_DATA_REF Triple
Each data extent backref identifies its user by a
(root, objectid, offset) triple:
-
root: the tree objectid containing the referencing inode. For user files this is the FS tree (5) or a subvolume/snapshot tree ID.
-
objectid: the inode number of the file that references the extent. Regular file inodes start at 257 (
BTRFS_FIRST_FREE_OBJECTID + 1). -
offset: the byte offset within the file where this extent is referenced. This is the key offset of the
EXTENT_DATAitem in the FS tree.
The count Field
The count field records how many times the exact same
(root, objectid, offset) triple references this extent. In normal
operation, count = 1. It can be greater than 1 in specific scenarios
involving log replay or certain reflink patterns.
Hash Computation for Standalone Keys
When an EXTENT_DATA_REF is stored as a standalone item, the key offset
is not the file offset but rather a hash of the full triple. This allows
multiple data refs with different triples to be stored as separate items
under the same extent bytenr.
The hash function (from disk/src/items.rs) computes:
high = CRC32C(seed=0xFFFFFFFF, root_le_bytes)
low = CRC32C(seed=0xFFFFFFFF, objectid_le_bytes)
low = CRC32C(seed=low, offset_le_bytes)
hash = (high << 31) ^ low
This produces a 63-bit hash (the top bit is always the MSB of the high CRC, shifted to bit 62). The hash is deterministic and the same function is used in both the kernel and userspace tools.
Metadata Extent Backrefs in Detail
TREE_BLOCK_REF
A TREE_BLOCK_REF links a metadata block to the tree that owns it.
The root field is the tree’s objectid:
- 1 = root tree
- 2 = extent tree
- 3 = chunk tree
- 4 = device tree
- 5 = FS tree (default subvolume)
- 6 = csum tree
- 7 = quota tree
- 10 = free-space tree
-
= 256 = subvolume/snapshot trees
SHARED_BLOCK_REF
When a tree block is shared between a subvolume and its snapshot, the
normal TREE_BLOCK_REF is replaced with a SHARED_BLOCK_REF that
points to the parent node. This happens because the same physical block
cannot be “owned” by two different trees simultaneously.
The parent field is the logical bytenr of the tree node whose key
pointer array includes this block. When the filesystem needs to modify
a shared block, it performs copy-on-write: allocating a new block, copying
the data, and updating the parent’s pointer. This is how snapshots achieve
their constant-time creation — they share all blocks with the source
subvolume.
FULL_BACKREF Flag
The FULL_BACKREF flag in the extent item’s flags field indicates that
this metadata extent uses only shared backrefs (no direct tree backrefs).
This typically happens for tree blocks at levels > 0 after a snapshot,
where the ownership is ambiguous until the block is CoW’d.
Cross-Referencing with Tree Ownership
btrfs check collects a map of (block_address -> owning_tree) during
its tree walks. The owning tree for each block is determined by the
owner field in the block’s header (btrfs_header). This map is then
cross-referenced against the extent tree’s TREE_BLOCK_REF entries in
both directions.
Block Group Items in the Extent Tree
Historically, BLOCK_GROUP_ITEM entries are stored directly in the
extent tree alongside extent items. With the BLOCK_GROUP_TREE
compat_ro feature (default since btrfs-progs 6.x), they are moved to a
separate tree (objectid 10).
BLOCK_GROUP_ITEM Structure
Key: (logical_offset, BLOCK_GROUP_ITEM, length)
objectid = group start type = 192 offset = group size
Item payload (24 bytes):
| Field | Offset | Size | Description |
|---|---|---|---|
used | 0 | 8 | Bytes allocated within this block group |
chunk_objectid | 8 | 8 | FIRST_CHUNK_TREE_OBJECTID (256) |
flags | 16 | 8 | Type + profile flags (`DATA |
The used field tracks how many bytes of the block group are currently
allocated to extents. For a new filesystem:
- System block group:
used= one nodesize (the chunk tree block) - Metadata block group:
used= N * nodesize (all non-chunk tree blocks) - Data block group:
used= 0 (no file data yet)
Ordering in the Extent Tree
When block group items are in the extent tree, they sort among the extent
items by key. Since BLOCK_GROUP_ITEM has type 192 and EXTENT_ITEM has
type 168 / METADATA_ITEM has type 169, block group items for a given
logical offset sort after any extent item at the same address (because
key comparison is (objectid, type, offset) and 192 > 169).
mkfs Construction
mkfs creates three block group items, one for each chunk:
#![allow(unused)]
fn main() {
add_block_group_items(extent_items, cfg, layout, chunks, data_used);
}
This adds entries for the system (SYSTEM flag), metadata (METADATA | profile flag), and data (DATA | profile flag) block groups.
When the BLOCK_GROUP_TREE feature is enabled, these items are placed
in a separate tree instead (build_block_group_tree_with_used).
What btrfs check Verifies
The extent tree checker (implemented in cli/src/check/extents.rs)
performs several categories of verification.
Reference Count Matching
For each extent item (EXTENT_ITEM or METADATA_ITEM) and its associated
standalone backrefs, the checker computes the total reference count from
inline + standalone refs and compares it to the declared refs field:
#![allow(unused)]
fn main() {
if state.pending_refs != state.pending_counted {
results.report(CheckError::ExtentRefMismatch {
bytenr, expected: state.pending_refs, found: state.pending_counted,
});
}
}
The checker processes items in key order. When it encounters a new EXTENT_ITEM or METADATA_ITEM, it “flushes” the previous extent (checking its ref count) and begins accumulating refs for the new one. Standalone backref items (TREE_BLOCK_REF, SHARED_BLOCK_REF, EXTENT_DATA_REF, SHARED_DATA_REF, EXTENT_OWNER_REF) that follow an extent item with a matching objectid add to the running count.
Extent Overlap Detection
Extents in the extent tree are sorted by logical address. The checker tracks the end address of the previous extent and reports an error if the next extent starts before the previous one ends:
#![allow(unused)]
fn main() {
if length > 0 && bytenr < state.prev_end && state.prev_end > 0 {
results.report(CheckError::OverlappingExtent {
bytenr, length, prev_end: state.prev_end,
});
}
}
Note that METADATA_ITEM entries store the tree level (not the length) in the key offset. Since the checker does not have access to the nodesize at this point, it uses length = 0 for metadata items and skips overlap detection for them.
Backref Owner Cross-Checks (Direction 1: Walk to Extent)
During tree walks in earlier check phases, the checker builds a map of
tree_block_owners: HashMap<u64, u64> mapping each tree block’s logical
address to the tree objectid that owns it (from the block header’s
owner field).
After processing the extent tree, the checker verifies that every block encountered during walks has an extent item:
#![allow(unused)]
fn main() {
if !state.extent_item_addrs.contains(&addr) {
results.report(CheckError::MissingExtentItem { bytenr: addr });
}
}
And that the extent tree’s backrefs agree with the actual owner:
#![allow(unused)]
fn main() {
if !claimed_owners.contains(&actual_owner) {
results.report(CheckError::BackrefOwnerMismatch {
bytenr: addr, actual_owner, claimed_owners,
});
}
}
Backref Owner Cross-Checks (Direction 2: Extent to Walk)
The checker also verifies the reverse: every TREE_BLOCK_REF in the
extent tree (both inline and standalone) must correspond to a tree block
that was actually encountered during walks and is owned by the claimed
tree:
#![allow(unused)]
fn main() {
let actual = tree_block_owners.get(&addr).copied();
if actual != Some(claimed) {
results.report(CheckError::BackrefOrphan {
bytenr: addr, claimed_owner: claimed,
});
}
}
This catches “orphan” backrefs that point to blocks that either do not exist or are owned by a different tree than claimed.
Data Byte Accounting
The checker accumulates two statistics from data extents:
-
data_bytes_allocated: the sum of
lengthfor all data extent items. This is the total physical space reserved for data. -
data_bytes_referenced: the sum of
length * countfor all data extent references. When data is shared (via snapshots or reflinks), referenced bytes exceed allocated bytes.
For inline-only data refs (no standalone ExtentDataRef items),
referenced bytes are computed from the inline ref count. For standalone
refs, each EXTENT_DATA_REF and SHARED_DATA_REF item contributes
length * count.
Extent Item Construction in mkfs
Metadata Extent Items
For each tree block allocated during mkfs, the extent tree receives a
metadata extent item with one inline TREE_BLOCK_REF:
#![allow(unused)]
fn main() {
fn metadata_extent_item(addr, skinny, generation, owner, nodesize) -> (Key, Vec<u8>) {
let (item_type, offset) = if skinny {
(BTRFS_METADATA_ITEM_KEY, 0u64) // offset = level 0
} else {
(BTRFS_EXTENT_ITEM_KEY, nodesize) // offset = nodesize
};
(
Key::new(addr, item_type, offset),
extent_item(1, generation, skinny, owner),
)
}
}
The extent_item() function serializes:
btrfs_extent_itemheader: refs=1, generation, flags=TREE_BLOCK- For non-skinny: zero-filled
btrfs_tree_block_info(25 bytes) - Inline
TREE_BLOCK_REF: type byte (176) + root objectid (8 bytes)
Total item size: 33 bytes (skinny) or 58 bytes (non-skinny).
Data Extent Items
For each data extent written during --rootdir mode, the extent tree
receives a data extent item with one inline EXTENT_DATA_REF:
#![allow(unused)]
fn main() {
fn data_extent_item(refs, generation, root, objectid, offset, count) -> Vec<u8> {
// btrfs_extent_item header
buf.put_u64_le(refs);
buf.put_u64_le(generation);
buf.put_u64_le(BTRFS_EXTENT_FLAG_DATA);
// inline EXTENT_DATA_REF
buf.put_u8(BTRFS_EXTENT_DATA_REF_KEY);
buf.put_u64_le(root);
buf.put_u64_le(objectid);
buf.put_u64_le(offset);
buf.put_u32_le(count);
}
}
Total item size: 53 bytes. The key is
(extent_bytenr, EXTENT_ITEM, extent_length).
Self-Referential Convergence
The extent tree must contain entries for its own tree blocks. But the number of tree blocks needed depends on how many items the tree contains, which depends on how many extent items there are, which depends on the number of tree blocks… This creates a circular dependency.
The --rootdir code path solves this with a convergence loop
(converge_extent_tree_block_count in mkfs/src/mkfs.rs):
- Start with
extent_tree_block_count = 1. - Build a trial extent tree with all items (including placeholder entries for the extent tree’s own blocks).
- If the trial tree’s actual block count differs from the assumed count, update the count and repeat.
- The loop converges quickly (usually in 1-2 iterations) because adding extent items for additional blocks only marginally increases the tree size.
After convergence, the real extent tree is built with actual logical
addresses assigned by the BlockAllocator.
Extent Tree Key Ordering
Items in the extent tree are sorted by the standard btrfs key comparison
(objectid, type, offset). Since objectid is the extent’s logical
byte address, items are effectively sorted by logical address.
Within a single extent’s address, the ordering is:
EXTENT_ITEMorMETADATA_ITEM(type 168 or 169) — the extent headerEXTENT_OWNER_REF(type 172) — if simple quotas are enabledTREE_BLOCK_REF(type 176) — standalone metadata backrefsEXTENT_DATA_REF(type 178) — standalone data backrefsSHARED_BLOCK_REF(type 182) — standalone shared metadata backrefsSHARED_DATA_REF(type 184) — standalone shared data backrefsBLOCK_GROUP_ITEM(type 192) — if not using block-group tree
This ordering is a natural consequence of the type field values and
ensures that btrfs check can process all backrefs for an extent by
reading items sequentially until the objectid (bytenr) changes.
Relationship to File Extents
The connection between the extent tree and actual file data flows through
EXTENT_DATA items in FS trees:
FS tree: (inode, EXTENT_DATA, file_offset)
-> disk_bytenr, disk_num_bytes, offset, num_bytes
Extent tree: (disk_bytenr, EXTENT_ITEM, disk_num_bytes)
-> refs, generation, flags=DATA
-> inline EXTENT_DATA_REF(root, inode, file_offset, count)
The disk_bytenr in the file extent item is the logical address of the
data extent. The extent tree entry at that address records who references
the extent and how many times.
For inline file extents (small files where data is embedded directly in the tree leaf), there is no corresponding extent tree entry — the data does not occupy a separate extent.
For hole/sparse extents (disk_bytenr = 0), there is similarly no extent
tree entry. The no-holes feature eliminates explicit hole extent items
entirely.
Summary of Key Formats
| Item type | Key | Payload |
|---|---|---|
EXTENT_ITEM | (bytenr, 168, length) | extent_item + inline refs |
METADATA_ITEM | (bytenr, 169, level) | extent_item + inline refs |
EXTENT_OWNER_REF | (bytenr, 172, root) | (empty) |
TREE_BLOCK_REF | (bytenr, 176, root) | (empty) |
EXTENT_DATA_REF | (bytenr, 178, hash) | extent_data_ref (28 bytes) |
SHARED_BLOCK_REF | (bytenr, 182, parent) | (empty) |
SHARED_DATA_REF | (bytenr, 184, parent) | shared_data_ref (4 bytes) |
BLOCK_GROUP_ITEM | (logical, 192, length) | block_group_item (24 bytes) |
All bytenr values are logical byte addresses. The extent tree provides
the complete picture of space allocation and ownership across the entire
filesystem.
Btrfs Transaction Infrastructure: On-Disk Format Specification
This document is the sole reference for implementing the btrfs-transaction
crate. It describes the on-disk format, invariants, and protocols needed to
safely modify a btrfs filesystem from userspace.
Tree block layout
A btrfs filesystem stores its metadata in a B-tree. Each tree block (also
called a node or extent buffer) is nodesize bytes (typically 16,384, but
can be 4,096 to 65,536). Tree blocks are identified by their logical byte
address (bytenr), which is translated to a physical device offset via the
chunk tree.
Every tree block begins with a 101-byte header, followed by either leaf items (level 0) or internal node key pointers (level > 0).
Header (101 bytes)
All multi-byte integers are little-endian on disk.
| Offset | Size | Field | Description |
|---|---|---|---|
| 0 | 32 | csum | Checksum of bytes 32..nodesize (header fields after csum + all payload). Algorithm determined by superblock csum_type. Zero-padded: for CRC32C only bytes 0..3 are meaningful. |
| 32 | 16 | fsid | Filesystem UUID. Must match superblock fsid (or metadata_uuid if METADATA_UUID incompat flag is set). |
| 48 | 8 | bytenr | Logical byte address of this block. Must match the address used to read/write it. |
| 56 | 8 | flags | Bits 0..55: header flags (currently unused by userspace). Bits 56..63: backref revision (1 = mixed backrefs, the modern format). |
| 64 | 16 | chunk_tree_uuid | UUID of the chunk tree that maps this block’s logical address to physical. Typically the same for all blocks on a single-device fs. |
| 80 | 8 | generation | Transaction generation when this block was last written. Critical for COW: a block with generation == current transaction has already been COWed and can be modified in place. |
| 88 | 8 | owner | Tree ID that owns this block (e.g. 1 for root tree, 2 for extent tree, 5 for default fs tree). Used for backref accounting. |
| 96 | 4 | nritems | Number of items (leaf) or key pointers (node). |
| 100 | 1 | level | B-tree level. 0 = leaf, 1..7 = internal node. Maximum level is 7 (BTRFS_MAX_LEVEL = 8 levels total, 0-indexed). |
Key (17 bytes)
Every item and pointer in the B-tree is identified by a three-part key.
On disk this is the btrfs_disk_key (little-endian):
| Offset | Size | Field | Description |
|---|---|---|---|
| 0 | 8 | objectid | Primary identifier (inode number, tree ID, extent bytenr, etc. depending on key type). |
| 8 | 1 | type | Key type discriminator (see section 7). |
| 9 | 8 | offset | Type-specific secondary value (file offset, extent size, parent ID, etc.). |
Keys are compared as a tuple (objectid, type, offset) in that order, all
as unsigned integers. This defines the sort order within every B-tree.
Leaf layout (level 0)
A leaf contains item descriptors that grow forward from the header, and item data payloads that grow backward from the end of the block. Free space is the gap between them.
Byte 0..100: Header
Byte 101..101+nritems*25-1: Item descriptors [item0, item1, ..., itemN-1]
(25 bytes each, sorted by key ascending)
...free space...
Byte X..nodesize-1: Item data [dataN-1, ..., data1, data0]
(packed from the end of the block backward)
Each item descriptor is 25 bytes:
| Offset | Size | Field | Description |
|---|---|---|---|
| 0 | 17 | key | The item’s key (btrfs_disk_key). |
| 17 | 4 | offset | Byte offset of this item’s data payload, relative to the start of the data area (byte 101). To get the absolute position in the block: absolute = 101 + offset. |
| 21 | 4 | size | Size of the item’s data payload in bytes. |
Invariants:
- Items are sorted by key in ascending order.
- Item data regions must not overlap.
- The last item’s data starts at
101 + item[N-1].offsetand extends foritem[N-1].sizebytes. Items with lower indices have data at higher offsets (data grows backward). - The first item’s data ends at
101 + item[0].offset + item[0].size, which must be <=nodesize. - Free space =
(101 + item[N-1].offset)-(101 + nritems * 25). When this is < 25 + data_size for a new item, the leaf is full.
Data offset convention:
The offset field in btrfs_item counts from byte 101 (immediately after
the header), not from the start of the block. When constructing a new leaf:
- Start
data_endatnodesize. - For each item (in key order):
data_end -= data.len(), write data atdata_end, storeoffset = data_end - 101in the item descriptor. - Item descriptors are written at
101 + i * 25.
Internal node layout (level > 0)
An internal node contains key pointers that identify child subtrees.
Byte 0..100: Header
Byte 101..101+nritems*33-1: Key pointers [ptr0, ptr1, ..., ptrN-1]
(33 bytes each, sorted by key ascending)
Each key pointer is 33 bytes:
| Offset | Size | Field | Description |
|---|---|---|---|
| 0 | 17 | key | Lowest key in the child subtree. |
| 17 | 8 | blockptr | Logical byte address of the child block. |
| 25 | 8 | generation | Generation of the child block (used for consistency checking during reads). |
Invariants:
- Key pointers are sorted by key in ascending order.
blockptrmust be a valid, allocated logical address.generationmust match the generation in the child block’s header.
Maximum capacities
For a given nodesize:
- Leaf items per block: depends on item data size. The theoretical maximum
number of zero-size items is
(nodesize - 101) / 25= 651 for 16 KiB. - Key pointers per node:
(nodesize - 101) / 33= 493 for 16 KiB. - Maximum tree depth: 8 levels (
BTRFS_MAX_LEVEL). In practice, trees rarely exceed 3-4 levels.
Superblock
The superblock is the entry point for reading a btrfs filesystem. It is a 4,096-byte structure stored at fixed offsets on every device:
- Mirror 0: byte 65,536 (64 KiB)
- Mirror 1: byte 67,108,864 (64 MiB)
- Mirror 2: byte 274,877,906,944 (256 GiB), only if device is large enough
Superblock layout (4,096 bytes)
| Offset | Size | Field | Description |
|---|---|---|---|
| 0 | 32 | csum | Checksum of bytes 32..4095. |
| 32 | 16 | fsid | Filesystem UUID. |
| 48 | 8 | bytenr | Physical byte offset of this copy. |
| 56 | 8 | flags | BTRFS_SUPER_FLAG_* bits. |
| 64 | 8 | magic | 0x4D5F53665248425F (“_BHRfS_M” reversed). |
| 72 | 8 | generation | Current transaction generation. |
| 80 | 8 | root | Logical bytenr of root tree root block. |
| 88 | 8 | chunk_root | Logical bytenr of chunk tree root block. |
| 96 | 8 | log_root | Logical bytenr of log tree root (0 if none). |
| 104 | 8 | __unused_log_root_transid | Deprecated, always 0. |
| 112 | 8 | total_bytes | Total usable bytes across all devices. |
| 120 | 8 | bytes_used | Total bytes allocated to extents. |
| 128 | 8 | root_dir_objectid | Always 6 (BTRFS_ROOT_TREE_DIR_OBJECTID). |
| 136 | 8 | num_devices | Number of devices. |
| 144 | 4 | sectorsize | Minimum I/O unit (typically 4096). |
| 148 | 4 | nodesize | Tree block size (typically 16384). |
| 152 | 4 | __unused_leafsize | Legacy, always equal to nodesize. |
| 156 | 4 | stripesize | RAID stripe unit (typically 65536). |
| 160 | 4 | sys_chunk_array_size | Valid bytes in the sys_chunk_array field. |
| 164 | 8 | chunk_root_generation | Generation of the chunk tree root. |
| 172 | 8 | compat_flags | Compatible feature flags. |
| 180 | 8 | compat_ro_flags | Read-only compatible feature flags. |
| 188 | 8 | incompat_flags | Incompatible feature flags. |
| 196 | 2 | csum_type | Checksum algorithm (0=CRC32C, 1=xxhash, 2=SHA256, 3=BLAKE2). |
| 198 | 1 | root_level | B-tree level of root tree root. |
| 199 | 1 | chunk_root_level | B-tree level of chunk tree root. |
| 200 | 1 | log_root_level | B-tree level of log tree root. |
| 201 | 98 | dev_item | Embedded device item for this device (see section 6.4). |
| 299 | 256 | label | NUL-terminated filesystem label. |
| 555 | 8 | cache_generation | Free space cache v1 generation. |
| 563 | 8 | uuid_tree_generation | UUID tree last-updated generation. |
| 571 | 16 | metadata_uuid | Metadata UUID (if METADATA_UUID flag set). |
| 587 | 8 | nr_global_roots | Global root count (extent-tree-v2, rare). |
| 595 | 8 | remap_root | Remap tree bytenr. |
| 603 | 8 | remap_root_generation | Remap tree generation. |
| 611 | 1 | remap_root_level | Remap tree level. |
| 612 | 199 | reserved | Zero-filled. |
| 811 | 2048 | sys_chunk_array | Bootstrap chunk tree entries (key + chunk item pairs, packed sequentially). |
| 2859 | 668 | super_roots | 4 rotating backup root entries (167 bytes each). See section 2.3. |
| 3527 | 569 | padding | Zero-filled to 4096. |
Fields updated on every transaction commit
When committing a transaction, the following superblock fields are updated:
generation— incremented by 1.root— logical bytenr of the (possibly new) root tree root block.root_level— level of the root tree root.chunk_root— logical bytenr of the chunk tree root (if chunk tree was modified).chunk_root_generation— generation of the chunk tree root.chunk_root_level— level of the chunk tree root.bytes_used— updated to reflect allocations/frees.log_root— set to 0 after log replay, or updated if log is active.super_roots— one of the 4 backup root slots is written (rotating).csum— recomputed last, covering bytes 32..4095.
The commit writes the superblock to all mirrors. The superblock write is the atomic commit point: if power is lost before the superblock is written, the previous generation’s state is intact because COW ensures old blocks are never overwritten (see section 3).
Backup roots (167 bytes each, 4 entries)
The superblock contains 4 rotating backup root entries. On each commit, one slot is overwritten (cycling 0 → 1 → 2 → 3 → 0 → …). These are used for recovery when the primary root pointers are corrupt.
| Offset | Size | Field | Description |
|---|---|---|---|
| 0 | 8 | tree_root | Root tree root bytenr. |
| 8 | 8 | tree_root_gen | Root tree generation. |
| 16 | 8 | chunk_root | Chunk tree root bytenr. |
| 24 | 8 | chunk_root_gen | Chunk tree generation. |
| 32 | 8 | extent_root | Extent tree root bytenr. |
| 40 | 8 | extent_root_gen | Extent tree generation. |
| 48 | 8 | fs_root | Default FS tree root bytenr. |
| 56 | 8 | fs_root_gen | FS tree generation. |
| 64 | 8 | dev_root | Device tree root bytenr. |
| 72 | 8 | dev_root_gen | Device tree generation. |
| 80 | 8 | csum_root | Checksum tree root bytenr. |
| 88 | 8 | csum_root_gen | Checksum tree generation. |
| 96 | 8 | total_bytes | Total filesystem bytes at this point. |
| 104 | 8 | bytes_used | Bytes used at this point. |
| 112 | 8 | num_devices | Device count at this point. |
| 120 | 32 | unused | Reserved (zero). |
| 152 | 1 | tree_root_level | Root tree level. |
| 153 | 1 | chunk_root_level | Chunk tree level. |
| 154 | 1 | extent_root_level | Extent tree level. |
| 155 | 1 | fs_root_level | FS tree level. |
| 156 | 1 | dev_root_level | Device tree level. |
| 157 | 1 | csum_root_level | Checksum tree level. |
| 158 | 9 | padding | Padding to 167 bytes. |
Superblock flags
| Bit | Name | Description |
|---|---|---|
| 2 | BTRFS_SUPER_FLAG_ERROR | Filesystem has errors. |
| 32 | BTRFS_SUPER_FLAG_SEEDING | Seed device (read-only base for cloning). |
| 33 | BTRFS_SUPER_FLAG_METADUMP | Metadump image. |
| 34 | BTRFS_SUPER_FLAG_METADUMP_V2 | Metadump v2 image. |
| 35 | BTRFS_SUPER_FLAG_CHANGING_FSID | FSID rewrite in progress. |
| 36 | BTRFS_SUPER_FLAG_CHANGING_FSID_V2 | FSID rewrite v2 in progress. |
| 38 | BTRFS_SUPER_FLAG_CHANGING_BG_TREE | Block group tree migration. |
| 39 | BTRFS_SUPER_FLAG_CHANGING_DATA_CSUM | Data csum algorithm change. |
| 40 | BTRFS_SUPER_FLAG_CHANGING_META_CSUM | Metadata csum algorithm change. |
Feature flags
Incompatible (incompat_flags):
| Bit | Name | Hex | Description |
|---|---|---|---|
| 0 | MIXED_BACKREF | 0x1 | Modern backreference format. |
| 1 | DEFAULT_SUBVOL | 0x2 | Non-default default subvolume set. |
| 2 | MIXED_GROUPS | 0x4 | Mixed data+metadata block groups. |
| 3 | COMPRESS_LZO | 0x8 | LZO compression used. |
| 4 | COMPRESS_ZSTD | 0x10 | ZSTD compression used. |
| 5 | BIG_METADATA | 0x20 | Metadata blocks > 4 KiB (always set with modern mkfs for nodesize > 4096). |
| 6 | EXTENDED_IREF | 0x40 | Extended inode references (INODE_EXTREF). |
| 7 | RAID56 | 0x80 | RAID5/RAID6 profiles in use. |
| 8 | SKINNY_METADATA | 0x100 | Skinny metadata extent refs (see 5.1). |
| 9 | NO_HOLES | 0x200 | No explicit hole extent items. |
| 10 | METADATA_UUID | 0x400 | metadata_uuid field is in use. |
| 11 | RAID1C34 | 0x800 | RAID1C3 or RAID1C4 profiles in use. |
| 12 | ZONED | 0x1000 | Zoned block device support. |
| 13 | EXTENT_TREE_V2 | 0x2000 | Extent tree v2 (experimental). |
| 14 | RAID_STRIPE_TREE | 0x4000 | RAID stripe tree. |
| 16 | SIMPLE_QUOTA | 0x10000 | Simple quota accounting. |
Read-only compatible (compat_ro_flags):
| Bit | Name | Hex | Description |
|---|---|---|---|
| 0 | FREE_SPACE_TREE | 0x1 | Free space tree present. |
| 1 | FREE_SPACE_TREE_VALID | 0x2 | Free space tree is valid/consistent. |
| 2 | VERITY | 0x4 | fs-verity enabled files present. |
| 3 | BLOCK_GROUP_TREE | 0x8 | Separate block group tree. |
Default features for modern mkfs:
incompat_flags:MIXED_BACKREF | BIG_METADATA | EXTENDED_IREF | SKINNY_METADATA | NO_HOLES=0x361compat_ro_flags:FREE_SPACE_TREE | FREE_SPACE_TREE_VALID | BLOCK_GROUP_TREE=0xB
System chunk array
The sys_chunk_array (2,048 bytes at offset 811) contains bootstrap chunk
entries needed to read the chunk tree itself. Format: packed sequence of
(btrfs_disk_key, btrfs_chunk) pairs. The sys_chunk_array_size field says
how many bytes are valid. Parsing: read key (17 bytes), then chunk header
(48 bytes) + stripes (num_stripes * 32 bytes), repeat until consumed.
Copy-on-write (COW) protocol
Btrfs never modifies tree blocks in place (except when a block was already allocated in the current transaction). This is the fundamental mechanism that provides crash consistency.
COW a tree block
When a transaction needs to modify a tree block:
-
Check generation. If
block.generation == current_transaction_generation, the block was already COWed in this transaction. Modify it in place. -
Allocate a new block. Find free space in an appropriate metadata block group and allocate
nodesizebytes at a new logical address. -
Copy. Copy the entire block contents to the new address.
-
Update parent pointer. In the parent node, change the
blockptrfor the relevant slot to the new address, and setgenerationto the current transaction generation. -
Update the new block’s header. Set
bytenrto the new logical address,generationto the current transaction generation. -
Queue old block for freeing. The old block’s extent reference is decremented. If its refcount reaches 0, the space is freed (but only after the transaction commits, to maintain crash consistency).
-
COW cascades upward. If the parent was not yet COWed, it must be COWed first (step 1 check), then updated. This cascades up to the root.
COW and the root pointer
The root of each tree is stored in a root_item in the root tree (tree ID 1).
The root tree’s own root pointer is stored in the superblock (root field).
When COW reaches the root of a non-root tree:
- Update the
root_item’sbytenrandlevelfields in the root tree. - This modification to the root tree triggers COW of the root tree itself.
When COW reaches the root tree’s root:
- The new root block address is written to the superblock’s
rootfield at commit time.
COW and the chunk tree
The chunk tree root is special: its pointer lives directly in the superblock
(chunk_root field), not in the root tree. If the chunk tree is modified,
its new root address updates chunk_root at commit time.
Crash consistency
The commit point is the superblock write. Before the superblock is updated:
- All new tree blocks have been written to new locations.
- All old tree blocks are still intact at their original locations.
- The old superblock still points to the old root tree root, which points to the old state of all trees.
If power is lost before the superblock write completes, the filesystem reverts to the previous generation. No fsck needed.
Transaction lifecycle
A transaction groups multiple tree modifications into a single atomic commit.
Start
- Read the current superblock generation
G. - Set the new transaction generation to
G + 1. - Track all blocks modified during this transaction (the “dirty set”).
Modify
All tree modifications (insert, delete, update items) go through COW:
search_slotdescends the tree, COWing each block along the path.- Item operations modify the COWed leaf.
- Reference counts are updated for allocated and freed extents.
Commit
-
Flush pending reference updates. Process all queued extent reference changes (delayed refs, see section 5.3). This may modify the extent tree, which may COW more blocks and generate more ref updates. Repeat until stable (no more pending updates).
-
Update root items. For every tree whose root block changed, update its
root_itemin the root tree (fields:bytenr,generation,level). This may COW the root tree. -
Write dirty blocks. Write all blocks in the dirty set to disk with correct checksums. Each block’s checksum covers bytes 32..
nodesize. -
Prepare superblock. Update the superblock fields listed in section 2.2. Write one backup root entry (rotating through slots 0-3). Recompute the superblock checksum.
-
Write superblock. Write the superblock to all mirrors. Issue fsync to ensure durability.
Abort
Discard all dirty blocks. Do not write the superblock. The filesystem remains at the previous generation.
Extent tree and reference counting
The extent tree (tree ID 2) tracks which logical address ranges are allocated and who references them. Every allocated extent (both data and metadata) has an entry in the extent tree.
Extent items
There are two key types for extent records:
EXTENT_ITEM (type 168): Used for data extents and (on older filesystems
without SKINNY_METADATA) for tree blocks.
- Key:
(logical_bytenr, EXTENT_ITEM=168, size_in_bytes) - Data:
extent_itemheader (24 bytes), optionallytree_block_info(18 bytes), then inline backreferences.
METADATA_ITEM (type 169): Used for tree blocks when SKINNY_METADATA
incompat flag is set. This is the modern default.
- Key:
(logical_bytenr, METADATA_ITEM=169, tree_level) - Data:
extent_itemheader (24 bytes), then inline backreferences. Notree_block_info(the level is in the key offset, and the first key is not stored).
Extent item header (24 bytes):
| Offset | Size | Field | Description |
|---|---|---|---|
| 0 | 8 | refs | Total reference count for this extent. |
| 8 | 8 | generation | Transaction generation when allocated. |
| 16 | 8 | flags | EXTENT_FLAG_DATA (bit 0) for data extents, EXTENT_FLAG_TREE_BLOCK (bit 1) for metadata. BLOCK_FLAG_FULL_BACKREF (bit 8) indicates full backrefs (shared block refs use parent bytenr instead of root ID). |
Tree block info (18 bytes, only for non-skinny EXTENT_ITEM with TREE_BLOCK flag):
| Offset | Size | Field | Description |
|---|---|---|---|
| 0 | 17 | key | First key in the tree block (btrfs_disk_key). |
| 17 | 1 | level | Level of the tree block. |
Backreferences
Backreferences record who uses an extent. They come in two forms: inline (packed inside the extent item’s data) and standalone (separate items in the extent tree).
Inline backreferences follow the extent item header (and tree_block_info if present). Each inline ref has a 1-byte type followed by an 8-byte offset, then type-specific data:
| Offset | Size | Field | Description |
|---|---|---|---|
| 0 | 1 | type | One of the backref type codes below. |
| 1 | 8 | offset | Type-dependent (see below). |
The backref types:
| Type code | Name | Offset meaning | Extra data | Total inline size |
|---|---|---|---|---|
| 176 | TREE_BLOCK_REF | Root tree ID | (none) | 9 bytes |
| 182 | SHARED_BLOCK_REF | Parent block bytenr | (none) | 9 bytes |
| 178 | EXTENT_DATA_REF | (see below) | 28 bytes | 37 bytes |
| 184 | SHARED_DATA_REF | Parent block bytenr | 4-byte count | 13 bytes |
| 172 | EXTENT_OWNER_REF | Root tree ID | (none) | 9 bytes |
TREE_BLOCK_REF (type 176): A tree block is referenced by a specific
tree (identified by root ID). The offset field IS the root objectid.
No additional data. Each such ref contributes 1 to the extent’s refcount.
SHARED_BLOCK_REF (type 182): A tree block is referenced by another tree
block (identified by its bytenr) rather than by root ID. This happens
during snapshots. The offset field IS the parent block’s bytenr. Each
such ref contributes 1 to the extent’s refcount.
EXTENT_DATA_REF (type 178): A data extent is referenced by a file. The
inline form packs the following 28 bytes immediately after the type byte
(the 8-byte offset from the generic header is actually the first field
root of this struct — parse carefully):
| Offset | Size | Field | Description |
|---|---|---|---|
| 0 | 8 | root | Root tree ID containing the referencing inode. |
| 8 | 8 | objectid | Inode number. |
| 16 | 8 | offset | File offset where this extent is referenced. |
| 24 | 4 | count | Number of references (typically 1, >1 for reflinked files). |
Each EXTENT_DATA_REF contributes count to the extent’s refcount.
SHARED_DATA_REF (type 184): A data extent is referenced through a shared
tree block (snapshot). The offset field is the parent block bytenr.
Additional 4 bytes:
| Offset | Size | Field | Description |
|---|---|---|---|
| 0 | 4 | count | Reference count from this parent. |
Each SHARED_DATA_REF contributes count to the extent’s refcount.
Standalone backreferences: When inline refs don’t fit in the extent item (rare, happens with many references), they overflow to standalone items:
TREE_BLOCK_REF_KEY(176): key(extent_bytenr, 176, root_id), no data.SHARED_BLOCK_REF_KEY(182): key(extent_bytenr, 182, parent_bytenr), no data.EXTENT_DATA_REF_KEY(178): key(extent_bytenr, 178, hash), 28-bytebtrfs_extent_data_refdata. The hash is computed as:
Note: these are rawhigh_crc = crc32c(seed=0xFFFFFFFF, root.to_le_bytes()) low_crc = crc32c(seed=0xFFFFFFFF, objectid.to_le_bytes()) low_crc = crc32c(seed=low_crc, offset.to_le_bytes()) hash = (high_crc as u64) << 31 ^ (low_crc as u64)CRC32C(no final inversion), not the standard ISO 3309 form.SHARED_DATA_REF_KEY(184): key(extent_bytenr, 184, parent_bytenr), 4-byte count.
Delayed references
Modifying a tree generates many reference count updates (every COWed block creates a new ref and removes an old ref). Processing each one immediately would cause excessive extent tree modifications. Instead, reference updates are queued and batched:
- When a block is COWed, queue:
+1 ref at new_bytenr,-1 ref at old_bytenr. - When a block is allocated for splitting, queue
+1 ref. - When blocks are freed (e.g., after merging), queue
-1 ref.
At commit time, process all queued refs:
- Merge updates to the same extent (e.g.,
+1and-1cancel out). - For each remaining update, modify the extent item in the extent tree.
- If a refcount drops to 0, delete the extent item and free the space.
- Processing delayed refs modifies the extent tree, which may generate more delayed refs (from COWing extent tree blocks). Repeat until the queue is empty. This converges because each iteration processes more refs than it creates.
Refcount invariant
The refs field in an extent item must always equal the sum of all its
backreferences:
- Each
TREE_BLOCK_REForSHARED_BLOCK_REFcontributes 1. - Each
EXTENT_DATA_REFcontributes itscountfield. - Each
SHARED_DATA_REFcontributes itscountfield.
If refs reaches 0, the extent is freed.
Block groups, chunks, and device extents
Btrfs organizes disk space into three layers: block groups (logical allocation regions), chunks (logical-to-physical mapping), and device extents (physical device reservations).
Block group item (24 bytes)
Stored in the extent tree (or block group tree if BLOCK_GROUP_TREE
compat_ro flag is set).
Key: (logical_offset, BLOCK_GROUP_ITEM=192, length)
| Offset | Size | Field | Description |
|---|---|---|---|
| 0 | 8 | used | Bytes currently allocated within this group. |
| 8 | 8 | chunk_objectid | Always 256 (BTRFS_FIRST_CHUNK_TREE_OBJECTID). |
| 16 | 8 | flags | Type + RAID profile (see 6.5). |
Block groups are the allocation units: when allocating an extent, the allocator finds a block group of the right type (DATA, METADATA, or SYSTEM) with enough free space.
Chunk item (48 + num_stripes * 32 bytes)
Stored in the chunk tree (tree ID 3).
Key: (256, CHUNK_ITEM=228, logical_offset)
| Offset | Size | Field | Description |
|---|---|---|---|
| 0 | 8 | length | Logical size of this chunk. |
| 8 | 8 | owner | Owner tree (always 2, extent tree). |
| 16 | 8 | stripe_len | Stripe unit for RAID (typically 65536). |
| 24 | 8 | type | Flags: same as block group flags. |
| 32 | 4 | io_align | I/O alignment (typically 65536 for non-system, sectorsize for system chunks). |
| 36 | 4 | io_width | I/O width (same as io_align). |
| 40 | 4 | sector_size | Device sector size (typically 4096). |
| 44 | 2 | num_stripes | Number of stripes. |
| 46 | 2 | sub_stripes | Sub-stripes for RAID10 (0 otherwise). |
| 48+ | 32*N | stripes | Array of stripe descriptors. |
Each stripe (32 bytes):
| Offset | Size | Field | Description |
|---|---|---|---|
| 0 | 8 | devid | Device ID. |
| 8 | 8 | offset | Physical byte offset on the device. |
| 16 | 16 | dev_uuid | Device UUID. |
Chunk-to-physical resolution: For a logical address L within a chunk
starting at chunk_start with a single stripe at device offset phys:
physical = phys + (L - chunk_start). RAID profiles use more complex
mapping.
Device extent (48 bytes)
Stored in the device tree (tree ID 4).
Key: (devid, DEV_EXTENT=204, physical_offset)
| Offset | Size | Field | Description |
|---|---|---|---|
| 0 | 8 | chunk_tree | Always 3 (BTRFS_CHUNK_TREE_OBJECTID). |
| 8 | 8 | chunk_objectid | Always 256. |
| 16 | 8 | chunk_offset | Logical offset of the owning chunk. |
| 24 | 8 | length | Length of this device extent. |
| 32 | 16 | chunk_tree_uuid | Chunk tree UUID. |
For each stripe in a chunk, there is one device extent on the corresponding device.
Device item (98 bytes)
Stored in the chunk tree (and embedded in the superblock for the local device).
Key: (1, DEV_ITEM=216, devid) (objectid 1 = BTRFS_DEV_ITEMS_OBJECTID)
| Offset | Size | Field | Description |
|---|---|---|---|
| 0 | 8 | devid | Device ID (1, 2, 3, …). |
| 8 | 8 | total_bytes | Total device size. |
| 16 | 8 | bytes_used | Bytes allocated to chunks on this device. |
| 24 | 4 | io_align | I/O alignment. |
| 28 | 4 | io_width | I/O width. |
| 32 | 4 | sector_size | Sector size. |
| 36 | 8 | type | Reserved (0). |
| 44 | 8 | generation | Last transaction touching this device. |
| 52 | 8 | start_offset | Start offset for new allocations. |
| 60 | 4 | dev_group | Reserved (0). |
| 64 | 1 | seek_speed | Hint (0 = unset). |
| 65 | 1 | bandwidth | Hint (0 = unset). |
| 66 | 16 | uuid | Device UUID. |
| 82 | 16 | fsid | Filesystem UUID. |
Block group type flags
| Bit | Name | Hex | Description |
|---|---|---|---|
| 0 | DATA | 0x1 | Data extents. |
| 1 | SYSTEM | 0x2 | System (chunk tree) metadata. |
| 2 | METADATA | 0x4 | Metadata extents. |
| 3 | RAID0 | 0x8 | Striped. |
| 4 | RAID1 | 0x10 | Mirrored (2 copies). |
| 5 | DUP | 0x20 | Duplicated on same device. |
| 6 | RAID10 | 0x40 | Striped + mirrored. |
| 7 | RAID5 | 0x80 | RAID5. |
| 8 | RAID6 | 0x100 | RAID6. |
| 9 | RAID1C3 | 0x200 | Mirrored (3 copies). |
| 10 | RAID1C4 | 0x400 | Mirrored (4 copies). |
A block group’s flags combine exactly one type (DATA, SYSTEM, METADATA) with
zero or one RAID profile. If no RAID profile bit is set, the block group is
SINGLE (no replication, but the virtual SINGLE bit 48 = 0x1000000000000
is used in some display contexts only).
Relationships between structures
For each allocated region of logical space:
- A block group item in the extent tree defines the logical range and tracks usage.
- A chunk item in the chunk tree maps the same logical range to one or more physical stripes.
- For each stripe, a device extent in the device tree reserves the physical space on that device.
- The device item in the chunk tree tracks total and used bytes per device.
All four must be consistent. When allocating a new block group (rare in rescue operations), all four structures must be created atomically within one transaction.
Tree types and key reference
Tree IDs
| ID | Name | Stored in |
|---|---|---|
| 1 | Root tree | Superblock (root field) |
| 2 | Extent tree | Root tree (ROOT_ITEM objectid=2) |
| 3 | Chunk tree | Superblock (chunk_root field) |
| 4 | Device tree | Root tree (ROOT_ITEM objectid=4) |
| 5 | Default FS tree | Root tree (ROOT_ITEM objectid=5) |
| 6 | Root tree directory | (virtual, in root tree) |
| 7 | Checksum tree | Root tree (ROOT_ITEM objectid=7) |
| 8 | Quota tree | Root tree (ROOT_ITEM objectid=8) |
| 9 | UUID tree | Root tree (ROOT_ITEM objectid=9) |
| 10 | Free space tree | Root tree (ROOT_ITEM objectid=10) |
| 11 | Block group tree | Root tree (ROOT_ITEM objectid=11) |
| 12 | RAID stripe tree | Root tree (ROOT_ITEM objectid=12) |
| 256+ | User subvolume/snapshot trees | Root tree (ROOT_ITEM objectid=N) |
The root tree is the master index. It contains a ROOT_ITEM for every other tree (except itself and the chunk tree, whose roots are in the superblock).
Root item (439 bytes used, padded to 496 bytes)
Stored in root tree with key (tree_id, ROOT_ITEM=132, 0).
| Offset | Size | Field | Description |
|---|---|---|---|
| 0 | 176 | inode | Embedded btrfs_inode_item (see 7.3). |
| 176 | 8 | generation | Transaction generation of this root. |
| 184 | 8 | root_dirid | Root directory objectid (typically 256). |
| 192 | 8 | bytenr | Logical bytenr of this tree’s root block. |
| 200 | 8 | byte_limit | Deprecated (0). |
| 208 | 8 | bytes_used | Total bytes used by this tree’s extents. |
| 216 | 8 | last_snapshot | Generation of last snapshot of this tree. |
| 224 | 8 | flags | Root flags (bit 0 = read-only subvolume). |
| 232 | 4 | refs | Reference count. |
| 236 | 17 | drop_progress | Key tracking in-progress drop operation. |
| 253 | 1 | drop_level | Level of drop progress. |
| 254 | 1 | level | Current B-tree height of this root. |
| 255 | 8 | generation_v2 | Same as generation (marks v2 format). |
| 263 | 16 | uuid | Subvolume UUID. |
| 279 | 16 | parent_uuid | Parent subvolume UUID (for snapshots). |
| 295 | 16 | received_uuid | Source UUID (for received subvolumes). |
| 311 | 8 | ctransid | Transaction of last inode change. |
| 319 | 8 | otransid | Transaction when this root was created. |
| 327 | 8 | stransid | Transaction when sent. |
| 335 | 8 | rtransid | Transaction when received. |
| 343 | 12 | ctime | Change time (8-byte sec + 4-byte nsec). |
| 355 | 12 | otime | Creation time. |
| 367 | 12 | stime | Send time. |
| 379 | 12 | rtime | Receive time. |
| 391 | 64 | reserved | Zero-filled. |
Fields updated when a tree’s root block changes (during commit):
bytenr— new root block address.generationandgeneration_v2— current transaction generation.level— root block level.
Inode item (176 bytes)
Embedded in root items and stored standalone in FS trees.
| Offset | Size | Field | Description |
|---|---|---|---|
| 0 | 8 | generation | NFS generation. |
| 8 | 8 | transid | Last modifying transaction. |
| 16 | 8 | size | File size. |
| 24 | 8 | nbytes | Sum of EXTENT_DATA.num_bytes for regular/prealloc extents plus inline payload length. See File data extents. |
| 32 | 8 | block_group | Block group hint for allocation. |
| 40 | 4 | nlink | Hard link count. |
| 44 | 4 | uid | User ID. |
| 48 | 4 | gid | Group ID. |
| 52 | 4 | mode | File mode (permissions + type). |
| 56 | 8 | rdev | Device number (block/char devices). |
| 64 | 8 | flags | Inode flags. |
| 72 | 8 | sequence | NFS sequence number. |
| 80 | 32 | reserved | Zero-filled. |
| 112 | 12 | atime | Access time (8-byte sec + 4-byte nsec). |
| 124 | 12 | ctime | Change time. |
| 136 | 12 | mtime | Modification time. |
| 148 | 12 | otime | Creation time. |
Key type reference
All key types with their numeric values:
| Value | Name | Primary tree | Key semantics |
|---|---|---|---|
| 1 | INODE_ITEM | FS tree | (inode#, 1, 0) |
| 12 | INODE_REF | FS tree | (inode#, 12, parent_dir_inode#) |
| 13 | INODE_EXTREF | FS tree | (inode#, 13, hash) |
| 24 | XATTR_ITEM | FS tree | (inode#, 24, name_hash) |
| 36 | VERITY_DESC_ITEM | FS tree | (inode#, 36, 0) |
| 37 | VERITY_MERKLE_ITEM | FS tree | (inode#, 37, offset) |
| 48 | ORPHAN_ITEM | Root/FS tree | (objectid, 48, offset) |
| 60 | DIR_LOG_ITEM | Log tree | (dir_inode#, 60, hash) |
| 72 | DIR_LOG_INDEX | Log tree | (dir_inode#, 72, index) |
| 84 | DIR_ITEM | FS tree | (dir_inode#, 84, name_hash) |
| 96 | DIR_INDEX | FS tree | (dir_inode#, 96, index) |
| 108 | EXTENT_DATA | FS tree | (inode#, 108, file_offset) |
| 128 | EXTENT_CSUM | Csum tree | (-10, 128, logical_bytenr) |
| 132 | ROOT_ITEM | Root tree | (tree_id, 132, 0) |
| 144 | ROOT_BACKREF | Root tree | (child_id, 144, parent_id) |
| 156 | ROOT_REF | Root tree | (parent_id, 156, child_id) |
| 168 | EXTENT_ITEM | Extent tree | (bytenr, 168, size) |
| 169 | METADATA_ITEM | Extent tree | (bytenr, 169, level) |
| 172 | EXTENT_OWNER_REF | (inline only) | – |
| 176 | TREE_BLOCK_REF | Extent tree | (bytenr, 176, root_id) |
| 178 | EXTENT_DATA_REF | Extent tree | (bytenr, 178, hash) |
| 182 | SHARED_BLOCK_REF | Extent tree | (bytenr, 182, parent_bytenr) |
| 184 | SHARED_DATA_REF | Extent tree | (bytenr, 184, parent_bytenr) |
| 192 | BLOCK_GROUP_ITEM | Extent tree* | (logical, 192, length) |
| 198 | FREE_SPACE_INFO | Free space tree | (bg_start, 198, bg_length) |
| 199 | FREE_SPACE_EXTENT | Free space tree | (start, 199, length) |
| 200 | FREE_SPACE_BITMAP | Free space tree | (start, 200, length) |
| 204 | DEV_EXTENT | Device tree | (devid, 204, phys_offset) |
| 216 | DEV_ITEM | Chunk tree | (1, 216, devid) |
| 228 | CHUNK_ITEM | Chunk tree | (256, 228, logical) |
| 230 | RAID_STRIPE | Stripe tree | (logical, 230, length) |
| 240 | QGROUP_STATUS | Quota tree | (0, 240, 0) |
| 242 | QGROUP_INFO | Quota tree | (qgroupid, 242, 0) |
| 244 | QGROUP_LIMIT | Quota tree | (qgroupid, 244, 0) |
| 246 | QGROUP_RELATION | Quota tree | (qgroupid, 246, other_qgroupid) |
| 248 | TEMPORARY_ITEM | Root tree | (objectid, 248, offset) |
| 249 | PERSISTENT_ITEM | Root tree | (objectid, 249, offset) |
| 250 | DEV_REPLACE | Root tree | (objectid, 250, 0) |
*BLOCK_GROUP_ITEM lives in the extent tree by default. With the
BLOCK_GROUP_TREE compat_ro flag, it moves to tree ID 11.
Root ref and root backref (18+ bytes)
Forward and backward links between parent and child subvolumes.
ROOT_REF key: (parent_tree_id, ROOT_REF=156, child_tree_id)
ROOT_BACKREF key: (child_tree_id, ROOT_BACKREF=144, parent_tree_id)
Both use the same data format:
| Offset | Size | Field | Description |
|---|---|---|---|
| 0 | 8 | dirid | Directory objectid in the parent tree that contains this subvolume. |
| 8 | 8 | sequence | Index in the directory. |
| 16 | 2 | name_len | Length of the subvolume name. |
| 18 | N | name | Subvolume name (not NUL-terminated). |
File data extents
Regular file content lives in EXTENT_DATA items in the FS tree, keyed
(inode#, EXTENT_DATA, file_offset). Each item describes a contiguous
range of the file’s logical bytes; consecutive items must cover
non-overlapping ranges. Three extent types exist:
BTRFS_FILE_EXTENT_INLINE(0): data embedded directly in the leaf.BTRFS_FILE_EXTENT_REG(1): pointer to a separate data extent on disk.BTRFS_FILE_EXTENT_PREALLOC(2): reserved on disk but not yet written.
Common header (21 bytes)
| Offset | Size | Field | Description |
|---|---|---|---|
| 0 | 8 | generation | Transid at extent creation. |
| 8 | 8 | ram_bytes | Uncompressed size of the extent’s data. |
| 16 | 1 | compression | 0=none, 1=zlib, 2=LZO, 3=zstd. |
| 17 | 1 | encryption | Always 0. |
| 18 | 2 | other_encoding | Always 0. |
| 20 | 1 | extent_type | 0=inline, 1=regular, 2=prealloc. |
Regular and prealloc body (32 bytes follow header)
| Offset | Size | Field | Description |
|---|---|---|---|
| 21 | 8 | disk_bytenr | Logical address of data extent (0 = hole). |
| 29 | 8 | disk_num_bytes | On-disk size, sectorsize-aligned. |
| 37 | 8 | offset | Byte offset into the on-disk extent (bookend; 0 for non-shared). |
| 45 | 8 | num_bytes | Logical file bytes covered by this item. |
Inline body
For inline extents the bytes after the 21-byte header are the (possibly
compressed) file data. There is no disk_bytenr, no extent-tree entry,
and no csum entry: the inline payload is covered by the FS tree leaf’s
own checksum.
For LZO inline extents the embedded bytes carry an additional framing
header: [4B total_len LE] [4B seg_len LE] [lzo1x compressed bytes],
where total_len includes the 8-byte framing header itself.
Validation rules
These invariants are enforced by btrfs check and must hold for any
EXTENT_DATA written by userspace:
- Regular and prealloc extents:
num_bytesmust be sectorsize-aligned and non-zero.disk_num_bytesmust also be sectorsize-aligned.num_bytes + offset <= ram_bytes. - Inline extents: total embedded payload (compressed or not) must
fit in a leaf, capped at
min(nodesize - 147, sectorsize - 1)bytes on a default filesystem. The 147 =HEADER_SIZE(101) +ITEM_SIZE(25) + 21-byte file-extent header. Thesectorsize - 1cap is btrfs’s rule that sector-or-larger files must use a regular extent. INODE.nbytes: sum ofnum_bytesfor every regular/prealloc extent (wheredisk_bytenr > 0) plus the inline payload length for any inline extent. For non-compressed extentsnum_bytesis the sector-aligned logical size, NOT the on-disk byte count. For compressed extentsnum_bytesis still the sector-aligned logical size — the smallerdisk_num_bytesis not what gets summed.INODE.size: the file’s logical size in bytes. May be smaller than the sum ofnum_bytes(the unwritten tail in the last extent reads as zero up tosize).
LZO regular framing
For non-inline LZO extents, the on-disk bytes use a per-sector framed format:
[4B total_len LE] { [4B seg_len LE] [lzo1x compressed bytes] [zero pad] }*
- Each input sector is compressed independently.
seg_lenis the size of that sector’s compressed segment.total_lenis the total framed buffer size, including the 4-byte header.- After each segment, if fewer than 4 bytes remain in the current sector (i.e. the next 4-byte length header would cross a sector boundary), zero-pad to the next sector boundary so the next segment’s length header is sector-aligned.
This per-sector independence lets the kernel decompress individual sectors without reading neighbours.
Holes
With the NO_HOLES incompat flag (default on modern filesystems), gaps
in the file_offset sequence indicate holes — no EXTENT_DATA item is
written for the unmapped range. Without NO_HOLES, hole regions are
recorded as regular extents with disk_bytenr == 0 and
disk_num_bytes == 0.
Checksum computation
Tree block checksums
The checksum field (bytes 0..31 of the header) covers bytes 32..nodesize.
For CRC32C (type 0), the checksum is 4 bytes stored at offset 0, with
bytes 4..31 zero-padded.
Computation: standard ISO 3309 CRC32C (initial seed 0xFFFFFFFF, final XOR
with 0xFFFFFFFF) over the data region bytes 32..nodesize.
Superblock checksums
Same as tree block checksums: bytes 0..31 are the checksum field, covering bytes 32..4095.
Data checksums (csum tree)
Data checksums are stored in the csum tree (tree ID 7) with key
(EXTENT_CSUM_OBJECTID=-10, EXTENT_CSUM=128, logical_bytenr).
The item data is a packed array of checksums, one per sector. For
CRC32C, each checksum is 4 bytes. The number of sectors covered is
item_size / csum_size_for_type. Sectors are consecutive starting at
the key’s offset (logical_bytenr).
Computation: standard ISO 3309 CRC32C (the same algorithm as for tree
blocks; initial seed 0xFFFFFFFF, final XOR with 0xFFFFFFFF). The
csum input is the on-disk bytes of the data extent — for compressed
extents, that is the compressed+sector-padded payload, NOT the
uncompressed original.
Note this is distinct from the raw_crc32c (no final invert) used by
EXTENT_DATA_REF_KEY hashes and by the send-stream protocol. On-disk
csum-tree entries always use the standard variant.
A single csum item may cover multiple consecutive sectors. The
practical upper bound for a single item’s payload is roughly
leaf_data_size - 2 * item_header_size - csum_size bytes, leaving
room for a future split. Adjacent items at sector-contiguous logical
addresses may be merged into one larger item, but btrfs check
accepts either layout.
Inline extents have no csum entries — the data lives in the leaf and is covered by the leaf’s own header checksum.
NODATASUM extents (inode flag BTRFS_INODE_NODATASUM) skip csum
computation entirely. btrfs check rejects csum entries for
NODATASUM extents, and rejects missing csum entries for non-NODATASUM
regular extents.
Extent data ref hash
The hash used in EXTENT_DATA_REF_KEY’s offset field:
high_crc = raw_crc32c(seed=0xFFFFFFFF, root.to_le_bytes())
low_crc = raw_crc32c(seed=0xFFFFFFFF, objectid.to_le_bytes())
low_crc = raw_crc32c(seed=low_crc, offset.to_le_bytes())
hash = (high_crc as u64) << 31 ^ (low_crc as u64)
Here raw_crc32c means NO final XOR — the raw CRC register value. This
can be recovered from the standard API: raw = !standard_crc32c(data)
when seed is !0, or equivalently raw = crc32c_with_seed(!0, data) if
the API exposes the seed.
B-tree operations
This section describes the algorithms for searching, inserting, and deleting items in a btrfs B-tree. These are standard B-tree algorithms adapted for the btrfs leaf/node layout and COW model.
Binary search within a block
Given a block and a target key, find the slot:
In a leaf: Binary search over items[0..nritems-1] comparing keys. If found, return (true, slot). If not found, return (false, slot) where slot is the insertion point (the index of the first item with key > target).
In a node: Binary search over ptrs[0..nritems-1] comparing keys. The
result is the slot of the child subtree that could contain the target key.
Specifically, find the largest slot where ptrs[slot].key <= target. If
the target is less than all keys, use slot 0.
Search (search_slot)
search_slot(trans, root, key, path, ins_len, cow) descends from the root
to a leaf:
- Start at the root block (level = root_level).
- If
cow != 0and the block hasn’t been COWed in this transaction, COW it. - Binary search for the key within the block.
- Store
(block, slot)inpath.nodes[level]andpath.slots[level]. - If level > 0: read the child at
ptrs[slot].blockptr, go to step 2 with the child. - If level == 0: done. If the key was found,
path.slots[0]points to it. If not found,path.slots[0]is the insertion point.
When ins_len > 0 (insert operation), the search checks whether the target
leaf has enough free space. If not, it triggers a leaf split before
returning.
Item insertion
Given a search path pointing to the insertion slot in a leaf:
-
If the leaf has enough free space (
>= 25 + data_size): a. Shift items at slots [insert_slot..nritems-1] right by 25 bytes (one item descriptor). b. Shift all data belonging to items at [insert_slot..nritems-1] left bydata_sizebytes (making room at the end of the data area). c. Update theoffsetfield of shifted items (subtractdata_sizefrom each). d. Write the new item descriptor at the insert slot. e. Write the new item data. f. Incrementnritems. -
If the leaf is full: split the leaf (section 9.5), then insert.
Item deletion
Given a search path pointing to items to delete (slot, count):
- If deleting items in the middle: shift items at
[slot+count..
nritems-1] left bycount * 25bytes. - Shift data: move data belonging to remaining items to fill the gap
left by deleted items’ data. Update
offsetfields accordingly. - Decrement
nritemsby count. - If the leaf becomes empty: remove the key pointer from the parent node and free the leaf block. If the parent also becomes empty (or has only one child), rebalance upward.
Leaf split
When a leaf is too full for an insertion:
- Allocate a new leaf block.
- Find the split point: aim for roughly half the data in each leaf. The split point should be at an item boundary (never split an item).
- Copy items [split..
nritems-1] and their data to the new leaf. - Update the original leaf’s
nritems. - Insert a new key pointer in the parent node pointing to the new leaf. The key is the first key of the new leaf.
- If the parent node is full, split the parent (section 9.6).
Node split
When an internal node is too full for a new key pointer:
- Allocate a new node at the same level.
- Move roughly half the key pointers to the new node.
- Insert a new key pointer in the parent (one level up) for the new node. The key is the first key of the new node.
- If the parent is also full, split it recursively.
- If the root node splits, create a new root one level higher containing two key pointers (to the old and new nodes). Update the tree’s root pointer. The tree grows taller by one level.
Rebalancing (optional optimization)
Before splitting, try to redistribute items to a neighboring sibling:
- Push left: If the left sibling has free space, move items from the start of the full leaf to the end of the left sibling. Update the parent’s key for the full leaf.
- Push right: If the right sibling has free space, move items from the end of the full leaf to the start of the right sibling. Update the parent’s key for the right sibling.
This reduces tree height growth. It’s an optimization, not required for correctness. The same applies to nodes (push key pointers to siblings).
After deletion, if a leaf or node is less than ~25% full, consider merging with a sibling. This is also optional for correctness but prevents excessive tree bloat.
Path advancement
next_leaf(path): advance from the current leaf to the next one.
- Walk up the path until finding a level where
slot < nritems - 1. - Increment that slot.
- Walk back down, always taking slot 0, until reaching a leaf.
- Update the path at each level.
prev_leaf(path): similar but in reverse (walk up until slot > 0,
decrement, walk down taking the last slot at each level).
Free space management
To allocate extents, the transaction crate needs to know which logical addresses are free within each block group.
Extent tree scanning
The simplest approach: walk the extent tree within a block group’s logical
range. Allocated extents are contiguous EXTENT_ITEM/METADATA_ITEM entries.
Gaps between them are free space. This is O(n) in the number of extents
but works without additional infrastructure.
Free space tree (optional optimization)
If the FREE_SPACE_TREE compat_ro flag is set, the free space tree (tree ID
10) provides pre-computed free space information per block group.
For each block group, there is a FREE_SPACE_INFO item:
Key: (block_group_start, FREE_SPACE_INFO=198, block_group_length)
| Offset | Size | Field | Description |
|---|---|---|---|
| 0 | 4 | extent_count | Number of free extents. |
| 4 | 4 | flags | Bit 0: USING_BITMAPS (bitmap mode). |
If not using bitmaps, free extents are stored as:
Key: (start, FREE_SPACE_EXTENT=199, length) — no item data.
If using bitmaps:
Key: (start, FREE_SPACE_BITMAP=200, length) — item data is a bitmap
where each bit represents one sector (1 = free).
The free space tree must be kept in sync with the extent tree during transactions. When allocating or freeing extents, update both.
Allocation strategy
For metadata blocks:
- Find a block group with type
METADATA(orSYSTEMfor chunk tree blocks). - Find a free region >=
nodesize. - Prefer the block group hinted by the tree’s root item or the most recently used block group.
For data extents:
- Find a block group with type
DATA. - Find a free region >= requested size.
Rescue command requirements
This section maps each rescue command to the specific tree operations needed.
clear-uuid-tree
Delete all items from the UUID tree and remove its root item.
- Start transaction.
- Search for the first key in the UUID tree:
search_slot(uuid_root, min_key). - Delete items in batches (walk forward, delete, repeat until tree empty).
- Delete the
ROOT_ITEMfor tree ID 9 from the root tree. - Free all tree blocks that belonged to the UUID tree (decrement refs).
- Set
uuid_tree_generation= 0 in the superblock (tells the kernel to rebuild the UUID tree on next mount). - Commit transaction.
clear-ino-cache
Remove leftover inode cache items (from the deprecated v1 inode cache).
- Start transaction.
- For each FS tree (tree IDs 5, 256+): search for
INODE_ITEMwith objectid =BTRFS_FREE_INO_OBJECTID(-12). Delete the inode item and all associatedEXTENT_DATAitems. - Free any data extents referenced by the deleted extent data items.
- Commit transaction.
clear-space-cache
Two modes: v1 (free space inode cache) and v2 (free space tree).
v1: Similar to clear-ino-cache — delete free space cache inodes
(objectid = BTRFS_FREE_SPACE_OBJECTID = -11) from each block group.
v2: Delete the entire free space tree (tree ID 10) like clear-uuid-tree.
Clear the FREE_SPACE_TREE_VALID compat_ro flag so the kernel rebuilds it
on next mount.
fix-device-size
Correct device and superblock size fields when they’re inconsistent.
- Start transaction.
- Walk the device tree to find all
DEV_EXTENTitems for each device. - Sum the extent lengths to get the true
bytes_usedper device. - Update each
DEV_ITEM’stotal_bytesandbytes_used. - Update the superblock’s embedded dev_item and
total_bytes. - Commit transaction.
fix-data-checksum
Verify and repair data checksums using mirror redundancy.
- Start transaction.
- Walk the csum tree (
EXTENT_CSUMitems). - For each checksummed range, read data from each available mirror.
- Verify each mirror’s data against the stored checksum.
- If a checksum mismatch is found and a good mirror exists: optionally update the csum item to match the good mirror’s data (or rewrite the data from the good mirror).
- Commit transaction.
Requires: extent tree walking for backref resolution (to report which files are affected), multi-device I/O for reading mirrors.
chunk-recover
Rebuild the chunk tree by scanning device surfaces for tree blocks.
- Scan all devices for valid tree block headers (check magic, csum).
- From found tree blocks, reconstruct chunk items by cross-referencing block group items and device extents.
- Rebuild the chunk tree with the recovered mappings.
- Commit.
This is the most complex rescue operation and requires extensive device scanning infrastructure beyond basic tree operations.
mkfs.btrfs: filesystem creation process
This document describes how mkfs.btrfs creates a new btrfs filesystem,
covering both the empty filesystem case (make_btrfs) and the directory
population case (make_btrfs_with_rootdir).
Overview
mkfs.btrfs creates a filesystem by constructing B-tree nodes as raw byte
buffers and writing them directly to a block device or image file with pwrite.
No kernel ioctls or mounting are involved. The process produces a valid,
mountable btrfs filesystem.
The implementation spans several modules:
mkfs/src/mkfs.rs– orchestration:make_btrfsandmake_btrfs_with_rootdirmkfs/src/layout.rs– chunk layout computation and block address assignmentmkfs/src/tree.rs–LeafBuilderandNodeBuilderfor individual blocksmkfs/src/treebuilder.rs–TreeBuilderfor multi-leaf treesmkfs/src/items.rs– serializers for all on-disk item typesmkfs/src/rootdir.rs– directory walking, data writing, compressionmkfs/src/write.rs– checksum computation and pwrite I/O
Part 1: empty filesystem creation (make_btrfs)
Step 1: validation
Before any I/O, the configuration is validated:
sectorsizemust be a power of 2 and >= 4096.nodesizemust be a power of 2, >= sectorsize, and <= 65536.- If the
mixed-bgincompat feature is set, nodesize must equal sectorsize.
Step 2: chunk layout computation
ChunkLayout::new computes the physical placement of three block groups on
disk:
System block group
- Logical offset: 1 MiB (
SYSTEM_GROUP_OFFSET). - Size: 4 MiB (
SYSTEM_GROUP_SIZE). - Physical offset: same as logical (system chunk has identity mapping on device 1).
- Profile: always SINGLE (one stripe on device 1).
- Contains: the chunk tree block.
Metadata block group
- Logical offset: 5 MiB (
CHUNK_START= system offset + system size). - Size:
clamp(total_bytes / 10, 32 MiB, 256 MiB), rounded down to 64 KiB (STRIPE_LEN). - Profile: DUP on single device (two physical stripes on device 1, sequential
after the system group) or RAID1 on multi-device (one stripe per device at
CHUNK_START). - Contains: all non-chunk tree blocks (root, extent, dev, FS, csum, free-space, data-reloc, and optionally block-group tree).
Data block group
- Logical offset: metadata logical + metadata size.
- Size:
clamp(total_bytes / 10, 64 MiB, 1 GiB), rounded down toSTRIPE_LEN. - Profile: SINGLE (one stripe on device 1, after the last metadata stripe).
- Contains: file data (empty for a freshly created filesystem).
The layout validates that all stripes fit on their respective devices. If they
do not, ChunkLayout::new returns None and mkfs reports “device too small”.
The minimum device size is approximately 133 MiB: 5 MiB (system) + 64 MiB (2 x 32 MiB metadata DUP) + 64 MiB (data).
Step 3: block address assignment
BlockLayout assigns a logical address to each tree block:
- Chunk tree: at
SYSTEM_GROUP_OFFSET(1 MiB), in the system chunk. - Root, Extent, Dev, FS, Csum, FreeSpace, DataReloc trees: sequential
in the metadata chunk starting at
meta_logical, spaced bynodesize. - Block-group tree (if enabled): the 8th block in the metadata chunk.
For example, with nodesize = 16384 and meta_logical = 5 MiB:
| Tree | Logical address |
|---|---|
| Chunk | 0x100000 (1 MiB) |
| Root | 0x500000 (5 MiB) |
| Extent | 0x504000 |
| Dev | 0x508000 |
| FS | 0x50C000 |
| Csum | 0x510000 |
| FreeSpace | 0x514000 |
| DataReloc | 0x518000 |
| BlockGroup | 0x51C000 (optional) |
Step 4: tree block construction
Each tree is built as a single leaf node using LeafBuilder. Items must be
pushed in strictly ascending key order. The builder handles offset bookkeeping:
item descriptors grow forward from byte 101 (after the header), item data grows
backward from the end of the block.
Tree block format
Bytes 0-31: checksum (32 bytes, computed last)
Bytes 32-47: fsid (16 bytes)
Bytes 48-55: bytenr (logical address, 8 bytes LE)
Bytes 56-63: flags (8 bytes LE)
Bytes 64-79: chunk_tree_uuid (16 bytes)
Bytes 80-87: generation (8 bytes LE)
Bytes 88-95: owner tree objectid (8 bytes LE)
Bytes 96-99: nritems (4 bytes LE)
Byte 100: level (0 for leaf, >0 for internal node)
After the 101-byte header, item descriptors occupy 25 bytes each:
Bytes 0-16: key (objectid:8 + type:1 + offset:8)
Bytes 17-20: data_offset (relative to end of header, 4 bytes LE)
Bytes 21-24: data_size (4 bytes LE)
Item data payloads fill from the end of the block backward. The space between the last descriptor and the first data payload is unused.
Root tree contents
The root tree contains a ROOT_ITEM (key type 132) for each tree that needs
one. The root tree itself and the chunk tree are excluded (the root tree cannot
reference itself; the chunk tree is referenced by the superblock’s chunk_root
pointer, though a ROOT_ITEM is still written for the chunk tree in practice
through the ROOT_ITEM_TREES list).
Trees receiving a ROOT_ITEM: Extent, Dev, FS, Csum, FreeSpace, DataReloc, and optionally BlockGroup. Each ROOT_ITEM is 439 bytes and contains:
- An embedded
btrfs_inode_item(160 bytes) for the root directory. - Tree-specific fields: generation, root_dirid, bytenr (pointing to the tree’s block), byte_limit, bytes_used, refs, level.
The FS tree’s ROOT_ITEM gets additional initialization:
- A deterministic UUID (derived by XOR-flipping the filesystem UUID).
BTRFS_INODE_ROOT_ITEM_INITflag set in the embedded inode.inode.size = 3,inode.nbytes = nodesize.ctimeandotimetimestamps set to the creation time.
Extent tree contents
The extent tree contains one METADATA_ITEM (or EXTENT_ITEM if skinny
metadata is disabled) for each tree block, plus BLOCK_GROUP_ITEM entries
for each block group (unless the block-group tree is enabled, in which case
block group items go there instead).
Each metadata extent item consists of 24 bytes (btrfs_extent_item: refs,
generation, flags) plus a 9-byte inline TREE_BLOCK_REF (type byte + root
objectid). With skinny metadata, the key is (bytenr, METADATA_ITEM, level).
Without skinny metadata, the key is (bytenr, EXTENT_ITEM, nodesize) and an
additional 18-byte btrfs_tree_block_info is included.
Block group items (24 bytes each) are keyed as (logical_addr, BLOCK_GROUP_ITEM, chunk_size) and contain the bytes used, chunk objectid, and
profile flags.
All items are collected, sorted by key, then pushed to the leaf.
Chunk tree contents
The chunk tree contains:
-
DEV_ITEMentries for each device, keyed as(DEV_ITEMS_OBJECTID, DEV_ITEM, devid). Each contains the device’s total bytes, bytes used, sector size, and UUIDs. -
CHUNK_ITEMentries for each block group:- System chunk: uses
sectorsizeforio_align/io_width(bootstrap convention). One stripe on device 1. - Metadata chunk: uses
STRIPE_LEN(64 KiB) forio_align/io_width. Two stripes for DUP, one per device for RAID1. - Data chunk: uses
STRIPE_LENforio_align/io_width. One stripe for SINGLE.
- System chunk: uses
Dev tree contents
The dev tree contains:
PERSISTENT_ITEM(DEV_STATS) for each device – all five counters zeroed (40 bytes).DEV_EXTENTitems for each physical allocation:- System chunk: device 1 at
SYSTEM_GROUP_OFFSET. - Metadata stripes: one or two entries per device.
- Data stripes: one entry per device.
- System chunk: device 1 at
Items are sorted by key (devid, DEV_EXTENT, physical_offset).
FS tree contents
Contains two items for the root directory inode (objectid 256):
INODE_ITEM: directory mode 040755, nlink=1, nbytes=nodesize, generation=1, timestamps set to creation time.INODE_REF: index=0, name=.., parent_ino=256 (self-referencing for the root directory).
Csum tree
Empty leaf (no items). Populated later if files are written.
Free-space tree
If the free-space-tree feature is enabled, contains FREE_SPACE_INFO and
FREE_SPACE_EXTENT items for each block group. Each block group gets:
- One
FREE_SPACE_INFOitem withextent_count=1. - One
FREE_SPACE_EXTENTitem covering the unused portion of the block group (fromused_bytestogroup_size).
If the free-space-tree feature is disabled, this is an empty leaf.
Data-reloc tree
Same structure as the FS tree: root directory inode (objectid 256) with
INODE_ITEM and INODE_REF.
Block-group tree (optional)
If the block-group-tree compat_ro feature is enabled, block group items are
placed here instead of in the extent tree. Contains three BLOCK_GROUP_ITEM
entries (system, metadata, data).
Step 5: checksum computation
After each tree block is fully constructed,
btrfs_disk::util::csum_tree_block computes the checksum of bytes
CSUM_SIZE..nodesize and writes the result into the first bytes of the block:
- CRC32C: 4 bytes (standard CRC32C via
crc32c::crc32c). - xxHash64: 8 bytes.
- SHA-256: 32 bytes.
- BLAKE2b-256: 32 bytes.
Remaining bytes in the 32-byte checksum field stay zero.
Step 6: writing to disk
Tree blocks
Each tree block is written to its physical location(s) using pwrite_all.
The logical-to-physical mapping is provided by ChunkLayout::logical_to_physical:
- System chunk blocks: one write at the logical address (identity mapping) on device 1.
- Metadata chunk blocks: one write per stripe. For DUP: two writes on device 1 at different offsets. For RAID1: one write per device.
- Data chunk blocks: one write per stripe (typically one for SINGLE).
Superblocks
The superblock is constructed with all necessary fields:
magic:_BHRfS_Mroot: logical address of the root tree blockchunk_root: logical address of the chunk tree blocktotal_bytes: sum across all devicesbytes_used: system used + metadata used (no data used for empty filesystem)sectorsize,nodesize,leafsize(= nodesize),stripesize(= sectorsize)num_devices: device countincompat_flags,compat_ro_flags: from configurationcsum_type: checksum algorithmcache_generation: 0 if free-space-tree enabled, u64::MAX otherwisesys_chunk_array: embedded copy of the system chunk (disk_key + chunk_item bytes), enabling the kernel to bootstrap chunk mapping from the superblock alone
The sys_chunk_array is the bootstrap mechanism: it contains a serialized
disk key followed by the system chunk item data (including stripe info), stored
in a fixed 2048-byte buffer within the superblock. The kernel reads this array
first to locate the chunk tree block, then reads the chunk tree to find all
other chunks.
Each device gets its own superblock with device-specific fields (devid, dev_uuid, bytes_used for that device). The superblock is written to all valid mirror locations (up to 3):
- Mirror 0: byte offset 65536 (64 KiB) – always written.
- Mirror 1: byte offset 67108864 (64 MiB) – written if device is large enough.
- Mirror 2: byte offset 274877906944 (256 GiB) – written if device is large enough.
After all writes, fsync is called on all device files.
Part 2: rootdir population (make_btrfs_with_rootdir)
The --rootdir flag populates the new filesystem from a source directory on the
host. This is significantly more complex than the empty filesystem case because:
- The FS tree may need multiple leaf blocks (and internal nodes).
- File data must be written to the data chunk.
- The extent tree must reference both metadata blocks and data extents.
- The csum tree must contain checksums for all data.
- The extent tree must contain entries for its own blocks, creating a circular dependency.
Step 1: directory walk (walk_directory)
The rootdir::walk_directory function performs a depth-first traversal of the
source directory, building all FS tree items and identifying files that need
data extents.
Inode assignment
Inode numbers are assigned sequentially starting at 257 (inode 256 is the root
directory, handled separately). The root directory (objectid 256) gets its
INODE_ITEM and INODE_REF added during the merge phase.
Hardlink detection
For files with nlink > 1, the function tracks (dev, ino) pairs from the
host filesystem in a HashMap. When a subsequent directory entry refers to
the same host inode:
- No new btrfs inode number is assigned; the existing one is reused.
- An
INODE_REFis added (additional reference from the new parent). - No new
INODE_ITEMis created. - The nlink counter for that btrfs inode is incremented.
After all entries are processed, fixup_inode_nlink patches the nlink field
in the INODE_ITEM for all hardlinked inodes.
Per-entry processing
For each directory entry (file, directory, symlink, special file):
- DIR_ITEM in the parent directory, keyed by name hash
(
crc32c(0xFFFFFFFE, name)). - DIR_INDEX in the parent directory, keyed by sequential index (starting at 2 for each directory).
- INODE_REF for the new inode, pointing to the parent.
- INODE_ITEM with metadata copied from the host filesystem (uid, gid, mode, timestamps, rdev for special files).
- XATTR_ITEM entries for each extended attribute on the host file (read via
llistxattr/lgetxattr).
Type-specific items:
- Directories: Push children onto the DFS stack (reversed for correct order). Initialize the dir_index counter for the new directory.
- Symlinks: Create an inline
FILE_EXTENT_ITEMcontaining the link target (never compressed). - Regular files with
size > 0:- If
size <= max_inline_data_size: read the file, optionally compress, create an inlineFILE_EXTENT_ITEM. - If
size > max_inline_data_size: defer to the data writing phase. Record aFileAllocationwith the host path, btrfs inode, size, and NODATASUM flag.
- If
- Special files (FIFO, socket, char/block device):
INODE_ITEMonly, no extent.
Inline extent threshold
The maximum inline data size is min(sectorsize - 1, nodesize - 147). With
the defaults (sectorsize=4096, nodesize=16384), this is 4095 bytes. Files at
or below this threshold are stored directly in the tree leaf.
Inode flags
The --inode-flags argument allows setting NODATACOW and NODATASUM flags
per path. NODATACOW implies NODATASUM for regular files. These flags are
set in the INODE_ITEM and affect whether checksums are generated during the
data writing phase.
Directory size fixup
After the walk, fixup_inode_size patches each non-root directory’s INODE_ITEM
size field to match the sum of name_len * 2 from its DIR_INDEX entries
(the btrfs convention for directory sizes).
Inline nbytes fixup
fixup_inline_nbytes patches the nbytes field of INODE_ITEM entries for
files with inline extents. For inline extents, nbytes equals the inline data
size (the actual stored bytes, which may be compressed).
Output
walk_directory returns a RootdirPlan containing:
fs_items: sorted list of all FS tree items (excluding root dir inode).file_extents: list ofFileAllocationentries for files needing data extents.data_bytes_needed: total aligned data bytes needed in the data chunk.root_dir_nlink,root_dir_size: root directory metadata.
Step 2: data writing (write_file_data)
For each file in plan.file_extents, the function reads the host file in 1 MiB
chunks (MAX_EXTENT_SIZE) and writes each chunk to the data block group:
Per-extent processing
- Read up to 1 MiB of raw data from the host file.
- Optionally try compression (zlib or zstd). If the compressed output is smaller than the input, use it; otherwise store uncompressed.
- Pad the (possibly compressed) data to sectorsize alignment.
- Compute the logical disk address:
data_logical + current_offset. - Write the padded data to all physical locations for this logical address.
- Compute per-sector checksums (skipped for NODATASUM files):
- For each sector in the padded data, compute the checksum using the configured algorithm.
- Pack all checksums into a single
EXTENT_CSUMitem.
- Create a
FILE_EXTENT_ITEM(regular type) in the FS tree items:disk_bytenr,disk_num_bytes(aligned compressed size),offset=0,num_bytes(logical file extent size),ram_bytes(uncompressed size), compression type. - Create an
EXTENT_ITEMwith inlineEXTENT_DATA_REFin the extent tree items: refs=1, generation=1, flags=DATA.
After processing all files, nbytes_updates records the total disk-allocated
bytes per inode, which are patched into the corresponding INODE_ITEM entries
via apply_nbytes_updates.
Step 3: multi-leaf tree building (TreeBuilder)
When a tree has more items than fit in a single leaf, TreeBuilder splits
them across multiple leaves and creates internal nodes to form a valid B-tree.
Leaf packing
Items are packed into leaves sequentially:
- Start a new leaf.
- For each item, check if the leaf has space for the item descriptor (25 bytes) plus the item data. If not, finalize the current leaf and start a new one.
- Record the first key of each leaf for parent node entries.
Internal node construction
If more than one leaf is produced:
- Create internal nodes at level 1, each pointing to up to
(nodesize - 101) / 33child blocks (33 bytes per key-pointer entry: 17 key + 8 blockptr + 8 generation). - If more than one level-1 node is needed, create level-2 nodes, and so on.
- Repeat until a single root node remains.
Node balancing: if the last node at a level would have fewer than 1/4 of the maximum entries, the previous node is split more evenly to avoid a tiny remainder.
Placeholder addresses
All blocks are initially built with bytenr = 0 in the header. After address
assignment, TreeBuilder::assign_addresses patches:
- The
bytenrfield in each block’s header (offset 48). - The
blockptrfields in internal nodes (for each key-pointer entry at offset 17 relative to the entry start).
Step 4: the convergence loop
This is the solution to the bootstrapping problem.
The bootstrapping problem
The extent tree must contain a METADATA_ITEM (or EXTENT_ITEM) for every
tree block in the filesystem, including the extent tree’s own blocks. But the
number of extent tree blocks depends on how many items it contains, which
includes its own self-referential entries. Adding more extent tree blocks
requires more extent items, which might require even more blocks.
Solution: iterate until stable
The converge_extent_tree_block_count function iteratively computes the extent
tree block count:
- Start with
extent_tree_block_count = 1. - Construct a trial set of all extent items:
- One
METADATA_ITEMper tree block (chunk tree, root tree,extent_tree_block_countextent tree blocks, dev tree, FS tree blocks, csum tree blocks, free-space tree block, data-reloc tree blocks, block-group tree block if applicable). - All data extent items from the data writing phase.
- Block group items (if not using block-group tree).
- One
- Sort all trial items by key.
- Build the trial extent tree using
TreeBuilder::buildto determine how many blocks it needs. - If
trial.blocks.len() == extent_tree_block_count, the count has stabilized; break. - Otherwise, set
extent_tree_block_count = trial.blocks.len()and repeat.
In practice, this converges in 1-3 iterations. The count is monotonically non-decreasing (adding self-referential items can only increase the block count), so convergence is guaranteed.
Step 5: address assignment
Once the extent tree block count is known, BlockAllocator assigns real
logical addresses in a fixed order:
- Chunk tree: allocate from the system chunk (
alloc_system). - Root tree: allocate from the metadata chunk (
alloc_metadata). - Extent tree blocks (count from convergence loop): sequential metadata allocations.
- Dev tree: one metadata allocation.
- FS tree blocks: sequential metadata allocations.
- Csum tree blocks: sequential metadata allocations.
- Free-space tree: one metadata allocation (if enabled).
- Data-reloc tree blocks: sequential metadata allocations.
- Block-group tree: one metadata allocation (if enabled).
BlockAllocator maintains separate bumping pointers for the system chunk
(SYSTEM_GROUP_OFFSET to SYSTEM_GROUP_OFFSET + SYSTEM_GROUP_SIZE) and the
metadata chunk (meta_logical to meta_logical + meta_size), returning an
error if either runs out of space.
Step 6: building the real extent tree
With real addresses known, the actual extent tree is built:
- Create
METADATA_ITEMentries for every tree block using their real addresses. - Include all data extent items from the data writing phase.
- Include block group items (in-extent-tree or separate block-group tree).
- Sort all items by key.
- Build with
TreeBuilder::build. - Assert that the block count matches the converged count (if it does not, the convergence loop has a bug).
- Assign addresses to extent tree blocks from the pre-allocated address list.
Step 7: building remaining trees
With all addresses finalized:
- FS tree:
TreeBuilder::assign_addressespatches bytenr fields using pre-allocated addresses. - Csum tree: same.
- Data-reloc tree: same.
- Chunk tree: rebuilt as a single leaf with final device bytes_used values.
- Dev tree: rebuilt as a single leaf with final device extent information.
- Free-space tree: rebuilt with final used-byte counts for each block group.
- Block-group tree: rebuilt with final used-byte counts.
- Root tree: rebuilt with final tree root addresses and levels for all trees.
The root tree is always a single leaf because the number of ROOT_ITEM entries is small (6-8 trees). It is built last because it needs the root address and level of every other tree.
Step 8: writing to disk
All tree blocks are written in order:
- Single-leaf trees (chunk, root, dev): compute checksum, write to all physical locations.
- Multi-block trees (extent, FS, csum, data-reloc): for each block, compute checksum, write to all physical locations.
- Optional single-leaf trees (free-space, block-group): compute checksum, write.
The write_rootdir_trees helper manages this process.
Step 9: superblock
The superblock is built with:
root: root tree address (from step 5).chunk_root: chunk tree address (from step 5).bytes_used: system_used + metadata_used + data_used.
Written to all mirror locations on all devices.
Step 10: shrink (optional)
If --shrink is specified and there is a single device:
- Compute the physical end of the last chunk (considering all metadata and data stripes).
- Round up to sectorsize alignment.
- Create a new config with
total_bytesset to this shrunk size. - Rebuild the chunk tree and superblock with the reduced total_bytes
(so
DEV_ITEM.total_bytesandsuperblock.total_bytesreflect the actual image size). - After all writes, truncate the image file to the shrunk size with
set_len.
This produces a minimal image file suitable for distribution or flashing.
Item serialization (items.rs)
All item serializers produce Vec<u8> suitable for LeafBuilder::push. They
use the bytes::BufMut trait for little-endian encoding and derive field
positions from std::mem::offset_of! and std::mem::size_of on the bindgen
structs.
Key serializers and their sizes:
| Function | Item type | Approximate size |
|---|---|---|
root_item | ROOT_ITEM | 439 bytes |
extent_item | EXTENT_ITEM/METADATA_ITEM | 33 bytes (skinny) or 51 bytes |
block_group_item | BLOCK_GROUP_ITEM | 24 bytes |
dev_item | DEV_ITEM | 98 bytes |
chunk_item | CHUNK_ITEM | 48 + 32*num_stripes bytes |
dev_extent | DEV_EXTENT | 48 bytes |
dev_stats_zeroed | PERSISTENT_ITEM | 40 bytes |
free_space_info | FREE_SPACE_INFO | 8 bytes |
inode_item_dir | INODE_ITEM | 160 bytes |
inode_item | INODE_ITEM | 160 bytes |
inode_ref | INODE_REF | 10 + name_len bytes |
dir_item | DIR_ITEM/DIR_INDEX | 30 + name_len bytes |
xattr_item | XATTR_ITEM | 30 + name_len + value_len bytes |
file_extent_inline | FILE_EXTENT_ITEM | 21 + data_len bytes |
file_extent_reg | FILE_EXTENT_ITEM | 53 bytes |
data_extent_item | EXTENT_ITEM | 53 bytes |
Checksum computation (write.rs)
ChecksumType supports four algorithms, each computing checksums of the data
portion (bytes 32..end) of tree blocks and superblocks:
| Algorithm | On-disk type value | Output size | Implementation |
|---|---|---|---|
| CRC32C | 0 | 4 bytes | crc32c crate |
| xxHash64 | 1 | 8 bytes | xxhash-rust crate |
| SHA-256 | 2 | 32 bytes | sha2 crate |
| BLAKE2b-256 | 3 | 32 bytes | blake2 crate |
csum_tree_block writes the computed hash into the first N bytes of the
block’s checksum field (32 bytes total), zero-filling the remaining bytes.
Data block checksums (in the csum tree) use the same algorithm but are computed per-sector.
The bootstrapping problem in detail
The bootstrapping problem is fundamental to mkfs and worth understanding in depth.
The circular dependency
Consider a minimal filesystem with 8 tree blocks. The extent tree must contain
8 METADATA_ITEM entries (one for each block, including itself). But what if
those 8 entries do not fit in a single leaf?
With skinny metadata (METADATA_ITEM, 33-byte payload), each item uses
25 (descriptor) + 33 (data) = 58 bytes. A 16 KiB leaf has 16384 - 101 = 16283
usable bytes, fitting 280 items. So for an empty filesystem, the extent tree
easily fits in one block.
But with --rootdir populating thousands of files, the FS tree, csum tree, and
extent tree can each grow to many blocks. If the FS tree has 100 blocks and
there are 500 data extents, the extent tree might need several blocks itself,
and each additional extent tree block requires another METADATA_ITEM entry
in the extent tree.
Why pre-computing works
The solution works because:
-
Addresses are independent of content. Tree block addresses are assigned by sequential bump allocation, so the address of each block depends only on how many blocks precede it, not on the content of any block.
-
Block count is monotonically non-decreasing. Adding self-referential entries can only increase (or maintain) the block count, never decrease it.
-
The system is finite. There is a maximum number of blocks that can fit in the metadata chunk, bounding the iteration.
-
Content depends only on addresses and counts. Once addresses are assigned, every tree block’s content is fully determined. There are no further dependencies.
The convergence loop exploits properties (1) and (2): it guesses a block count, computes trial content, checks if the trial needs the same number of blocks, and if not, tries again with the new count. Property (2) guarantees this converges (the count can only go up until it stabilizes).
Implementation detail
The trial in each iteration uses placeholder addresses (sequential from
meta_logical), not the final addresses. This is acceptable because the
TreeBuilder only needs the item count and sizes to determine how many blocks
are needed – the actual address values do not affect block count. After
convergence, the real extent tree is built with the actual addresses from
BlockAllocator.
Default features
The default incompat feature flags are:
MIXED_BACKREF– mixed backreference formatBIG_METADATA– larger metadata blocksEXTENDED_IREF– extended inode references (INODE_EXTREF)SKINNY_METADATA– skinny metadata extent refs (METADATA_ITEM key type)NO_HOLES– no explicit hole extent items
The default compat_ro feature flags are:
FREE_SPACE_TREE– free-space tree (v2 free space tracking)FREE_SPACE_TREE_VALID– marks the free-space tree as validBLOCK_GROUP_TREE– separate tree for block group items
Features can be enabled or disabled with -O feature or -O ^feature.
Multi-device support
For multi-device filesystems, chunk layout computation distributes stripes across devices:
- RAID1 metadata: one stripe per device at
CHUNK_START. - SINGLE data: one stripe on device 1.
Each device gets its own superblock with device-specific devid, dev_uuid,
and bytes_used. The chunk tree contains a DEV_ITEM per device, and the dev
tree contains DEV_EXTENT entries mapping physical allocations to chunks.
The logical_to_physical function determines write destinations: system chunk
blocks go to device 1 only, metadata blocks go to all metadata stripe devices,
data blocks go to all data stripe devices.
Limitations
Not yet implemented:
--rootdirwith LZO compression (rejected at argument validation).- RAID0/5/6/10 profiles.
- Zoned device support.
- Mixed block group mode with
--rootdir.
btrfs check: verification phases
This document describes the seven phases of btrfs check, as implemented in the
cli/src/check/ module. The checker operates in read-only mode on an unmounted
filesystem, reading the raw on-disk image through btrfs-disk’s BlockReader
without requiring any kernel ioctls.
Overview
The check command opens the filesystem image and bootstraps the chunk tree (superblock -> sys_chunk_array -> chunk tree -> root tree), then runs seven sequential verification phases:
- Superblock mirror validation
- Tree structure checks (all trees)
- Extent tree cross-checks (reference counting and ownership)
- Chunk / block group / device extent cross-checks
- FS tree inode consistency
- Checksum tree verification
- ROOT_REF / ROOT_BACKREF consistency
Each phase accumulates errors into a CheckResults struct. Errors are printed
to stderr as they are found, and a summary is printed at the end. The process
exits with code 1 if any errors were detected.
Orchestration (check.rs)
The main CheckCommand::run method:
- Rejects unsupported flags (
--repair,--init-csum-tree,--init-extent-tree,--backup,--tree-root,--chunk-root,--qgroup-report,--subvol-extents). - Checks mount status (skippable with
--force). - Validates the superblock mirror index (0-2).
- Opens the filesystem via
reader::filesystem_open_mirror, which bootstraps chunk mapping and discovers all tree roots. - Runs phases 1-7 in order.
- Prints summary and exits.
Statistics tracking
Throughout all phases, CheckResults accumulates byte counts that are printed
in the final summary:
total_tree_bytes: sum of nodesize for every tree block visited in phase 2.total_fs_tree_bytes: subset of the above for FS trees (objectid 5 and >= 256).total_extent_tree_bytes: subset of the above for the extent tree (objectid 2).btree_space_waste: for each leaf, nodesize minus actual bytes used (header + item descriptors + item data payloads).data_bytes_allocated: total length of data extents from extent items.data_bytes_referenced: total referenced bytes, accounting for shared extents viaExtentDataRefandSharedDataRefcount fields.total_csum_bytes: total bytes of checksum data in the csum tree.
Phase 1: Superblocks
Source: cli/src/check/superblock.rs
Purpose: Validate all three superblock mirror copies.
What it checks
Btrfs stores up to three copies of the superblock at fixed byte offsets on the device:
- Mirror 0: byte offset 65536 (64 KiB)
- Mirror 1: byte offset 67108864 (64 MiB)
- Mirror 2: byte offset 274877906944 (256 GiB)
For each mirror (0 through SUPER_MIRROR_MAX - 1):
- Read 4096 bytes from the mirror offset using
read_superblock_bytes_at. - Validate the superblock using
superblock_is_valid, which checks:- The magic number matches
_BHRfS_M(0x4D5F53665248425F). - The CRC32C checksum of bytes 32..4096 matches the stored checksum in bytes 0..4.
- The magic number matches
If a mirror cannot be read (I/O error), this is only reported as an error for mirror 0. Mirrors 1 and 2 may legitimately be absent on small devices where the device is shorter than the mirror offset.
Generation consistency
The current implementation validates each mirror independently (magic + checksum). The C reference additionally checks that the generation fields across valid mirrors are consistent (the primary mirror should have the highest generation). This is not yet implemented.
Error variants produced
SuperblockInvalid { mirror, detail }– reported when:- A mirror has an invalid checksum or magic number.
- Mirror 0 cannot be read at all (I/O error).
Return value
Returns the count of valid mirrors found (0-3). This value is currently not used by the caller but could be used for repair decisions in the future.
Phase 2: Tree structure
Source: cli/src/check/tree_structure.rs
Purpose: Walk every tree in the filesystem and verify per-block structural integrity. Collect a map of all tree block addresses and their owners for use in phase 3.
Trees checked
The phase checks:
- Root tree – directly from
superblock.root. - Chunk tree – directly from
superblock.chunk_root. - All trees discovered in the root tree – every
(tree_id, (bytenr, gen))pair fromopen.tree_roots. This includes the extent tree, dev tree, FS tree, csum tree, free-space tree, data-reloc tree, block-group tree (if present), and all subvolume/snapshot trees.
Each tree is walked using reader::tree_walk_tolerant, which performs a depth-first
traversal through all internal nodes and leaves, calling the visitor callback for
each block. The _tolerant variant collects read errors instead of aborting,
allowing the checker to report all problems rather than stopping at the first.
Per-block checks
For every tree block (leaf or internal node), the following checks are performed:
CRC32C checksum verification
The first 32 bytes of each block contain the checksum. The checker computes
btrfs_csum_data(&raw[32..]) (standard CRC32C with ISO 3309 seed) and compares
it to the stored value in raw[0..4]. This check is only performed when the
superblock’s csum_type is CRC32C; other checksum types emit a warning and
skip verification.
Fsid match
The block header’s fsid field (16 bytes at offset 32) must match the
filesystem’s effective fsid. The effective fsid is metadata_uuid if the
METADATA_UUID incompat flag is set, or fsid otherwise. This distinction
matters for filesystems that have had their metadata UUID changed via
btrfs-tune -m.
Generation bound
The block header’s generation field must not exceed the superblock’s
generation. A block with a higher generation than the superblock indicates
corruption (the block was written in a transaction that was never committed,
or the block has been corrupted).
Level consistency
- Leaf blocks (items present) must have
header.level == 0. - Internal nodes (key-pointer entries) must have
header.level > 0.
A mismatch indicates structural corruption where a block’s type disagrees with its declared level.
Key ordering
Within each block, keys must be in strictly ascending order using the compound
key comparison (objectid, type, offset):
- For leaves: consecutive items
items[i-1]anditems[i]must satisfykey_less(prev, cur). - For internal nodes: consecutive key-pointers
ptrs[i-1]andptrs[i]must satisfykey_less(prev, cur).
Strictly ascending means no duplicates are allowed. The comparison function uses
the raw type byte for the type field (via key_type.to_raw()), comparing the
tuple (objectid, type_raw, offset) lexicographically.
Byte attribution
Each visited block contributes nodesize bytes to the appropriate category:
- Extent tree blocks (objectid 2) ->
total_extent_tree_bytes - FS tree blocks (objectid 5 or >= 256) ->
total_fs_tree_bytes - All blocks ->
total_tree_bytes
For leaf blocks, space waste is computed as:
waste = nodesize - (101 + nritems * 25 + sum(item.size for each item))
where 101 is the header size and 25 is the item descriptor size.
Output
Returns a HashMap<u64, u64> mapping each tree block’s logical address to the
objectid of the tree that owns it. This map is used by phase 3 for bidirectional
ownership verification.
Tree name resolution
Tree names for error messages are derived from the objectid using ObjectId
formatting (e.g., objectid 1 = “ROOT_TREE”, objectid 5 = “FS_TREE”, objectid
256+ = the numeric subvolume ID). Names are leaked as &'static str since the
set of tree names is small and bounded.
Error variants produced
TreeBlockChecksumMismatch { tree, logical }– CRC32C does not match.TreeBlockBadFsid { tree, logical }– header fsid does not match the filesystem’s effective fsid.TreeBlockBadBytenr { tree, logical, header_bytenr }– the header’sbytenrfield does not match the logical address where the block was read. (Note: this check is performed by the block reader during parsing, not directly in this phase, but the error is reported here if it occurs.)TreeBlockBadGeneration { tree, logical, block_gen, super_gen }– block generation exceeds superblock generation.TreeBlockBadLevel { tree, logical, detail }– level/type mismatch (leaf with non-zero level, or node with zero level).KeyOrderViolation { tree, logical, index }– key atindexis not strictly greater than the key atindex - 1.ReadError { logical, detail }– I/O error reading a tree block.
Phase 3: Extents
Source: cli/src/check/extents.rs
Purpose: Walk the extent tree to verify reference counts, detect overlapping extents, and cross-check tree block ownership against extent tree backrefs in both directions.
How it works
The phase walks the extent tree leaf by leaf, processing items in key order.
It maintains an ExtentCheckState that tracks the “current” extent being
verified and accumulates statistics.
Item processing
Items are processed based on their key type:
EXTENT_ITEM / METADATA_ITEM: Start a new extent. The previous extent
(if any) is flushed first. For the new extent:
- Record the bytenr in
extent_item_addrs(for later ownership checks). - Determine the extent length:
EXTENT_ITEM: length =key.offset.METADATA_ITEM: length = 0 (skinny refs usekey.offsetas level, not length, so overlap detection is skipped for metadata items).
- Check for overlap: if
bytenr < prev_endandprev_end > 0, report an overlapping extent error. - Parse the
ExtentItempayload to extract:- The declared reference count (
refs). - Inline backrefs and their count.
- Whether this is a data extent (via
BTRFS_EXTENT_FLAG_DATA). - For tree block extents: collect
TreeBlockBackrefinline refs intoextent_backref_owners[bytenr].
- The declared reference count (
- Initialize pending state:
pending_refs= declared refs,pending_counted= inline ref count. - For data extents, add
lengthtodata_bytes_allocated.
TREE_BLOCK_REF: Standalone tree block backref. Increments pending_counted
by 1. Records key.offset (the root objectid) in extent_backref_owners.
SHARED_BLOCK_REF / EXTENT_OWNER_REF: Standalone backrefs. Each
increments pending_counted by 1.
EXTENT_DATA_REF: Standalone data backref. Parses the item to extract the
count field (number of references from this particular root/objectid/offset
combination). Increments pending_counted by count. Adds
length * count to data_bytes_referenced.
SHARED_DATA_REF: Same as EXTENT_DATA_REF but for shared (relocated)
data references.
All other key types (e.g., BLOCK_GROUP_ITEM): ignored.
Inline reference counting
The count_inline_refs function iterates over the InlineRef variants in an
ExtentItem:
TreeBlockBackref,SharedBlockBackref,ExtentOwnerRef: count as 1 each.ExtentDataBackref,SharedDataBackref: count as their embeddedcountfield (which may be > 1 for multiply-referenced data extents).
Flushing
When a new EXTENT_ITEM/METADATA_ITEM is encountered, or at the end of the
tree walk, flush_pending is called:
- Skip if no extent is pending (
pending_bytenr == 0). - For data extents where
data_bytes_referencedis still 0 (only inline refs, no standaloneExtentDataRef), computedata_bytes_referenced += pending_length * pending_counted. - Compare
pending_refs(declared) topending_counted(actual). If they differ, report anExtentRefMismatcherror. - Reset
pending_bytenrto 0.
Bidirectional ownership cross-check
After the extent tree walk completes, two cross-checks are performed using the
tree_block_owners map from phase 2:
Direction 1: tree block -> extent tree. For every tree block address found during phase 2 tree walks:
- If the address has no
EXTENT_ITEMorMETADATA_ITEMin the extent tree, reportMissingExtentItem. - If the address has extent tree entries but none of the claimed owner roots
match the actual owner (the tree that contained this block during phase 2
walks), report
BackrefOwnerMismatch.
Direction 2: extent tree -> tree block. For every tree block address with backrefs in the extent tree:
- For each claimed owner root, check if the actual owner (from the phase 2 map)
matches. If the block was not found during phase 2 walks, or belongs to a
different tree, report
BackrefOrphan.
Both cross-checks sort addresses before iteration for deterministic error ordering.
Error variants produced
ExtentRefMismatch { bytenr, expected, found }– the declared reference count in theExtentItemdoes not match the sum of inline and standalone backrefs.MissingExtentItem { bytenr }– a tree block observed during phase 2 has no correspondingEXTENT_ITEMorMETADATA_ITEMin the extent tree.BackrefOwnerMismatch { bytenr, actual_owner, claimed_owners }– the tree block’s actual owner (from phase 2) does not appear in the extent tree’s list of backref owners for that address.BackrefOrphan { bytenr, claimed_owner }– the extent tree claims a backref for a tree that does not actually contain a block at that address.OverlappingExtent { bytenr, length, prev_end }– two data extents overlap in logical address space (the start of one extent is before the end of the previous).ReadError { logical, detail }– I/O error reading the extent tree.
Phase 4: Chunks / block groups / device extents
Source: cli/src/check/chunks.rs
Purpose: Cross-check the chunk tree, block group items, and device extents for mutual consistency.
What it checks
This phase performs three categories of cross-checks:
Chunk <-> block group cross-check
Every chunk in the chunk tree’s ChunkTreeCache (built during filesystem open)
should have a corresponding BLOCK_GROUP_ITEM in the extent tree (or block-group
tree, if the BLOCK_GROUP_TREE compat_ro feature is enabled). And vice versa:
every block group item should correspond to a chunk.
Block groups are collected by walking either:
- The block-group tree if
BTRFS_FEATURE_COMPAT_RO_BLOCK_GROUP_TREEis set in the superblock’scompat_ro_flags. - The extent tree otherwise (block group items historically lived in the extent tree).
The walk collects all items with key type BLOCK_GROUP_ITEM into a BTreeMap
keyed by logical address.
Then:
- For each chunk in the chunk cache: if no block group exists at that logical
address, report
ChunkMissingBlockGroup. - For each block group: if the chunk cache has no chunk at that logical address,
report
BlockGroupMissingChunk.
Device extent overlap detection
Device extents are collected from the device tree by walking all items with key
type DEV_EXTENT. Each extent is recorded as (offset, length) grouped by
device ID (key.objectid).
For each device, extents are sorted by physical offset. Then consecutive pairs
are checked: if extents[i].offset < extents[i-1].offset + extents[i-1].length,
the extents overlap and DeviceExtentOverlap is reported.
Error variants produced
ChunkMissingBlockGroup { logical }– a chunk exists in the chunk tree but no block group item was found at the same logical address.BlockGroupMissingChunk { logical }– a block group item exists but no chunk was found at the same logical address.DeviceExtentOverlap { devid, offset }– two device extents on the same device overlap in physical address space.ReadError { logical, detail }– I/O error reading the block-group tree, extent tree, or device tree.
Phase 5: FS roots
Source: cli/src/check/fs_roots.rs
Purpose: Walk every filesystem tree (the default FS tree and all subvolume trees) and verify inode-level consistency.
Which trees are checked
From the tree_roots map (populated during filesystem open), the phase selects
trees whose objectid is either:
BTRFS_FS_TREE_OBJECTID(5) – the default filesystem tree.>= BTRFS_FIRST_FREE_OBJECTID(256) – subvolume and snapshot trees.
Item collection
For each FS tree, collect_fs_items walks all leaves and groups items by
objectid (inode number). Each item is stored as a (KeyType, key_offset, raw_data_bytes) tuple. Items arrive in sorted key order due to the B-tree
traversal, which means within an objectid group, items are sorted by
(key_type, offset).
Per-inode checks
For each objectid group (inode), the following checks are performed:
INODE_ITEM presence
The checker notes whether the objectid has an INODE_ITEM. If directory entries
reference an objectid that has no INODE_ITEM, the entry is an orphan.
Parsed from INODE_ITEM: nlink, size (isize), nbytes, and mode.
Nlink consistency
The actual reference count is computed by counting entries across all
INODE_REF items (via InodeRef::parse_all) and INODE_EXTREF items (via
InodeExtref::parse_all) for this objectid. If the computed count differs from
inode_item.nlink and the inode has at least one reference, NlinkMismatch is
reported.
The root directory inode (objectid 256, BTRFS_FIRST_FREE_OBJECTID) is excluded
from this check because it has special nlink handling in btrfs.
File extent overlap detection
For regular files, all EXTENT_DATA items are processed to extract
(file_offset, file_offset + length) ranges:
- Regular extents: length =
num_bytesfrom theFileExtentBody::Regularvariant. - Inline extents: length =
inline_sizefrom theFileExtentBody::Inlinevariant.
Since items are in key order and EXTENT_DATA keys use the file offset, ranges
are already sorted by start offset. Consecutive ranges are checked: if
ranges[i].start < ranges[i-1].end, a FileExtentOverlap is reported.
Directory inode size (isize) check
For directory inodes (mode & S_IFMT == S_IFDIR), the expected inode size is
computed by summing name_len * 2 for every DIR_INDEX entry belonging to
this inode. The factor of 2 matches the btrfs convention where directory inode
size counts each entry’s name length twice (once for DIR_ITEM, once for
DIR_INDEX).
If the inode’s stored size field differs from this computed sum, DirSizeWrong
is reported.
File nbytes check
For regular files and symlinks (mode & S_IFMT == S_IFREG or S_IFLNK), the
expected nbytes is computed from extent items:
- Inline extents:
nbytes += data_len(the inline payload size). - Regular extents:
nbytes += disk_num_bytes, but only for non-prealloc extents. Prealloc extents (preallocated but unwritten) and hole extents (disk_bytenr == 0) do not contribute.
If the inode’s stored nbytes differs from the computed total, NbytesWrong
is reported.
Orphan directory entries
When processing DIR_ITEM and DIR_INDEX items, for each entry whose location
key type is INODE_ITEM and whose target objectid is >= BTRFS_FIRST_FREE_OBJECTID
(256): if the target inode has no INODE_ITEM anywhere in this tree,
DirItemOrphan is reported. Both DIR_ITEM and DIR_INDEX entries are
checked, so an orphan reference in either will be caught.
Error variants produced
InodeMissing { tree, ino }– an objectid is referenced but has noINODE_ITEM. (Note: this is detected indirectly throughDirItemOrphanin the current implementation.)NlinkMismatch { tree, ino, expected, found }– the inode’s stored nlink differs from the number ofINODE_REF+INODE_EXTREFentries.FileExtentOverlap { tree, ino, offset }– two file extent items for the same inode overlap in file offset space.DirItemOrphan { tree, parent_ino, name }– a directory entry references an inode that has noINODE_ITEM.DirSizeWrong { tree, ino, expected, found }– a directory inode’s stored size does not match the computed sum of DIR_INDEX name lengths times 2.NbytesWrong { tree, ino, expected, found }– a file inode’s stored nbytes does not match the computed sum from extent items.ReadError { logical, detail }– I/O error reading the FS tree.
Phase 6: Checksums
Source: cli/src/check/csums.rs
Purpose: Walk the checksum tree and optionally verify data block checksums against the actual on-disk data.
Structure of the csum tree
The csum tree contains EXTENT_CSUM items (key type 128). Each item covers a
contiguous range of data sectors:
- Key objectid:
BTRFS_EXTENT_CSUM_OBJECTID(fixed constant). - Key offset: the logical byte address of the first sector covered.
- Item data: packed array of checksums, one per sector. With CRC32C (4 bytes per checksum) and 4K sectors, a single item can cover many sectors.
What it checks
Phase 6a: tree walk and byte counting
Always performed. The phase walks the csum tree and for each EXTENT_CSUM item,
computes num_csums = item_data_len / csum_size and adds item_data_len to
total_csum_bytes. This total is reported in the final summary.
Phase 6b: data verification (optional)
Only performed when --check-data-csum is passed. Only supported for CRC32C
checksums; other checksum types emit a warning and skip verification.
For each csum item, the phase iterates over every sector:
- Compute the logical address:
item.key.offset + i * sectorsize. - Read
sectorsizebytes from that logical address viareader.read_data. - Compute
btrfs_csum_data(&data)(standard CRC32C). - Compare to the stored checksum (extracted from the item data at offset
i * csum_size). - If they differ, or if the read fails, report
CsumMismatch.
The btrfs_csum_data function uses the standard ISO 3309 CRC32C computation
(seed = 0xFFFFFFFF, final XOR), matching the kernel’s checksum for tree blocks
and data. This is distinct from the raw CRC32C used in send streams.
Error variants produced
CsumMismatch { logical }– the computed CRC32C of the data at the given logical address does not match the stored checksum, or the data could not be read.ReadError { logical, detail }– I/O error reading the csum tree itself.
Phase 7: Root refs
Source: cli/src/check/root_refs.rs
Purpose: Verify that ROOT_REF and ROOT_BACKREF items in the root tree
are consistent with each other.
Background
In btrfs, subvolume parent-child relationships are recorded in the root tree using two item types:
-
ROOT_REF(key type 156): stored withobjectid = parent_root_id,offset = child_root_id. Contains the directory ID, sequence number, and name of the directory entry that references the child subvolume. -
ROOT_BACKREF(key type 157): stored withobjectid = child_root_id,offset = parent_root_id. Contains the same fields as the correspondingROOT_REF.
These items form a bidirectional link. For every ROOT_REF there should be a
matching ROOT_BACKREF, and vice versa. The fields (dirid, sequence, name)
should be identical between the pair.
What it checks
The phase walks the root tree and collects all ROOT_REF and ROOT_BACKREF
items into two maps, keyed by (child_root_id, parent_root_id). Both item
types are parsed using RootRef::parse (the on-disk format is identical).
Then two passes are made:
Forward check: every ROOT_REF has a matching ROOT_BACKREF
For each (child, parent) pair in the forward refs map:
- If no entry exists in the back refs map, report
RootBackrefMissing. - If an entry exists, compare the three fields:
dirid: if they differ, reportRootRefMismatchwith “dirid mismatch”.sequence: if they differ, reportRootRefMismatchwith “sequence mismatch”.name: if they differ, reportRootRefMismatchwith “name mismatch”.
Each field is checked independently, so a single pair can produce up to 3 mismatch errors.
Reverse check: every ROOT_BACKREF has a matching ROOT_REF
For each (child, parent) pair in the back refs map:
- If no entry exists in the forward refs map, report
RootRefMissing.
Field comparison is not repeated in this direction because the forward check already caught any field mismatches for pairs that exist in both maps.
Error variants produced
RootRefMissing { child, parent }– aROOT_BACKREFexists for this child/parent pair but no correspondingROOT_REFwas found.RootBackrefMissing { child, parent }– aROOT_REFexists for this child/parent pair but no correspondingROOT_BACKREFwas found.RootRefMismatch { child, parent, detail }– bothROOT_REFandROOT_BACKREFexist but one of their fields (dirid, sequence, or name) differs. Thedetailstring describes which field mismatched and shows both values.ReadError { logical, detail }– I/O error reading the root tree.
Complete error type reference
All error variants are defined in cli/src/check/errors.rs as the CheckError
enum. Each variant implements Display for human-readable error messages.
Phase 1 errors
| Variant | Fields | Description |
|---|---|---|
SuperblockInvalid | mirror: u32, detail: String | Superblock mirror failed validation (bad magic, bad checksum, or read error) |
Phase 2 errors
| Variant | Fields | Description |
|---|---|---|
TreeBlockChecksumMismatch | tree: &'static str, logical: u64 | CRC32C checksum does not match |
TreeBlockBadFsid | tree: &'static str, logical: u64 | Header fsid does not match filesystem |
TreeBlockBadBytenr | tree: &'static str, logical: u64, header_bytenr: u64 | Header bytenr disagrees with read address |
TreeBlockBadGeneration | tree: &'static str, logical: u64, block_gen: u64, super_gen: u64 | Block generation exceeds superblock generation |
TreeBlockBadLevel | tree: &'static str, logical: u64, detail: String | Level/type mismatch (leaf with level>0 or node with level==0) |
KeyOrderViolation | tree: &'static str, logical: u64, index: usize | Key at index is not strictly greater than previous key |
Phase 3 errors
| Variant | Fields | Description |
|---|---|---|
ExtentRefMismatch | bytenr: u64, expected: u64, found: u64 | Declared refs != counted refs (inline + standalone) |
MissingExtentItem | bytenr: u64 | Tree block has no extent/metadata item in extent tree |
BackrefOwnerMismatch | bytenr: u64, actual_owner: u64, claimed_owners: Vec<u64> | Actual tree block owner not in extent tree’s backref list |
BackrefOrphan | bytenr: u64, claimed_owner: u64 | Extent tree claims a backref but no tree block found |
OverlappingExtent | bytenr: u64, length: u64, prev_end: u64 | Data extent overlaps with previous extent |
Phase 4 errors
| Variant | Fields | Description |
|---|---|---|
ChunkMissingBlockGroup | logical: u64 | Chunk has no matching block group item |
BlockGroupMissingChunk | logical: u64 | Block group has no matching chunk |
DeviceExtentOverlap | devid: u64, offset: u64 | Two device extents overlap on the same device |
Phase 5 errors
| Variant | Fields | Description |
|---|---|---|
InodeMissing | tree: u64, ino: u64 | Inode referenced but has no INODE_ITEM |
NlinkMismatch | tree: u64, ino: u64, expected: u32, found: u32 | Stored nlink differs from counted references |
FileExtentOverlap | tree: u64, ino: u64, offset: u64 | File extent items overlap in file offset space |
DirItemOrphan | tree: u64, parent_ino: u64, name: String | Dir entry references non-existent inode |
DirSizeWrong | tree: u64, ino: u64, expected: u64, found: u64 | Directory inode size does not match DIR_INDEX name sum |
NbytesWrong | tree: u64, ino: u64, expected: u64, found: u64 | File inode nbytes does not match extent sum |
Phase 6 errors
| Variant | Fields | Description |
|---|---|---|
CsumMismatch | logical: u64 | Data checksum does not match stored value |
Phase 7 errors
| Variant | Fields | Description |
|---|---|---|
RootRefMissing | child: u64, parent: u64 | ROOT_BACKREF exists but no matching ROOT_REF |
RootBackrefMissing | child: u64, parent: u64 | ROOT_REF exists but no matching ROOT_BACKREF |
RootRefMismatch | child: u64, parent: u64, detail: String | ROOT_REF and ROOT_BACKREF fields disagree |
Cross-phase error
| Variant | Fields | Description |
|---|---|---|
ReadError | logical: u64, detail: String | I/O error reading any tree block (used in phases 2-7) |
Summary output
After all phases complete, CheckResults::print_summary writes to stdout:
found <bytes_used> bytes used, <error_count> error(s) found
total csum bytes: <total_csum_bytes>
total tree bytes: <total_tree_bytes>
total fs tree bytes: <total_fs_tree_bytes>
total extent tree bytes: <total_extent_tree_bytes>
btree space waste bytes: <btree_space_waste>
file data blocks allocated: <data_bytes_allocated>
referenced <data_bytes_referenced>
If error_count > 0, the process exits with code 1.
Limitations and future work
The following checks from the C reference implementation are not yet implemented:
--mode lowmemdifferentiation (the current implementation uses the “original” mode approach of collecting all items then cross-checking).- Log tree checking (the log tree is not walked in phase 2).
--repair(all checking is read-only).--backup/--tree-root/--chunk-root** (alternate root selection).--init-csum-tree/--init-extent-tree** (destructive reconstruction).--qgroup-report(quota group consistency checking).--subvol-extents(per-subvolume extent sharing analysis).- Superblock generation cross-checking between mirror copies.
- Block group used-bytes verification (comparing declared
usedin block group items against actual allocated extents).