ext4 Journaling and Crash Recovery

JBD2 transaction lifecycle: how ext4 survives power failures

Why journaling?

Without journaling, a power failure mid-write can leave the filesystem in an inconsistent state:

Without journal:
  1. Write inode (update file size)          ← power fail here
  2. Write data block
  3. Update block bitmap

After crash: file has new size but block bitmap says block is free.
fsck must scan the entire filesystem to find and fix inconsistencies.
On a 4TB filesystem: hours.

With journaling, changes are written to the journal (a circular log) first. After crash, the kernel replays the journal in seconds.

ext4 journal modes

# Check/set journal mode at mount time:
mount -o data=ordered /dev/sda1 /mnt   # default
mount -o data=writeback /dev/sda1 /mnt  # highest performance
mount -o data=journal /dev/sda1 /mnt    # most durable

Mode	What's journaled	Performance	Durability
`writeback`	Metadata only; data can be written after metadata	Fastest	Data may be corrupted after crash
`ordered`	Metadata only; data written before metadata	Good	Data blocks committed before metadata
`journal`	Both metadata and data	Slowest	Strongest

Default is ordered: metadata changes are atomic, data blocks are guaranteed to be written before the metadata that references them.

JBD2: the journaling layer

ext4 uses JBD2 (Journaling Block Device 2) as its journaling layer. JBD2 is independent of ext4 — it could journal any block device.

Journal layout on disk

Ext4 partition layout:
┌────────────────────────────────────────────────────┐
│ Superblock │ Block groups │ Journal (special inode) │
│            │ [ext4 data]  │ [JBD2 circular log]     │
└────────────────────────────────────────────────────┘

Journal structure (circular log):
┌──────────────────────────────────────────────────────────┐
│ Journal superblock │ T1: [commit block] │ T2: [commit] │..│
│ (seq, tail_sequence, s_start)                            │
└──────────────────────────────────────────────────────────┘

T1 (transaction 1):
  ┌─────────────────────────────────────────────────────┐
  │ descriptor block │ data/metadata blocks │ commit block│
  │ (lists what blocks follow) │ (original blocks) │ (seq #)│
  └─────────────────────────────────────────────────────┘

The journal superblock records s_start (tail) and s_sequence (current transaction). Recovery replays all transactions from s_start to the last valid commit block.

struct journal_t

/* include/linux/jbd2.h */
typedef struct journal_s {
    /* Journal superblock data */
    unsigned long   j_flags;        /* JBD2_UNMOUNT, JBD2_ABORT, ... */
    int             j_errno;
    struct buffer_head *j_sb_buffer; /* journal superblock buffer */

    /* Transaction state */
    transaction_t  *j_running_transaction;    /* current open transaction */
    transaction_t  *j_committing_transaction; /* being committed */
    transaction_t  *j_checkpoint_transactions; /* not yet freed */

    /* Journal block range */
    unsigned int    j_maxlen;       /* total journal blocks */
    unsigned int    j_tail;         /* oldest known transaction */
    unsigned int    j_free;         /* free journal blocks */
    unsigned int    j_first;        /* journal start block */
    unsigned int    j_last;         /* journal end block (exclusive) */
    unsigned int    j_head;         /* current write position */

    /* Sequence numbers */
    tid_t           j_tail_sequence;
    tid_t           j_transaction_sequence;
    tid_t           j_commit_sequence;
    tid_t           j_commit_request;

    /* Block device */
    struct block_device *j_dev;     /* journal device */
    struct block_device *j_fs_dev;  /* filesystem device */
    unsigned int        j_blocksize;

    /* kjournald kernel thread */
    struct task_struct *j_task;

    /* Commit timer and workqueue */
    struct timer_list   j_commit_timer;
} journal_t;

struct transaction_t

typedef struct transaction_s {
    journal_t      *t_journal;
    tid_t           t_tid;          /* transaction ID */
    enum {
        T_RUNNING,          /* accepting new handles */
        T_LOCKED,           /* waiting for handles to close */
        T_SWITCH,
        T_FLUSH,
        T_COMMIT,
        T_COMMIT_DFLUSH,
        T_COMMIT_JFLUSH,
        T_FINISHED,
    } t_state;

    struct list_head t_buffers;      /* buffers modified in this transaction */
    struct list_head t_forget;       /* buffers to unlink after commit */
    struct list_head t_checkpoint_list; /* buffers needing checkpoint */

    atomic_t        t_handle_count; /* number of open handles */
    unsigned int    t_outstanding_credits; /* remaining journal space */
} transaction_t;

Transaction lifecycle

Phase 1: begin

/* Start a transaction */
handle_t *handle = jbd2_journal_start(journal, nblocks);
/* nblocks: worst-case blocks this handle will modify */
/* Returns a handle; increments t_handle_count */

Phase 2: modify buffers

/* Tell JBD2 we're about to modify a buffer (gets the journal credit): */
jbd2_journal_get_write_access(handle, bh);
/* Adds bh to the transaction's t_buffers list */
/* Copies the original block to the journal buffer */

/* Modify the buffer: */
ext4_set_bit(block, group->bg_block_bitmap->b_data);

/* Tell JBD2 the modification is complete: */
jbd2_journal_dirty_metadata(handle, bh);
/* Marks bh as dirty in the transaction */

Phase 3: commit (kjournald)

/* kjournald thread (fs/jbd2/journal.c) */
static int kjournald2(void *arg)
{
    journal_t *journal = arg;
    transaction_t *transaction;

    for (;;) {
        /* Wait for commit needed or timer */
        wait_event_interruptible_timeout(journal->j_wait_commit, ...);

        /* Commit the current transaction */
        jbd2_journal_commit_transaction(journal);
    }
}

Commit phases: 1. Close: stop accepting new handles for this transaction (T_LOCKED) 2. Write descriptor: write the descriptor block listing all modified buffers 3. Write data: write all modified metadata buffers to the journal 4. Flush: wait for all journal writes to complete 5. Write commit block: write the commit block with the transaction's sequence number 6. Checkpoint: once the commit block is on disk, the filesystem blocks can be written in place 7. Free: after checkpoint, free journal space

Phase 4: checkpoint

After committing, the original filesystem blocks must be written out (checkpointed) before the journal space can be reused:

/* The transaction's modified blocks are eventually written to
   their real filesystem locations (checkpointed) */
jbd2_log_do_checkpoint(journal);

/* After checkpoint: journal space is freed for the next transaction */

Crash recovery

# View journal state
debugfs -R "show_super_stats" /dev/sda1 | grep -i journal
# Journal inode: 8
# Journal backup: inode blocks
# Journal features: journal_64bit, journal_fast_commit
# Journal size: 128m
# Journal length: 32768

# Replay journal manually (dry run):
e2fsck -n /dev/sda1 2>&1 | head -20
# Pass 1: Checking inodes, blocks, and sizes
# Pass 2: Checking directory structure
# ...

Kernel journal recovery

/* fs/jbd2/recovery.c */
int jbd2_journal_recover(journal_t *journal)
{
    struct recovery_info info = {};

    /* Scan the journal for all valid transactions */
    err = do_one_pass(journal, &info, PASS_SCAN);
    /* Finds all committed transactions: info.end_transaction */

    /* Replay valid transactions: write blocks to their real locations */
    err = do_one_pass(journal, &info, PASS_REVOKE);
    /* Applies revoke records (blocks that were freed and shouldn't be replayed) */

    err = do_one_pass(journal, &info, PASS_REPLAY);
    /* Replays remaining transactions to their real filesystem locations */

    jbd2_journal_clear_revoke(journal);
    err = sync_blockdev(journal->j_fs_dev);
    return err;
}

Recovery is O(journal size), not O(filesystem size). Even a 2TB filesystem recovers in seconds because only the journal (default 128MB) is replayed.

Fast commit (ext4 5.10+)

JBD2 full commits are expensive (hundreds of microseconds). Fast commit is an optimized path for common operations:

/* fs/ext4/fast_commit.c */
/* Fast commit records only what changed (delta), not full blocks: */
/* - Inode changes (truncate, setattr)           */
/* - Directory entry changes (create, unlink)    */

/* Much smaller journal writes: ~1KB vs ~4KB per commit */

# Enable fast commit (default on new ext4 filesystems):
tune2fs -O fast_commit /dev/sda1
mke2fs -O fast_commit /dev/sda1

# Check:
tune2fs -l /dev/sda1 | grep "Filesystem features"
# Filesystem features: fast_commit ...

ext4 journaling vs fsync

/* Application: ensure data durability */
fd = open("file", O_WRONLY);
write(fd, data, len);
fsync(fd);     /* blocks until data AND metadata are on disk */
/* Without fsync: data may be in page cache or journal, not disk */

/* fdatasync: data only (no metadata unless file size changed) */
fdatasync(fd);

/* O_SYNC: fsync after every write */
fd = open("file", O_WRONLY | O_SYNC);

The journal only guarantees metadata consistency after crash. Data durability requires fsync() or O_SYNC.

Monitoring journal performance

# Journal statistics
cat /proc/fs/ext4/<dev>/journal_info
# journal_sequence: 12345
# journal_buffers: 0
# journal_outstanding: 0

# Trace journal commits
bpftrace -e '
kprobe:jbd2_journal_commit_transaction {
    @commits = count();
    @time_ns = hist(nsecs);
}'

# iostat: journal writes appear as writes to the same device
iostat -x 1

# blktrace: see journal vs data writes
blktrace -d /dev/sda1 -o - | blkparse -i - | grep -v "Q  W" | head -50