Skip to content

Path Resolution (namei)

How the kernel converts a string like "/usr/bin/bash" into an inode

The challenge

Path resolution sounds simple: split the path on /, look up each component in the parent directory. But it must handle:

  • Symlinks (which themselves contain paths to resolve)
  • Mount points (crossing into a different filesystem)
  • Concurrent modifications (another process renames a directory while you traverse it)
  • . (current directory) and .. (parent directory)
  • Permission checks at each level

The kernel's path resolution lives in fs/namei.c and is called the namei subsystem (from the historical Unix function name).

Entry point: filename_lookup

Most VFS operations start with a path. The kernel resolves it through:

/* Simplified call chain for open("/usr/bin/bash", O_RDONLY) */
do_sys_open()
   do_filp_open()
     path_openat()
       link_path_walk()    main resolution loop
         walk_component()   resolve one path component
           lookup_fast()    dcache lookup (RCU, lock-free)
           lookup_slow()    filesystem lookup (takes i_rwsem)
       do_last()           handle final component (open file)

The nameidata structure

struct nameidata carries state through the resolution:

/* fs/namei.c */
struct nameidata {
    struct path path;         /* current position: (dentry, vfsmount) */
    struct qstr last;         /* name of last component */
    struct path root;         /* root for this lookup */
    struct inode *inode;      /* inode of current directory */
    unsigned int flags;       /* LOOKUP_* flags */
    unsigned int state;       /* ND_* state bits */
    unsigned int depth;       /* symlink recursion depth */
    int last_type;            /* LAST_NORM, LAST_ROOT, LAST_DOT, LAST_DOTDOT */
    /* ... */
};

link_path_walk() iterates over path components separated by /:

/* fs/namei.c (simplified) */
static int link_path_walk(const char *name, struct nameidata *nd)
{
    /* Skip leading slashes */
    while (*name == '/')
        name++;

    for (;;) {
        struct qstr this;
        int type;

        /* Get next component (up to next '/') */
        type = hash_name(nd->path.dentry, name, &this);

        /* Special cases: "." and ".." */
        if (type != LAST_NORM) {
            /* handle . and .. */
        }

        /* Look up the component */
        err = walk_component(nd, WALK_MORE);

        /* Advance past the "/" */
        name += this.len;
        while (*name == '/')
            name++;
        if (!*name)
            break;  /* end of path */
    }
    return 0;
}

walk_component: RCU-walk and ref-walk

Each component is looked up in two stages:

Stage 1: RCU walk (lock-free, optimistic)

Linux 2.6.38 introduced RCU-walk by Nick Piggin (LWN), which lets most path lookups complete without taking any lock or incrementing any reference count.

static struct dentry *lookup_fast(struct nameidata *nd)
{
    struct dentry *parent = nd->path.dentry;

    /* Try to find in dcache under RCU (no locks, very fast) */
    dentry = __d_lookup_rcu(parent, &nd->last, &nd->next_seq);
    if (!dentry)
        return NULL;  /* miss: fall through to ref-walk */

    /* Validate: did the dentry get deleted while we looked? */
    if (!d_try_get_rcu(dentry, nd->next_seq))
        return NULL;  /* invalidated: retry */

    return dentry;
}

The RCU walk uses d_seq (a seqlock on each dentry) to detect concurrent modifications without taking any lock. This is the fast path — most lookups hit the dcache and return without taking a lock.

Stage 2: Ref walk (takes i_rwsem)

If RCU walk fails (cache miss or retry):

static struct dentry *lookup_slow(const struct qstr *name,
                                   struct dentry *dir, unsigned int flags)
{
    struct inode *inode = dir->d_inode;
    struct dentry *dentry;

    /* Lock directory for reading */
    inode_lock_shared(inode);  /* down_read(&inode->i_rwsem) */

    /* Check dcache again (race: could have been added) */
    dentry = __d_lookup(dir, name);
    if (!dentry) {
        /* Not in cache: call filesystem */
        dentry = d_alloc(dir, name);
        inode->i_op->lookup(inode, dentry, flags);
        /* dentry is now positive (with inode) or negative */
    }

    inode_unlock_shared(inode);
    return dentry;
}

i_op->lookup() calls into the filesystem (e.g., ext4) which reads the directory from disk, finds the entry, and associates its inode with the dentry.

Mount point crossing

When resolution hits a mount point, it silently crosses into the mounted filesystem:

/* At a mountpoint: switch to the new filesystem's root */
if (d_mountpoint(dentry)) {
    /* Follow the mount: nd->path = mnt->mnt_root */
    follow_mount(&nd->path);
    /* Continue resolution in the new filesystem */
}

This is why cd /proc/net works transparently even though /proc is a different filesystem from /.

When a component is a symlink, link_path_walk is called recursively with the symlink's target:

/* Symlink: get the link target and resolve it */
const char *link = nd->inode->i_op->get_link(nd->path.dentry,
                                               nd->inode, &delayed);
/* Recurse into link_path_walk with link as the new path */
nd->depth++;
if (nd->depth > MAXSYMLINKS)  /* default 40 */
    return -ELOOP;
err = link_path_walk(link, nd);
nd->depth--;

LOOKUP flags

LOOKUP_FOLLOW    /* follow final symlink */
LOOKUP_DIRECTORY /* final component must be a directory */
LOOKUP_OPEN      /* called from open(): prepare for file creation */
LOOKUP_CREATE    /* create if not exists */
LOOKUP_EXCL      /* fail if exists (for O_EXCL) */
LOOKUP_REVAL     /* force dentry revalidation (NFS) */
LOOKUP_RCU       /* stay in RCU walk mode if possible */

/proc/self/fd and /proc/self/fdinfo

# Each open file has a symlink in /proc/self/fd/
ls -la /proc/self/fd
# lrwxrwxrwx 1 root 0 -> /dev/null
# lrwxrwxrwx 1 root 1 -> /dev/pts/0
# lrwxrwxrwx 1 root 2 -> /dev/pts/0

# Reading the link resolves to the file's path
readlink /proc/self/fd/0
# /dev/null

The /proc/self/fd/N symlinks use a special resolution path that goes through the process's file descriptor table rather than through the filesystem.

Further reading