Chapter 13. The Virtual Filesystem

Although the kernel developers may shun C++ and other explicitly object-oriented languages, thinking in terms of objects is often useful. Robert Love

The Virtual Filesystem (VFS), sometimes called the Virtual File Switch is the subsystem of the kernel that implements the file and filesystem-related interfaces to user-space. All filesystems rely on the VFS not only to coexist but also to interoperate. This enables programs to use standard Unix system calls to read and write to different filesystems, even on different media.

Common Filesystem Interface

The VFS is the glue that enables system calls such as open(), read(), and write() to work regardless of the filesystem or underlying physical medium. The system calls work between these different filesystems and media; we can use standard system calls to copy or move files from one filesystem to another. Modern operating systems, such as Linux, abstract access to the filesystems via a virtual interface that such interoperation and generic access is possible. [p262]

New filesystems and new varieties of storage media can find their way into Linux, and programs need not be rewritten or even recompiled.

The VFS and the block I/O layer (Chapter 14) provide the abstractions, interfaces, and glue that allow user-space programs to issue generic system calls to access files via a uniform naming policy on any filesystem, which itself exists on any storage medium.

Filesystem Abstraction Layer

Such a generic interface for any type of filesystem is feasible only because the kernel implements an abstraction layer around its low-level filesystem interface. This abstraction layer enables Linux to support different filesystems, even if they differ in supported features or behavior.

This is possible because the VFS provides a common file model that can represent any filesystem’s general feature set and behavior. It is biased toward Unix-style filesystems. Regardless, wildly differing filesystem types are still supportable in Linux, from DOS’s FAT to Windows’s NTFS to many Unix-style and Linux-specific filesystems.

The abstraction layer works by defining the basic conceptual interfaces and data structures that all filesystems support. The actual filesystem code hides the implementation details. To the VFS layer and the rest of the kernel, each filesystem looks the same. They all support notions such as files and directories, and they all support operations such as creating and deleting files.

The result is a general abstraction layer that enables the kernel to support many types of filesystems easily and cleanly.The filesystems are programmed to provide the abstracted interfaces and data structures the VFS expects; in turn, the kernel easily works with any filesystem and the exported user-space interface seamlessly works on any filesystem.

Nothing in the kernel needs to understand the underlying details of the filesystems, except the filesystems themselves. For example, consider a simple user-space program that does:

ret = write (fd, buf, len);

This system call writes the len bytes pointed to by buf into the current position in the file represented by the file descriptor fd.

  1. This system call is first handled by a generic sys_write() system call that determines the actual file writing method for the filesystem on which fd resides.
  2. The generic write system call then invokes this method, which is part of the filesystem implementation, to write the data to the media (or whatever this filesystem does on write).

The following figure shows the flow from user-space’s write() call through the data arriving on the physical media. On one side of the system call is the generic VFS interface, providing the frontend to user-space; on the other side of the system call is the filesystem-specific backend, dealing with the implementation details.

Figure 13.2 The flow of data from user-space issuing a write() call, through the VFS’s generic system call, into the filesystem’s specific write method, and finally arriving at the physical media.

Unix Filesystems

Historically, Unix has provided four basic filesystem-related abstractions:

Filesystem and namespace *

A filesystem is a hierarchical storage of data adhering to a specific structure. Filesystems contain files, directories, and associated control information. Typical operations performed on filesystems are creation, deletion, and mounting. In Unix, filesystems are mounted at a specific mount point in a global hierarchy known as a namespace. (Linux has made this hierarchy per-process, to give a unique namespace to each process. Because each process inherits its parent’s namespace unless you specify otherwise, there is seemingly one global namespace.) This enables all mounted filesystems to appear as entries in a single tree.

Contrast this single, unified tree with the behavior of DOS and Windows, which break the file namespace up into drive letters, such as C:. This breaks the namespace up among device and partition boundaries, "leaking" hardware details into the filesystem abstraction. As this delineation may be arbitrary and even confusing to the user, it is inferior to Linux’s unified namespace

Files and directories *

A file is an ordered string of bytes. The first byte marks the beginning of the file, and the last byte marks the end of the file. Each file is assigned a human-readable name for identification by both the system and the user. Typical file operations are read, write, create, and delete. The Unix concept of the file is in stark contrast to record-oriented filesystems, such as OpenVMS’s Files-11. Record-oriented filesystems provide a richer, more structured representation of files than Unix’s simple byte-stream abstraction, at the cost of simplicity and flexibility.

Files are organized in directories. A directory is analogous to a folder and usually contains related files. Directories can also contain other directories, called subdirectories. In this fashion, directories may be nested to form paths. Each component of a path is called a directory entry. A path example is /home/wolfman/butter: the root directory /, the directories home and wolfman, and the file butter are all directory entries, called dentries. In Unix, directories are actually normal files that simply list the files contained therein. Because a directory is a file to the VFS, the same operations performed on files can be performed on directories.

File metadata, inode and superblock *

Unix systems separate the concept of a file from any associated information about it, such as access permissions, size, owner, creation time, and so on. This information is sometimes called file metadata (data about the file’s data) and is stored in a separate data structure from the file, called the inode. This name is short for index node, although these days the term inode is much more ubiquitous.

All this information is tied together with the filesystem’s own control information, which is stored in the superblock. The superblock is a data structure containing information about the filesystem as a whole. Sometimes the collective data is referred to as filesystem metadata. Filesystem metadata includes information about both the individual files and the filesystem as a whole.

Traditionally, Unix filesystems implement these notions as part of their physical on-disk layout. For example, file information is stored as an inode in a separate block on the disk; directories are files; control information is stored centrally in a superblock, and so on. The Unix file concepts are physically mapped on to the storage medium.

The Linux VFS is designed to work with filesystems that understand and implement such concepts. Non-Unix filesystems, such as FAT or NTFS, still work in Linux, but their filesystem code must provide the appearance of these concepts. For example, even if a filesystem does not support distinct inodes, it must assemble the inode data structure in memory as if it did. Or if a filesystem treats directories as a special object, to the VFS they must represent directories as mere files. Often, this involves some special processing done on-the-fly by the non-Unix filesystems to cope with the Unix paradigm and the requirements of the VFS. Such filesystems still work, however, and the overhead is not unreasonable.

VFS Objects and Their Data Structures

The VFS is object-oriented. A family of data structures represents the common file model. These data structures are akin to objects. Because the kernel is programmed strictly in C, without the benefit of a language directly supporting object-oriented paradigms, the data structures are represented as C structures. The structures contain both data and pointers to filesystem-implemented functions that operate on the data.

The four primary object types of the VFS are:

Because the VFS treats directories as normal files, there is not a specific directory object. A dentry represents a component in a path, which might include a regular file. In other words, a dentry is not the same as a directory, but a directory is just another kind of file.

An operations object is contained within each of these primary objects. These objects describe the methods that the kernel invokes against the primary objects:

The operations objects are implemented as a structure of pointers to functions that operate on the parent object. For many methods, the objects can inherit a generic function if basic functionality is sufficient. Otherwise, the specific instance of the particular filesystem fills in the pointers with its own filesystem-specific methods.

Note that objects refer to structures, not explicit class types, such as those in C++ or Java. These structures, however, represent specific instances of an object, their associated data, and methods to operate on themselves. They are very much objects.

The VFS is comprised of a couple more than the primary objects previously discussed:

Two per-process structures, which describe the filesystem and files associated with a process, are respectively, the fs_struct structure and the file structure. The rest of this chapter discusses these objects and the role they play in implementing the VFS layer.

The superblock object is implemented by each filesystem and is used to store information describing that specific filesystem. This object usually corresponds to the filesystem superblock or the filesystem control block, which is stored in a special sector on disk (hence the object’s name). Filesystems that are not disk-based (a virtual memory–based filesystem, such as sysfs) generate the superblock on-the-fly and store it in memory.

The superblock object is represented by struct super_block and defined in <linux/fs.h>:

include/linux/fs.h#L1319

struct super_block {
    struct list_head s_list; /* list of all superblocks */
    dev_t s_dev; /* identifier */
    unsigned long s_blocksize; /* block size in bytes */
    unsigned char s_blocksize_bits; /* block size in bits */
    unsigned char s_dirt; /* dirty flag */
    unsigned long long s_maxbytes; /* max file size */
    struct file_system_type s_type; /* filesystem type */
    struct super_operations s_op; /* superblock methods */
    struct dquot_operations *dq_op; /* quota methods */
    struct quotactl_ops *s_qcop; /* quota control methods */
    struct export_operations *s_export_op; /* export methods */
    unsigned long s_flags; /* mount flags */
    unsigned long s_magic; /* filesystem’s magic number */
    struct dentry *s_root; /* directory mount point */
    struct rw_semaphore s_umount; /* unmount semaphore */
    struct semaphore s_lock; /* superblock semaphore */
    int s_count; /* superblock ref count */
    int s_need_sync; /* not-yet-synced flag */
    atomic_t s_active; /* active reference count */
    void *s_security; /* security module */
    struct xattr_handler **s_xattr; /* extended attribute handlers */
    struct list_head s_inodes; /* list of inodes */
    struct list_head s_dirty; /* list of dirty inodes */
    struct list_head s_io; /* list of writebacks */
    struct list_head s_more_io; /* list of more writeback */
    struct hlist_head s_anon; /* anonymous dentries */
    struct list_head s_files; /* list of assigned files */
    struct list_head s_dentry_lru; /* list of unused dentries */
    int s_nr_dentry_unused; /* number of dentries on list */
    struct block_device *s_bdev; /* associated block device */
    struct mtd_info *s_mtd; /* memory disk information */
    struct list_head s_instances; /* instances of this fs */
    struct quota_info s_dquot; /* quota-specific options */
    int s_frozen; /* frozen status */
    wait_queue_head_t s_wait_unfrozen; /* wait queue on freeze */
    char s_id[32]; /* text name */
    void *s_fs_info; /* filesystem-specific info */
    fmode_t s_mode; /* mount permissions */
    struct semaphore s_vfs_rename_sem; /* rename semaphore */
    u32 s_time_gran; /* granularity of timestamps */
    char *s_subtype; /* subtype name */
    char *s_options; /* saved mount options */
};

The code for creating, managing, and destroying superblock objects lives in fs/super.c. A superblock object is created and initialized via the alloc_super() function. When mounted, a filesystem invokes this function, reads its superblock off of the disk, and fills in its superblock object.

The most important item in the superblock object is s_op, which is a pointer to the superblock operations table. The superblock operations table is represented by struct super_operations and is defined in <linux/fs.h>, which looks like this:

struct super_operations {
    struct inode *(*alloc_inode)(struct super_block *sb);
    void (*destroy_inode)(struct inode *);
    void (*dirty_inode) (struct inode *);
    int (*write_inode) (struct inode *, int);
    void (*drop_inode) (struct inode *);
    void (*delete_inode) (struct inode *);
    void (*put_super) (struct super_block *);
    void (*write_super) (struct super_block *);
    int (*sync_fs)(struct super_block *sb, int wait);
    int (*freeze_fs) (struct super_block *);
    int (*unfreeze_fs) (struct super_block *);
    int (*statfs) (struct dentry *, struct kstatfs *);
    int (*remount_fs) (struct super_block *, int *, char *);
    void (*clear_inode) (struct inode *);
    void (*umount_begin) (struct super_block *);
    int (*show_options)(struct seq_file *, struct vfsmount *);
    int (*show_stats)(struct seq_file *, struct vfsmount *);
    ssize_t (*quota_read)(struct super_block *, int, char *, size_t, loff_t);
    ssize_t (*quota_write)(struct super_block *, int, const char *, size_t, loff_t);
    int (*bdev_try_to_free_page)(struct super_block*, struct page*, gfp_t);
};

Each item in this structure is a pointer to a function that operates on a superblock object. The superblock operations perform low-level operations on the filesystem and its inodes.

When a filesystem needs to perform an operation on its superblock, it follows the pointers from its superblock object to the desired method. For example, if a filesystem wanted to write to its superblock, it would invoke:

sb->s_op->write_super(sb);

In this call, sb is a pointer to the filesystem’s superblock. Following that pointer into s_op yields the superblock operations table and ultimately the desired write_super() function, which is then invoked. Note how the write_super() call must be passed a superblock, despite the method being associated with one. This is because of the lack of object-oriented support in C. In C++, a call such as the following would suffice:

sb.write_super();

In C, there is no way for the method to easily obtain its parent, so you have to pass it.

The following are some of the superblock operations that are specified by super_operations:

The Superblock Object

Superblock Operations

The Inode Object

Inode Operations

The Dentry Object

Dentry State

The Dentry Cache

Dentry Operations

The File Object

File Operations

Data Structures Associated with Filesystems

Data Structures Associated with a Process