Chapter 14. The Block I/O Layer

Block devices are hardware devices distinguished by the random (not necessarily sequential) access of fixed-size chunks of data. The fixed-size chunks of data are called blocks. The most common block device is a hard disk, but many other block devices exist, such as floppy drives, Blu-ray readers, and flash memory. Notice how these are all devices on which you mount a filesystem; filesystems are the lingua franca of block devices.

The other basic type of device is a character device. Character devices, or char devices, are accessed as a stream of sequential data, one byte after another. Example character devices are serial ports and keyboards.

Character Device vs. Block Device *

If the hardware device is accessed as a stream of data, it is implemented as a character device. On the other hand, if the device is accessed randomly (nonsequentially), it is a block device.

The difference comes down to whether the device accesses data randomly, in other words, whether the device can seek to one position from another. [p289]

Managing block devices in the kernel requires more care, preparation, and work than managing character devices. Character devices have only one position, the current one, whereas block devices must be able to navigate back and forth between any location on the media.

The kernel does not have to provide an entire subsystem dedicated to the management of character devices, but block devices receive exactly that. Such a subsystem is a necessity partly because of the complexity of block devices. A large reason for such extensive support is that block devices are quite performance sensitive; getting every last drop out of your hard disk is much more important than squeezing an extra percent of speed out of your keyboard. Furthermore, the complexity of block devices provides a lot of room for such optimizations.

The topic of this chapter is how the kernel manages block devices and their requests. This part of the kernel is known as the block I/O layer.

Anatomy of a Block Device

Sector *

The smallest addressable unit on a block device is a sector. Sector sizes are powers of two, but 512 bytes is the most common size. The sector size is a physical property of the device, and the sector is the fundamental unit of all block devices; the device cannot address or operate on a unit smaller than the sector, although many block devices can operate on multiple sectors at one time. Most block devices have 512-byte sectors, although other sizes are common. For example, many CD-ROM discs have 2-kilobyte sectors.

Block *

The block is the smallest logically addressable unit for software.

The block is an abstraction of the filesystem; filesystems can be accessed only in multiples of a block. Although the physical device is addressable at the sector level, the kernel performs all disk operations in terms of blocks.

Block vs. Sector *

Therefore, block sizes are a power-of-two multiple of the sector size and are not greater than the page size. Common block sizes are 512 bytes, 1 kilobyte, and 4 kilobytes.

Confusion of block and sector *

Some people confusingly refer to sectors and blocks with different names:

This chapter continues to call the two notions sectors and blocks, but you should keep these other terms in mind. Below is a diagram of the relationship between sectors and blocks:

Figure 14.1 Relationship between sectors and blocks.

Hard disk related terminology, such as clusters, cylinders, and heads are specific only to certain block devices and are mostly invisible to user-space software. The reason that the sector is important to the kernel is because all device I/O must be done in units of sectors. In turn, blocks, which is the higher-level concept used by the kernel, are built on top of sectors.

Buffers and Buffer Heads

When a block is stored in memory (e.g. after a read or pending a write), it is stored in a buffer. Each buffer is associated with exactly one block.The buffer serves as the object that represents a disk block in memory. A block is composed of one or more sectors but is no more than a page in size, a single page can hold one or more blocks in memory. Because the kernel requires some associated control information to accompany the data (such as from which block device and which specific block the buffer is), each buffer is associated with a descriptor. This descriptor is called a buffer head and is of type struct buffer_head. The buffer_head structure holds all the information that the kernel needs to manipulate buffers and is defined in <linux/buffer_head.h> (include/linux/buffer_head.h).

include/linux/buffer_head.h#L61

struct buffer_head {
    unsigned long b_state;             /* buffer state flags */
    struct buffer_head *b_this_page;   /* list of page’s buffers */
    struct page *b_page;               /* associated page */
    sector_t b_blocknr;                /* starting block number */
    size_t b_size;                     /* size of mapping */
    char *b_data;                      /* pointer to data within the page */
    struct block_device *b_bdev;       /* associated block device */
    bh_end_io_t *b_end_io;             /* I/O completion */
    void *b_private;                   /* reserved for b_end_io */
    struct list_head b_assoc_buffers;  /* associated mappings */
    struct address_space *b_assoc_map; /* associated address space */
    atomic_t b_count;                  /* use count */
};

The b_state field and bh_state_bits enumeration *

The b_state field specifies the state of this particular buffer. It can be one or more of the flags in the following table. The legal flags are stored in the bh_state_bits (include/linux/buffer_head.h#L19) enumeration, which is defined in <linux/buffer_head.h>.

Status Flag Meaning
BH_Uptodate Buffer contains valid data.
BH_Dirty Buffer is dirty. (The contents of the buffer are newer than the contents of the block on disk and therefore the buffer must eventually be written back to disk.)
BH_Lock Buffer is undergoing disk I/O and is locked to prevent concurrent access.
BH_Req Buffer is involved in an I/O request.
BH_Mapped Buffer is a valid buffer mapped to an on-disk block.
BH_New Buffer is newly mapped via get_block() and not yet accessed.
BH_Async_Read Buffer is undergoing asynchronous read I/O via end_buffer_async_read().
BH_Async_Write Buffer is undergoing asynchronous write I/O via end_buffer_async_write().
BH_Delay Buffer does not yet have an associated on-disk block (delayed allocation).
BH_Boundary Buffer forms the boundary of contiguous blocks; the next block is discontinuous.
BH_Write_EIO Buffer incurred an I/O error on write.
BH_Ordered Ordered write.
BH_Eopnotsupp Buffer incurred a "not supported" error.
BH_Unwritten Space for the buffer has been allocated on disk but the actual data has not yet been written out.
BH_Quiet Suppress errors for this buffer.

The bh_state_bits enumeration also contains a BH_PrivateStart flag (as the last value in the list). This is not a valid state flag but instead corresponds to the first usable bit of which other code can make use. All bit values equal to and greater than BH_PrivateStart are not used by the block I/O layer proper, so these bits are safe to use by individual drivers who want to store information in the b_state field. Drivers can base the bit values of their internal flags off this flag and rest assured that they are not encroaching on an official bit used by the block I/O layer.

b_count

The b_count field is the buffer’s usage count. The value is incremented and decremented by two inline functions defined in <linux/buffer_head.h>:

include/linux/buffer_head.h#L252

static inline void get_bh(struct buffer_head *bh)
{
        atomic_inc(&bh->b_count);
}

static inline void put_bh(struct buffer_head *bh)
{
        smp_mb__before_atomic_dec();
        atomic_dec(&bh->b_count);
}

Physical block, page and buffer *

The purpose of a buffer head is to describe this mapping between the on-disk block and the physical in-memory buffer (which is a sequence of bytes on a specific page). Acting as a descriptor of this buffer-to-block mapping is the data structure’s only role in the kernel.

Problems with buffer heads *

Before the 2.6 kernel, the buffer head was an important data structure. It was the unit of I/O in the kernel:

However, it had two primary problems:

  1. The buffer head was a large and unwieldy data structure. It was neither clean nor simple to manipulate data in terms of buffer heads.
    • Instead, the kernel prefers to work in terms of pages, which are simple and enable for greater performance.
    • A large buffer head describing each individual buffer (which might be smaller than a page) was inefficient.
    • Consequently, in the 2.6 kernel, much work has gone into making the kernel work directly with pages and address spaces instead of buffers. Some of this work is discussed in Chapter 16 "The Page Cache and Page Writeback", where the address_space structure and the pdflush daemons are discussed.
  2. Buffer heads describe only a single buffer. When used as the container for all I/O operations, the buffer head forces the kernel to break up potentially large block I/O operations into multiple buffer_head structures. This results in needless overhead and space consumption.
    • The primary goal of the 2.5 development kernel was to introduce a new, flexible, and lightweight container, bio structure (discussed in the next section), for block I/O operations.

The bio Structure

The bio structure is the basic container for block I/O within the kernel is the bio structure. Defined in <linux/bio.h>, this structure represents block I/O operations that are in flight (active) as a list of segments. A segment is a chunk of a buffer that is contiguous in memory. Thus, individual buffers need not be contiguous in memory. By allowing the buffers to be described in chunks, the bio structure provides the capability for the kernel to perform block I/O operations of even a single buffer from multiple locations in memory. Vector I/O such as this is called scatter-gather I/O.

The following is struct bio:

include/linux/bio.h#L62

struct bio {
    sector_t bi_sector;               /* associated sector on disk */
    struct bio *bi_next;              /* list of requests */
    struct block_device *bi_bdev;     /* associated block device */
    unsigned long bi_flags;           /* status and command flags */
    unsigned long bi_rw;              /* read or write? */
    unsigned short bi_vcnt;           /* number of bio_vecs off */
    unsigned short bi_idx;            /* current index in bi_io_vec */
    unsigned short bi_phys_segments;  /* number of segments */
    unsigned int bi_size;             /* I/O count */
    unsigned int bi_seg_front_size;   /* size of first segment */
    unsigned int bi_seg_back_size;    /* size of last segment */
    unsigned int bi_max_vecs;         /* maximum bio_vecs possible */
    unsigned int bi_comp_cpu;         /* completion CPU */
    atomic_t bi_cnt;                  /* usage counter */
    struct bio_vec *bi_io_vec;        /* bio_vec list */
    bio_end_io_t *bi_end_io;          /* I/O completion method */
    void *bi_private;                 /* owner-private method */
    bio_destructor_t *bi_destructor;  /* destructor method */
    struct bio_vec bi_inline_vecs[0]; /* inline bio vectors */
};

The primary purpose of a bio structure is to represent an in-flight (in progress) block I/O operation. To this end, the majority of the fields in the structure are housekeeping related. The most important fields are bi_io_vec, bi_vcnt, and bi_idx. The following figure shows the relationship between struct bio, struct bio_vec, and struct page.

Figure 14.2 Relationship between struct bio, struct bio_vec, and struct page.

I/O vectors

The bi_io_vec field points to an array of bio_vec structures, each of which is used as a list of individual segments in this specific block I/O operation. The entire array of these vectors describes the entire buffer.

Each bio_vec is treated as a vector of the form <page, offset, len>, which describes a specific segment:

The bio_vec structure is defined in <linux/bio.h>:

struct bio_vec {
    /* pointer to the physical page on which this buffer resides */
    struct page *bv_page;

    /* the length in bytes of this buffer */
    unsigned int bv_len;

    /* the byte offset within the page where the buffer resides */
    unsigned int bv_offset;
};

In each given block I/O operation, there are bi_vcnt vectors in the bio_vec array starting with bi_io_vec. As the block I/O operation is carried out, the bi_idx field is used to point to the current index into the array.

[p295]

Each block I/O request is represented by a bio structure. Each request is composed of one or more blocks, which are stored in an array of bio_vec structures. These structures act as vectors and describe each segment’s location in a physical page in memory. The first segment in the I/O operation is pointed to by b_io_vec. Each additional segment follows after the first, for a total of bi_vcnt segments in the list. As the block I/O layer submits segments in the request, the bi_idx field is updated to point to the current segment.

Splitting of bio structures and RAID *

The bi_idx field is used to point to the current bio_vec in the list, which helps the block I/O layer keep track of partially completed block I/O operations. More importantly, it allows the splitting of bio structures. With this feature, RAID (a hard disk setup that enables single volumes to span multiple disks for performance and reliability purposes) drivers can take a single bio structure, initially intended for a single device and split it among the multiple hard drives in the RAID array. All the RAID driver needs to do is copy the bio structure and update the bi_idx field to point to where the individual drive should start its operation.

Usage count *

The bio structure maintains a usage count in the bi_cnt field. When this field reaches zero, the structure is destroyed and the backing memory is freed. The following two functions manage the usage counters:

void bio_get(struct bio *bio)
void bio_put(struct bio *bio)

The former increments the usage count, whereas the latter decrements the usage count (if the count reaches zero, destroys the bio structure). Before manipulating an in-flight bio structure, be sure to increment its usage count to make sure it does not complete and deallocate out. When you finish, decrement the usage count in turn.

The bi_private field is a private field for the owner (creator) of the structure. As a rule, you can read or write this field only if you allocated the bio structure.

The Old Versus the New

The bio structure represents an I/O operation, which may include one or more pages in memory. On the other hand, the buffer_head structure represents a single buffer, which describes a single block on the disk.

bio structure benefits *

The bio structure also has the following benefits:

Why is buffer head still needed? *

The concept of buffer heads is still required. Buffer heads function as descriptors, mapping disk blocks to pages. The bio structure does not contain any information about the state of a buffer: it is simply an array of vectors describing one or more segments of data for a single block I/O operation, plus related information. In the current setup, the buffer_head structure is still needed to contain information about buffers while the bio structure describes in-flight I/O. Keeping the two structures separate enables each to remain as small as possible.

Request Queues

Block devices maintain request queues to store their pending block I/O requests. The request queue is represented by the request_queue structure and is defined in <linux/blkdev.h>. The request queue contains a doubly linked list of requests and associated control information. Requests are added to the queue by higher-level code in the kernel, such as filesystems. As long as the request queue is nonempty, the block device driver associated with the queue grabs the request from the head of the queue and submits it to its associated block device. Each item in the queue’s request list is a single request, of type struct request (also defined in <linux/blkdev.h>).

Each request can be composed of more than one bio structure because individual requests can operate on multiple consecutive disk blocks. Note that although the blocks on the disk must be adjacent, the blocks in memory need not be; each bio structure can describe multiple segments (segments are contiguous chunks of a block in memory) and the request can be composed of multiple bio structures.

Doubts and Solutions

Verbatim

p294 on the bio structure:

The bio structure is the basic container for block I/O within the kernel is the bio structure. Defined in <linux/bio.h>, this structure represents block I/O operations that are in flight (active) as a list of segments. A segment is a chunk of a buffer that is contiguous in memory. Thus, individual buffers need not be contiguous in memory. By allowing the buffers to be described in chunks, the bio structure provides the capability for the kernel to perform block I/O operations of even a single buffer from multiple locations in memory. Vector I/O such as this is called scatter-gather I/O.

Need in-depth understanding on this paragrah.

Solution: