Chapter 15. The Process Address Space

Chapter 12 Memory Management covers how the kernel manages physical memory. In addition to managing its own memory, the kernel also has to manage the memory of user-space processes. This memory is called the process address space, which is the representation of memory given to each user-space process on the system.

Linux is a virtual memory operating system:

This chapter discusses how the kernel manages the process address space.

Address Spaces

Flat and segmented address space *

The process address space consists of the virtual memory addressable by a process and the addresses within the virtual memory that the process is allowed to use. Each process is given a flat 32- or 64-bit address space, with the size depending on the architecture. The term flat denotes that the address space exists in a single range. (For example, a 32-bit address space extends from the address 0 to 4294967295.)

Some operating systems provide a segmented address space, with addresses existing not in a single linear range, but instead in multiple segments. Modern virtual memory operating systems generally have a flat memory model and not a segmented one.

In the flat memory model, the flat address space is unique to each process. A memory address in one process's address space is completely unrelated to that same memory address in another process's address space. Both processes can have different data at the same address in their respective address spaces. Alternatively, processes can elect to share their address space with other processes; these processes are known as as threads.

Memory areas*

A memory address is a given value within the address space, such as 4021f000. This particular value identifies a specific byte in a process's 32-bit address space. Although a process can address up to 4GB of memory (with a 32-bit address space), it doesn't have permission to access all of it.

These intervals of legal addresses (the process has permission to access), such as 08048000-0804c000 are called memory areas. The process, through the kernel, can dynamically add and remove memory areas to its address space.

The process can access a memory address only in a valid memory area. Memory areas have associated permissions, such as readable, writable, and executable, that the associated process must respect. If a process accesses a memory address not in a valid memory area, or if it accesses a valid area in an invalid manner, the kernel kills the process with the dreaded "Segmentation Fault" message.

Memory areas can contain the following things:

The Memory Descriptor

The kernel represents a process's address space with a data structure called the memory descriptor. This structure contains all the information related to the process address space.

The memory descriptor is represented by struct mm_struct and defined in <linux/mm_types.h> (include/linux/mm_types.h):

include/linux/mm_types.h#L222

struct mm_struct {
    struct vm_area_struct *mmap; /* list of memory areas */
    struct rb_root mm_rb; /* red-black tree of VMAs */
    struct vm_area_struct *mmap_cache; /* last used memory area */
    unsigned long free_area_cache; /* 1st address space hole */
    pgd_t *pgd; /* page global directory */
    atomic_t mm_users; /* address space users */
    atomic_t mm_count; /* primary usage counter */
    int map_count; /* number of memory areas */
    struct rw_semaphore mmap_sem; /* memory area semaphore */
    spinlock_t page_table_lock; /* page table lock */
    struct list_head mmlist; /* list of all mm_structs */
    unsigned long start_code; /* start address of code */
    unsigned long end_code; /* final address of code */
    unsigned long start_data; /* start address of data */
    unsigned long end_data; /* final address of data */
    unsigned long start_brk; /* start address of heap */
    unsigned long brk; /* final address of heap */
    unsigned long start_stack; /* start address of stack */
    unsigned long arg_start; /* start of arguments */
    unsigned long arg_end; /* end of arguments */
    unsigned long env_start; /* start of environment */
    unsigned long env_end; /* end of environment */
    unsigned long rss; /* pages allocated */
    unsigned long total_vm; /* total number of pages */
    unsigned long locked_vm; /* number of locked pages */
    unsigned long saved_auxv[AT_VECTOR_SIZE]; /* saved auxv */
    cpumask_t cpu_vm_mask; /* lazy TLB switch mask */
    mm_context_t context; /* arch-specific data */
    unsigned long flags; /* status flags */
    int core_waiters; /* thread core dump waiters */
    struct core_state *core_state; /* core dump support */
    spinlock_t ioctx_lock; /* AIO I/O list lock */
    struct hlist_head ioctx_list; /* AIO I/O list */
};

Having two counters enables the kernel to differentiate between the main usage counter (mm_count) and the number of processes using the address space (mm_users).

This redundancy (two mmap and mm_rb data structures on the same data) are handy:

The kernel isn't duplicating the mm_struct structures; just the containing objects. Overlaying a linked list onto a tree, and using both to access the same set of data, is sometimes called a threaded tree.

All of the mm_struct structures are strung together in a doubly linked list via the mmlist field. The initial element in the list is the init_mm memory descriptor, which describes the address space of the init process. The list is protected from concurrent access via the mmlist_lock, which is defined in kernel/fork.c.

Allocating a Memory Descriptor

The memory descriptor associated with a given task is stored in the mm field of the task's process descriptor:

Processes may share their address spaces with their children via the CLONE_VM flag to clone(). The process is then called a thread. This is essentially the only difference between normal processes and so-called threads in Linux (Chapter 3); the Linux kernel does not otherwise differentiate between them. Threads are regular processes to the kernel that merely share certain resources.

When CLONE_VM is specified, allocate_mm() is not called, and the process's mm field is set to point to the memory descriptor of its parent via this logic in copy_mm():

if (clone_flags & CLONE_VM) {
    /*
    * current is the parent process and
    * tsk is the child process during a fork()
    */
    atomic_inc(&current->mm->mm_users);
    tsk->mm = current->mm;
}

Destroying a Memory Descriptor

When the process associated with a specific address space exits, exit_mm(), defined in kernel/exit.c, is invoked. This function performs some housekeeping and updates some statistics. It then calls mmput(), which decrements the memory descriptor's mm_users user counter. If the user count reaches zero, mmdrop() is called to decrement the mm_count usage counter. If that counter is finally zero, the free_mm() macro is invoked to return the mm_struct to the mm_cachep slab cache via kmem_cache_free(), because the memory descriptor does not have any users.

The mm_struct and Kernel Threads

Kernel threads do not have a process address space and therefore do not have an associated memory descriptor. Thus, the mm field of a kernel thread's process descriptor is NULL. This is the definition of a kernel thread: processes that have no user context.

Kernel threads do not ever access any userspace memory. Because kernel threads do not have any pages in user-space, they do not deserve their own memory descriptor and page tables (discussed later in the chapter). However, kernel threads need some of the data, such as the page tables, even to access kernel memory. To provide kernel threads the needed data, without wasting memory on a memory descriptor and page tables, or wasting processor cycles to switch to a new address space whenever a kernel thread begins running, kernel threads use the memory descriptor of whatever task ran previously:

Virtual Memory Areas

The memory area structure vm_area_struct represents memory areas, defined in <linux/mm_types.h>. In the Linux kernel, memory areas are often called virtual memory areas (VMAs).

The vm_area_struct structure describes a single memory area over a contiguous interval in a given address space:

In this manner, each VMA structure can represent different types of memory areas, e.g. memory-mapped files or the process's user-space stack. This is similar to the object-oriented approach taken by the VFS layer (Chapter 13).

include/linux/mm_types.h#L130

struct vm_area_struct {
    struct mm_struct *vm_mm; /* associated mm_struct */
    unsigned long vm_start; /* VMA start, inclusive */
    unsigned long vm_end; /* VMA end , exclusive */
    struct vm_area_struct *vm_next; /* list of VMA's */
    pgprot_t vm_page_prot; /* access permissions */
    unsigned long vm_flags; /* flags */
    struct rb_node vm_rb; /* VMA's node in the tree */
    union { /* links to address_space->i_mmap or i_mmap_nonlinear */
    struct {
            struct list_head list;
            void *parent;
            struct vm_area_struct *head;
        } vm_set;
        struct prio_tree_node prio_tree_node;
    } shared;
    struct list_head anon_vma_node; /* anon_vma entry */
    struct anon_vma *anon_vma; /* anonymous VMA object */
    struct vm_operations_struct *vm_ops; /* associated ops */
    unsigned long vm_pgoff; /* offset within file */
    struct file *vm_file; /* mapped file, if any */
    void *vm_private_data; /* private data */
};

Each memory descriptor is associated with a unique interval in the process's address space.

vm_end - vm_start is the length in bytes of the memory area, which exists over the interval [vm_start, vm_end). Intervals in different memory areas in the same address space cannot overlap.

The vm_mm field points to this VMA's associated mm_struct. Each VMA is unique to the mm_struct with which it is associated:

VMA Flags

The vm_flags field contains bit flags, defined in <linux/mm.h>, that specify the behavior of and provide information about the pages contained in the memory area. Unlike permissions associated with a specific physical page, the VMA flags specify behavior for which the kernel is responsible, not the hardware. vm_flags contains information that relates to the memory area as a whole (each page), not specific individual pages.

The following table is a listing of the possible vm_flags values.

Flag Effect on the VMA and Its Pages
VM_READ Pages can be read from.
VM_WRITE Pages can be written to.
VM_EXEC Pages can be executed.
VM_SHARED Pages are shared.
VM_MAYREAD The VM_READ flag can be set.
VM_MAYWRITE The VM_WRITE flag can be set.
VM_MAYEXEC The VM_EXEC flag can be set.
VM_MAYSHARE The VM_SHARE flag can be set.
VM_GROWSDOWN The area can grow downward.
VM_GROWSUP The area can grow upward.
VM_SHM The area is used for shared memory.
VM_DENYWRITE The area maps an unwritable file.
VM_EXECUTABLE The area maps an executable file.
VM_LOCKED The pages in this area are locked.
VM_IO The area maps a device's I/O space.
VM_SEQ_READ The pages seem to be accessed sequentially.
VM_RAND_READ The pages seem to be accessed randomly.
VM_DONTCOPY This area must not be copied on fork().
VM_DONTEXPAND This area cannot grow via mremap().
VM_RESERVED This area must not be swapped out.
VM_ACCOUNT This area is an accounted VM object.
VM_HUGETLB This area uses hugetlb pages.
VM_NONLINEAR This area is a nonlinear mapping.

These flags (VM_SEQ_READ and VM_RAND_READ flags) are set via the madvise() system call with the MADV_SEQUENTIAL and MADV_RANDOM flags, respectively:

VMA Operations

The vm_ops field in the vm_area_struct structure points to the table of operations associated with a given memory area, which the kernel can invoke to manipulate the VMA. The vm_area_struct acts as a generic object for representing any type of memory area, and the operations table describes the specific methods that can operate on this particular instance of the object.

The operations table is represented by struct vm_operations_struct and is defined in <linux/mm.h>:

include/linux/mm.h#L185

struct vm_operations_struct {
    void (*open) (struct vm_area_struct *);
    void (*close) (struct vm_area_struct *);
    int (*fault) (struct vm_area_struct *, struct vm_fault *);
    int (*page_mkwrite) (struct vm_area_struct *vma, struct vm_fault *vmf);
    int (*access) (struct vm_area_struct *, unsigned long ,
    void *, int, int);
};

Lists and Trees of Memory Areas

As discussed earlier, memory areas are accessed via both the mmap and the mm_rb fields of the memory descriptor. These two data structures independently point to all the memory area objects associated with the memory descriptor. They both contain pointers to the same vm_area_struct structures, represented in different ways.

The first field, mmap, links together all the memory area objects in a singly linked list:

The second field, mm_rb, links together all the memory area objects in a red-black tree.

A red-black tree is a type of balanced binary tree.

The kernel uses the redundant data structures to provide optimal performance regardless of the operation performed on the memory areas:

Memory Areas in Real Life

This section discusses a particular process's address space and the memory areas inside, with the useful /proc filesystem and the pmap(1) utility. The example is a simple userspace program which does nothing:

int main(int argc, char *argv[])
{
    return 0;
}

Memory areas in this process's address space include:

The output from /proc/<pid>/maps lists the memory areas in this process's address space:

$ cat /proc/1426/maps
00e80000-00faf000 r-xp 00000000 03:01 208530 /lib/tls/libc-2.5.1.so
00faf000-00fb2000 rw-p 0012f000 03:01 208530 /lib/tls/libc-2.5.1.so
00fb2000-00fb4000 rw-p 00000000 00:00 0
08048000-08049000 r-xp 00000000 03:03 439029 /home/rlove/src/example
08049000-0804a000 rw-p 00000000 03:03 439029 /home/rlove/src/example
40000000-40015000 r-xp 00000000 03:01 80276 /lib/ld-2.5.1.so
40015000-40016000 rw-p 00015000 03:01 80276 /lib/ld-2.5.1.so
4001e000-4001f000 rw-p 00000000 00:00 0
bfffe000-c0000000 rwxp fffff000 00:00 0

The data is in the form:

start-end permission offset major:minor inode file

The pmap(1) utility (of the procps package) formats this information in a bit more readable manner:

$ pmap 1426
example[1426]
00e80000    (1212 KB)   r-xp    (03:01 208530)  /lib/tls/libc-2.5.1.so
00faf000    (12 KB)     rw-p    (03:01 208530)  /lib/tls/libc-2.5.1.so
00fb2000    (8 KB)      rw-p    (00:00 0)
08048000    (4 KB)      r-xp    (03:03 439029)  /home/rlove/src/example
08049000    (4 KB)      rw-p    (03:03 439029)  /home/rlove/src/example
40000000    (84 KB)     r-xp    (03:01 80276)   /lib/ld-2.5.1.so
40015000    (4 KB)      rw-p    (03:01 80276)   /lib/ld-2.5.1.so
4001e000    (4 KB)      rw-p    (00:00 0)
bfffe000    (8 KB)      rwxp        (00:00 0)       [ stack ]
mapped: 1340 KB writable/private: 40 KB shared: 0 KB

Description of the above output:

Permissions:

Sizes:

The entire address space takes up about 1340KB, but only 40KB are writable and private. If a memory region is shared or nonwritable, the kernel keeps only one copy of the backing file in memory. Since a nonwritable mapping can never be changed (the mapping is only read from), it is safe to load the image only once into memory. Therefore, the C library needs to occupy only 1212KB in physical memory (not 1212KB multiplied by every process using the library). Because this process has access to about 1340KB worth of data and code, yet consumes only about 40KB of physical memory, the space savings from such sharing is substantial.

The memory areas without a mapped file on device 00:00 and inode zero are the zero pages, each of which is a mapping that consists of all zeros. By mapping the zero page over a writable memory area, the area is in effect "initialized" to all zeros. This is important in that it provides a zeroed memory area, which is expected by the bss and stack. Because the mapping is not shared, as soon as the process writes to this data, a copy is made (à la copy-on-write) and the value updated from zero.

Each of the memory areas associated with the process corresponds to a vm_area_struct structure. Because the process was not a thread, it has a unique mm_struct structure referenced from its task_struct.

Manipulating Memory Areas

The kernel often has to perform operations on a memory area, e.g. checking whether a given address exists in a given VMA. These operations are frequent and form the basis of the mmap() routine (discussed later). Helper functions are defined to assist these jobs.

These functions are all declared in <linux/mm.h>.

find_vma()

The kernel provides a function find_vma() (include/linux/mm.h#L1335) for searching for the VMA in which a given memory address resides. It is defined in mm/mmap.c:

struct vm_area_struct * find_vma(struct mm_struct *mm, unsigned long addr);

mm/mmap.c#L1569

struct vm_area_struct *find_vma(struct mm_struct *mm, unsigned long addr)
{
    struct vm_area_struct *vma = NULL;

    if (mm) {
        /* Check the cache first. */
        /* (Cache hit rate is typically around 35%.) */
        vma = mm->mmap_cache;
        if (!(vma && vma->vm_end > addr && vma->vm_start <= addr)) {
            struct rb_node * rb_node;

            rb_node = mm->mm_rb.rb_node;
            vma = NULL;

            while (rb_node) {
                struct vm_area_struct * vma_tmp;

                vma_tmp = rb_entry(rb_node,
                        struct vm_area_struct, vm_rb);

                if (vma_tmp->vm_end > addr) {
                    vma = vma_tmp;
                    if (vma_tmp->vm_start <= addr)
                        break;
                    rb_node = rb_node->rb_left;
                } else
                    rb_node = rb_node->rb_right;
            }
            if (vma)
                mm->mmap_cache = vma;
        }
    }
    return vma;

find_vma_prev()

The find_vma_prev() function works the same as find_vma(), but it also returns the last VMA before addr. The function is also defined in mm/mmap.c and declared in <linux/mm.h>:

struct vm_area_struct * find_vma_prev(struct mm_struct *mm, unsigned long addr,
                                      struct vm_area_struct **pprev)

The pprev argument stores a pointer to the VMA preceding addr.

find_vma_intersection()

The find_vma_intersection() function returns the first VMA that overlaps a given address interval. The function is defined in <linux/mm.h> because it is inline:

static inline struct vm_area_struct *
find_vma_intersection(struct mm_struct *mm,
                      unsigned long start_addr,
                      unsigned long end_addr)
{
    struct vm_area_struct *vma;

    vma = find_vma(mm, start_addr);
    if (vma && end_addr <= vma->vm_start)
    vma = NULL;
    return vma;
}

The first parameter is the address space to search, start_addr is the start of the interval, and end_addr is the end of the interval.

If find_vma() returns NULL, so would find_vma_intersection(). If find_vma() returns a valid VMA, however, find_vma_intersection() returns the same VMA only if it does not start after the end of the given address range. If the returned memory area does start after the end of the given address range, the function returns NULL.

mmap() and do_mmap(): Creating an Address Interval

The do_mmap() function creates a new linear address interval. This function does not technically create a new VMA, because:

In any case, do_mmap() is the function used to add an address interval to a process's address space, whether that means expanding an existing memory area or creating a new one.

The do_mmap() function is declared in <linux/mm.h>:

include/linux/mm.h#L1271

unsigned long do_mmap(struct file *file, unsigned long addr,
                      unsigned long len, unsigned long prot,
                      unsigned long flag, unsigned long offset)
Flag Effect on the Pages in the New Interval
PROT_READ Corresponds to VM_READ
PROT_WRITE Corresponds to VM_WRITE
PROT_EXEC Corresponds to VM_EXEC
PROT_NONE Cannot access page
Flag Effect on the New Interval
MAP_SHARED The mapping can be shared.
MAP_PRIVATE The mapping cannot be shared.
MAP_FIXED The new interval must start at the given address addr.
MAP_ANONYMOUS The mapping is not file-backed, but is anonymous.
MAP_GROWSDOWN Corresponds to VM_GROWSDOWN.
MAP_DENYWRITE Corresponds to VM_DENYWRITE.
MAP_EXECUTABLE Corresponds to VM_EXECUTABLE.
MAP_LOCKED Corresponds to VM_LOCKED.
MAP_NORESERVE No need to reserve space for the mapping.
MAP_POPULATE Populate (prefault) page tables.
MAP_NONBLOCK Do not block on I/O.

The do_mmap() functionality is exported to user-space via the mmap() system call, defined as:

void * mmap2(void *start,
             size_t length,
             int prot,
             int flags,
             int fd,
             off_t pgoff)

This system call is named mmap2() because it is the second variant of mmap():

munmap() and do_munmap(): Removing an Address Interval

The do_munmap() function removes an address interval from a specified process address space. The function is declared in <linux/mm.h>:

include/linux/mm.h#L1284

int do_munmap(struct mm_struct *mm, unsigned long start, size_t len)

The first parameter mm specifies the address space from which the interval starting at address start of length len bytes is removed. On success, zero is returned. Otherwise, a negative error code is returned.

As the complement of the mmap(), the munmap() system call is exported to user-space as a means to enable processes to remove address intervals from their address space:

int munmap(void *start, size_t length)

The system call is defined in mm/mmap.c and acts as a simple wrapper to do_munmap():

asmlinkage long sys_munmap(unsigned long addr, size_t len)
{
    int ret;
    struct mm_struct *mm;
    mm = current->mm;
    down_write(&mm->mmap_sem);
    ret = do_munmap(mm, addr, len);
    up_write(&mm->mmap_sem);
    return ret;
}

Page Tables

Although applications operate on virtual memory mapped to physical addresses, processors operate directly on those physical addresses. Consequently, when an application accesses a virtual memory address, it must first be converted to a physical address before the processor can resolve the request. Performing this lookup is done via page tables. Page tables work by splitting the virtual address into chunks. Each chunk is used as an index into a table. The table points to either another table or the associated physical page.

In Linux, the page tables consist of three levels. The multiple levels enable a sparsely populated address space, even on 64-bit machines. If the page tables were implemented as a single static array, their size on even 32-bit architectures would be enormous. Linux uses three levels of page tables even on architectures that do not support three levels in hardware. (For example, some hardware uses only two levels or implements a hash in hardware.) Using three levels is a sort of "greatest common denominator": architectures with a less complicated implementation can simplify the kernel page tables as needed with compiler optimizations.

The following figure diagrams the flow of a virtual to physical address lookup using page tables.

Figure 15.1 Virtual-to-physical address lookup


[UTLK p57-58]

Up to version 2.6.10, the Linux paging model consisted of three paging levels. Starting with version 2.6.11, a four-level paging model has been adopted. The four types of page tables illustrated in the figure below:

Figure 2-12. The Linux paging model (UTLK)


Translation Lookaside Buffer *

Because nearly every access of a page in virtual memory must be resolved to its corresponding address in physical memory, the performance of the page tables is very critical. Unfortunately, looking up all these addresses in memory can not be done so quickly. To facilitate this, most processors implement a translation lookaside buffer (TLB), which acts as a hardware cache of virtual-to-physical mappings. When accessing a virtual address, the processor first checks whether the mapping is cached in the TLB:

Future work of page tables *

Page table management is still a critical and evolving part of the kernel. Changes to this area in 2.6 include allocating parts of the page table out of high memory. Future possibilities include shared page tables with copy-on-write semantics. In that scheme, page tables would be shared between parent and child across a fork(). When the parent or the child attempted to modify a particular page table entry, a copy would be created, and the two processes would no longer share that entry. Sharing page tables would remove the overhead of copying the page table entries on fork().