Chapter 12. Memory Management

Unlike user-space, the kernel is not always afforded the capability to easily allocate memory. This chapter discusses how the kernel handles memory and the methods used to obtain memory inside the kernel.

Page

To kernel, physical pages are the basic unit of memory management. Although the processor’s smallest addressable unit is a byte or a word, the memory management unit (MMU, the hardware that manages memory and performs virtual to physical address translations) typically deals in pages. Therefore, the MMU maintains the system’s page tables with page-sized granularity. In terms of virtual memory, pages are the smallest unit.

Each architecture defines its own page size. Many architectures even support multiple page sizes.

This implies that on a machine with 4KB pages and 1GB of memory, physical memory is divided into 262,144 distinct pages.

The kernel represents every physical page on the system with a struct page structure. This structure is defined in <linux/mm_types.h> (include/linux/mm_types.h). The following is a simplified the definition (two confusing unions are removed):

include/linux/mm_types.h#L34

struct page {
    unsigned long flags;
    atomic_t _count;
    atomic_t _mapcount;
    unsigned long private;
    struct address_space *mapping;
    pgoff_t index;
    struct list_head lru;
    void *virtual;
};

The page structure is associated with physical pages, not virtual pages; what the structure describes is transient at best. Even if the data contained in the page continues to exist, it might not always be associated with the same page structure because of swapping and so on. The kernel uses this data structure to describe the associated physical page. The data structure’s goal is to describe physical memory, not the data contained therein.

The kernel uses this structure to keep track of all the pages in the system, because the kernel needs to know whether a page is free (whether the page is not allocated). If a page is not free, the kernel needs to know who owns the page. Possible owners include (but not limited to):

Since an instance of this structure is allocated for each physical page in the system. How bad (or good) the space consumption is from all these pages? Assume struct page consumes 40 bytes of memory, the system has 8KB physical pages, and the system has 4GB of physical memory. In that case, there are about 524,288 pages and page structures on the system. The page structures consume 20MB: perhaps a surprisingly large number in absolute terms, but only a small fraction of a percent relative to the system’s 4GB. This is not too high a cost for managing all the system’s physical pages.

Zones

The kernel cannot treat all pages as identical due to hardware limitations. Some pages, because of their physical address in memory, cannot be used for certain tasks. Thus, the kernel divides pages into different zones. The kernel uses the zones to group pages of similar properties.

Linux has to deal with two shortcomings of hardware with respect to memory addressing:

Due to these contraints, Linux has four primary memory zones:

These zones are defined in <linux/mmzone.h> (include/linux/mmzone.h)

The layout of the memory zones is architecture-dependent. For example:

ZONE_HIGHMEM works similarly. On 32-bit x86 systems, ZONE_HIGHMEM is all memory above the physical 896MB mark. On other architectures, ZONE_HIGHMEM is empty because all memory is directly mapped. The memory contained in ZONE_HIGHMEM is called high memory. The rest of the system’s memory is called low memory.

ZONE_NORMAL is the remainder after the previous two zones claim their requisite shares. On x86, ZONE_NORMAL is all physical memory from 16MB to 896MB. On other architectures, ZONE_NORMAL is all available memory.

The following table is a listing of each zone and its consumed pages on x86-32.

Zone Description Physical Memory
ZONE_DMA DMA-able pages < 16MB
ZONE_NORMAL Normally addressable pages 16–896MB
ZONE_HIGHMEM Dynamically mapped pages > 896MB

Linux partitions pages into zones to have a pooling in place to satisfy allocations as needed. For example, with a ZONE_DMA pool, the kernel has the capability to satisfy memory allocations needed for DMA. If such memory is needed, the kernel can simply pull the required number of pages from ZONE_DMA. The zones do not have any physical relevance but are simply logical groupings used by the kernel to keep track of pages.

Although some allocations may require pages from a particular zone, other allocations may pull from multiple zones. For example:

Not all architectures define all zones. For example, a 64-bit architecture such as Intel’s x86-64 can fully map and handle 64-bits of memory.Thus, x86-64 has no ZONE_HIGHMEM and all physical memory is contained within ZONE_DMA and ZONE_NORMAL.

Each zone is represented by struct zone, which is defined in <linux/mmzone.h>:

include/linux/mmzone.h#L280

struct zone {
    unsigned long watermark[NR_WMARK];
    unsigned long lowmem_reserve[MAX_NR_ZONES];
    struct per_cpu_pageset pageset[NR_CPUS];
    spinlock_t lock;
    struct free_area free_area[MAX_ORDER]
    spinlock_t lru_lock;
    struct zone_lru {
        struct list_head list;
        unsigned long nr_saved_scan;
    } lru[NR_LRU_LISTS];
    struct zone_reclaim_stat reclaim_stat;
    unsigned long pages_scanned;
    unsigned long flags;
    atomic_long_t vm_stat[NR_VM_ZONE_STAT_ITEMS];
    int prev_priority;
    unsigned int inactive_ratio;
    wait_queue_head_t *wait_table;
    unsigned long wait_table_hash_nr_entries;
    unsigned long wait_table_bits;
    struct pglist_data *zone_pgdat;
    unsigned long zone_start_pfn;
    unsigned long spanned_pages;
    unsigned long present_pages;
    const char *name;
};

Getting Pages

This section discusses the interfaces the kernel implements to enable you to allocate and free memory within the kernel.

The kernel provides one low-level mechanism for requesting memory, along with several interfaces to access it. All these interfaces allocate memory with page-sized granularity and are declared in <linux/gfp.h> (include/linux/gfp.h). The core function is:

struct page * alloc_pages(gfp_t gfp_mask, unsigned int order)

This allocates 2order (1 << order) contiguous physical pages and returns a pointer to the first page’s page structure; on error it returns NULL.

You can convert a given page to its logical address with the function:

void * page_address(struct page *page)

This returns a pointer to the logical address where the given physical page currently resides.

If you have no need for the actual struct page, you can call:

unsigned long __get_free_pages(gfp_t gfp_mask, unsigned int order)

This function works the same as alloc_pages(), except that it directly returns the logical address of the first requested page. Because the pages are contiguous, the other pages simply follow from the first.

If you need only one page, two functions are implemented as wrappers to save you a bit of typing:

struct page * alloc_page(gfp_t gfp_mask)
unsigned long __get_free_page(gfp_t gfp_mask)

These functions work the same but pass zero for the order (20 = one page).

Getting Zeroed Pages

If you need the returned page filled with zeros, use the function:

unsigned long get_zeroed_page(unsigned int gfp_mask)

This function works the same as __get_free_page(), except that the allocated page is then zero-filled (every bit of every byte is unset). This is useful for pages given to userspace because the random garbage in an allocated page is not so random; it might contain sensitive data. All data must be zeroed or otherwise cleaned before it is returned to userspace to ensure system security is not compromised.

The following table is a listing of all the low-level page allocation methods.

Flag Description
alloc_page(gfp_mask) Allocates a single page and returns a pointer to its first page’s page structure
alloc_pages(gfp_mask, order) Allocates 2order pages and returns a pointer to the first page’s page structure
__get_free_page(gfp_mask) Allocates a single page and returns a pointer to its logical address
__get_free_pages(gfp_mask, order) Allocates 2order pages and returns a pointer to the first page’s logical address
get_zeroed_page(gfp_mask) Allocates a single page, zero its contents and returns a pointer to its logical address

Freeing Pages

A family of functions enables you to free allocated pages when you no longer need them:

void __free_pages(struct page *page, unsigned int order)
void free_pages(unsigned long addr, unsigned int order)
void free_page(unsigned long addr)

Be careful to free only pages you allocate. Passing the wrong struct page or address, or the incorrect order, can result in corruption. [p237]

For example, we want to allocate eight pages:

unsigned long page;
page = __get_free_pages(GFP_KERNEL, 3);
if (!page) {
    /* insufficient memory: you must handle this error! */
    return -ENOMEM;
}
/* 'page' is now the address of the first of eight contiguous pages ... */

Free the eight pages, after we are done using them:

free_pages(page, 3);
/*
* our pages are now freed and we should no
* longer access the address stored in 'page'
*/

The GFP_KERNEL parameter is an example of a gfp_mask flag (discussed shortly).

A kernel allocation can fail, and your code must check for and handle such errors after the call to __get_free_pages(). It therefore often makes sense to allocate your memory at the start of the routine to make handling the error easier. [p237]

These low-level page functions are useful when you need page-sized chunks of physically contiguous pages, especially if you need exactly a single page or two. For more general byte-sized allocations, the kernel provides kmalloc().

kmalloc()

The kmalloc() function is similar to user-space’s malloc(), with the exception of the additional flags parameter. The kmalloc() function is a simple interface for obtaining kernel memory in byte-sized chunks. If you need whole pages, the previously discussed interfaces might be a better choice. For most kernel allocations, kmalloc() is the preferred interface.

The function is declared in <linux/slab.h> (include/linux/slab.h):

include/linux/slab_def.h#L128

void * kmalloc(size_t size, gfp_t flags)

The kmalloc() function returns a pointer to a region of memory that is at least size bytes in length. It may allocate more than you asked, although you have no way of knowing how much more. Because the kernel allocator is page-based, some allocations may be rounded up to fit within the available memory. The kernel never returns less memory than requested. If the kernel is unable to find at least the requested amount, the allocation fails and the function returns NULL. The region of memory allocated is physically contiguous. Kernel allocations always succeed, unless an insufficient amount of memory is available. Thus, you must check for NULL after all calls to kmalloc() and handle the error appropriately.

For example:

struct dog *p;
p = kmalloc(sizeof(struct dog), GFP_KERNEL);
if (!p)
    /* handle error ... */

If the kmalloc() call succeeds, p now points to a block of memory that is at least the requested size.The GFP_KERNEL flag specifies the behavior of the memory allocator while trying to obtain the memory to return to the caller of kmalloc().

gfp_mask Flags

Flags are represented by the gfp_t type, which is defined in <linux/types.h> (include/linux/types.h#L179) as an unsigned int. gfp stands for __get_free_pages() (discussed earlier).

The flags are broken up into three categories:

Action Modifiers

All the flags are declared in <linux/gfp.h> (include/linux/gfp.h). The file <linux/slab.h> (include/linux/slab.h) includes this header.

The table below is a list of the action modifiers.

Flag Description
__GFP_WAIT The allocator can sleep.
__GFP_HIGH The allocator can access emergency pools.
__GFP_IO The allocator can start disk I/O.
__GFP_FS The allocator can start filesystem I/O.
__GFP_COLD The allocator should use cache cold pages.
__GFP_NOWARN The allocator does not print failure warnings.
__GFP_REPEAT The allocator repeats the allocation if it fails, but the allocation can potentially fail.
__GFP_NOFAIL The allocator indefinitely repeats the allocation. The allocation cannot fail.
__GFP_NORETRY The allocator never retries if the allocation fails.
__GFP_NOMEMALLOC The allocator does not fall back on reserves.
__GFP_HARDWALL The allocator enforces "hardwall" cpuset boundaries.
__GFP_RECLAIMABLE The allocator marks the pages reclaimable.
__GFP_COMP The allocator adds compound.

These allocations can be specified together. For example:

ptr = kmalloc(size, __GFP_WAIT | __GFP_IO | __GFP_FS);

This call instructs the page allocator (ultimately alloc_pages()) that the allocation can block, perform I/O, and perform filesystem operations. This gives the kernel great freedom in how it can find the free memory to satisfy the allocation.

Zone Modifiers

Zone modifiers specify from which memory zone the allocation should originate. Though allocations can be fulfilled from any zone, the kernel prefers ZONE_NORMAL to ensure that the other zones have free pages when they are needed.

There are only three zone modifiers because there are only three zones other than ZONE_NORMAL, as in the following table:

Flag Description
__GFP_DMA Allocates only from ZONE_DMA
__GFP_DMA32 Allocates only from ZONE_DMA32
__GFP_HIGHMEM Allocates from ZONE_HIGHMEM or ZONE_NORMAL

If neither flag is specified, the kernel fulfills the allocation from either ZONE_DMA or ZONE_NORMAL, with a strong preference to satisfy the allocation from ZONE_NORMAL.

You cannot specify __GFP_HIGHMEM to either __get_free_pages() or kmalloc(), because these both return a logical address, and not a page structure. It is possible that these functions would allocate memory not currently mapped in the kernel’s virtual address space and thus, does not have a logical address. Only alloc_pages() can allocate high memory.

The majoriy of allocations will not specify a zone modifier because ZONE_NORMAL is sufficient.

Type Flags

The type flags specify the required action and zone modifiers to fulfill a particular type of transaction. Therefore, kernel code tends to use the correct type flag and not specify the myriad of other flags it might need. This is both simpler and less error-prone.

The table below is a list of the type flags:

Flag Description
GFP_ATOMIC The allocation is high priority and must not sleep. This is the flag to use in interrupt handlers, in bottom halves, while holding a spinlock, and in other situations where you cannot sleep.
GFP_NOWAIT Like GFP_ATOMIC, except that the call will not fallback on emergency memory pools. This increases the liklihood of the memory allocation failing.
GFP_NOIO This allocation can block, but must not initiate disk I/O. This is the flag to use in block I/O code when you cannot cause more disk I/O, which might lead to some unpleasant recursion.
GFP_NOFS This allocation can block and can initiate disk I/O, if it must, but it will not initiate a filesystem operation. This is the flag to use in filesystem code when you cannot start another filesystem operation.
GFP_KERNEL This is a normal allocation and might block. This is the flag to use in process context code when it is safe to sleep. The kernel will do whatever it has to do to obtain the memory requested by the caller. This flag should be your default choice.
GFP_USER This is a normal allocation and might block. This flag is used to allocate memory for user-space processes.
GFP_HIGHUSER This is an allocation from ZONE_HIGHMEM and might block. This flag is used to allocate memory for user-space processes.
GFP_DMA This is an allocation from ZONE_DMA. Device drivers that need DMA-able memory use this flag, usually in combination with one of the preceding flags.

The following table shows which modifiers are associated with each type flag:

Flag Modifier Flags
GFP_NOFS (__GFP_WAIT | __GFP_IO)
GFP_KERNEL (__GFP_WAIT | __GFP_IO | __GFP_FS)
GFP_USER (__GFP_WAIT | __GFP_IO | __GFP_FS)
GFP_HIGHUSER (__GFP_WAIT | __GFP_IO | __GFP_FS | __GFP_HIGHMEM)
GFP_DMA __GFP_DMA

Allocations initiated with GFP_NOIO and GFP_NOFS might block, but they refrain from performing certain other operations.

They are needed for certain low-level block I/O or filesystem code. For example, a common path in the filesystem code allocated memory without the GFP_NOFS flag, the allocation could result in more filesystem operations, which may beget other allocations. This could continue indefinitely. Such filesystem code that invokes the allocator must ensure that the allocator also does not execute itself, or else the allocation can create a deadlock.

The GFP_DMA flag specifes that the allocator must satisfy the request from ZONE_DMA. This flag is used by device drivers, which need DMA-able memory for their devices. Normally, you combine this flag with the GFP_ATOMIC or GFP_KERNEL flag.

The majority of the code uses either GFP_KERNEL or GFP_ATOMIC. Regardless of the allocation type, you must check for and handle failures.

Below is a list of the common situations and the flags to use.

Situation Solution
Process context, can sleep Use GFP_KERNEL.
Process context, cannot sleep Use GFP_ATOMIC, or perform your allocations with GFP_KERNEL at an earlier or later point when you can sleep.
Interrupt handler Use GFP_ATOMIC.
Softirq Use GFP_ATOMIC.
Tasklet Use GFP_ATOMIC.
Need DMA-able memory, can sleep Use (GFP_DMA | GFP_KERNEL).
Need DMA-able memory, cannot sleep Use (GFP_DMA | GFP_ATOMIC), or perform your allocation at an earlier point when you can sleep.

kfree()

The counterpart to kmalloc() is kfree(), declared in <linux/slab.h>:

include/linux/slab.h#L144

void kfree(const void *ptr)

The kfree() method frees a block of memory previously allocated with kmalloc().

Do not call this function on memory not previously allocated with kmalloc(), or on memory that has already been freed. Doing so is a bug, resulting in bad behavior such as freeing memory belonging to another part of the kernel.

As in user-space, be careful to balance your allocations with your deallocations to prevent memory leaks and other bugs. kfree(NULL) is explicitly checked for and safe.

[p243]

vmalloc()

The vmalloc() function works in a similar fashion to kmalloc(), except vmalloc() allocates memory that is only virtually contiguous and not necessarily physically contiguous. This is similar to user-space malloc(), the returned pages by which are contiguous within the virtual address space, but necessarily contiguous in physical RAM.

The vmalloc() function ensures that the pages are physically contiguous by by allocating potentially noncontiguous chunks of physical memory and "fixing up" the page tables to map the memory into a contiguous chunk of the logical address space.

Though physically contiguous memory is required in only certain cases, most kernel code uses kmalloc() and not vmalloc() to obtain memory primarily for performance. The vmalloc() function, to make nonphysically contiguous pages contiguous in the virtual address space, must specifically set up the page table entries. Worse, pages obtained via vmalloc() must be mapped by their individual pages (because they are not physically contiguous), which results in much greater TLB thrashing than you see when directly mapped memory is used. Because of these concerns, vmalloc() is used only when absolutely necessary (typically, to obtain large regions of memory). For example, when modules are dynamically inserted into the kernel, they are loaded into memory created via vmalloc().

The vmalloc() function is declared in <linux/vmalloc.h> and defined in mm/vmalloc.c. Its usage is identical to user-space’s malloc():

include/linux/vmalloc.h#L53

void * vmalloc(unsigned long size)

To free an allocation obtained via vmalloc(), use:

include/linux/vmalloc.h#L62

void vfree(const void *addr)

[p245]

Slab Layer

Free Lists *

To facilitate frequent allocations and deallocations of data, programmers often introduce free lists. A free list contains a block of available, already allocated, data structures:

In this sense, the free list acts as an object cache, caching a frequently used type of object.

A main problem with free lists in the kernel is that there exists no global control. When available memory is low, there is no way for the kernel to communicate to every free list that it should shrink the sizes of its cache to free up memory. The kernel has no understanding of the random free lists at all.To remedy this and to consolidate code, the Linux kernel provides the slab layer (also called the slab allocator). The slab layer acts as a generic data structure-caching layer.

Slab Layer: generic data structure-caching layer *

The concept of a slab allocator was first implemented in Sun Microsystem’s SunOS 5.4 operating system. The Linux data structure caching layer shares the same name and basic design.

The slab layer attempts to leverage several basic tenets:

The slab layer in Linux was designed and implemented with these premises in mind.

Design of the Slab Layer

The following figure diagrams the relationship between caches, slabs, and objects.

Figure 12.1 The relationship between caches, slabs, and objects.

For example of the inode structure, the in-memory representation of a disk inode (Chapter 13). These structures are frequently created and destroyed, so it makes sense to manage them via the slab allocator.

Each cache is represented by a kmem_cache structure, which contains three lists (stored inside a kmem_list3 structure, which is defined in mm/slab.c)

These lists contain all the slabs associated with the cache.

A slab descriptor struct slab represents each slab:

struct slab {
    struct list_head list; /* full, partial, or empty list */
    unsigned long colouroff; /* offset for the slab coloring */
    void *s_mem; /* first object in the slab */
    unsigned int inuse; /* allocated objects in the slab */
    kmem_bufctl_t free; /* first free object, if any */
};

Slab descriptors are allocated either outside the slab in a general cache or inside the slab itself, at the beginning. The descriptor is stored inside the slab if the total size of the slab is sufficiently small, or if internal slack space is sufficient to hold the descriptor.

The slab allocator creates new slabs by interfacing with the low-level kernel page allocator via __get_free_pages():

mm/slab.c#L1609

static void *kmem_getpages(struct kmem_cache *cachep, gfp_t flags, int nodeid)
{
    struct page *page;
    void *addr;
    int i;

    flags |= cachep->gfpflags;
    if (likely(nodeid == -1)) {
        addr = (void*)__get_free_pages(flags, cachep->gfporder);
        if (!addr)
            return NULL;
        page = virt_to_page(addr);
    } else {
        page = alloc_pages_node(nodeid, flags, cachep->gfporder);
        if (!page)
            return NULL;
        addr = page_address(page);
    }

    i = (1 << cachep->gfporder);
    if (cachep->flags & SLAB_RECLAIM_ACCOUNT)
        atomic_add(i, &slab_reclaim_pages);
    add_page_state(nr_slab, i);
    while (i--) {
      SetPageSlab(page);
      page++;
    }
    return addr;
}

This function uses __get_free_pages() to allocate memory sufficient to hold the cache:

A simplified version kmem_getpages() that ignores NUMA-aware code (for educational purpose) is like:

static inline void * kmem_getpages(struct kmem_cache *cachep, gfp_t flags)
{
    void *addr;
    flags |= cachep->gfpflags;
    addr = (void*) __get_free_pages(flags, cachep->gfporder);

    return addr;
}

Memory is freed by kmem_freepages(), which calls free_pages() on the given cache’s pages.

The point of the slab layer is to refrain from allocating and freeing pages. In turn, the slab layer invokes the page allocation function only when there does not exist any partial or empty slabs in a given cache.The freeing function is called only when available memory grows low and the system is attempting to free memory, or when a cache is explicitly destroyed.

The slab layer is managed on a per-cache basis through a simple interface, which is exported to the entire kernel. The interface enables the creation and destruction of new caches and the allocation and freeing of objects within the caches. The sophisticated management of caches and the slabs within is entirely handled by the internals of the slab layer. After you create a cache, the slab layer works just like a specialized allocator for the specific type of object.

Slab Allocator Interface

A new cache is created via kmem_cache_create():

/mm/slab.c#L2098

struct kmem_cache * kmem_cache_create(const char *name,
                                      size_t size,
                                      size_t align,
                                      unsigned long flags,
                                      void (*ctor)(void *));

kmem_cache_destroy destroy a cache:

mm/slab.c#L2543

int kmem_cache_destroy(struct kmem_cache *cachep)

This function is invoked from module shutdown code in modules that create their own caches. It must not be called from interrupt context because it may sleep.The caller of this function must ensure two conditions before invoking this function:

Allocating from the Cache

[p250]

Example of Using the Slab Allocator

[p251]

Statically Allocating on the Stack

User-space can afforded large, dynamically growing stack, whereas the the kernel’s stack is small and fixed.

The size of the per-process kernel stacks depends on both the architecture and a compile-time option. Historically, the kernel stack has been two pages per process.This is usually:

Single-Page Kernel Stacks

Early in the 2.6 kernel series, an option was introduced to move to single-page kernel stacks, where each process is given only a single page (4KB on 32-bit architectures and 8KB on 64-bit architectures). This was done for two reasons:

  1. It results in a page with less memory consumption per process.
  2. As uptime increases, it becomes increasingly hard to find two physically contiguous unallocated pages. Physical memory becomes fragmented, and the resulting VM pressure from allocating a single new process is expensive.

There is one more complication. Each process’s entire call chain has to fit in its kernel stack. Historically, however, interrupt handlers also used the kernel stack of the process they interrupted, thus they too had to fit. This was efficient and simple, but it placed even tighter constraints on the already meager kernel stack. When the stack moved to only a single page, interrupt handlers no longer fit.

Interrupt stacks

To rectify this problem, the kernel developers implemented a new feature: interrupt stacks. Interrupt stacks provide a single per-processor stack used for interrupt handlers. With this option, interrupt handlers no longer share the kernel stack of the interrupted process. Instead, they use their own stacks. This consumes only a single page per processor.

To summarize, kernel stacks are either one or two pages, depending on compile-time configuration options. The stack can therefore range from 4KB to 16KB. Historically, interrupt handlers shared the stack of the interrupted process. When single page stacks are enabled, interrupt handlers are given their own stacks. In any case, unbounded recursion and alloca() are obviously not allowed.

Playing Fair on the Stack

In any given function, you must keep stack usage to a minimum. You should keep the sum of all local (automatic) variables in a function to a maximum of a couple hundred bytes. Performing a large static allocation on the stack (e.g. a large array or structure) is dangerous. Otherwise, stack allocations are performed in the kernel just as in user-space.

Stack overflows occur silently and will undoubtedly result in problems. Because the kernel does not make any effort to manage the stack, when the stack overflows, the excess data simply spills into whatever exists at the tail end of the stack, the first thing of which is the thread_info structure, which is allocated at the end of each process’s kernel stack (Chapter 3).

Beyond the stack, any kernel data might lurk.At best, the machine will crash when the stack overflows. At worst, the overflow will silently corrupt data. Therefore, it is wise to use a dynamic allocation scheme (discussed perviouly in this chapter) for any large memory allocations.

High Memory Mappings

By definition, pages in high memory might not be permanently mapped into the kernel’s (virtual) address space. Thus, pages obtained via alloc_pages() with the __GFP_HIGHMEM flag might not have a logical address (see relevant text in Zone Modifers subsection).

On the x86 architecture, all physical memory beyond the 896MB mark is high memory and is not permanently or automatically mapped into the kernel’s address space, despite x86 processors being capable of physically addressing up to 4GB (64GB with PAE) of physical RAM. After they are allocated, these pages must be mapped into the kernel’s logical address space. On x86, pages in high memory are mapped somewhere between the 3GB and 4GB mark.

Permanent Mappings

To map a given page structure into the kernel’s address space, use this function, declared in <linux/highmem.h>:

include/linux/highmem.h#L58

void *kmap(struct page *page)

This function works on either high or low memory:

The function may sleep, so kmap() works only in process context.

Because the number of permanent mappings are limited, high memory should be unmapped when no longer needed. This is done via the kunmap function, which unmaps the given page:

void kunmap(struct page *page)

Temporary Mappings

When a mapping must be created but the current context cannot sleep, the kernel provides temporary mappings (also called atomic mappings). The kernel can atomically map a high memory page into one of the reserved mappings (which can hold temporary mappings). Consequently, a temporary mapping can be used in places that cannot sleep, such as interrupt handlers, because obtaining the mapping never blocks.

Setting up a temporary mapping is done via kmap_atomic():

include/linux/highmem.h#L68

void *kmap_atomic(struct page *page, enum km_type type)

include/asm-generic/kmap_types.h

enum km_type {
    KM_BOUNCE_READ,
    KM_SKB_SUNRPC_DATA,
    KM_SKB_DATA_SOFTIRQ,
    KM_USER0,
    KM_USER1,
    KM_BIO_SRC_IRQ,
    KM_BIO_DST_IRQ,
    KM_PTE0,
    KM_PTE1,
    KM_PTE2,
    KM_IRQ0,
    KM_IRQ1,
    KM_SOFTIRQ0,
    KM_SOFTIRQ1,
    KM_SYNC_ICACHE,
    KM_SYNC_DCACHE,
    KM_UML_USERCOPY,
    KM_IRQ_PTE,
    KM_NMI,
    KM_NMI_PTE,
    KM_TYPE_NR
};

This function does not block and thus can be used in interrupt context and other places that cannot reschedule. It also disables kernel preemption, which is needed because the mappings are unique to each processor and a reschedule might change which task is running on which processor.

The mapping is undone via:

void kunmap_atomic(void *kvaddr, enum km_type type)

This function also does not block. In many architectures it does not do anything at all except enable kernel preemption, because a temporary mapping is valid only until the next temporary mapping. Thus, the kernel can just "forget about" the kmap_atomic() mapping, and kunmap_atomic() does not need to do anything special. The next atomic mapping then simply overwrites the previous one.

Per-CPU Allocations

Modern SMP-capable operating systems use per-CPU data (data that is unique to a given processor). Typically, per-CPU data is stored in an array. Each item in the array corresponds to a possible processor on the system. The current processor number indexes this array (from 2.4 to 2.6 kernel). [p255] You declare the data as:

unsigned long my_percpu[NR_CPUS];

Then you can access it as:

int cpu;

cpu = get_cpu(); /* get current processor and disable kernel preemption */
my_percpu[cpu]++; /* ... or whatever */
printk("my_percpu on cpu=%d is %lu\n", cpu, my_percpu[cpu]);
put_cpu(); /* enable kernel preemption */

No lock is required because this data is unique to the current processor. If no processor touches this data except the current, no concurrency concerns exist, and the current processor can safely access the data without lock.

Kernel preemption is the only concern with per-CPU data, which poses two problems:

The call get_cpu(), on top of returning the current processor number, also disables kernel preemption. The corresponding call to put_cpu() enables kernel preemption. If you use a call to smp_processor_id() to get the current processor number, kernel preemption is not disabled. Always use the aforementioned methods to remain safe.

The New percpu Interface

The 2.6 kernel introduced a new interface, percpu, for creating and manipulating per-CPU data. This interface generalizes the previous example. Creation and manipulation of per-CPU data is simplified with this new approach.

The previously discussed method of creating and accessing per-CPU data is still valid and accepted. This new interface, however, grew out of the needs for a simpler and more powerful method for manipulating per-CPU data on large symmetrical multiprocessing computers.

The header <linux/percpu.h> (include/linux/percpu.h) declares all the routines. You can find the actual definitions there, in mm/slab.c, and in <asm/percpu.h>.

Per-CPU Data at Compile-Time

Define a per-CPU variable at compile time:

DEFINE_PER_CPU(type, name);

This creates an instance of a variable of type type, named name, for each processor on the system. If you need a declaration of the variable elsewhere, to avoid compile warnings, use the following macro:

DECLARE_PER_CPU(type, name);

You can manipulate the variables with the get_cpu_var() and put_cpu_var() routines. A call to get_cpu_var() returns an lvalue for the given variable on the current processor. It also disables preemption, which put_cpu_var() correspondingly enables.

get_cpu_var(name)++; /* increment name on this processor */
put_cpu_var(name); /* done; enable kernel preemption */

You can obtain the value of another processor’s per-CPU data, too:

per_cpu(name, cpu)++; /* increment name on the given processor */

You need to be careful with this approach because per_cpu() neither disables kernel preemption nor provides any sort of locking mechanism.The lockless nature of per-CPU data exists only if the current processor is the only manipulator of the data. If other processors touch other processors’ data, you need locks. Chapter 9 and Chapter 10 discuss locking.

These compile-time per-CPU examples do not work for modules because the linker actually creates them in a unique executable section (.data.percpu). If you need to access per-CPU data from modules, or if you need to create such data dynamically, you cannot use compile-time per-CPU data.

Per-CPU Data at Runtime

Doubts and Solutions

Verbatim

p235 on zones:

A specific lock does not protect individual pages, although parts of the kernel may lock the data that happens to reside in said pages.

p244 on vmalloc():

The vmalloc() function, to make nonphysically contiguous pages contiguous in the virtual address space, must specifically set up the page table entries. Worse, pages obtained via vmalloc() must be mapped by their individual pages (because they are not physically contiguous), which results in much greater TLB thrashing than you see when directly mapped memory is used.

p248 on slab descriptors:

The descriptor is stored inside the slab if the total size of the slab is sufficiently small, or if internal slack space is sufficient to hold the descriptor.

p249 on kmem_getpages():

For educational purposes, we can ignore the NUMA-aware code and write a simple kmem_getpages():

static inline void * kmem_getpages(struct kmem_cache *cachep, gfp_t flags)
{
    void *addr;
    flags |= cachep->gfpflags;
    addr = (void*) __get_free_pages(flags, cachep->gfporder);

    return addr;
}

The "educational" code merely wraps around __get_free_pages without any slab related code. WTF?

p250 on flags to kmem_cache_create():

SLAB_HWCACHE_ALIGN instructs the slab layer to align each object within a slab to a cache line, which prevents "false sharing" (two or more objects mapping to the same cache line despite existing at different addresses in memory). This improves performance but comes at a cost of increased memory footprint because the stricter alignment results in more wasted slack space. How large the increase in memory consumption is depends on the size of the objects and how they naturally align with respect to the system’s cache lines. For frequently used caches in performance-critical code, setting this option is a good idea; otherwise, think twice.

p253 on single-page kernel stacks

When single page stacks are enabled, interrupt handlers are given their own stacks. In any case, unbounded recursion and alloca() are obviously not allowed.

What is unbounded recursion and alloca()?