Sunday, February 9, 2014

Linux Memory Management


  • The physical pages are the basic unit of memory management for the Kernel.
  • The MMU(memory management unit) manages the memory in terms of page sizes.
  • Generally a 32 bit architecture has 4KB page size and a 64 bit architecture has 8KB page size.
  • Kernel stores info about these pages(physical pages) in its structure struct page.
  • This structure is defined in <linux/mm.h>.

struct page {
        page_flags_t          flags;
        atomic_t              _count;
        atomic_t              _mapcount;
        unsigned long         private;
        struct address_space  *mapping;
        pgoff_t               index;
        struct list_head      lru;
        void                  *virtual;
  • Some important fields are-
  1. The flags field stores the status of the page. Such flags include whether the page is dirty(it has been modified) or whether it is locked in memory. There are 32 different flags available. The flag values are defined in <linux/page-flags.h>.
  2. _count field means how many instance virtual pages are there for the given physical page. When the value of _count reaches zero that means noone is using the page at current.
  3. virtual  this is the address of the page in virtual memory. For highmem(highmemory >896MB of virtual memory) this field is zero.

  • The goal of this data structure is to describe the physical pages and not the data contained in that page.


  • The kernel divides its 1GB virtual address space into three zones -ZONE_DMA(<16MB), ZONE_NORMAL(16MB-896MB) and ZONE_HIGHMEM(>896MB).
  • Kernel groups pages with similar properties into separate zones.
  • The zones have no physical relevance, it has just logical relevance.
  • Each zone is represented by struct zone, which is defined in <linux/mmzone.h>:
  • For more details on ZONES , read my other post on linux addressing.


  • Kernel allows us with some interfaces to allocate and free memory within kernel space.
  • All these interfaces allocate memory with page-sized granularity and are declared in <linux/gfp.h>. 
  • We can either allocate physical contiguous memory or only virtual contiguous memory.
  • One should never attempt to allocate memory for userspace from the kernel - this is a huge violation of the kernel's abstraction layering.
  • Instead have userspace mmap pages owned by your driver directly into its address space or  have userspace ask how much space it needs. Userspace allocates, then grabs the memory from the kernel.
  • There no way to allocate contiguous physical memory from userspace in linux.
  • This is because a user space program has no way of controlling or even knowing if the underlying memory is contiguous or not.
  • The core function is struct page * alloc_pages(unsigned int gfp_mask, unsigned int order)
  • This allocates 2order (that is, 1 << order) contiguous physical pages and returns a pointer to the first page's page structure;on error it returns NULL.
  • To convert a given page(physical) to its logical address we can use the function-void * page_address(struct page *page).
  • This function returns a pointer to the logical address where our allocated physical pages resides.
  • If we just need the virtual address of the pages( we don't need page structure) we can use the function -unsigned long __get_free_pages(unsigned int gfp_mask, unsigned int order).
  • The pages thus obtained are contiguous in virtual space.
  • This function also uses the core function alloc_pages, but it directly gives us the starting address of the first page.
  • If we just need a single page(order 0 then we have two functions, one for physical and other for logical-
  1. struct page * alloc_page(unsigned int gfp_mask)
  2. unsigned long __get_free_page(unsigned int gfp_mask) 
  • If we need page filled with zero( for security issues we want to initialize memory with all zeros so that if we need to pass this memory to user space then the user space will get access to the contents written on this memory location previously) we can use this function unsigned long get_zeroed_page(unsigned int gfp_mask) 
  • This function works the same as __get_free_page(), except that the allocated page is then zero-filled
  • To free the pages we have some functions-
  1. void __free_pages(struct page *page, unsigned int order)
  2. void free_pages(unsigned long addr, unsigned int order)
  3. void free_page(unsigned long addr)

  • Allocation of page/s may fail so we must define a handler to handle such situations.


  • The kmalloc() function's operation is very similar to that of user-space's familiar malloc() routine, with the exception of the addition of a flags parameter.
  • This is used when we want to allocate a small chunk of memory in bytes size.
  • For bigger sized memory, the previous page allocation functions is a good option.
  • Mostly in Kernel we use Kmalloc() for memory allocation.
  • The function is declared in <linux/slab.h>
  •  void * kmalloc(size_t size, int flags)
  • The function returns a pointer to a region of memory that is at least size bytes in length.
  •  The region of memory allocated is physically contiguous. 
  • On error, it returns NULL.
  •  Kernel allocations almost always succeed, unless there is an insufficient amount of memory available. 
  • Still we must check for NULL after all calls to kmalloc() and handle the error appropriately.
  • eg.   struct abc *ptr;
    ptr = kmalloc(sizeof(struct abc), GFP_KERNEL);
    if (!ptr)
            /* handle error ... */

  • The GFP_KERNEL flag specifies the behavior of the memory allocator 
    while trying to obtain the memory to return to the caller of 

gfp_mask Flags

In this section we will discuss about the flags that we used in kmalloc and other low level page functions.

The flags are broken up into three categories:

  1. action modifiers
  2.  zone modifiers
  3.  types.

  • Action modifiers specify how the kernel is supposed to allocate the requested memory. 
  • In certain situations, only certain methods can be employed to allocate memory. 
  • For example, interrupt handlers must instruct the kernel not to sleep (because interrupt handlers cannot reschedule) in the course of allocating memory.
  •  Zone modifiers specify from where to allocate memory.
  •  As we saw in the article on linux addressing ( the kernel divides physical memory into multiple zones, each of which serves a different purpose.
  •  Zone modifiers specify from which of these zones to allocate.
  •  Type flags specify a combination of action and zone modifiers as needed by a certain type of memory allocation. 
  •  Type flags simplify specifying numerous modifiers; instead, we generally specify just one type flag.
  • All the flags are declared in <linux/gfp.h>
  • The file <linux/slab.h> includes this header, however, so we don't often need not include it directly. 

Action modifiers-

The allocator can sleep.
The allocator can access emergency pools.
The allocator can start disk I/O.
The allocator can start filesystem I/O.
The allocator should use cache cold pages.
The allocator will not print failure warnings.
The allocator will repeat the allocation if it fails, but the allocation can potentially fail.
The allocator will indefinitely repeat the allocation. The allocation cannot fail.
The allocator will never retry if the allocation fails.
Used internally by the slab layer.
Add compound page metadata. Used internally by the hugetlb code.

  • These allocations can be specified together. For example,ptr = kmalloc(size, __GFP_WAIT | __GFP_IO | __GFP_FS);
  • Lets see how this allocation will work--
  • It will instruct the page allocator (function finally comes to  alloc_pages() as we had seen before) that the allocation can-
  1.  block
  2.  perform I/O
  3.  perform filesystem operations, if needed. 
  •  This allows the kernel great freedom in how it can find the free memory to satisfy the allocation.

Zone Modifier-

  • Zone modifiers specify from which memory zone the allocation should originate. 
  • Normally, allocations can be fulfilled from any zone. 
  • The kernel prefers ZONE_NORMAL, however, to ensure that the other zones have free pages when they are needed.
  • There are only two zone modifiers because there are only two zones other than ZONE_NORMAL (which is where, by default, allocations originate). 

Allocate only from ZONE_DMA

  • If none of the  flags are specified, the kernel fulfills the allocation from either ZONE_DMA or ZONE_NORMAL, with a strong preference to satisfy the allocation from ZONE_NORMAL.
  • We cannot specify __GFP_HIGHMEM to either __get_free_pages() or kmalloc() because these both return a logical address, and not a page structure.
  • Though it is possible that these functions would allocate memory that is not currently mapped in the kernel's virtual address space and, thus, does not have a logical address.
  •  Only alloc_pages() can allocate high memory.
  •  For majority of our allocations, however, we don't need to  specify a zone modifier because ZONE_NORMAL is sufficient.

Type Flags-

  • The type flags specify the required action and zone modifiers to fulfill a particular type of transaction. 
  • Therefore, there is a good news that kernel code tends to use the correct type flag and not specify the various number of flags it would want to define.

The allocation is high priority and must not sleep. This is the flag to use in interrupt handlers, in bottom halves, while holding a spinlock, and in other situations where we cannot sleep.
This allocation can block, but must not initiate disk I/O. This is the flag to use in block I/O code when we cannot cause more disk I/O, which might lead to some unpleasant recursion.
This allocation can block and can initiate disk I/O, if it must, but will not initiate a filesystem operation. This is the flag to use in filesystem code when we cannot start another filesystem operation.
This is a normal allocation and might block. This is the flag to use in process context code when it is safe to sleep. The kernel will do whatever it has to in order to obtain the memory requested by the caller. This flag should be our first choice.
This is a normal allocation and might block. This flag is used to allocate memory for user-space processes.
This is an allocation from ZONE_HIGHMEM and might block. This flag is used to allocate memory for user-space processes.
This is an allocation from ZONE_DMA. Device drivers that need DMA-able memory use this flag, usually in combination with one of the above.

What all  action modifier files are internally involved in Type Flags ?

(__GFP_WAIT | __GFP_IO | __GFP_FS)
(__GFP_WAIT | __GFP_IO | __GFP_FS)

Lets try to understand important Type flags.


  • The vast majority of allocations in the kernel use the GFP_KERNEL flag.
  •  The resulting allocation can sleep as it is normal priority allocation.
  •  Because the call can block, this flag can be used only from process context that can safely reschedule (that is, no locks are held and so on).
  •  Because this flag does not make any stipulations as to how the kernel may obtain the requested memory, the memory allocation has a high probability of succeeding.

  •  The GFP_ATOMIC flag is at the extreme end as compared to GFP_KERNEL flag.
  •  This flag specifies a memory allocation that cannot sleep, the allocation is very restrictive in the memory it can obtain for the caller.
  •  If no sufficiently sized contiguous chunk of memory is available, the kernel is not very likely to free memory because it cannot put the caller to sleep. 
  • Conversely, the GFP_KERNEL allocation can put the caller to sleep to swap inactive pages to disk, flush dirty pages to disk, and so on.
  •  Because GFP_ATOMIC is unable to perform any of these actions, it has less of a chance of succeeding (at least when memory is low) compared to GFP_KERNEL allocations
  • Still the GFP_ATOMIC flag is the only option when the current code is unable to sleep, such as with interrupt handlers, softirqs, and tasklets.

GFP_NOIO and GFP_NOFS flags-

  • In between these two flags are GFP_NOIO and GFP_NOFS.
  •  Allocations initiated with these flags might block, but they refrain from performing certain other operations.
  •  A GFP_NOIO allocation does not initiate any disk I/O whatsoever to fulfill the request
  •  On the other hand, GFP_NOFS might initiate disk I/O, but does not initiate filesystem I/O.
  • One question that immediately comes to our mind. Why might you need these flags?
  •  They are needed for certain low-level block I/O or filesystem code, respectively
  •  Imagine if a common path in the filesystem code allocated memory without the GFP_NOFS flag. The allocation could result in more filesystem operations, which would then beget other allocations and, thus, more filesystem operations! This could continue indefinitely.
  •  Code such as this that invokes the allocator must ensure that the allocator also does not execute it, or else the allocation can create a deadlock.
  •  Not surprisingly, the kernel uses these two flags only in few places.

GFP_DMA flag-

  • The GFP_DMA flag is used to specify that the allocator must satisfy the request from ZONE_DMA.
  •  This flag is used by device drivers, which need DMA-able memory for their devices. Normally, we combine this flag with the GFP_ATOMIC or GFP_KERNEL flag

Which flag to use when??

Process context, can sleep
Process context, cannot sleep
Use GFP_ATOMIC, or perform your allocations with GFP_KERNEL at an earlier or later point when you can sleep
Interrupt handler
Need DMA-able memory, can sleep
Need DMA-able memory, cannot sleep
Use (GFP_DMA | GFP_ATOMIC), or perform your allocation at an earlier point when you can sleep 


  • kfree undoes the work done by kmalloc().
  • This function is declared in <linux/slab.h>.
  • void kfree(const void *ptr).
  • use it only for those blocks of memory that was previously allocated using kmalloc().
  • eg. char *buf;
    buffer = kmalloc(BUF_SIZE, GFP_ATOMIC);
    if (!buffer)
            /* error allocting memory ! */

    Later, when we no longer need the memory, we must free it.


  • This Kernel function is similar to user space function malloc().
  • Both vmalloc() and malloc() returns virtually contiguous memory but not necessarily physically contiguous.
  • In kernel we normally use kmalloc() and seldom use vmalloc().
  • vmalloc is used when the requested memory size is quite big as it may not be possible to allocate a large block of contiguous memory via kmalloc() and it may fail.
  • The vmalloc() function is declared in <linux/vmalloc.h> and defined in mm/vmalloc.c
  • Usage is identical to user-space's malloc(): void * vmalloc(unsigned long size).
  • Usage of vmalloc() also affects the system performance.
  • To free an allocation obtained via vmalloc(), we use
    void vfree(void *addr).


  1. very good link.. keep up the good work :)

  2. thanks, keep visiting the blog so that I stay encouraged to write more :D

  3. very nice posts..keep doing good job mate!!!!!!!!

  4. 64 bit architectures have 8KB page size - not true for x86-64

  5. Nice. It Helps me a lot to understand with this simple description.

  6. This comment has been removed by a blog administrator.

  7. This comment has been removed by the author.

  8. Thanks for sharing useful information. I learned something new from your bog. Its very interesting and informative. keep updating. If you are looking for any Big Data related information, please visit our website Big Data training in Bangalore.