LINUX KERNEL INTERNALS: February 2014

Thursday, February 27, 2014

Linux Kernel Questions

For answers to these questions see on the right side -"Linux Kernel Questions and Answers". This post is just to make you think about the answer before seeing the actual answers :)

1. Why do we need two bootloaders viz. primary and secondary?

2. When bootloader is finished it's job where is the first location and what is the first process that gets executed in Linux kernel?

3. When linux kernel is loaded what are the tasks it does?

4. What is the first user space process that runs when linux kernel loads?

5. Why we aren't allowed to sleep in interrupt context?

6. What are the possible task states?

7. How to Pass Command Line Arguments to a Kernel Module?

8.What is a Linux Device Driver Model ?

9. Explain the basics of Linux kernel.

10. What is a Loadable Kernel Module?

Interrupts

INTERRUPTS

The kernel is responsible for servicing the request of hardwares.
The CPU must process the request from the hardware.
Since the CPU frequency and the hardware frequency is not the same( hardware is slower) so the the hardwares can't send the data/request to the CPU synchronously.
There are two ways in which CPU can check about the request from a hardware-

1. Polling
2. Interrupt

In polling the CPU keeps on checking all the hardwares of the availablilty of any request.
In interrupt the CPU takes care of the hardware only when the hardware requests for some service.
Polling is an expensive job as it requires a grater overhead.
The better way is to use interrupt as the hardware will request the CPU only when it has some request to be serviced.
Different devices are given different interrupt values called IRQ (interrupt request) lines.
For ex. IRQ zero is the timer interrupt and IRQ one is the keyboard interrupt.
An interrupt is physically produced by electronic signals originating from hardware devices and directed into input pins on an interrupt controller.
Some interrupt numbers are static and some interrupts are dynamically assigned.
Be it static or dynamic, the kernel must know which interrupt number is associated with which hardware.
The interrupt controller, in turn, sends a signal to the processor. The processor detects this signal and interrupts its current execution to handle the interrupt.
The processor can then notify the operating system that an interrupt has occurred, and the operating system can handle the interrupt appropriately.
Interrupt handlers in Linux need not be reentrant. When a given interrupt handler is executing, the corresponding interrupt line is masked out on all processors, preventing another interrupt on the same line from being received. Normally all other interrupts are enabled, so other interrupts are serviced, but the current line is always disabled.

Comparison between interrupts and exceptions-

Exceptions occur synchronously with respect to the processor clock while interrupts occur async.
That is why exceptions are often called synchronous interrupts.
Exceptions are produced by the processor while executing instructions either in response to a programming error (for example, divide by zero) or abnormal conditions that must be handled by the kernel (for example, a page fault).
Many processor architectures handle exceptions in a similar manner to interrupts, therefore, the kernel infrastructure for handling the two is similar.
Exceptions are of two types- traps and software interrupts.
Exceptions are produced by the processor while executing instructions either in response to a programming error (for example, divide by zero) or abnormal conditions that must be handled by the kernel (for example, a page fault).
A trap is a kind of exceptions, whose main purpose is for debugging (eg. notify the debugger that an instruction has been reached) or it occurs during abnormal conditions.
A software interrupt occur at the request of a programmer eg. System calls.

Interrupt Handler

These are the C functions that get executed when an interrupt comes.
Each interrupt is associated with a particular interrupt handler.
Interrupt handler is also known as interrupt service routine (ISR).
Since interrupts can come any time therefore interrupt handlers has to be short and quick.
At least the interrupt handler has to acknowledge the hardware and rest of the work can be done at a later time.

Top Halves Versus Bottom Halves

There are two goals that an interrupt handler needs to perform 1. execute quickly and 2. perform a large amount of work .
Because of these conflicting goals, the processing of interrupts is split into two parts, or halves.
The interrupt handler is the top half.
It is run immediately upon receipt of the interrupt and performs only the work that is time critical, such as acknowledging receipt of the interrupt or resetting the hardware.
Work that can be performed later is delayed until the bottom half.
The bottom half runs in the future, at a more convenient time, with all interrupts enabled.
Let us consider a case where we need to collect the data form a data card and then process it.
The most important job is to collect the data from data card to the memory and free the card for incoming data and this is done in top half.
The rest part which deals with the processing of data is done in the bottom half.

Registering an Interrupt Handler

Interrupt handlers are the responsibility of the driver managing the hardware. Each device has one associated driver and, if that device uses interrupts (and most do), then that driver registers one interrupt handler.
Drivers can register an interrupt handler and enable a given interrupt line for handling via the function

/* request_irq: allocate a given interrupt line */
int request_irq(unsigned int irq,
                irqreturn_t (*handler)(int, void *, struct pt_regs *),
                unsigned long irqflags,
                const char *devname,
                void *dev_id)

The first parameter, irq, specifies the interrupt number to allocate. For some devices, for example legacy PC devices such as the system timer or keyboard, this value is typically hard-coded. For most other devices, it is probed or otherwise determined programmatically and dynamically.
The second parameter, handler, is a function pointer to the actual interrupt handler that services this interrupt. This function is invoked whenever the operating system receives the interrupt. Note the specific prototype of the handler function: It takes three parameters and has a return value of irqreturn_t.
The third parameter, irqflags, might be either zero or a bit mask of one or more of the following flags:

SA_INTERRUPT This flag specifies that the given interrupt handler is a fast interrupt handler. Fast interrupt handlers run with all interrupts disabled on the local processor. This enables a fast handler to complete quickly, without possible interruption from other interrupts. By default (without this flag), all interrupts are enabled except the interrupt lines of any running handlers, which are masked out on all processors. Sans the timer interrupt, most interrupts do not want to enable this flag.
SA_SAMPLE_RANDOM This flag specifies that interrupts generated by this device should contribute to the kernel entropy pool. The kernel entropy pool provides truly random numbers derived from various random events. If this flag is specified, the timing of interrupts from this device are fed to the pool as entropy. Do not set this if your device issues interrupts at a predictable rate (for example, the system timer) or can be influenced by external attackers (for example, a networking device). On the other hand, most other hardware generates interrupts at nondeterministic times and is, therefore, a good source of entropy.
SA_SHIRQ This flag specifies that the interrupt line can be shared among multiple interrupt handlers. Each handler registered on a given line must specify this flag; otherwise, only one handler can exist per line. More information on shared handlers is provided in a following section.
The fourth parameter, devname, is an ASCII text representation of the device associated with the interrupt. For example, this value for the keyboard interrupt on a PC is "keyboard". These text names are used by /proc/irq and /proc/interrupts for communication with the user, which is discussed shortly.
The fifth parameter, dev_id, is used primarily for shared interrupt lines. When an interrupt handler is freed (discussed later), dev_id provides a unique cookie to allow the removal of only the desired interrupt handler from the interrupt line. Without this parameter, it would be impossible for the kernel to know which handler to remove on a given interrupt line. You can pass NULL here if the line is not shared, but you must pass a unique cookie if your interrupt line is shared (and unless your device is old and crusty and lives on the ISA bus, there is good chance it must support sharing). This pointer is also passed into the interrupt handler on each invocation. A common practice is to pass the driver's device structure: This pointer is unique and might be useful to have within the handlers and the Device Model.
On success, request_irq() returns zero. A nonzero value indicates error, in which case the specified interrupt handler was not registered. A common error is -EBUSY, which denotes that the given interrupt line is already in use (and either the current user or you did not specify SA_SHIRQ).
Note that request_irq() can sleep and therefore cannot be called from interrupt context or other situations where code cannot block.
On registration, an entry corresponding to the interrupt is created in /proc/irq.
The function proc_mkdir() is used to create new procfs entries. This function calls proc_create() to set up the new procfs entries, which in turn call kmalloc() to allocate memory
In a driver, requesting an interrupt line and installing a handler is done via request_irq():

if (request_irq(irqn, my_interrupt, SA_SHIRQ, "my_device", dev)) {
        printk(KERN_ERR "my_device: cannot register IRQ %d\n", irqn);
        return -EIO;
}

In this example, irqn is the requested interrupt line, my_interrupt is the handler, the line can be shared, the device is named "my_device," and we passed dev for dev_id. On failure, the code prints an error and returns. If the call returns zero, the handler has been successfully installed. From that point forward, the handler is invoked in response to an interrupt. It is important to initialize hardware and register an interrupt handler in the proper order to prevent the interrupt handler from running before the device is fully initialized.

Freeing an Interrupt Handler

When we unregister our device drivers it is compulsory to free the interrupt handler what we have registered for the device.
This frees the interrupt line.

void free_irq(unsigned int irq, void *dev_id)

If the specified interrupt line is not shared, this function removes the handler 
and disables the line. If the interrupt line is shared, the handler identified 
via dev_id is removed, but the interrupt line itself is disabled only 
when the last handler is removed.


A call to free_irq() must be made from process 
context.

Interrupt handling concepts

Each CPU core has only one interrupt line coming towards it from the Interrupt controller which has n number of interrupt lines.
When a core is executing an interrupt the interrupt is said to be in active state, let us suppose we get same interrupt immediately, in that case the the new interrupt will be put to pending and the resultant interrupt line will be in active pending state. Only when the active interrupt is cleared this pending one will be entertained for execution.
Let us suppose we are in such active pending situation and a higher priority interrupt arrives, in this case the lower priority one is temporarily interrupted and higher one is executed, when the higher one is done the lower one is executed. The lower priority one is still active when interrupted as its context remains in the stack.

down vote

On a SMP architecture Advanced Programmable Interrupt Controller(APIC) is used to route the interrupts from peripherals to the CPU's.

the APIC, based on
1. the routing table,
2. priority of the interrupt,
3. the load on the CPUs(higher busy one is less burdened)

Let us consider a case of same interrupt being received one an SMP system. FOr each core we have an APIC and for external interrupt interface we have one more APIC.

For example, consider a interrupt is received at IRQ line 10, this goes through external APIC,the interrupt is routed to a particular CPU APIC, for now consider CPU0, this interrupt line is masked until the ISR is handled, which means we will not get a interrupt of the same type if ISR execution is in progress, new occurrence will be put to pending state(only 1).

Once ISR is handled, only then the interrupt line is unmasked for future interrupts

How to write an interrupt handler?

The declaration of interrupt handler:

static irqreturn_t intr_handler(int irq, void *dev_id, struct pt_regs *regs)

The first parameter, irq, is the numeric value(10,11,30, e.t.c) of the interrupt line the handler is supposed to service

The second parameter, dev_id, is a generic pointer to the same dev_id that was given to request_irq() when the interrupt handler was registered. This value should be unique if it is intended for shared interrupt as it can act as a cookie to differentiate between multiple devices using the same interrupt handler.

The final parameter, regs, holds a pointer to a structure containing the processor registers and state before servicing the interrupt.

The return value of an interrupt handler is the special type irqreturn_t. An interrupt could be serviced or could not be An interrupt handler can return two special values, IRQ_NONE or IRQ_HANDLED. The former is returned when the interrupt handler detects an interrupt for which its device was not the originator. The latter is returned if the interrupt handler was correctly invoked, and its device did indeed cause the interrupt. Alternatively, IRQ_RETVAL(val) may be used. If val is non-zero, this macro returns IRQ_HANDLED. Otherwise, the macro returns IRQ_NONE. These special values are used to let the kernel know whether devices are issuing spurious (that is, unrequested) interrupts. If all the interrupt handlers on a given interrupt line return IRQ_NONE, then the kernel can detect the problem. Note the curious return type, irqreturn_t, which is simply an int

The interrupt handler is normally marked static because it is never called directly from another file.

The role of the interrupt handler depends entirely on the device and its reasons for issuing the interrupt. At a minimum, most interrupt handlers need to provide acknowledgment to the device that they received the interrupt. Devices that are more complex need to additionally send and receive data and perform extended work in the interrupt handler. As mentioned, the extended work is pushed as much as possible into the bottom half handler, which I have discussed in other post on Bottom Halves.

Shared Handlers

A shared handler is registered and executed much like a non-shared handler. There are three main differences:

The SA_SHIRQ flag must be set in the flags argument to request_irq().
The dev_id argument must be unique to each registered handler. A pointer to any per-device structure is sufficient; a common choice is the device structure as it is both unique and potentially useful to the handler. You cannot pass NULL for a shared handler!
The interrupt handler must be capable of distinguishing whether its device actually generated an interrupt. This requires both hardware support and associated logic in the interrupt handler. If the hardware did not offer this capability, there would be no way for the interrupt handler to know whether its associated device or some other device sharing the line caused the interrupt.

Interrupt Context

When executing an interrupt handler or bottom half, the kernel is in interrupt context. (process context is the mode of operation the kernel is in while it is executing on behalf of a process for example, executing a system call or running a kernel thread. )
In process context, the current macro points to the associated task. Furthermore, because a process is coupled to the kernel in process context, process context can sleep or otherwise invoke the scheduler.
Interrupt context is not associated with a process. The current macro is not relevant (although it points to the interrupted process). Without a backing process, interrupt context cannot sleephow would it ever reschedule? Therefore, you cannot call certain functions from interrupt context. If a function sleeps, you cannot use it from your interrupt handlerthis limits the functions that one can call from an interrupt handler.
Interrupt context is time critical because the interrupt handler interrupts other code. Code should be quick and simple. Busy looping is discouraged. This is a very important point; always keep in mind that your interrupt handler has interrupted other code (possibly even another interrupt handler on a different line!).
Because of this asynchronous nature, it is imperative that all interrupt handlers be as quick and as simple as possible. As much as possible, work should be pushed out from the interrupt handler and performed in a bottom half, which runs at a more convenient time.
The setup of an interrupt handler's stacks is a configuration option. Historically, interrupt handlers did not receive their own stacks. Instead, they would share the stack of the process that they interrupted.
The kernel stack is two pages in size; typically, that is 8KB on 32-bit architectures and 16KB on 64-bit architectures. Because in this setup interrupt handlers share the stack, they must be exceptionally economical with what data they allocate there. Of course, the kernel stack is limited to begin with, so all kernel code should be cautious.

Implementation of Interrupt Handling

The implementation of interrupt handler is architecture dependent and hardware dependent.

Path Of interrupt

A device issues an interrupt by sending an electric signal over its bus to the interrupt controller.
If the interrupt line is enabled (they can be masked out), the interrupt controller sends the interrupt to the processor.
In most architectures, this is accomplished by an electrical signal that is sent over a special pin to the processor. Unless interrupts are disabled in the processor (which can also happen), the processor immediately stops what it is doing, disables the interrupt system, and jumps to a predefined location in memory and executes the code located there. This predefined point is set up by the kernel and is the entry point for interrupt handlers.
The interrupt's journey in the kernel begins at this predefined entry point, just as system calls enter the kernel through a predefined exception handler.
For each interrupt line, the processor jumps to a unique location in memory and executes the code located there.
In this manner, the kernel knows the IRQ number of the incoming interrupt. The initial entry point simply saves this value and stores the current register values (which belong to the interrupted task) on the stack; then the kernel calls do_IRQ(). From here onward, most of the interrupt handling code is written in C however, it is still architecture dependent.
The do_IRQ() function is declared as

unsigned int do_IRQ(struct pt_regs regs)

Because the C calling convention places function arguments at the top of the stack, the pt_regs structure contains the initial register values that were previously saved in the assembly entry routine. Because the interrupt value was also saved, do_IRQ() can extract it.
The x86 code is int irq = regs.orig_eax & 0xff;
After the interrupt line is calculated, do_IRQ() acknowledges the receipt of the interrupt and disables interrupt delivery on the line. On normal PC machines, these operations are handled by mask_and_ack_8259A(), which do_IRQ() calls.
Next, do_IRQ() ensures that a valid handler is registered on the line, and that it is enabled and not currently executing. If so, it calls handle_IRQ_event() to run the installed interrupt handlers for the line. On x86, handle_IRQ_event() is

asmlinkage int handle_IRQ_event(unsigned int irq, struct pt_regs *regs,
                                struct irqaction *action)
{
        int status = 1;
        int retval = 0;

        if (!(action->flags & SA_INTERRUPT))
                local_irq_enable();

        do {
                status |= action->flags;
                retval |= action->handler(irq, action->dev_id, regs);
                action = action->next;
        } while (action);

        if (status & SA_SAMPLE_RANDOM)
                add_interrupt_randomness(irq);

        local_irq_disable();

        return retval;
}

First, because the processor disabled interrupts, they are turned back on unless SA_INTERRUPT was specified during the handler's registration. Recall that SA_INTERRUPT specifies that the handler must be run with interrupts disabled. Next, each potential handler is executed in a loop.
If this line is not shared, the loop terminates after the first iteration. Otherwise, all handlers are executed. After that, add_interrupt_randomness() is called if SA_SAMPLE_RANDOM was specified during registration.
This function uses the timing of the interrupt to generate entropy for the random number generator.
Finally, interrupts are again disabled (do_IRQ() expects them still to be off) and the function returns. Back in do_IRQ(), the function cleans up and returns to the initial entry point, which then jumps to ret_from_intr().
The routine ret_from_intr() is, as with the initial entry code, written in assembly. This routine checks whether a reschedule is pending (this implies that need_resched is set).
If a reschedule is pending, and the kernel is returning to user-space (that is, the interrupt interrupted a user process), schedule() is called. If the kernel is returning to kernel-space (that is, the interrupt interrupted the kernel itself), schedule() is called only if the preempt_count is zero (otherwise it is not safe to preempt the kernel). After schedule() returns, or if there is no work pending, the initial registers are restored and the kernel resumes whatever was interrupted
On x86, the initial assembly routines are located in arch/i386/kernel/entry.S and the C methods are located in arch/i386/kernel/irq.c. Other supported architectures are similar.

`/proc/interrupts`

Procfs is a virtual filesystem that exists only in kernel memory and is typically mounted at /proc.
Reading or writing files in procfs invokes kernel functions that simulate reading or writing from a real file.
A relevant example is the /proc/interrupts file, which is populated with statistics related to interrupts on the system. Here is sample output from a uniprocessor PC:

      CPU0  
 0:   3602371   XT-PIC   timer
 1:   3048      XT-PIC   i8042
 2:   0         XT-PIC   cascade
 4:   2689466   XT-PIC   uhci-hcd, eth0
 5:   0         XT-PIC   EMU10K1
 12:  85077     XT-PIC   uhci-hcd
 15:  24571     XT-PIC   aic7xxx
NMI:  0 
LOC:  3602236 
ERR:  0

The first column is the interrupt line. On this system, interrupts numbered 02, 4, 5, 12, and 15 are present
. Handlers are not installed on lines not displayed.
The second column is a counter of the number of interrupts received. A column is present for each processor on the system, but this machine has only one processor.
As you can see, the timer interrupt has received 3,602,371 interrupt, whereas the sound card (EMU10K1) has received none (which is an indication that it has not been used since the machine booted).
The third column is the interrupt controller handling this interrupt. XT-PIC corresponds to the standard PC programmable interrupt controller. On systems with an I/O APIC, most interrupts would list IO-APIC-level or IO-APIC-edge as their interrupt controller.
Finally, the last column is the device associated with this interrupt. This name is supplied by the devname parameter to request_irq(), as discussed previously. If the interrupt is shared, as is the case with interrupt number four in this example, all the devices registered on the interrupt line are listed.
procfs code is located primarily in fs/proc. The function that provides /proc/interrupts is, not surprisingly, architecture dependent and named show_interrupts().

Saturday, February 22, 2014

System Calls

In this post we will discuss mainly about what are system calls, why do we need it and how to implement it.

What is a system call ?

To understand this first we would ask ourselves what are the stuffs the OS(read kernel) needs to do ?

Process Management (starting, running, stopping processes)
File Management(creating, opening, closing, reading, writing, renaming files)
Memory Management (allocating, deallocating memory)
Other stuff (timing, scheduling, network management).

So, system call is an interface through which user space applications request the Kernel do perform the operations listed above.

An example would be , the user space requests to open a device(hardware).

In short we can say that the System call is an interface between user space processes and hardware.

Why do we need system call?

It provides an abstraction to the user space process. Eg. open call for user means just open the device, the user doesn't need to care about intricacy of the call.
It maintains the system security and stability as the kernel first checks the authenticity of the call before requesting it a service.
It helps in virtualization of various processes i.e various processes can use it independently.

System call interface and C library.

The system call interface in Linux, as with most Unix systems, is provided in part by the C library.

We will see How System call works using a example of printf() call in userspace.

Syscalls

System calls (syscalls in Linux) are accessed via function calls. System calls need inputs and also provide a return value (long) signifies success or error.( 0 generally means success).
System calls have a defined behavior.

For example, the system call getpid() is defined to return an integer that is the current process's PID.

The implementation of this syscall in the kernel is very simple:


asmlinkage long sys_getpid(void)
{
        return current->tgid;


}


Some important observations from this-


A convention in which a system call is appended with sys in kernel space.
asmlinkage modifier -tells the compiler that the function should not expect to find any of its arguments in registers (a common optimization), but only on the CPU's stack.

In Linux, each system call is assigned a syscall number. This is a unique number that is used to reference a specific system call.
When the syscall number is assigned, it cannot changed or be recycled.
System calls in Linux are faster than in many other operating systems. (such as fast context switch times.
The kernel keeps track of all the registered system calls in table sys_call_table which is defined in enTRy.S( assembler file) in arch/arch-name/kernel/

System Call Handler:-

Since the system call code lies in kernel side, so to execute it we must switch the processor to kernel mode when system call is executed.
This is done by issuing a software interrupt.
In this mechanism an exception is raised and the Kernel switches to kernel mode and execute the system call handler.
The defined software interrupt on x86 is the int $0x80 instruction in ARM the address is 0x08 offset from start of exception vector base(0X00000000, or 0xFFFF0000)
It triggers a switch to kernel mode and the execution of exception vector 128, which is the system call handler.
The system call handler function is system_call().
It is architecture dependent and typically implemented in assembly in entry.S
User space first enters the system call number in eax register(X86) and causes the trap.
The kernel reads the value of the eax register and calls the appropriate system call handler.
The system_call() function checks the validity of the given system call number by comparing it to NR_syscalls.
If it is larger than or equal to NR_syscalls, the function returns -ENOSYS. Otherwise, the specified system call is invoked:

call *sys_call_table(,%eax,4)

Because each element in the system call table is 32 bits (four bytes), the kernel multiplies the given system call number by four to arrive at its location in the system call table

Now, the system call is called with some parameters, generally upto 5 parameters, we store the parameters values in registers ebx, ecx, edx, esi, and edi.
In some unique cases when 6 or more parameters are passed then a single register is used which stores the pointer to the user space where all the parameters are stored.
Not only this, even the return value is stored in the the register( eax in case of X86).

How to implement system calls?

Adding a system call is an easy task. But it is the implementation that has to be done carefully.

Now we will see what are the steps used to implement a system call.

First we must define its purpose. What is the use of this system call? The syscall should have exactly one purpose.

Next, we must define system call's arguments, return value, and error codes.

The system call should have a clean and simple interface with the smallest number of arguments possible.

Final Steps in Binding a System Call

First, add an entry to the end of the system call table.
For each architecture supported, the syscall number needs to be defined in <asm/unistd.h>.
The syscall needs to be compiled into the kernel image

How system call verifies parameters(arguments)?

System calls must make sure all of their parameters are valid and legal. Such as access permission.
System calls must carefully verify all their parameters to ensure that they are valid and legal.
The system call runs in kernel-space, and if the user is able to pass invalid input into the kernel without restraint, the system's security and stability can suffer, in short the kernel can be hacked!!
For example, for file I/O syscalls, the syscall must check whether the file descriptor is valid. Process-related functions must check whether the provided PID is valid. Every parameter must be checked to ensure it is not just valid and legal, but correct.
One of the most important checks is the validity of any pointers that the user provides. Imagine if a process could pass any pointer into the kernel, unchecked, with warts and all, even passing a pointer for which it did not have read access! Processes could then trick the kernel into copying data for which they did not have access permission, such as data belonging to another process. Before following a pointer into user-space, the system must ensure that

The pointer points to a region of memory in user-space. Processes must not be able to trick the kernel into reading data in kernel-space on their behalf.

The pointer points to a region of memory in the process's address space. The process must not be able to trick the kernel into reading someone else's data.

If reading, the memory is marked readable. If writing, the memory is marked writable. The process must not be able to bypass memory access restrictions

Two methods for performing the requisite checks and the desired copy to and from user-space:

For writing into user-space, the method copy_to_user(destination memory address , source pointer , size of the data to copy ) is provided.
For reading from user-space, the method copy_from_user(destination memory address , source pointer, the number from the second parameter reading into the first parameter) is used.

Both of these functions return the number of bytes they failed to copy on error. On success, they return zero. It is standard for the syscall to return -EFAULT in the case of such an error.
check is for valid permission. A call to capable() with a valid capabilities flag returns nonzero if the caller holds the specified capability and zero otherwise. For example, capable(CAP_SYS_NICE) checks whether the caller has the ability to modify nice values of other processes.

Sunday, February 9, 2014

Linux Memory Management

PAGES:

The physical pages are the basic unit of memory management for the Kernel.
The MMU(memory management unit) manages the memory in terms of page sizes.
Generally a 32 bit architecture has 4KB page size and a 64 bit architecture has 8KB page size.
Kernel stores info about these pages(physical pages) in its structure struct page.
This structure is defined in <linux/mm.h>.

struct page {
        page_flags_t          flags;
        atomic_t              _count;
        atomic_t              _mapcount;
        unsigned long         private;
        struct address_space  *mapping;
        pgoff_t               index;
        struct list_head      lru;
        void                  *virtual;
};

Some important fields are-

The flags field stores the status of the page. Such flags include whether the page is dirty(it has been modified) or whether it is locked in memory. There are 32 different flags available. The flag values are defined in <linux/page-flags.h>.
_count field means how many instance virtual pages are there for the given physical page. When the value of _count reaches zero that means noone is using the page at current.
virtual this is the address of the page in virtual memory. For highmem(highmemory >896MB of virtual memory) this field is zero.

The goal of this data structure is to describe the physical pages and not the data contained in that page.

ZONES:

The kernel divides its 1GB virtual address space into three zones -ZONE_DMA(<16MB), ZONE_NORMAL(16MB-896MB) and ZONE_HIGHMEM(>896MB).
Kernel groups pages with similar properties into separate zones.
The zones have no physical relevance, it has just logical relevance.
Each zone is represented by struct zone, which is defined in <linux/mmzone.h>:
For more details on ZONES , read my other post on linux addressing.

GETTING PAGES:

Kernel allows us with some interfaces to allocate and free memory within kernel space.
All these interfaces allocate memory with page-sized granularity and are declared in <linux/gfp.h>.
We can either allocate physical contiguous memory or only virtual contiguous memory.
One should never attempt to allocate memory for userspace from the kernel - this is a huge violation of the kernel's abstraction layering.
Instead have userspace mmap pages owned by your driver directly into its address space or have userspace ask how much space it needs. Userspace allocates, then grabs the memory from the kernel.
There no way to allocate contiguous physical memory from userspace in linux.
This is because a user space program has no way of controlling or even knowing if the underlying memory is contiguous or not.
The core function is struct page * alloc_pages(unsigned int gfp_mask, unsigned int order)
This allocates 2^order (that is, 1 << order) contiguous physical pages and returns a pointer to the first page's page structure;on error it returns NULL.
To convert a given page(physical) to its logical address we can use the function-void * page_address(struct page *page).
This function returns a pointer to the logical address where our allocated physical pages resides.
If we just need the virtual address of the pages( we don't need page structure) we can use the function -unsigned long __get_free_pages(unsigned int gfp_mask, unsigned int order).
The pages thus obtained are contiguous in virtual space.
This function also uses the core function alloc_pages, but it directly gives us the starting address of the first page.
If we just need a single page(order 0 then we have two functions, one for physical and other for logical-

struct page * alloc_page(unsigned int gfp_mask)

unsigned long __get_free_page(unsigned int gfp_mask)

If we need page filled with zero( for security issues we want to initialize memory with all zeros so that if we need to pass this memory to user space then the user space will get access to the contents written on this memory location previously) we can use this function unsigned long get_zeroed_page(unsigned int gfp_mask)
This function works the same as __get_free_page(), except that the allocated page is then zero-filled
To free the pages we have some functions-

void __free_pages(struct page *page, unsigned int order)

void free_pages(unsigned long addr, unsigned int order)

```
void free_page(unsigned long addr)
```

Allocation of page/s may fail so we must define a handler to handle such situations.

kmalloc()

The kmalloc() function's operation is very similar to that of user-space's familiar malloc() routine, with the exception of the addition of a flags parameter.
This is used when we want to allocate a small chunk of memory in bytes size.
For bigger sized memory, the previous page allocation functions is a good option.
Mostly in Kernel we use Kmalloc() for memory allocation.
The function is declared in <linux/slab.h>
void * kmalloc(size_t size, int flags)
The function returns a pointer to a region of memory that is at least size bytes in length.
The region of memory allocated is physically contiguous.
On error, it returns NULL.
Kernel allocations almost always succeed, unless there is an insufficient amount of memory available.
Still we must check for NULL after all calls to kmalloc() and handle the error appropriately.

eg.   struct abc *ptr;

ptr = kmalloc(sizeof(struct abc), GFP_KERNEL);
if (!ptr)
        /* handle error ... */

The GFP_KERNEL flag specifies the behavior of the memory allocator 
while trying to obtain the memory to return to the caller of 
kmalloc().

`gfp_mask` Flags

In this section we will discuss about the flags that we used in kmalloc and other low level page functions.

The flags are broken up into three categories:

action modifiers
zone modifiers
types.

Action modifiers specify how the kernel is supposed to allocate the requested memory.
In certain situations, only certain methods can be employed to allocate memory.
For example, interrupt handlers must instruct the kernel not to sleep (because interrupt handlers cannot reschedule) in the course of allocating memory.
Zone modifiers specify from where to allocate memory.
As we saw in the article on linux addressing (http://learnlinuxconcepts.blogspot.in/2014/02/linux-addressing.html) the kernel divides physical memory into multiple zones, each of which serves a different purpose.
Zone modifiers specify from which of these zones to allocate.
Type flags specify a combination of action and zone modifiers as needed by a certain type of memory allocation.
Type flags simplify specifying numerous modifiers; instead, we generally specify just one type flag.
All the flags are declared in <linux/gfp.h>.
The file <linux/slab.h> includes this header, however, so we don't often need not include it directly.

Action modifiers-

Flag	Description
`__GFP_WAIT`	The allocator can sleep.
`__GFP_HIGH`	The allocator can access emergency pools.
`__GFP_IO`	The allocator can start disk I/O.
`__GFP_FS`	The allocator can start filesystem I/O.
`__GFP_COLD`	The allocator should use cache cold pages.
`__GFP_NOWARN`	The allocator will not print failure warnings.
`__GFP_REPEAT`	The allocator will repeat the allocation if it fails, but the allocation can potentially fail.
`__GFP_NOFAIL`	The allocator will indefinitely repeat the allocation. The allocation cannot fail.
`__GFP_NORETRY`	The allocator will never retry if the allocation fails.
`__GFP_NO_GROW`	Used internally by the slab layer.
`__GFP_COMP`	Add compound page metadata. Used internally by the `hugetlb` code.

These allocations can be specified together. For example,ptr = kmalloc(size, __GFP_WAIT | __GFP_IO | __GFP_FS);
Lets see how this allocation will work--
It will instruct the page allocator (function finally comes to alloc_pages() as we had seen before) that the allocation can-

block
perform I/O
perform filesystem operations, if needed.

This allows the kernel great freedom in how it can find the free memory to satisfy the allocation.

Zone Modifier-

Zone modifiers specify from which memory zone the allocation should originate.
Normally, allocations can be fulfilled from any zone.
The kernel prefers ZONE_NORMAL, however, to ensure that the other zones have free pages when they are needed.
There are only two zone modifiers because there are only two zones other than ZONE_NORMAL (which is where, by default, allocations originate).

Flag	Description
`__GFP_DMA`	Allocate only from `ZONE_DMA`
`__GFP_HIGHMEM`	Allocate from `ZONE_HIGHMEM` or `ZONE_NORMAL`

If none of the flags are specified, the kernel fulfills the allocation from either ZONE_DMA or ZONE_NORMAL, with a strong preference to satisfy the allocation from ZONE_NORMAL.
We cannot specify __GFP_HIGHMEM to either __get_free_pages() or kmalloc() because these both return a logical address, and not a page structure.
Though it is possible that these functions would allocate memory that is not currently mapped in the kernel's virtual address space and, thus, does not have a logical address.
Only alloc_pages() can allocate high memory.
For majority of our allocations, however, we don't need to specify a zone modifier because ZONE_NORMAL is sufficient.

Type Flags-

The type flags specify the required action and zone modifiers to fulfill a particular type of transaction.
Therefore, there is a good news that kernel code tends to use the correct type flag and not specify the various number of flags it would want to define.

`GFP_ATOMIC`	The allocation is high priority and must not sleep. This is the flag to use in interrupt handlers, in bottom halves, while holding a spinlock, and in other situations where we cannot sleep.
`GFP_NOIO`	This allocation can block, but must not initiate disk I/O. This is the flag to use in block I/O code when we cannot cause more disk I/O, which might lead to some unpleasant recursion.
`GFP_NOFS`	This allocation can block and can initiate disk I/O, if it must, but will not initiate a filesystem operation. This is the flag to use in filesystem code when we cannot start another filesystem operation.
`GFP_KERNEL`	This is a normal allocation and might block. This is the flag to use in process context code when it is safe to sleep. The kernel will do whatever it has to in order to obtain the memory requested by the caller. This flag should be our first choice.
`GFP_USER`	This is a normal allocation and might block. This flag is used to allocate memory for user-space processes.
`GFP_HIGHUSER`	This is an allocation from `ZONE_HIGHMEM` and might block. This flag is used to allocate memory for user-space processes.
`GFP_DMA`	This is an allocation from `ZONE_DMA`. Device drivers that need DMA-able memory use this flag, usually in combination with one of the above.

What all action modifier files are internally involved in Type Flags ?

`GFP_ATOMIC`	`__GFP_HIGH`
`GFP_NOIO`	`__GFP_WAIT`
`GFP_NOFS`	`(__GFP_WAIT \| __GFP_IO)`
`GFP_KERNEL`	`(__GFP_WAIT \| __GFP_IO \| __GFP_FS)`
`GFP_USER`	`(__GFP_WAIT \| __GFP_IO \| __GFP_FS)`
`GFP_HIGHUSER`	`(__GFP_WAIT \| __GFP_IO \| __GFP_FS \| __GFP_HIGHMEM)`
`GFP_DMA`	`__GFP_DMA`

Lets try to understand important Type flags.

GFP_KERNEL flag-

The vast majority of allocations in the kernel use the GFP_KERNEL flag.
The resulting allocation can sleep as it is normal priority allocation.
Because the call can block, this flag can be used only from process context that can safely reschedule (that is, no locks are held and so on).
Because this flag does not make any stipulations as to how the kernel may obtain the requested memory, the memory allocation has a high probability of succeeding.

GFP_ATOMIC flag-

The GFP_ATOMIC flag is at the extreme end as compared to GFP_KERNEL flag.
This flag specifies a memory allocation that cannot sleep, the allocation is very restrictive in the memory it can obtain for the caller.
If no sufficiently sized contiguous chunk of memory is available, the kernel is not very likely to free memory because it cannot put the caller to sleep.
Conversely, the GFP_KERNEL allocation can put the caller to sleep to swap inactive pages to disk, flush dirty pages to disk, and so on.
Because GFP_ATOMIC is unable to perform any of these actions, it has less of a chance of succeeding (at least when memory is low) compared to GFP_KERNEL allocations
Still the GFP_ATOMIC flag is the only option when the current code is unable to sleep, such as with interrupt handlers, softirqs, and tasklets.

GFP_NOIO and GFP_NOFS flags-

In between these two flags are GFP_NOIO and GFP_NOFS.
Allocations initiated with these flags might block, but they refrain from performing certain other operations.
A GFP_NOIO allocation does not initiate any disk I/O whatsoever to fulfill the request
On the other hand, GFP_NOFS might initiate disk I/O, but does not initiate filesystem I/O.
One question that immediately comes to our mind. Why might you need these flags?
They are needed for certain low-level block I/O or filesystem code, respectively
Imagine if a common path in the filesystem code allocated memory without the GFP_NOFS flag. The allocation could result in more filesystem operations, which would then beget other allocations and, thus, more filesystem operations! This could continue indefinitely.
Code such as this that invokes the allocator must ensure that the allocator also does not execute it, or else the allocation can create a deadlock.
Not surprisingly, the kernel uses these two flags only in few places.

GFP_DMA flag-

The GFP_DMA flag is used to specify that the allocator must satisfy the request from ZONE_DMA.
This flag is used by device drivers, which need DMA-able memory for their devices. Normally, we combine this flag with the GFP_ATOMIC or GFP_KERNEL flag

Which flag to use when??

Situation	Solution
Process context, can sleep	Use `GFP_KERNEL`
Process context, cannot sleep	Use `GFP_ATOMIC`, or perform your allocations with `GFP_KERNEL` at an earlier or later point when you can sleep
Interrupt handler	Use `GFP_ATOMIC`
Softirq	Use `GFP_ATOMIC`
Tasklet	Use `GFP_ATOMIC`
Need DMA-able memory, can sleep	Use `(GFP_DMA \| GFP_KERNEL)`
Need DMA-able memory, cannot sleep	Use `(GFP_DMA \| GFP_ATOMIC)`, or perform your allocation at an earlier point when you can sleep

`kfree()`

kfree undoes the work done by kmalloc().
This function is declared in <linux/slab.h>.
void kfree(const void *ptr).
use it only for those blocks of memory that was previously allocated using kmalloc().

eg. char *buf;

buffer = kmalloc(BUF_SIZE, GFP_ATOMIC);
if (!buffer)
        /* error allocting memory ! */

Later, when we no longer need the memory, we must free it.

kfree(buffer);

vmalloc()

This Kernel function is similar to user space function malloc().
Both vmalloc() and malloc() returns virtually contiguous memory but not necessarily physically contiguous.
In kernel we normally use kmalloc() and seldom use vmalloc().
vmalloc is used when the requested memory size is quite big as it may not be possible to allocate a large block of contiguous memory via kmalloc() and it may fail.

The vmalloc() function is declared in <linux/vmalloc.h> and defined in mm/vmalloc.c.

Usage is identical to user-space's malloc(): void * vmalloc(unsigned long size).

Usage of vmalloc() also affects the system performance.


To free an allocation obtained via vmalloc(), we use
void vfree(void *addr).

LINUX KERNEL INTERNALS

Popular Posts

Thursday, February 27, 2014

Linux Kernel Questions

Interrupts

Top Halves Versus Bottom Halves

Registering an Interrupt Handler

Freeing an Interrupt Handler

Shared Handlers

Implementation of Interrupt Handling

`/proc/interrupts`

Saturday, February 22, 2014

System Calls

Sunday, February 9, 2014

Linux Memory Management

`gfp_mask` Flags

Type Flags-

`kfree()`

About Me

Blog Archive

Popular Posts

Thursday, February 27, 2014

Linux Kernel Questions

Interrupts

Top Halves Versus Bottom Halves

Registering an Interrupt Handler

Freeing an Interrupt Handler

Shared Handlers

Implementation of Interrupt Handling

/proc/interrupts

Saturday, February 22, 2014

System Calls

Sunday, February 9, 2014

Linux Memory Management

gfp_mask Flags

Type Flags-

kfree()

`/proc/interrupts`

`gfp_mask` Flags

`kfree()`