Wednesday, April 13, 2016

Wait Queues in Kernel

When one or more processes/threads want to sleep and wait for one or more events,  the kernel provides wait queues to handle such kind of scenarios. These are higher level mechanism implemented inside Linux kernel to put process/thread in sleep and wake them up.

Important Points :-

  • A Wait queue for an event is a list of nodes
  • Each node points to the thread/process waiting for that particular event. 
  • An individual node in the list is called a wait queue entry.
  • On the occurrence of the event, one or more processes on the list are woken.
  • After waking up, the processes remove themselves from the list.
Wait Queue is defined and used in the following way :-
  • wait_queue_head_t my_event;
  • init_waitqueue_head(&my_event);

The same could be also achieved by using this macro:
DECLARE_WAIT_QUEUE_HEAD(my_event);

Any thread/process that wants to wait for my_event could use either of the following options:
  • wait_event(&my_event, (event_present == 1) ); //Uninterruptible Sleep
  • wait_event_interruptible(&my_event, (event_present == 1) );   

Note that the second argument to the wait function is condition for waiting of the my_event here.


How to wake up thread/process which are sleeping on a wait queue:
  • wake_up(&my_event);: wakes up only one process from the wait queue.
  • wake_up_all(&my_event);: wakes up all the processes on the wait queue.
  • wake_up_interruptible(&my_event);: wakes up only one process from the wait queue that is in interruptible sleep.
Below is an example from Kernel driver(driver/mmc/core/core.c)

/*

 *  Start a new MMC custom command request for a host.
 * If there is on ongoing async request wait for completion
 * of that request and start the new one and return.
 * Does not wait for the new request to complete.
 */
static int mmc_wait_for_data_req_done(struct mmc_host *host,
     struct mmc_request *mrq,
     struct mmc_async_req *next_req)
{
while (1) {
wait_event_interruptible(context_info->wait,
(context_info->is_done_rcv ||
                        context_info->is_new_req));
                               .........
                               .........
                               .........
             }
}


/*

 * Wakes up mmc context, passed as a callback to host controller driver
 */
static void mmc_wait_data_done(struct mmc_request *mrq)
{
mrq->host->context_info.is_done_rcv = true;
wake_up_interruptible(&mrq->host->context_info.wait);
}




Tuesday, April 12, 2016

The Classic Lost Wake up in Linux Kernel

The lost wake-up problem arises out of a race condition that occurs while a thread/process goes to conditional sleep. It is a classic problem in operating systems.

Consider two threads/processes, A and B. 
  • Thread/Process A is processing from a list, consumer.
  • The thread/process B is adding to this list, producer. 
  • When the list is empty, thread/process A sleeps.
  • Thread/Process B wakes A up when it appends anything to the list. 
And the code looks like as below :

 Process/Thread A (Processing the List):

1  spin_lock(&list_lock);
2  if(list_empty(&list_head)) {
3      spin_unlock(&list_lock);  
4      set_current_state(TASK_INTERRUPTIBLE);  
5      schedule();
6      spin_lock(&list_lock);
7  }
8
9  /* Rest of the code ... */
10 spin_unlock(&list_lock);


Process/Thread B( Adding to the List ):

100  spin_lock(&list_lock);
101  list_add_tail(&list_head, new_node);
102  spin_unlock(&list_lock);
103  wake_up_process(processa_task);

There is one problem with this situation as described below. 

STEP1 :  It may happen that after process A executes line 3 but before it executes line 4,thread/process B is scheduled on another processor. In other words. the Process/Thread A is not blocked however thread/process B is scheduled before the Line 4 Marked in RED is executed.

STEP2  : In this timeslice, thread/process B executes all its instructions, 100 through 103. Thus, it performs a wake-up on thread/process A, which has not yet gone to sleep. 

STEP3 :  Now, thread/process A sets the state to TASK_INTERRUPTIBLE and goes to sleep.

Thus, a wake up from thread/process B is lost or it was not processed at all. 
This is known as the lost wake-up problem. The thread/process A sleeps, even though there are nodes available on the list.

SOLUTION to the problem :- We need to re-write the thread or Process A  so that it doesn't misses any wake up.

Process A:

1  set_current_state(TASK_INTERRUPTIBLE);
2  spin_lock(&list_lock);
3  if(list_empty(&list_head)) {
4         spin_unlock(&list_lock);
5         schedule();
6         spin_lock(&list_lock);
7  }
8  set_current_state(TASK_RUNNING);
9
10 /* Rest of the code ... */
11 spin_unlock(&list_lock);

How did this solve problem?

1. The default state of Thread/Process A is Interruptible and after executing step 4 [ Marked RED] before the schedule ,the wake up is called by thread/process B hence the thread/process A is put to TASK_RUNNING state. Hence the wake up call of thread/process A for thread/process B is not missed.


2. If a wake-up is delivered by thread/process B at any point after the check for list_empty is made, the state of thread/process A automatically is changed to TASK_RUNNING. Hence, the call to schedule() does not put thread/process A to sleep; it merely schedules it out for a while.


Monday, April 4, 2016

Sleeping in the Kernel - Part 1

We will discuss here a simple way of sleeping and waking up a thread in there Kernel.

In Linux kernel, there are scenarios where a thread might be suspended/ or waiting for something else to happen voluntarily. In such a case, the thread should be allowed to sleep as much as possible and not scheduled unnecessary. Hence the thread will be Sleeping and waiting for events(Async or Sync). In other word, a process/thread is allowed to voluntarily relinquish the CPU.

One classic scenario could be that a thread is in sleep waiting for the Interrupt to occur and then goes to the scheduler queue and does it works. 

In Linux, the ready-to-run processes are maintained on a run queue.A ready-to-run process has the state TASK_RUNNING.Once the time-slice of a running process is over, the Linux scheduler picks up another appropriate process from the run queue and allocates CPU power to that process.

Two important process or task states which is important to discuss sleep :-

TASK_RUNNING A ready-to-run or running process/thread in the queue.

TASK_INTERRUPTIBLE  - The process/task is suspended and waiting for some event to occur - Interrupts or Signals, though need not be explicitly woken up by the code.

For all other task states, please refer the click here.

Sleep scenario with kernel code

At times, threads want to wait until a certain event occurs, such as a device to initialise, I/O to complete or a timer to expire. In such a case, the thread is said to sleep on that event. A thread can go to sleep using the schedule() function. The following code puts the executing thread to sleep:

sleeping_task = current;
set_current_state(TASK_INTERRUPTIBLE);
schedule();
func1();
/* The rest of the code */

sleeping_task is used to store the context of the task so that it can be woken up later on interrupt/timer expiration or any similar event.

set_current_state() takes argument as the state of the task and switches the current task to that state. Hence in this case the task is brought out of the run queue and suspended or put to sleep till the code wakes up the task. Hence the code gets control over scheduling or state of the task.

When the schedule() function is called with the state as TASK_INTERRUPTIBLE in this case, the currently executing process is moved off the run queue before another process is scheduled. The effect of this is the executing process goes to sleep, as it no longer is on the run queue. Hence, it never is scheduled by the scheduler. And, that is how a process can sleep.

The schedule() function is used by the thread in this case to indicate to the scheduler that it can schedule some other process on the processor.

let's wake it up now. Given a reference to a task structure, the thread could be woken up by calling:
wake_up_process(sleeping_task);








Sunday, April 3, 2016

Introduction to Kthreads in Linux Kernel

Threads are programming abstractions used in concurrent processing. A kernel thread is a way to implement background tasks inside the kernel. A background task can be busy handling asynchronous events or can be asleep, waiting for an event to occur. Kernel threads are similar to user processes, except that they live in kernel space and have access to kernel functions and data structures. Like user processes, kernel threads appear to monopolize the processor because of preemptive scheduling.

To see these threads, run ps -ef from command line and note all of the processes in [square brackets] at the beginning of the listing are kernel threads. 





Examples of kthreads

(A). [ksoftirqd/n] is a kthread supporting implementation of softirqs in Kernel. where 'n' represents the core in SMP systems.Soft IRQs are raised by interrupt handlers to request “bottom half” processing of portions of the interrupt handler whose execution can be deferred. The idea is to minimize the code inside interrupt handlersm which results in reduced interrupt-off times in the system, thus resulting in lower latencies. ksoftirqd ensures that a high load of soft IRQs neither starves the soft IRQs nor overwhelms the system. On Symmetric Multi-Processing (SMP) machines, where multiple thread instances can run on different processors in parallel, one instance of ksoftirqd is created per processor to improve throughput.

(B). The [events/n] threads (where n is the processor number) help implement work queues, which are another way of deferring work in the kernel. If a part of the kernel wants to defer execution of work, it can either create its own work queue or make use of the default events/worker thread.

(C). The [pdflush] kernel thread flushes dirty pages from the page cache. 

(D). The [khubd] thread, part of the Linux USB core, monitors the machine’s USB hub and configures USB devices when they are hot-plugged into the system.


Example from Kernel Source

File :- kernel/time/clocksource.c

static int clocksource_watchdog_kthread(void *data)
{
mutex_lock(&clocksource_mutex);
if (__clocksource_watchdog_kthread())
clocksource_select();
mutex_unlock(&clocksource_mutex);
return 0;
}

static void __clocksource_watchdog_kthread(unsigned long data)
{
--- CODE -----
}

Mutex examples from Linux Kernel

(A). TWL6030(power-management integrated circuit)  Device Driver's ADC read function uses Mutex in order to create a critical section, the sole Mutex is used everywhere hence all functions get protected by a Mutex : -

int twl6030_gpadc_read_raw( )
{
      Mutex Lock ;

      Start_Conversion of GPADC channel;

      /* Waiting for IRQ to complete with a given timeout */
      wait_for_completion_interruptible_timeout();
   
     ----  CODE  ---

      Mutex UnLock ;
}


static irqreturn_t twl6030_gpadc_irq_handler(int irq, void *indio_dev)
{

       ----- CODE ----

       /* Triggers the completion of the IRQ */
complete(irq_complete);

}

(B).  Device Tree for clocks - OF data structures uses Mutex so as to avoid transaction issues in the Data Structure. One such example is the function as below where the node is deleted while not allowing access to the list as the sole Mutex is used in all places :-

/* Remove a previously registered clock provider */
of_clk_del_provider()
{
mutex_lock(&of_clk_mutex);

        Find the node and delete it;

mutex_unlock(&of_clk_mutex);
}



Friday, February 26, 2016

Introduction to Memory Barriers


The memory barrier instructions halt execution of the application code until a memory write of an instruction has finished executing. They are used to ensure that a critical section of code has been completed before continuing execution of the application code.

Memory accesses are randomly performed by CPU, however this could be a problem when multiple CPUs or I/O are accessing the same memory. The one way is to create a critical section using spinlocks or mutex or semaphore however this approach has its more overheads.

Linux can use its memory barriers to give a sort of security to memory via aligned or ordered access to the RAM. The random accesses to memory by CPU becomes more problematic in case of a Multi-core system. Memory barriers are only required where there's a possibility of interaction between two CPUs or between a CPU and a device( Refer the Abstract Memory Access Model as below). If it can be guaranteed that there won't be any such interaction in any particular piece of code, then memory barriers are not required. 

Memory barriers impose a perceived partial ordering over the memory operations on either side of the barrier.Such enforcement is important because the CPUs and other devices in a system can use a variety of tricks to improve performance, including reordering, deferral and combination of memory operations; speculative loads; speculative branch prediction and various types of caching. Memory barriers are used to override or suppress these tricks, allowing the code to sanely control the interaction of multiple CPUs and/or devices.


                                                      Abstract Memory Access Model 

It is important to note here that Memory Barriers are not suitable for bit fields. As compilers often modify the bit field code to non-atomic read/write code. Hence accessing to bit fields cannot be synchronized.

Even in cases where bitfields are protected by locks, all fields in a given bitfield must be protected by one lock.  If two fields in a given bitfield are protected by different locks, the compiler's non-atomic read-modify-write sequences can cause an update to one
field to corrupt the value of an adjacent field.





Wednesday, February 24, 2016

Usage of Mutex, Semaphore and Spinlocks

This article is about usage of semaphore, mutex and spinlocks with reference to Linux kernel.

Semaphore: Use a semaphore when your (thread) want to sleep till some other thread tells you to wake up. Semaphore 'down' happens in one thread (producer) and semaphore 'up' (for same semaphore) happens in another thread (consumer) e.g.: In producer-consumer problem, producer wants to sleep till at least one buffer slot is empty - only the consumer thread can tell when a buffer slot is empty. 


Hence this is mainly used for such thread synchronization where two threads are somewhere dependent upon each other.
Mutex: Use a mutex when you (thread) want to execute code that should not be executed by any other thread at the same time. Mutex 'down' happens in one thread and mutex 'up' must happen in the same thread later on. This is called 'Ownership property'  e.g.: If you are deleting a node from a global linked list, you do not want another thread to muck around with pointers while you are deleting the node. When you acquire a mutex and are busy deleting a node, if another thread tries to acquire the same mutex, it will be put to sleep till you release the mutex.

It is also important to note that semaphores/Mutexes makes the thread sleep when blocked hence they are never used in IRQ handlers. However a mutex/semaphore can be unlocked from an IRQ handler.
Spinlock: Use a spinlock when you really want to use a mutex but your thread is not allowed to sleep. e.g.: An interrupt handler within OS kernel must never sleep. If it does the system will freeze / crash. If you need to insert a node to globally shared linked list from the interrupt handler, acquire a spinlock - insert node - release spinlock. 

In other words, a spinlock is actually a special type of semaphore which doesn't sleep in fact is in busy-wait loop. Linus himself agrees to this fact.

 [ Refer : - http://yarchive.net/comp/linux/semaphores.html ]

It is important to note that while holding spinlock, your interrupt might be disabled hence release of lock should happen as soon as possible. Since the usual practice is to make your critical sections as short as possible, the result is that the kerne luses a lot more spinlocks than semaphore_t's.

Why does locking a Semaphore/Mutex in IRQ handler is illegal or a strict no no ?

These locks tend to sleep when it is blocked and sleeping in IRQ handlers are not allowed due to following reasons.
  • Interrupts needed to be disabled inside the IRQ handler.
  • If the Interrupts are disabled and then a mutex/semaphore is acquired, it might hold the lock and sleep hence other dependent threads which are waiting for other interrupts might get blocked and hence system freeze or watch dog reset might happen. 
  • Spinlock is more likely to be used in IRQ handlers as while you acquire spinlock the local CPU never sleeps.

Wednesday, January 27, 2016

Important Uboot Commands - Loady usage in Booting from RAM

The exact set of commands depends on the U-Boot configuration, however these are the important list of commands for reference. This also would be quite helpful for debugging Bootloader(Uboot) related issues :-

This post doesn't describe all the command in detail from usage wise however there has been two important example covered in the post.

  • help and help command
  • boot, runs the default boot command, stored in bootcmd
  • bootm / bootz  starts a kernel image loaded at the given address in RAM
  • ext2load, loads a file from an ext2 filesystem to RAM And also ext2ls to list files, ext2info for information
  • fatload, loads a file from a FAT filesystem to RAM and also fatls and fatinfo
  • tftp, loads a file from the network to RAM
  • ping, to test the network
  • loadb, loads, loady, load a file from the serial line to RAM
  • usb, to initialize and control the USB subsystem, mainly used for USB storage devices such as USB keys
  • mmc or mmc rescan , to initialize and control the MMC subsystem, used for SD and microSD cards
  • nand, to erase, read and write contents to NAND flash
  • erase, protect, cp, to erase, modify protection and write to NOR flash
  • md, displays memory contents. Can be useful to check the contents loaded in memory, or to look at hardware registers.
  • mm, modifies memory contents. Can be useful to modify directly hardware registers, for testing purposes.


Commands to manipulate environment variables:
  • printenv  [ Shows the value of a variable ]
  • setenv [ Changes the value of a variable, only in RAM ]
  • editenv [ Edits the value of a variable, only in RAM ]
  • saveenv  [ Saves the current state of the environment to flash ]

List of Important Environment variables :-

  • bootcmd, contains the command that U-Boot will automatically execute at boot time after a configurable delay(bootdelay), if the process is not interrupted.
  • bootargs, contains the arguments passed to the Linux kernel. 
  • serverip, the IP address of the server that U-Boot will contact for network related commands.
  • ipaddr, the IP address that U-Boot will use. 
  • netmask, the network mask to contact the server.
  • ethaddr, the MAC address, can only be set once. 
  • autostart, if yes, U-Boot starts automatically an image that has been loaded into memory.
  • filesize, the size of the latest copy to memory (from tftp, fat load, nand read...)


Example of Uboot commands to boot a Linux/Arm system from MMC (assuming that bootargs are not part of kernel configuration) :-

/* This commands is to set the bootargs */
setenv bootargs console=ttymxc1,115200 root=/dev/mmcblk0p7 ro rootwait rootfstype=ext4 earlyprintk no_console_suspend=1 consoleblank=0 init=/sbin/init log_buf_len=1M cma=128M mem=256M galcore.contiguousSize=67108864 lpj=3948544

Note an important configuration in bootargs is the "size of RAM"

The usual commands to boot into normal kernel from a MMC device :-

/* Select the MMC device */
mmc dev 2  

/* Read the MMC and keep the Linux kernel uImage(Assuming that DTB is already part of uImage) in 0x10500000 address in RAM */
mmc read 0x10500000 800 4000  

Note that 800 is the block number, 4000 number of blocks of MMC

/* The command to set the PC to 0x10500000 address and run */
bootm 0x10500000


Off Topic :-  the uImage for UBOOT is created in the two following steps :-

1. Copying the DTB to zImage(The Kernel Image) - This is optional as the uImage may or may not have the DTB.

2. Use the mkimage utility to create the uImage.

What If the DTB is not part of uImage?

Let have a look at these steps below which loads dtb into a particular address in the RAM and then boots the system,the DTB and uImage is loaded from USB to RAM. Below are the command to be run from Uboot prompt.

fatload usb 0 0x1070000 am335x-boneblack.dtb

fatload usb 0 0x1090000 uImage

bootm 0x1070000 - 0x1090000

How to boot Linux kernel image without flashing into EMMC or NAND and directly writing into RAM from the UBOOT prompt :-

1. loady 0x10500000 921600

2. Switch the baud rate for teraterm or console which your are using to 921600

3. Send kernel image over ymodem  : Go to File -> Transfer -> YMODEM -> Send -> browse and select the kernel image which you want to boot.

The pic below for reference.




4. Switch back to baud 115200 after the transfer is complete and press ESC to comeback to Uboot prompt.
5. bootm 0x10500000  // To Boot the Kernel (Assuming that DTB is already part of uImage)

Note that All UBOOT commands depends upon the overall porting and configuration of UBOOT which is beyond the scope of this blog post.

Monday, January 25, 2016

Linux Kernel Debug Configs

some important debug options of Linux kernel :-


CONFIG_DEBUG_SPINLOCK : With this option enabled, the kernel catches operations on uninitialized spinlocks and various other errors (such as unlocking a lock twice).

SPINLOCK_TIMEOUT and SPINLOCK_TIMEOUT_TIME: The spinlocks will timeout after X number of seconds.This is useful for catching deadlocks, and make sure locks are not held too long.

CONFIG_DEBUG_SPINLOCK_SLEEP : This option enables a check for attempts to sleep while holding a spinlock. In fact, it complains if you call a function that could potentially sleep, even if the call in question would not sleep.

CONFIG_DEBUG_SLAB: This crucial option turns on several types of checks in the kernel memory allocation functions; with these checks enabled, it is possible to detect a number of memory overrun and missing initialization errors

CONFIG_INPUT_EVBUG : This option (under "Device drivers/Input device support") turns on verbose logging of input events. If you are working on a driver for an input device, this option may be helpful. Be aware of the security implications of this option, however: it logs everything you type, including your passwords.


Sunday, January 24, 2016

Linux Kernel Debugging using Ftrace - Basic Post

Ftrace is a tracing utility built directly into the Linux kernel. Many distributions already have various configurations of Ftrace enabled in their most recent releases. 

Note that this post doesn't cover all tracers in details.


Important Kernel Configs for Tracing and debugging :-

CONFIG_FUNCTION_TRACER
CONFIG_FUNCTION_DURATION_TRACER
CONFIG_FUNCTION_GRAPH_TRACER
CONFIG_STACK_TRACER
CONFIG_DYNAMIC_FTRACE

To Find what tracers are available :-

If the DEBUGFS is not mounted 
mount debugfs -t debugfs /debug

Go to the tracer directory as below 

[~]# cd /sys/kernel/debug/tracing 
[tracing]#   
[tracing]# cat available_tracers 
function_graph function sched_switch nop

Note that the Kernel has huge list of tracers, covering all that is not be possible for just a single blog post, however below are few links which describes all of them in detail.


http://lwn.net/Articles/365835/


https://lwn.net/Articles/366796/


I will surely come back with a blog post on debugging critical moments of Linux kernel using FTRACE. I am working toward that right now.


BTW, There is another important tip as below :-


Use Ftrace's  trace_printk() instead of printks :-

printk() is the king of all debuggers, but it has a problem. If you are debugging a high volume area such as the timer interrupt, the scheduler, or the network, printk() can lead to bogging down the system or can even create a live lock. It is also quite common to see a bug "disappear" when adding a few printk()s. This is due to the sheer overhead that printk() introduces.

Ftrace introduces a new form of printk() called trace_printk(). It can be used just like printk(), and can also be used in any context (interrupt code, NMI code, and scheduler code). What is nice about trace_printk() is that it does not output to the console. Instead it writes to the Ftrace ring buffer and can be read via the trace file.


Writing into the ring buffer with trace_printk() only takes around a tenth of a microsecond or so. But using printk(), especially when writing to the serial console, may take several milliseconds per write. The performance advantage of trace_printk() lets you record the most sensitive areas of the kernel with very little impact.


For example you can add something like this to the kernel or module:


    trace_printk("read foo %d out of bar %p\n", bar->foo, bar);



tracing_on() and tracing_off() for Kernel Debugging :-

The driver gets stuck in a sleep state and never wakes up. You will not be able to disable the tracer from user space when a kernel event occurs is difficult and which can in a trace buffer overflow and loss of the relevant information before the user can stop the trace.

There are two functions that work well inside the kernel: tracing_on() and tracing_off(). 
These two act just like echoing "1" or "0" respectively into the tracing_on file. 

If there is some condition that can be checked for inside the kernel, then the tracer may be stopped by adding something like the following:

                               if (test_for_error())
                                           tracing_off();

This gives flexibility to turn on and off tracing in the particular driver, where you can enable tracing on some error and use trace_printk() till you disable it later.

After examining the trace, or saving it off in another file with:
cat trace > ~/trace.sav



Monday, January 18, 2016

How does Linux kernel handles shared IRQs?

Though this topic is discussed in many places and also covered in  Linux Device Drivers, 3rd edition, by Corbet et al. However I would still like to compile the topic in my own way.

The reason is that there are quite confusing things about this topic.


Basics :-

Each IRQ can be viewed via /proc/interrupts so the registered handlers come from the drivers that have invoked request_irq passing in the form:

irqreturn_t (*handler)(int, void *)

If the handler should not handle the particular interrupt it must return the kernel macro IRQ_NONE. Each Interrupt handler can get access to dev_id. Also usually there will be bit set in hardware for interrupts and also one may need to acknowledge the interrupt by clearning a bit in a hardware's memory mapped register. 


The Real Case :-

Notice that there will be multiple drivers registering  a single handler or multiple handlers, that means you have a single interrupt number or interrupt line and you have mapped multiple device drivers to listen to. Note that each device entry in the device tree will have a interrupt number which usually the drivers read while registering the interrupt. The case is like you have multiple device nodes in device tree with the same irq number representing a single interrupt line.


Key Points :-

1. All Interrupt handlers will be serviced.

2.Each driver will be aware of its own interrupts via memory mapped registers hence can easily check whether its his device is the cause of the interrupt or not. Each device virtual memory space is different so there is proper memory restriction already in place.

3.Each dev_id for the driver is unique however this may not be useful to check or differentiate shared interrupt handlers. Hence it wouldn't be correct to solve this problem from dev_id point of view as each irq handler gets the same value whatever by its own drivers.

Hence there is only one way to differentiate is a step(2) as mentioned in the article.

NOTE : Its a usual practice to trigger bottom-halves or any other logic in the IRQ handler only after checking the IRQ status from a memory mapped register. Hence the problem is default solved by a good programmer.


An Example :

Magnetic Card Driver's Interrupt handler:-


  static irqreturn_t timag_irq(int irq, void *dev)
{
            struct timag *mag_dev = dev;
            bool  over_or_under_flow = false;
            unsigned int status, irqclr = 0;

             status = timag_readl(mag_dev, REG_IRQSTATUS);

             if ((status & IRQENB_FIFO_IRQ) )  {
                timag_enable(mag_dev, true);

Friday, January 15, 2016

Linux Security Topic : Key Management/Service

The below text explains the basics of key service and relevant terms. It discusses in length the role of kernel in key service management. The Key service is an kernel's abstraction of creation of cryptographic keys, the actual interface for crypto hardware drivers  to the service is not covered here.

The Master References :- https://www.kernel.org/doc/Documentation/security/


===============================================================
HOW TO ENABLE KEY MANAGEMENT/SERVICE IN THE KERNEL/USERSPACE
===============================================================

The key service can be configured on by enabling: "Security options"/"Enable access key retention support" (CONFIG_KEYS) in the arch/../../defconfig or dynamacally using make menuconfig.

This service allows cryptographic keys, authentication tokens, cross-domain user mappings, and similar to be cached in the kernel for the use of filesystems and other kernel services

In the USERSPACE, the keyutils - In-kernel key management utilities is needed to support it. Refer :- http://man7.org/linux/man-pages/man7/keyutils.7.html for more details.


============
KEY BASICS
============

KEYS represent units of cryptographic data, authenticationtokens, keyrings, etc.. These are represented in the kernel by struct key.

Each key has a number of attributes:

A serial number :- Each key is issued a serial number of type key_serial_t that is unique for
the lifetime of that key. All serial numbers are positive non-zero 32-bit integers.

A type :- Each key is of a defined "type". Types must be registered inside the kernel by a kernel service (such as a filesystem) before keys of that type can be added or used. Userspace programs cannot define new types directly. Key types are represented in the kernel by struct key_type. This defines a number of operations that can be performed on a key of that type. Note that if a type be removed from the system, all the keys of that type will be invalidated.

A description (for matching a key in a search) : Each key has a description. This should be a printable string. The key type provides an operation to perform a match between the description on a key and a criterion string.

Access control information : Each key has an owner user ID, a group ID and a permissions mask. These are used to control what a process may do to a key from userspace, and whether a kernel service will  be able to find the key. 

An expiry time : Each key can be set to expire at a specific time by the key type's instantiation function. Keys can also be immortal.

A payload : This is a quantity of data that represent the actual "key".  In the case of a keyring, this is a list of keys to which the keyring links; In the case of a user-defined key, it's an arbitrary blob of data.


State :

 Each key can be in one of a number of basic states:

(a) Un-instantiated. The key exists, but does not have any data attached. Keys being requested from userspace will be in this state.

  (b) Instantiated. This is the normal state. The key is fully formed, and  has data attached.

  (c) Negative. This is a relatively short-lived state. The key acts as a
note saying that a previous call out to userspace failed, and acts as
a throttle on key lookups. A negative key can be updated to a normal
state.

  (d) Expired. Keys can have lifetimes set. If their lifetime is exceeded,
they traverse to this state. An expired key can be updated back to a
normal state.

  (e) Revoked. A key is put in this state by userspace action. It can't be
 found or operated upon (apart from by unlinking it).

  (f) Dead. The key's type was unregistered, and so the key is now useless.

 
Important Points : 

(a) A payload is not mandatory; and the payload can, in fact, just be a value stored in the struct key itself.

(b) When a key is instantiated, the key type's instantiation function is called with a blob of data, and that then creates the key's payload in  the kernel.

(c) when userspace wants to read back the contents of the key, if  permitted, another key type operation will be called to convert the key's attached payload back into a blob of data.

(d) Userspace programs can use a key's serial numbers as a way to gain access to it, subject to permission checking.

(e) Userspace programs cannot define new types directly. Key types are represented 
in the kernel by struct key_type.For further information on key_type refer :- https://www.kernel.org/doc/Documentation/security/keys.txt

(f) Kernel maintains all datastructures for key managemnent. Userspace can make system calls to use/update keys. We will discuss this in detail in the later part of the document.


=========
KEYRINGS
=========

These are a special type of key that can hold links to other keys. Processes each have three standard keyring subscriptions that a kernel service can search for relevant keys.


==================================================
USERSPACE SYSTEM CALL INTERFACE WITH THE KERNEL
==================================================

Userspace can manipulate keys directly through three new syscalls: add_key, request_key and keyctl. It is important to note here that mostly the key created and added by the kernel and user space just sees it and authenticates it.

For e.g. Trusted and Encrypted Keys are two new key types added to the existing kernel key ring service.  Both of these new types are variable length symmetric keys, and in both cases all keys are created in the kernel, and user space sees, stores, and loads only encrypted blobs.

1.Create a new key and add it to the nominated keyring:

key_serial_t add_key(const char *type, const char *desc,
    const void *payload, size_t plen,
    key_serial_t keyring);

2. When referring to a key directly, userspace programs should use the key's serial number to directly get the key. Search the process's keyrings for a key, potentially calling out to userspace to create it. It is important to note that this API is similar to the other API provided to kernel services(e.g Filesystem) to search or request a key.

key_serial_t request_key(const char *type, const char *description,
const char *callout_info,
key_serial_t dest_keyring); 

For more description of this API and its relavent usage related to processes , 
refer :-https://www.kernel.org/doc/Documentation/security/keys-request-key.txt


3. Read the payload data from a key - one of the important usage of keyctl.This function attempts to read the payload data  from the specified key into the buffer. The process must have read permission on the key to succeed.  

long keyctl(KEYCTL_READ, key_serial_t keyring, char *buffer, size_t buflen);

For all other types of keyctl,  please refer :- http://man7.org/linux/man-pages/man7/keyutils.7.html

The application of keyctl is brought out when a key is created for Filesystem and the filesystem is then mounted using the key.

Refer :- https://www.kernel.org/doc/Documentation/security/keys-ecryptfs.txt


============================================
USAGE OF KEYCTL for Filesystem key authentication
============================================

important commands for keyctl :-

keyctl add encrypted name "new ecryptfs key-type:master-key-name keylen" ring
keyctl add encrypted name "load hex_blob" ring
keyctl update keyid "update key-type:master-key-name"

Example of encrypted key usage with the eCryptfs filesystem:

Create an encrypted key "1000100010001000" of length 64 bytes with format
'ecryptfs' and save it using a previously loaded user key "test":

    $ keyctl add encrypted 1000100010001000 "new ecryptfs user:test 64" @u
    19184530

    $ keyctl print 19184530
    ecryptfs user:test 64 490045d4bfe48c99f0d465fbbbb79e7500da954178e2de0697
    dd85091f5450a0511219e9f7cd70dcd498038181466f78ac8d4c19504fcc72402bfc41c2
    f253a41b7507ccaa4b2b03fff19a69d1cc0b16e71746473f023a95488b6edfd86f7fdd40
    9d292e4bacded1258880122dd553a661

    $ keyctl pipe 19184530 > ecryptfs.blob

Mount an eCryptfs filesystem using the created encrypted key "1000100010001000"
into the '/secret' directory:

    $ mount -i -t ecryptfs -oecryptfs_sig=1000100010001000,\
      ecryptfs_cipher=aes,ecryptfs_key_bytes=32 /secret /secret


=====================================
KERNEL SERVICES FOR KEY MANAGEMENT
=====================================

The kernel services for key management can be broken down into two areas: keys and key types.

Firstly, the kernel service registers its type, then it searches for a key of that type. It should retain the key as long as it has need of it, and then it should release it. 

E.g. For a filesystem or device file, a search would probably be performed during the open call, and the key released upon close. 

To access the key manager, the following header must be included:
#include

Some of the important kernel functions are described below, for complete list you can refer : - https://www.kernel.org/doc/Documentation/security/keys.txt

1. A keyring can be created by:

struct key *keyring_alloc(const char *description, uid_t uid, gid_t gid,
const struct cred *cred,
key_perm_t perm,
unsigned long flags,
struct key *dest);

2. A kernel service may want to define its own key type. For instance, an AFS filesystem might want to define a key type. To do this, it author fills in  a key_type struct and registers it with the system.

(a)To register a key type, the following function should be called:

int register_key_type(struct key_type *type);

(b) To unregister a key type, call:

void unregister_key_type(struct key_type *type);


3. To get the Key, following are the functions to be called by Kernel service.

struct key *__key_get(struct key *key);
struct key *key_get(struct key *key);

4. A key's serial number can be obtained by calling:

key_serial_t key_serial(struct key *key);

5. To search for a key, call:

struct key *request_key(const struct key_type *type,
const char *description,
const char *callout_info);

References :- https://www.kernel.org/doc/Documentation/security/