Thursday, August 23, 2012

Scheduling in Linux : Part1


Introduction
---------------
The Linux scheduler is an interesting study in competing pressures. On one side are the use models in which Linux is applied. Although Linux was originally developed as a desktop operating system experiment, you'll now find it on servers, tiny embedded devices, mainframes, and supercomputers. Not surprisingly, the scheduling loads for these domains differ. On the other side are the technological advances made in the platform, including architectures (multiprocessing, symmetric multithreading, non-uniform memory access [NUMA]) and virtualization. Also embedded here is the balance between interactivity (user responsiveness) and overall fairness. From this perspective, it's easy to see how difficult the scheduling problem can be within Linux.


History
---------
Early Linux schedulers used minimal designs, obviously not focused on massive architectures with many processors or even hyperthreading. The 1.2 Linux scheduler used a circular queue for runnable task management that operated with a round-robin scheduling policy. This scheduler was efficient for adding and removing processes (with a lock to protect the structure). In short, the scheduler wasn't complex but was simple and fast.

Linux version 2.2 introduced the idea of scheduling classes, permitting scheduling policies for real-time tasks, non-preemptible tasks, and non-real-time tasks. The 2.2 scheduler also included support for symmetric multiprocessing (SMP).

The 2.4 kernel included a relatively simple scheduler that operated in O(N) time (as it iterated over every task during a scheduling event). The 2.4 scheduler divided time into epochs, and within each epoch, every task was allowed to execute up to its time slice. If a task did not use all of its time slice, then half of the remaining time slice was added to the new time slice to allow it to execute longer in the next epoch. The scheduler would simply iterate over the tasks, applying a goodness function (metric) to determine which task to execute next. Although this approach was relatively simple, it was relatively inefficient, lacked scalability, and was weak for real-time systems. It also lacked features to exploit new hardware architectures such as multi-core processors.

The early 2.6 scheduler, called the O(1) scheduler, was designed to solve many of the problems with the 2.4 scheduler—namely, the scheduler was not required to iterate the entire task list to identify the next task to schedule (resulting in its name, O(1), which meant that it was much more efficient and much more scalable). The O(1) scheduler kept track of runnable tasks in a run queue (actually, two run queues for each priority level—one for active and one for expired tasks), which meant that to identify the task to execute next, the scheduler simply needed to dequeue the next task off the specific active per-priority run queue. The O(1) scheduler was much more scalable and incorporated interactivity metrics with numerous heuristics to determine whether tasks were I/O-bound or processor-bound. But the O(1) scheduler became unwieldy in the kernel. The large mass of code needed to calculate heuristics was fundamentally difficult to manage and, for the purist, lacked algorithmic substance.



Linux's Old O(1) Scheduler
-----------------------------------
The Linux scheduler was overhauled completely with the release of kernel 2.6. This new scheduler is called the O(1) scheduler—O(...) is referred to as “big O notation”. The name was chosen because the scheduler's algorithm required constant time to make a scheduling decision, regardless of the number of tasks. The algorithm used by the O(1) scheduler relies on active and expired arrays of processes to achieve constant scheduling time. Each process is given a fixed time quantum, after which it is preempted and moved to the expired array. Once all the tasks from the active array have exhausted their time quantum and have been moved to the expired array, an array switch takes place. This switch makes the active array the new empty expired array, while the expired array becomes the active array.
The main issue with this algorithm is the complex heuristics used to mark a task as interactive or non-interactive. The algorithm tries to identify interactive processes by analyzing average sleep time (the amount of time the process spends waiting for input). Processes that sleep for long periods of time probably are waiting for user input, so the scheduler assumes they're interactive. The scheduler gives a priority bonus to interactive tasks (for better throughput) while penalizing non-interactive tasks by lowering their priorities. All the calculations to determine the interactivity of tasks are complex and subject to potential miscalculations, causing non-interactive behavior from an interactive process.



Difference between Kernel Threads and User Threads Scheduling
------------------------------------------------------------------------------------
The difference here is that, for kernel-space tasks, the interrupt is guaranteed to return the CPU back to the task sooner or later. For user-space tasks, the interrupt could cause another task to be scheduled on that CPU, and execution of other tasks might well happen - here the user-space task must be chosen again by the scheduler. Of course this explanation is not the entire story (as usual) - firstly, there are kernel-space subsystems that can register code to be run on the way back from an interrupt, such as bottom halves and tasklets. This does not changes the fact, however, that the scheduler will not be involved in the interruption of a kernel-space task. Secondly, interrupts can interrupt interrupts - a salient example is the ARM architecture, where fast interrupts (FIQs) have a higher hardware priority than normal interrupts (IRQs). So in fact return from an FIQ can return to another interrupt handler. Sooner or later though, the original interrupt will complete and return the CPU to a kernel-space task.


Processes fibers, threads etc , all are treated as task in the kernel
---------------------------------------------------------------------------------
Remember, to the scheduler and the kernel at large, every schedulable object (i.e. anything that can be chosen by the schedule() routine) is known as a task. No distinction is made between any of these objects, so what are often called processes, LWPs, kernel threads, fibers, threads, etc. are all just tasks to the kernel, each of them with their own particular characteristics. This is a big win in terms of kernel cleanliness - there is no real reason to separate the cases out.These characteristics are particularly interesting though. For example some tasks may have user space memory mappings and stack - a typical example being a user space process. The term process context is used to refer to one of these tasks executing in kernel space - they have both user space mappings, and the (possibly temporary) kernel mappings and stack. In this context copying to and from user memory makes sense.Once again, what are sometimes known as "kernel threads" or "fibers" are not treated differently from other tasks. They may have user-space memory mappings just like "normal" processes. The only distinguishing feature here is that the code executed by the kernel thread comes from the kernel or module image, rather than from binary process images.The term interrupt context is often used to mean code currently executing as a result of a hardware interrupt. This encompasses bottom halves, ISRs, softirqs, and tasklets. Here there is no associated task as such so it is meaningless to schedule (and in fact a panicking bug). This also means that you cannot sleep here, as this implies a schedule.

References :
http://kernelnewbies.org/Documents/SchedulingInUNIXAndLinux
http://www.ibm.com/developerworks/linux/library/l-completely-fair-scheduler/



No comments:

Post a Comment