The main objective of task scheduling is to schedule the arriving tasks onto available processors with the aim of producing minimum schedule length and without violating the precedence constraints. It is often seen that the different sections of the application task need different types of computations. So for single machine architecture it is impossible to satisfy all the computational requirements of such applications. So it is essential to schedule different tasks of such applications to different suitable processors across the distributed heterogeneous computing resources. This system helps in executing computationally intensive applications with different computational requirements. But the performance of such a system is highly dependent on the scheduling algorithm used.
We assume that the physical locations of VMs are determined by the provider. It is assumed that, the responsibility to determine when to request how many resources of which type, and how to schedule the user’s application on the allocated resources, lies with the user.
In a heterogeneous environment, it is important to schedule a job on its preferred resources to achieve high performance. In addition, it is not straightforward to provide fairness among jobs when there are multiple jobs. Moreover, the capacity of different machine types needs to be considered when a user makes a resource request.
Task Scheduling algorithms are static as well as dynamic in nature. As static algorithms are more fair, stable and easy to use, they are more commonly used. Hadoop also uses static algorithm for task scheduling.
Performance metrics for task scheduling algorithms are schedule length, speedup, efficiency and time complexity:
- Schedule length: is the maximum finish time of the exit task in scheduled DAG (Directed Acyclic Graph).
- Speedup: of a schedule is defined as the ratio of the schedule length obtained by assigning all tasks to the fastest processor, to the scheduled length of application.
- Efficiency: is the speedup divided by the number of processors used. It is the measure of the effective percentage of processor time is doing useful computation.
- Time complexity: is the amount of time taken to assign every task to specific processor according to specific priority. Once a set of resources such as virtual machines are allocated by the cloud driver, the analytics engine uses the resources that are heterogeneous and shared among multiple jobs. In this section, we consider challenges in job scheduling on a shared, heterogeneous cluster, to provide good performance while guaranteeing fairness.
Share and Fairness
In a shared cluster, providing fairness is one of the most important features that the analytics engine should support. There are different ways to define fairness, but one method might be having each job receive equal (or weighted) share of computing resources at any given moment. In that sense, Hadoop Fair Scheduler takes the number of slots assigned to a job as a metric of share, and fairness is provided by assigning same number of slots to each job. In a heterogeneous cluster, as all the slots are different, the number of slots might not be an appropriate metric of the share. Moreover, even on the same slot, the computation speed varies depending on jobs. The performance variance on different resources is also important to consider to improve overall performance of the data analytics cluster; if jobs are assigned to slots irrespective of their preferences will not only make the job run slow, but also may prevent other jobs that prefer the slot from utilizing it.
Progress share refers to how much progress each job is making with assigned resources (or slots in Hadoop) compared to the case of running the job on the entire cluster without sharing; therefore, it is between 0 (no progress at all) and 1 (all available resources are occupied).
To calculate the Progress share of each job, the analytics engine scheduler should be aware of the per-slot computing rate (CR). To that end, each job should go through two phases: calibration and normal. Calibration phase starts when a task is submitted. In this phase, by measuring the completion time of these tasks, the scheduler can determine the CR. Once the scheduler knows the CR, the task enters the normal phase. During this phase, when a slot becomes available, a job of which the share is less than its minimum or fair share is selected. However, if there is another job with a significantly higher computing rate on the slot, the scheduler chooses that job to improve overall performance. This is similar to the “delay scheduling” mechanism in Hadoop fair scheduler. By using progress share, the scheduler can make an appropriate decision. As a result, the cluster is better utilized. In addition, each job receives a fair amount of computing resources.