WHOLEPRO: An Online, Holistic Job Scheduling and Resource Provisioning Framework for Datacenter Architectures and Applications

WHOLWPRO is a top-down, holistic approach, aiming at meet diverse user Quality-of-Experience (QoE) requirements for individual customers, while allowing for user-utility-aware fair resource allocation, called the dual goal in this project. It starts from the very top, i.e., the QoE or user utility requirements at the user level. It then works its way downward, by mapping user-level requirements into datacenter system-level resource demands, which, together with the datacenter computing and networking resource constraints, set the boundary conditions for a user-utility-aware optimization framework. The solution to this framework is be composed of a set of host-based computing and flow controllers, and in-network load balancers. These controllers and load balancers not only ensure that QoE requirements for individual users are met, but also work in concert to enable user-utility-aware fair resource allocation among all the users, hence achieving the dual goal.interconnected distributed datacenter.


WHOLEPRO framework


The WHOLEPRO is a resource allocation framework, upon which job schedulers can be developed to provide tail-latency SLO guarantee, high resource utilization, and fair resource sharing. It is a top-down approach, taking the entire system stack into account. It decouples the upper job-level design from the lower task-level or runtime-system design. The solution is composed of a set of fully distributed, host-based task compute and flow controllers.
The above figure shows the job scheduling framework, applicable to both centralized and distributed job scheduling. A job scheduler, J-S, runs in a master node and distributed task schedulers, T-S’s, run in individual workers in the cluster, each of which is mainly composed of a computing controller, C-C, and a flow scheduler, F-C, per flow emitted from the worker.


SLO-to-Task-Resource-Demand Translation

Consider job j with fanout Nj and the n-th tail latency SLO Tn arriving at a job master. The master calculates the corresponding task response time budget ( Ej , Vj ) for each task. The task response time budget is divided into networking budget ( Efj , Vfj ) and compute budget ( Ecj , Vcj ). Then a network flow control and a task scheduling control scheme are developed to meet the networking and computimg time budgets.


Utility Maximization (UM) Framework

An optimization network utility maximization (NUM) framework is to maximize the total user utility while maintaining minimum user utility as a flow contraint. For each task with a networking utility finction u(rfj,k) and computing utility u(rcj,k), the NUM framework is

subject to bothlink bandwidt hconstraints and flow/computing rate constraints for flows/tasks with minimum user utility requirements.


HOLNET: Holistic, user-utility-based flow rate allocation framework

HOLNET introduces a HOLNET network utility maximization(NUM),a soft minimum user-utility guaranteed, center-of-utility-fairness-based NUM. It covers a large solution design space, allowing for integrated, multi-Class-of-Service(CoS) enabled, host-basedsingle/multipath congestion control and in-network loadbalancing.
The different user utilities are mapped to weighted base user utility. The weight for different user utilies are based on the center-of-utility, and hence achieving a center-of-utility fair flow rate allocation, provided that the minimum flow rates to sustain the minimum user utilities are satisfied.


TLG: tail-latency-SLO-guaranteed job scheduler

In TLG, a decomposition technique is employed to translate the jobtail-latency SLO into task-level performance budgets for individual tasks spawned by the job. This effectively decomposes a hard job-level cotask/coflow resource allocation problem into distributed task-level resource allocation subproblems for individual tasks and task flows. At the task level, a utility-maximization framework(UM) isproposed to enable joint task compute and flow fair resource sharing, subject to task budget constraints. The solution to this UM is in the form of distributed task compute and flow controllers that work in concert to achieve the three design objectives.