【译文】The evolution of cluster scheduler architectures 集群调度架构的进化史
This post is our first in a series of posts about task scheduling on large clusters, such as those operated by internet companies like Amazon, Google, Facebook, Microsoft, or Yahoo!, but increasingly elsewhere too. Scheduling is an important topic because it directly affects the cost of operating a cluster: a poor scheduler results in low utilization, which costs money as expensive machines are left idle. High utilization, however, is not sufficient on its own: antagonistic workloads interfere with other workloads unless the decisions are made carefully.
“资源调度”是企业的重要话题,因为资源调度上的变化将会直接影响到集群的运营成本。一个性能平庸的调度器将会导致令人失望的资源使用率,大量金钱被投入在几乎闲置的服务器上。高利用率本身也是不够的,除非调度器能够做出足够慎重的决定来排除对抗性工作负荷的干扰。
Architectural evolution
This post discusses how scheduler architectures have evolved over the last few years, and why this happened. Figure 1 visualises the different approaches: a gray square corresponds to a machine, a coloured circle to a task, and a rounded rectangle with an “S” inside corresponds to a scheduler.0 Arrows indicate placement decisions made by schedulers, and the three colours correspond to different workloads (e.g., web serving, batch analytics, and machine learning).
本文将会介绍调度器架构在最近若干年的进步,以及这种改变的深层原因。图1可视化了调度器进化的历史,一个灰色正方形代表一台服务器,一个彩色圆圈代表一个任务,一个标注有S的圆角矩形代表一个调度器,三种颜色代表三种不同类型的工作负荷(比如网络负荷、批处理或者机器学习的工作)。
Many cluster schedulers – such as most high-performance computing (HPC) schedulers, the Borg scheduler, various early Hadoop schedulers and the Kubernetes scheduler – are monolithic. A single scheduler process runs on one machine (e.g., the JobTracker
in Hadoop v1, and kube-scheduler
in Kubernetes) and assigns tasks to machines. All workloads are handled by the same scheduler, and all tasks run through the same scheduling logic (Figure 1a). This is simple and uniform, and has led to increasingly sophisticated schedulers being developed. As an example, see the Paragon and Quasar schedulers, which use a machine learning approach to avoid negative interference between workloads competing for resources.
诸如高性能计算调度器,Google Borg调度器,早期的Hadoop调度器或者Kubernetes调度器,这一类调度器都是单点(Monolithic)的。一个单点的调度进程被放置在一台服务器上(比如Hadoop v1中的JobTracker, Kubernetes中的kube-scheduler),负责调度所有任务。所有类型的工作负荷基于同一个调度器和同一种调度策略,来决定具体的运行位置。这种策略是简单而统一的,并且进一步发展为更加复杂的调度程序。Paragon和Quasar调度器都使用了机器学习的方法来避免不同类型的工作负荷相互影响。
Most clusters run different types of applications today (as opposed to, say, just Hadoop MapReduce jobs in the early days). However, maintaining a single scheduler implementation that handles mixed (heterogeneous) workloads can be tricky, for several reasons:
在今天的基础设施中,大部分集群都运行着不同类型的应用,与之相反,早期的Hadoop集群只运行MapReduce任务。尽管如此,使用一种调度器来调度多种异构的工作负荷仍然是困难重重的,这体现在以下方面:
- It is quite reasonable to expect a scheduler to treat long-running service jobs and batch analytics jobs differently.
- Since different applications have different needs, supporting them all keeps adding features to the scheduler, increasing the complexity of its logic and implementation.
- The order in which the scheduler processes tasks becomes an issue: queueing effects (e.g., head-of-line blocking) and backlog can become an issue unless the scheduler is carefully designed
- 我们期望调度器能够在处理长期运行的服务和批处理任务时,有不同的调度模式。
- 由于不同应用有不同的需求,所以势必需要在调度器中不断增加新的功能,这会导致调度器的逻辑和实现更加复杂。
- 调度器处理任务的顺序会变成一个难题:调度器需要慎重考虑来避免任务积压和排队效应。
Overall, this sounds like the makings of an engineering nightmare – and the never-ending lists of feature requests that scheduler maintainers receive attests to this.1
总而言之,开发一个尽善尽美的调度器将会是工程师的噩梦,开发清单将会异常冗长,直到调度器的维护者感到崩溃。
Two-level scheduling architectures address this problem by separating the concerns of resource allocation and task placement. This allows the task placement logic to be tailored towards specific applications, but also maintains the ability to share the cluster between them. The Mesos cluster manager pioneered this approach, and YARN supports a limited version of it. In Mesos, resources are offered to application-level schedulers (which may pick and choose from them), while YARN allows the application-level schedulers to request resources (and receive allocations in return).2 Figure 1b shows the general idea: workload-specific schedulers (S0–S2) interact with a resource manager that carves out dynamic partitions of the cluster resources for each workload. This is a very flexible approach that allows for custom, workload-specific scheduling policies.
双层调度架构(Two-level scheduling architectures)通过将资源分派(resource allocation)和任务布置(task placement)两种需求进行分离的方式来解决这个问题。这种设计允许为了特定的应用定制任务布置的需求,与此同时也保持了若干应用共享集群的资源可用性。Mesos是这种设计思想的先驱,而Yarn也部分支持这种功能。在Mesos中,资源由全局调度器分配给局部的应用调度器( Application Scheduler),应用调度器则从这些资源中进行选择与使用。而在Yarn中,全局调度器允许应用调度器发起资源请求,并且根据请求返回资源。图1b显示了这种设计思想,S0-S2是三个应用调度器,服务于三种不同类型的应用;这三个应用调度器和全局的资源调度器(resources manager)进行交互,将全局资源动态划分为三个部分。这种分离式的调度架构是一种适合于定制化工作负荷的灵活设计。
Yet, the separation of concerns in two-level architectures comes with a drawback: the application-level schedulers lose omniscience, i.e., they cannot see all the possible placement options any more.3 Instead, they merely see those options that correspond to resources offered (Mesos) or allocated (YARN) by the resource manager component. This has several disadvantages:
然而,这种双层调度架构也存在一个隐患:应用调度器缺乏全局视野,也无法看到在全局资源中的最优分配。与之相反,应用调度器只能根据被资源调度器提供(Mesos)或者分配(Yarn)的局部资源进行调度。这个隐患会具体表现在以下方面:
- Priority preemption (higher priority tasks kick out lower priority ones) becomes difficult to implement: in an offer-based model, the resources occupied by running tasks aren’t visible to the upper-level schedulers; in a request-based model, the lower-level resource manager must understand the preemption policy (which may be application-dependent).
- Schedulers are unable to consider interference from running workloads that may degrade resource quality (e.g., “noisy neighbours” that saturate I/O bandwidth), since they cannot see them.
- Application-specific schedulers care about many different aspects of the underlying resources, but their only means of choosing resources is the offer/request interface with the resource manager. This interface can easily become quite complex.
- 优先级抢占(Priority Preemption)会变得非常复杂:在Mesos的offer架构中,应用调度器所使用的具体资源对全局调度器是不透明的;而在Yarn的request架构中,应用调度器必须了解全局的抢占策略,并实现与之对应的接口,而这种抢占实现往往是因人而异的。
- 全局调度器难以感知多种应用的相互影响,以至于造成性能上的损耗。比如某些嘈杂的邻居(nosiy neighbours)导致IO带宽拥挤。
- 应用调度器将会使用多种系统提供的资源(cpu/mem/storage/network/gpu),无论是Offer架构还是Request架构,这些资源都最终由全局资源调度器来决定;两者交互的接口会随着资源类型的增加而愈加复杂。
Shared-state architectures address this by moving to a semi-distributed model,4 in which multiple replicas of cluster state are independently updated by application-level schedulers, as shown in Figure 1c. After the change is applied locally, the scheduler issues an optimistically concurrent transaction to update the shared cluster state. This transaction may fail, of course: another scheduler may have made a conflicting change in the meantime.
状态共享架构 (Shared-State Architecture) 通过一种半分布式的架构解决了这个问题。在这种架构中,如图1c所示,集群状态的多个副本被应用调度器独立修改。当局部的应用调度器做出某种调度决定之后,这种调度决定将会以一种乐观锁的方式执行,应用调度器首先尝试修改集群状态,如果与此同时有另一个应用调度器执行与之冲突的调度决定,本调度器的决定就会被阻塞。
The most prominent examples of shared-state designs are Omega at Google, and Apollo at Microsoft, as well as the Nomad container scheduler by Hashicorp. All of these materialise the shared cluster state in a single location: the “cell state” in Omega, the “resource monitor” in Apollo, and the “plan queue” in Nomad.5 Apollo differs from the other two as its shared-state is read-only, and the scheduling transactions are submitted directly to the cluster machines. The machines themselves check for conflicts and accept or reject the changes. This allows Apollo to make progress even if the shared-state is temporarily unavailable.6
状态共享调度器最好的例子是 Google的Omega调度器,Microsoft的Apollo调取器,以及Hashicorp的Nomad调度器。所有这些调度器都把共享资源状态(Shared Cluster State)定义为一个单独的对象,Omega中称为Cell State, Apollo中称为resource monitor, Nomad称为Plan Queue。Apollo的共享状态和另外两种有些不同,它的共享状态是只读的,局部的资源调度决定直接被提交到服务器上,由服务器进行冲突检测来判断是否接受资源调度决策。因此,即使集群无法满足应用的所有资源需求,也至少能够尽可能满足其中一部分。
A “logical” shared-state design can also be achieved without materialising the full cluster state anywhere. In this approach (somewhat similar to what Apollo does), each machine maintains its own state and sends updates to different interested agents such as schedulers, machine health monitors, and resource monitoring systems. Each machine’s local view of its state now forms a “shard” of the global shared-state.
一个符合逻辑的共享状态设计无需具象化集群的所有状态。就如同Apollo调度器所做的,集群的每一个服务器独立维护自己的状态,并且把自己的状态发送到全局调度器、状态安全监控器(Machine Health Monitor)、资源监控系统(resource monitoring system)或者任何感兴趣的组件。每一个服务器的局部视角最终构成了集群的全局视角。
However, shared-state architectures have some drawbacks, too: they must work with stale information (unlike a centralized scheduler), and may experience degraded scheduler performance under high contention (although this can apply to other architectures as well).
然而共享状态架构也存在某些缺陷,由于它摈弃了以往的中心架构,这使得调度器必须面对分布式服务器信息延时的问题。另外,当资源高度紧张时,这种架构与其他架构一样存在性能下降的问题。
Fully-distributed architectures take the disaggregation even further: they have no coordination between schedulers at all, and use many independent schedulers to service the incoming workload, as shown in Figure 1d. Each of these schedulers works purely with its local, partial, and often out-of-date view of the cluster. Jobs can typically be submitted to any scheduler, and each scheduler may place tasks anywhere in the cluster. Unlike with two-level schedulers, there are no partitions that each scheduler is responsible for. Instead, the overall schedule and resource partitioning are emergent consequences of statistical multiplexing and randomness in workload and scheduler decisions – similar to shared-state schedulers, albeit without any central control at all.
完全分布式架构(Fully-distributed architectures)进一步缓解了资源调度中的瓶颈 ,这种架构由许多独立的调度器组成,去除了中心式的主调度器。如图1d所示,每一个子调度器都只有一份关于集群状态的局部副本,由于各个子调度器没有协调机制,这种局部副本有时是过期的。任务可以被提交给任意一个子调度器,子调度器也可将任务分派到任何一台服务器上。和双层架构不同,在分布式架构中子调度器并不会被局限在部分资源中。由于工作负荷和子调度器策略的随机性,整体意义上的资源分片在统计学上仍然会顺利被实现,者就如同共享状态架构一样,这种资源分布不需要任何集中控制。
The recent distributed scheduler movement probably started with the Sparrow paper, although the underlying concept (power of multiple random choices) first appeared in 1996. The key premise of Sparrow is a hypothesis that the tasks we run on clusters are becoming ever shorter in duration, supported by an argument that fine-grained tasks have many benefits. Consequently, the authors assume that tasks are becoming more numerous, meaning that a higher decision throughput must be supported by the scheduler. Since a single scheduler may not be able to keep up with this throughput (assumed to be a million tasks per second!), Sparrow spreads the load across many schedulers.
尽管这种理念最早出现于1996年,最新的分布式调度器架构则是源于Sparrow。Sparrow架构基于集群中的任务大部分是短时间的,因为细粒度的任务有很多好处。基于细粒度的任务分配,任务数量将会非常庞大,因此调度器必须能够支持更高的决策吞吐量。由于单个的服务器无法支持这样的高吞吐量,Sparrow将吞吐量负载到多个调度器上。
This makes perfect sense: and the lack of central control can be conceptually appealing, and it suits some workloads very well – more on this in a future post. For the moment, it suffices to note that since the distributed schedulers are uncoordinated, they apply significantly simpler logic than advanced monolithic, two-level, or shared-state schedulers. For example:
这种设计的好处不言而喻,去除单点主调度器在概念上很有吸引力。由于分布式的子调度器之间没有协调机制,这种架构闲置只能应用在比核心调度器、双层调度器、共享状态调度器来得简单得多的逻辑情境中:
- Distributed schedulers are typically based on a simple “slot” concept that chops each machine into n uniform slots, and places up to n parallel tasks. This simplifies over the fact that tasks’ resource requirements are not uniform.
- They also use worker-side queues with simple service disciplines (e.g., FIFO in Sparrow), which restricts scheduling flexibility, as the scheduler can merely choose at which machine to enqueue a task.
- Distributed schedulers have difficulty enforcing global invariants (e.g., fairness policies or strict priority precedence), since there is no central control.
- Since they are designed for rapid decisions based on minimal knowledge, distributed schedulers cannot support or afford complex or application-specific scheduling policies. Avoiding interference between tasks, for example, becomes tricky.
- 分布式调度器是基于一个简单的理念,每一台机器的资源被分割为n个相同的slot,并且分割给n个平行的任务。这简化了不同任务所需求资源的不同。
- 使用了一个简单的服务队列和服务机制(比如FIFO),这种机制限制了调度的灵活性,但是子调度器可以决定向哪个服务器分派任务。
- 分布式调度器难以实现一些全局的要求,比如策略公平性(Fairness)或者严格的优先级。由于没有全局控制,这些要求很难被真正实现。
- 由于分布式调度器尝试使用最少的已知信息来实现尽可能快速的决策,所以它无法提供复杂的或者根据应用定制的调度策略,也无法减少运行在同一台服务器上的任务的相互影响。
Hybrid architectures are a recent (mostly academic) invention that seeks to address these drawbacks of fully distributed architectures by combining them with monolithic or shared-state designs. The way this typically works – e.g., in Tarcil, Mercury, and Hawk – is that there really are two scheduling paths: a distributed one for part of the workload (e.g., very short tasks, or low-priority batch workloads), and a centralized one for the rest. Figure 1e illustrates this design. The behaviour of each constituent part of a hybrid scheduler is identical to the part’s architecture described above. In practice, no hybrid schedulers have been deployed in production settings yet, however, as far as I know.
混合调度架构(Hybrid Architecture)是最近在学术界提出的一种新架构,它试图解决完全分布式架构的缺陷,并将之与其他架构结合在一起。诸如Tarcil, Mercury, Hawk之类的调度器中有两种调度路径,对于短时间的、低优先级的任务采用分布式的调度路径;对于其他的则采用中心式的调度路径。图1e展示了这种设计,混合调度器的每个组成部分和我们上述分别介绍的调度器是相同的。在我们的实践中,据我所知还没有混合调度器被部署在生产环境中。
What does this mean in practice?
Discussion about the relative merits of different scheduler architectures is not merely an academic topic, although it naturally revolves around research papers. For an extensive discussion of the Borg, Mesos and Omega papers from an industry perspective, for example, see Andrew Wang’s excellent blog post. Moreover, many of the systems discussed are deployed in production settings at large enterprises (e.g., Apollo at Microsoft, Borg at Google, and Mesos at Apple), and they have in turn inspired other systems that are available as open source projects.
尽管受到很多研究论文的关注,对于不同种类的调度器架构的讨论不仅仅是一个学术议题。许多我们刚才讨论的系统已经被部署在了大型企业的生产环境中,比如Microsoft的Apollo,Google的Borg和Apple的Mesos调度器;这些生产环境的系统又进一步影响了诸多的开源项目。
These days, many clusters run containerised workloads, and consequently a variety of contained-focused “orchestration frameworks” have appeared. These are similar to what Google and others call “cluster managers”. However, there are few detailed discussions of the schedulers within these frameworks and their design principles, and they typically focus more on the user-facing scheduler APIs (e.g., this report by Armand Grillet, which compares Docker Swarm, Mesos/Marathon, and the Kubernetes default scheduler). Moreover, many users neither know what difference the scheduler architecture makes, nor which one is most suitable for their applications.
现在许多集群也运行基于容器(Container)的工作负荷,随之而来的是许多以容器为中心的编排框架(Orchestration Framework)。关于这些容器化的调度体系也有不少更加深入的讨论,但是这些讨论更多聚焦在面向用户的调度接口的设计。许多用户既不知道这些调度架构性能上的差异,也不知道哪种调度体系最适合他们的应用。
Figure 2 shows an overview of a selection of open-source orchestration frameworks, their architecture and the features supported by their schedulers. At the bottom of the table, We also include closed-source systems at Google and Microsoft for reference. The resource granularity column indicates whether the scheduler assigns tasks to fixed-size slots, or whether it allocates resources in multiple dimensions (e.g., CPU, memory, disk I/O bandwidth, network bandwidth, etc.).
图2解释了我们选择的开源调度架构,以及这些架构支持的功能。资源粒度(resource granularity)这一栏表示调度器的任务粒度是固定大小的资源槽(slot),还是允许不同的资源维度(CPU,内存,硬盘,IO,网络等等)。
One key aspect that helps determine an appropriate scheduler architecture is whether your cluster runs a heterogeneous (i.e., mixed) workload. This is the case, for example, when combining production front-end services (e.g., load-balanced web servers and memcached) with batch data analytics (e.g., MapReduce or Spark). Such combinations make sense in order to improve utilization, but the different applications have different scheduling needs. In a mixed setting, a monolithic scheduler likely results in sub-optimal assignments, since the logic cannot be diversified on a per-application basis. A two-level or shared-state scheduler will likely offer benefits here. 8
决定何种调度架构最适合使用的一个标准是,你的集群是否运行异构工作负荷(Heterogeneous Workload);比如说集群上同时运行着生产级别的前端服务(负载均衡的网络服务器和memcached)和批处理的数据分析任务(MapReduce或者Spark)。这种负载组合意味着我们可以更好地提高集群中各类资源的使用率,但是这也意味着不同的应用需要有不同的调度策略。在这种混合设定之下,中心式的调度器无法取得最优的调度效果,因为中心式调度器无法根据不同的应用定制调度策略。一个双层的,或者共享状态的调度器会更加合适。
Most user-facing service workloads run with resource allocations sized to serve peak demand expected of each container, but in practice they typically under-utilize their allocations substantially. In this situation, being able to opportunistically over-subscribe the resources with lower-priority workloads (while maintaining QoS guarantees) is the key to an efficient cluster. Mesos is currently the only open-source system that ships support for such over-subscription, although Kubernetes has a fairly mature proposal for adding it. We should expect more activity in this space in the future, since the utilization of most clusters is still substantially lower than the 60-70% reported for Google’s Borg clusters. We will focus on resource estimation, over-subscription and efficient machine utilization in a future post in this series.
大部分面向用户的服务,都是根据容器在高峰时的使用率来选择资源分配大小的,但是在实践中,容器的资源使用率往往是很低的。在这种情况下,不失时机地对低优先级的工作负荷进行超卖(Over-subscribe)是提高集群使用率的关键所在。Mesos是现在唯一支持超卖的开源调度器,而Kubernetes也已经有了一个成熟的方案在未来支持超卖(编者按:现在已经支持了)。这一个领域上仍然有许多进一步发展的潜力,因为Google的Borg调度器的集群资源使用率通常都低于60-70%。
Finally, specific analytics and OLAP-style applications (for example, Dremel or SparkSQL queries) can benefit from fully-distributed schedulers. However, fully-distributed schedulers (like e.g., Sparrow) come with fairly restricted feature sets, and thus work best when the workload is homogeneous (i.e., all tasks run for roughly the same time), set-up times are low (i.e., tasks are scheduled to long-running workers, as e.g., with MapReduce application-level tasks in YARN), and task churn is very high (i.e., many scheduling decisions must be made in a short time). We will talk more about these conditions and why fully-distributed schedulers – and the distributed components of hybrid schedulers – only make sense for these applications in the next blog post in this series. For now, it sufficies to observe that distributed schedulers are substantially simpler than others, and do not support multiple resource dimensions, over-subscription, or re-scheduling.
最后,特定的分析型应用或者OLAP应用也可以在全分布式的调度器中得益,但是这种任务的特性和以上所述的优化目标并不完全匹配。全分布式的调度器的优化目标是严格受限的任务集合,最好是同质化(Homogeneous)的任务,各个任务运行的时间相仿,准备任务运行环境的时间很短,而任务的吞吐量很大;而运行时间不确定,需要复杂的运行环境的分析性任务并不适合。至今为止,我们需要提醒大家,分布式调度器架构比其他的架构都要来得更简单,不支持多种资源类型、超卖或者二次调度。
Overall, the table in Figure 2 is evidence that the open-source frameworks still have some way to go until they match the feature sets of advanced, but closed-source systems. This should serve as a call to action: as a result of missing features, utilization suffers, task performance is unpredictable, noisy neighbours cause pagers to go off, and elaborate hacks are required to coerce schedulers into supporting some user needs.
总而言之,正如图2所表示的一样,开源调度框架在追赶闭源的、高阶的调度框架上仍然有很长的一段路需要走。这将促使我们继续行动,如何提高集群的资源使用率,如何处理无法预测估计的任务,如何将同一台服务器上任务的相互影响降低,如何支持用户的特定需求;以上种种问题都需要我们从业者精心设计。
However, there are some good news: while many frameworks have monolithic schedulers today, many are also moving towards more flexible designs. Kubernetes already supports pluggable schedulers (the kube-scheduler
pod can be replaced by another API-compatible scheduler pod), multiple schedulers from v1.2, and has ongoing work on “extenders” to supply custom policies. Docker Swarm may – to my understanding – also gain pluggable scheduler support in the future.