Morpheus: Towards Automated SLOs for Enterprise Clusters

Posted 2022-12-10 银灯玉箫

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了Morpheus: Towards Automated SLOs for Enterprise Clusters相关的知识，希望对你有一定的参考价值。

Title（文章标题）
2016, Sangeeth Abdu Jyothi, OSDI

Summary
写完笔记之后最后填，概述文章的内容，以后查阅笔记的时候先看这一段。注：写文章summary切记需要通过自己的思考，用自己的语言描述。忌讳直接Ctrl + c原文。

Research Objective(s)
Modern resource management frameworks for large-scale analytics leave unresolved the problematic tension between high cluster utilization and job’s performance predictability-- respectively coveted by operators and users.
covet
美 [ˈkʌvɪtid]
英 [ˈkʌvɪtid]
v.垂涎；渴望；妄想(别人东西)
网络梦寐以求的；令人垂涎的；令人羡慕的

We address this in Morpheus, a nwe system that: 1) codifies implict user expectations as explict Service Level Objectives(SLOs), inferred from historical data, 2) enforces SLOs using novel scheduling techniques that isolate jobs from sharing-induced performance variability, and 3) mitigates inherent performance variance (e.g. due to failures) by means of dynamic reprovisioning of jobs.

Background / Problem Statement
Unpredictability comes from several sources, which roughly can be grouped as

Sharing-induced - performance variability caused by inconsistent allocations of resources across job runs
Inherent - due to changes on the job input(size, skew, availability), source code tweaks, failures – this si endemic even in dedicated and lightly used clusters.

Method(s)

(a) Data-dependencies in the Provenance Graph(PG).
PG gathers logs (application logs, filesystem logs…)
. (b) Resource utilization of each run in a Telemetry-History infrastructure database(TH).
(a) Form the PG it derives a dealine d – the SLO.
SLO — derive a dealine for the periodic job-- as time which downstream consumers read a job’s output.
(b) From the TH, it derives a model of the job resource demand over time, R*.
time-seris of resource utilization used by the job every one minute.

we refer to R* as the job resource model.
Morpheus enforces SLOs via recurring reservations:
(a) Adds a recurring reservation for JobX into the cluster agenda-- this set aside resources over time based on the job resource model R*.
Formally, skyline for the i-th instance can defined by the sequece $s_i,k$
, the average number of containers used for each time-step $k$ . Using a collection of sequece as input, the optimization problem outputs the vector $s=(s_1,.....s_K)$ – the number of containers reserved at each time-step.
Our optimization ojective is a cost function which is a linear combination of two term: One term which penalizes for “over-allocations” and another term which penalizes fpr “under-allcation”
minimize $a*A_0(s) +(1-a)A_u(s)$
Over-allocation penalty is defined as the average over-allocation of containers.
Using Linear Programming to solve this problem.

(b) New instances of JobX run within the recurring reservation(dedicated resources).

The Dynamic Reprovisioning componet monitors the job progress online, and increases/decreases the reservation, to mitigate inherent execution variability.
Reprovisioning is triggered when a job resource demand(used containers plus pending ask) exceeds the resources allocated in the predicted skyline.
Morpheus constantly feeds back into STep 2 the PG and TH information of the new runs for continuous learning and refinement of the SLO and the job resource model.

Evaluation
作者如何评估自己的方法？实验的setup是什么样的？感兴趣实验数据和结果有哪些？有没有问题或者可以借鉴的地方？

Conclusion
作者给出了哪些结论？哪些是strong conclusions, 哪些又是weak的conclusions（即作者并没有通过实验提供evidence，只在discussion中提到；或实验的数据并没有给出充分的evidence）?

Notes
(optional) 不在以上列表中，但需要特别记录的笔记。

References
(optional) 列出相关性高的文献，以便之后可以继续track下去。

以上是关于Morpheus: Towards Automated SLOs for Enterprise Clusters的主要内容，如果未能解决你的问题，请参考以下文章