K8S-–集群调度

Posted 2021-04-23 K8S学习之路

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了K8S-–集群调度相关的知识，希望对你有一定的参考价值。

K8S–集群调度

调度说明

简介

schedule是Kubernetes的调度器, 主要任务是把定义的Pod分配到集群的节点上.

公平: 保证每个节点都能被分配资源
资源高效利用: 集群所有资源最大化被使用
效率: 调度的性能要好, 能够尽快地对大批量的Pod完成调度任务
灵活: 允许用户根据自己的需求控制调度的逻辑

Schedule是作为单独的程序运行的, 启动之后会一直监听API Server, 获取PodSpec.NodeName为空的Pod, 对每个Pod都会创建一个Binding, 表明该Pod应该放到哪个节点上

调度过程

调度分为几个部分: 1. 过滤掉不满足条件的节点, 这个过程为 predicate; 2. 对通过的节点按照优先级排序, 这个是priority; 3. 从中选择优先级最高的节点. 如果中间发生错误, 就直接返回错误

predicate有一系列算法可以使用:

PodFitsResources : 节点上剩余的资源是否大于Pod请求的资源
PodFitsHost: 如果Pod指定了NodeName, 检查节点名称是否和NodeName匹配
PodFitsHostPorts : 节点上已经使用的Port是否和Pod申请的port冲突
PodSelectorMatches : 过滤掉和Pod指定的label不匹配的节点
NoDiskConflict: 已经mount的Volume和Pod指定的Volume不冲突, 除非都是只读

如果在predicate过程中没有合适的节点, Pod会一直在pending状态, 不断重试调度, 直到有节点满足条件.经过这个步骤,如果有多个节点满足条件, 就继续Priority过程: 按照优先级大小对节点排序.

优先级由一系列键值对组成, 键是该优先级项的名称, 值是他的权重. 优先级选项包括:

leastRequestedPriority: 通过计算CPU和Memory的使用率来决定权重, 使用率越低权重越高.这个优先级指标倾向于资源使用比例更低的节点.
BalancedResourceAllocation: 节点上CPU和Memory使用率越接近, 权重越高. 这个应该和leastRequestedPriority一起使用, 不应该单独使用
ImageLocalityPriority: 倾向于已经有要使用镜像的节点, 镜像总大小值越大, 权重越高

通过算法对所有的优先级项目和权重进行计算, 得出最终的结果

自定义调度器

除了Kubernetes自带的调度器, 也可以编写自己的调度器.通过spec.schedulername参数指定调度器的名字, 可以为Pod选择某个调度器进行调度.比如下面pod选择my-scheduler进行调度,而不是默认的default-scheduler

apiVersion: v1
kind: Pod
metadata:
  name: annotation-second-scheduler
  labels:
    name: multischeduler-example
spec:
  schedulername: my-scheduler
  containers: 
  - name: pod-with-second-annotation-container
    image: gcr.io/google_containers/pause:2.0

调度亲和性

Node亲和性

pod.spec.nodeAffinity

preferredDuringSchedulingIgnoredDuringExecution: 软策略
requiredDuringSchedulingIgnoredDuringExecution: 硬策略

requiredDuringSchedulingIgnoredDuringExecution

apiVersion: v1
kind: Pod
metadata:
  name: affinity
  labels:
    app: node-addinity-pod
spec:
  containers:
  - name: with-node-affinity
    image: busybox
  affinity:
    nodeAffinity: 
      requiredDuringSchedulingIgnoredDuringExecution: 
        nodeSelectorTerms:
        - matchExpressions: 
          - key: kubernetes.io/hostname
            operator: NotIn
            values: 
            - xuh04
            - k8s-node02

operator: NotIn 表示不会在values节点上创建pod:

将operator: NotIn修改为operator: In values设为04, 每次删除创建,都会创建在values对应节点上

K8S-–集群调度

preferredDuringSchedulingIgnoredDuringExecution:

apiVersion: v1
kind: Pod
metadata:
  name: affinity
  labels:
    app: node-addinity-pod
spec:
  containers:
  - name: with-node-affinity
    image: busybox
  affinity:
    nodeAffinity: 
      preferredDuringSchedulingIgnoredDuringExecution: 
      - weight: 1  # 权重, 越大越亲和
        preference: 
          matchExpressions: 
          - key: kubernetes.io/hostname
            operator: In
            values: 
            - xuh02

结合使用

apiVersion: v1
kind: Pod
metadata:
  name: affinity
  labels:
    app: node-addinity-pod
spec:
  containers:
  - name: with-node-affinity
    image: busybox
  affinity:
    nodeAffinity: 
      requiredDuringSchedulingIgnoredDuringExecution: 
        nodeSelectorTerms:
        - matchExpressions: 
          - key: kubernetes.io/hostname
            operator: NotIn
            values: 
            - xuh04
      preferredDuringSchedulingIgnoredDuringExecution: 
      - weight: 1  # 权重, 越大越亲和
        preference: 
          matchExpressions: 
          - key: kubernetes.io/hostname
            operator: In
            values: 
            - xuh03

键值运算关系

In : label的值在某个列表中
NotIn: 不在某个列表中
Gt: 大于
Lt: 小于
Exists: 某个label存在
DoesNotExist: 某个label不存在

Pod亲和性

pod.spec.affinity.podAffinity/podAntiAffinity

preferredDuringSchedulingIgnoredDuringExecution: 软策略
requiredDuringSchedulingIgnoredDuringExecution: 硬策略

apiVersion: v1
kind: Pod
metadata:
  name: pod-3
  labels:
    app: pod-3
spec:
  containers:
  - name: pod-3
    image: busybox
  affinity:
    podAffinity: 
      requiredDuringSchedulingIgnoredDuringExecution: 
      - labelSelector:
          matchExpressions: 
          - key: app
            operator: In
            values: 
            - pod-1
        topologyKey: kubernetes.io/hostname
    podAntiAffinity: 
      preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 1  # 权重, 越大越亲和
        podAffinityTerm:
          labelSelector: 
            matchExpressions: 
            - key: app
              operator: In
              values: 
              - pod-1
          topologyKey: kubernetes.io/hostname

K8S-–集群调度

亲和性/反亲和性调度策略比较

调度策略	匹配标签	操作符	拓扑域支持	调度目标
nodeAffinity	主机	In, NotIn, Exists, DoesNotExists, Gt, Lt	否	指定主机
podAffinity	Pod	In, NotIn, Exists, DoesNotExists	是	Pod与指定Pod同一拓扑域
podAntiAffinity	Pod	In, NotIn, Exists, DoesNotExists	是	Pod与指定Pod同一拓扑域

污点与容忍

Taint和Toleration

节点亲和性, 是Pod的一种属性(偏好或硬性要求), 他使Pod被吸引到一类特定的节点. Taint则相反, 他使节点能够排斥一类特定的Pod

Taint和toleration相互配合, 可以用来避免Pod被分配到不合适的节点上. 每个节点上都可以应用一个或多个taint, 这表示对于那些不能容忍这些Taint的Pod, 是不会被该节点接受的. 如果将toleration应用于Pod上, 表示这些Pod可以(但不要求)被调度到具有匹配taint的节点上

污点(Taint)

污点(taint)的组成
使用kubectl taint命令可以给某个Node节点设置污点, Node被设置上污点之后就和Pod之间存在了一种相斥的关系, 可以让Node拒绝Pod的调度执行, 甚至将Node已经存在的Pod驱逐出去
每个污点的组成如下:
```
key=value:effect
```
每个污点有一个key和value作为污点的标签, 其中value可以为空, effect描述污点的作用. 当前taint effect支持下列三种选项:

NoSchedule k8s将不会将Pod调度到具有该污点的Node上
PreferNoSchedule: k8s将尽量避免将pod调度到具有该污点的Node上
NoExecute: k8s不会将Pod调度到具有该污点的Node上, 同时会将Node上已经存在的Pod驱逐出去

污点的设置查看和去除

# 设置污点
kubectl taint nodes xuh01 key1=value1:NoSchedule

# 节点说明中, 查找Taints字段
kubectl describe pod pod-name

# 去除污点
kubectl taint nodes xuh01 key1=NoSchedule-

K8S-–集群调度

容忍(Toleration)

设置了污点的Node将根据Taint的effect: NoSchedule、PreferNoSchedule、NoExecute和Pod之间产生互斥的关系, Pod将在一定程度上不会被调度到Node上. 但我们可以在Pod上设置容忍(toleration), 意思是设置了容忍的Pod将可以容忍污点的存在, 可以被调度到存在污点的Node上

pod.spec.tolerations

tolerations: 
- key: "key1"
  operator: "Equal"
  value: "value1"
  effect: "NoSchedule"
  tolerationSecond: 3600
- key: "key1"
  operator: "Equal"
  value: "value1"
  effect: "NoExecute"
- key: "key2"
  operator: "Exists"
  effect: "NoSchedule"

其中key, value, effect要与Node上设置的taint保持一致
operator的值为Exists将会忽略value的值
tolerationSeconds用于描述当Pod需要被驱逐时可以在Pod上继续保留运行的时间

**当不指定key值时, 表示容忍所有的污点key: **
```
toleration:
- operator: "Exists"
```
当不指定effect值时, 表示容忍所有的污点作用
```
toleration:
- key: "key"
  operator: "Exists"
```

有多个Master存在时, 防止资源浪费, 可以设置

kubectl taint nodes Node-Name node-role.kubernetes.io/master=:PreferNoSchedule

固定节点调度

pod.spec.nodeName将Pod直接调度到指定的Node节点上, 会跳过Scheduler的调度策略, 匹配规则是强制匹配

apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  name: myweb
spec: 
  replicas: 7
  template:
    metadata:
      labels:
        app: myweb
    spec:
      nodeName: xuh03
      containers: 
      - name: myweb
        image: nginx
        ports:
        - containerPort: 80

K8S-–集群调度

pod.spec.nodeSelector: 通过Kubernetes的label-selector机制选择节点, 由调度器策略匹配label, 而后调度pod到目标节点, 该匹配规则属于强制约束

apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  name: myweb1
spec: 
  replicas: 5
  template:
    metadata:
      labels:
        app: myweb1
    spec:
      nodeSelector:
        disk: ssd
      containers: 
      - name: myweb
        image: nginx
        ports:
        - containerPort: 80

没有Node上有disk=ssd的标签, pod找不到调度的节点, 一直处于pending状态

kubectl label node xuh02 disk=ssd

以上是关于K8S-–集群调度的主要内容，如果未能解决你的问题，请参考以下文章