Kubernetes Node Controller源码分析之配置篇

Posted 2022-12-06 WaltonWang

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了Kubernetes Node Controller源码分析之配置篇相关的知识，希望对你有一定的参考价值。

Author: xidianwangtao@gmail.com

摘要：我认为，Node Controller是Kubernetes几十个Controller中最为重要的Controller之一，其重要程度在Top3，然而这可能也是最为复杂的一个Controller，网上还没有Node Controller的源码分析的文章，因此我觉得有必要对此做一个系列文章，希望能帮助自己有一个深入浅出的理解。本博文主要对NodeController的启动、定义及其行为配置做出一些分析，要求读者对Kubernetes的相关特性有较深的理解。

Node Controller的启动


if ctx.IsControllerEnabled(nodeControllerName) 

    // 解析得到Cluster CIDR， # clusterCIDR is CIDR Range for Pods in cluster.
    _, clusterCIDR, err := net.ParseCIDR(s.ClusterCIDR)

    // 解析得到Service CIDR，# serviceCIDR is CIDR Range for Services in cluster.
    _, serviceCIDR, err := net.ParseCIDR(s.ServiceCIDR)

    // 创建NodeController实例
    nodeController, err := nodecontroller.NewNodeController(
        sharedInformers.Core().V1().Pods(),
        sharedInformers.Core().V1().Nodes(),
        sharedInformers.Extensions().V1beta1().DaemonSets(),
        cloud,
        clientBuilder.ClientOrDie("node-controller"),
        s.PodEvictionTimeout.Duration,
        s.NodeEvictionRate,
        s.SecondaryNodeEvictionRate,
        s.LargeClusterSizeThreshold,
        s.UnhealthyZoneThreshold,
        s.NodeMonitorGracePeriod.Duration,
        s.NodeStartupGracePeriod.Duration,
        s.NodeMonitorPeriod.Duration,
        clusterCIDR,
        serviceCIDR,
        int(s.NodeCIDRMaskSize),
        s.AllocateNodeCIDRs,
        s.EnableTaintManager,
        utilfeature.DefaultFeatureGate.Enabled(features.TaintBasedEvictions),
    )

    // 执行Run方法启动该Controller
    nodeController.Run()

    // sleep一个随机时间，该时间大小为 “ControllerStartInterval + rand.Float64()*1.0*float64(ControllerStartInterval))”，其中ControllerStartInterval可以通过配置kube-controller-manager的"--controller-start-interval”参数指定。
    time.Sleep(wait.Jitter(s.ControllerStartInterval.Duration, ControllerStartJitter))

因此，很清晰地，关键就在以下两步：

nodeController, err := nodecontroller.NewNodeController创建NodeController实例。
nodeController.Run()执行Run方法启动该Controller。

NodeController的定义

在分析NodeController的原理之前，我们有必要先看看NodeController是如何定义的，其完整的定义如下：

type NodeController struct 
    allocateNodeCIDRs bool
    cloud             cloudprovider.Interface
    clusterCIDR       *net.IPNet
    serviceCIDR       *net.IPNet
    knownNodeSet      map[string]*v1.Node
    kubeClient        clientset.Interface
    // Method for easy mocking in unittest.
    lookupIP func(host string) ([]net.IP, error)
    // Value used if sync_nodes_status=False. NodeController will not proactively
    // sync node status in this case, but will monitor node status updated from kubelet. If
    // it doesn't receive update for this amount of time, it will start posting "NodeReady==
    // ConditionUnknown". The amount of time before which NodeController start evicting pods
    // is controlled via flag 'pod-eviction-timeout'.
    // Note: be cautious when changing the constant, it must work with nodeStatusUpdateFrequency
    // in kubelet. There are several constraints:
    // 1. nodeMonitorGracePeriod must be N times more than nodeStatusUpdateFrequency, where
    //    N means number of retries allowed for kubelet to post node status. It is pointless
    //    to make nodeMonitorGracePeriod be less than nodeStatusUpdateFrequency, since there
    //    will only be fresh values from Kubelet at an interval of nodeStatusUpdateFrequency.
    //    The constant must be less than podEvictionTimeout.
    // 2. nodeMonitorGracePeriod can't be too large for user experience - larger value takes
    //    longer for user to see up-to-date node status.
    nodeMonitorGracePeriod time.Duration
    // Value controlling NodeController monitoring period, i.e. how often does NodeController
    // check node status posted from kubelet. This value should be lower than nodeMonitorGracePeriod.
    // TODO: Change node status monitor to watch based.
    nodeMonitorPeriod time.Duration
    // Value used if sync_nodes_status=False, only for node startup. When node
    // is just created, e.g. cluster bootstrap or node creation, we give a longer grace period.
    nodeStartupGracePeriod time.Duration
    // per Node map storing last observed Status together with a local time when it was observed.
    // This timestamp is to be used instead of LastProbeTime stored in Condition. We do this
    // to aviod the problem with time skew across the cluster.
    nodeStatusMap map[string]nodeStatusData
    now           func() metav1.Time
    // Lock to access evictor workers
    evictorLock sync.Mutex
    // workers that evicts pods from unresponsive nodes.
    zonePodEvictor map[string]*RateLimitedTimedQueue
    // workers that are responsible for tainting nodes.
    zoneNotReadyOrUnreachableTainer map[string]*RateLimitedTimedQueue
    podEvictionTimeout              time.Duration
    // The maximum duration before a pod evicted from a node can be forcefully terminated.
    maximumGracePeriod time.Duration
    recorder           record.EventRecorder

    nodeLister         corelisters.NodeLister
    nodeInformerSynced cache.InformerSynced

    daemonSetStore          extensionslisters.DaemonSetLister
    daemonSetInformerSynced cache.InformerSynced

    podInformerSynced cache.InformerSynced

    // allocate/recycle CIDRs for node if allocateNodeCIDRs == true
    cidrAllocator CIDRAllocator
    // manages taints
    taintManager *NoExecuteTaintManager

    forcefullyDeletePod        func(*v1.Pod) error
    nodeExistsInCloudProvider  func(types.NodeName) (bool, error)
    computeZoneStateFunc       func(nodeConditions []*v1.NodeCondition) (int, zoneState)
    enterPartialDisruptionFunc func(nodeNum int) float32
    enterFullDisruptionFunc    func(nodeNum int) float32

    zoneStates                  map[string]zoneState
    evictionLimiterQPS          float32
    secondaryEvictionLimiterQPS float32
    largeClusterThreshold       int32
    unhealthyZoneThreshold      float32

    // if set to true NodeController will start TaintManager that will evict Pods from
    // tainted nodes, if they're not tolerated.
    runTaintManager bool

    // if set to true NodeController will taint Nodes with 'TaintNodeNotReady' and 'TaintNodeUnreachable'
    // taints instead of evicting Pods itself.
    useTaintBasedEvictions bool

NodeController的行为配置

整个NodeController结构体非常复杂，包含30+项，我们将重点关注：

clusterCIDR - 通过--cluster-cidr来设置，表示CIDR Range for Pods in cluster。
serivceCIDR - 通过--service-cluster-ip-range来设置，表示CIDR Range for Services in cluster。
knownNodeSet - 用来记录NodeController observed节点的集合。
nodeMonitorGracePeriod - 通过--node-monitor-grace-period来设置，默认为40s，表示在标记某个Node为unhealthy前，允许40s内该Node unresponsive。
nodeMonitorPeriod - 通过--node-monitor-period来设置，默认为5s，表示在NodeController中同步NodeStatus的周期。
nodeStatusMap - 用来记录每个Node最近一次观察到的Status。
zonePodEvictor - workers that evicts pods from unresponsive nodes.
zoneNotReadyOrUnreachableTainer - workers that are responsible for tainting nodes.
podEvictionTimeout - 通过--pod-eviction-timeout设置，默认为5min，表示在强制删除Pod时，允许的最大的Pod eviction时间。
maximumGracePeriod - The maximum duration before a pod evicted from a node can be forcefully terminated. 不可配置，代码中写死为5min。
nodeLister - 用来获取Node数据的Interface。
daemonSetStore - 用来获取 daemonSet数据的Interface。在通过Eviction方式删除Pods时，会跳过该Node上所有的daemonSet对应的Pods。
taintManager - 它是一个NoExecuteTaintManager对象，当runTaintManager(默认true)为true时:
- PodInformer和NodeInformer将监听到PodAdd,PodDelete,PodUpdate和NodeAdd,NodeDelete,NodeUpdate事件后，
- 触发TraintManager执行对应的NoExecuteTaintManager.PodUpdated和NoExecuteTaintManager.NodeUpdated方法，
- 将事件加入到对应的queue(podUpdateQueue and nodeUpdateQueue)，TaintController会从这些queue中消费这些消息，
- TaintController分别调用handlePodUpdate和handleNodeUpdate处理。
- 具体的TaintController的处理逻辑，后续再单独分析。
forcefullyDeletePod - 该方法用来NodeController调用apiserver接口强制删除该Pod。用来删除那些被调度到kubelet version 小于v1.1.0 Node上的Pod，因为kubelet v1.1.0之前的版本不支持graceful termination。
computeZoneStateFunc - 该方法返回Zone中NotReadyNodes数量以及该Zone的state。
- 如果没有一个Ready Node，则该node state为FullDisruption；
- 如果unhealthy Nodes所占的比例大于等于unhealthyZoneThreshold,则该node state为PartialDisruption;
- 否则该node state就是Narmal。
enterPartialDisruptionFunc - 该方法用当前node num对比largeClusterThreshold：
- 如果nodeNum > largeClusterThreshold则返回secondaryEvictionLimiterQPS（默认为0.01）；
- 否则返回0，表示停止evict操作。
enterFullDisruptionFunc - 用来获取evictionLimiterQPS（默认为0.1）的方法，关于evictionLimiterQPS的理解见下。
zoneStates - 表示各个zone的状态，状态值可以为
- Initial;
- Normal;
- FullDisruption;
- PartialDisruption;
evictionLimiterQPS - 通过--node-eviction-rate设置，默认为0.1，表示当某个Zone status为healthy时，每秒应该剔除的Nodes数量，即每10s剔除1个Node。
secondaryEvictionLimiterQPS - 通过--secondary-node-eviction-rate设置，默认为0.01，表示当某个Zone status为unhealthy时，每秒应该剔除的Nodes数量，即每100s剔除1个Node。
largeClusterThreshold - 通过--large-cluster-size-threshold设置，默认为50，表示当健康nodes组成的集群规模小于等于50时，secondary-node-eviction-rate将被设置为0。
unhealthyZoneThreshold - 通过--unhealthy-zone-threshold设置，默认为0.55，表示当某个Zone中unhealthy Nodes（最少为3）所占的比例达到0.55时，就认为该Zone的状态为unhealthy。
runTaintManager - 在--enable-taint-manager中指定，默认为true。如果为true，则表示NodeController将会启动TaintManager，由TaintManager负责将不能容忍该Taint的Nodes上的Pods进行evict操作。
useTaintBasedEvictions - 在--feature-gates中指定，默认TaintBasedEvictions=false,仍属于Alpha特性。如果为true，则表示将通过Taint Nodes的方式来Evict Pods。

以上是关于Kubernetes Node Controller源码分析之配置篇的主要内容，如果未能解决你的问题，请参考以下文章