记一次网络故障——pod间无法通信

Posted jayce9102

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了记一次网络故障——pod间无法通信相关的知识,希望对你有一定的参考价值。

一、背景

  1. 集群是二进制部署
  2. 部署完成后一起正常,各种资源对象均可正常创建、
  3. 部署应用后发现无法跨节点通信,且pod的ip都是172.17.0.0段的

二、排查过程层

  1. 查看节点路由,发现docker0网卡居然是172.17.0.0段(what?)
  2. 查找如下资料:基于docker的CNM部署flanel时,需要将/run/flannel/subnet.env作为docker的环境变量,且启动时指定flannel的网段信息

三、解决方案(修改配置文件:/usr/lib/systemd/system/docker.service)

 

[Unit]
Description=Docker Application Container Engine
Documentation=https://docs.docker.com
BindsTo=containerd.service
After=network-online.target firewalld.service containerd.service
Wants=network-online.target
Requires=docker.socket

[Service]
Type=notify
# the default is not to use systemd for cgroups because the delegate issues still
# exists and systemd currently does not support the cgroup feature set required
# for containers run by docker
EnvironmentFile=/run/flannel/subnet.env
ExecStart=/usr/bin/dockerd $DOCKER_NETWORK_OPTIONS  -H fd:// --containerd=/run/containerd/containerd.sock
ExecReload=/bin/kill -s HUP $MAINPID
TimeoutSec=0
RestartSec=2
Restart=always

# Note that StartLimit* options were moved from "Service" to "Unit" in systemd 229.
# Both the old, and new location are accepted by systemd 229 and up, so using the old location
# to make them work for either version of systemd.
StartLimitBurst=3

# Note that StartLimitInterval was renamed to StartLimitIntervalSec in systemd 230.
# Both the old, and new name are accepted by systemd 230 and up, so using the old name to make
# this option work for either version of systemd.
StartLimitInterval=60s

# Having non-zero Limit*s causes performance problems due to accounting overhead
# in the kernel. We recommend using cgroups to do container-local accounting.
LimitNOFILE=infinity
LimitNPROC=infinity
LimitCORE=infinity

# Comment TasksMax if your systemd version does not supports it.
# Only systemd 226 and above support this option.
TasksMax=infinity

# set delegate yes so that systemd does not reset the cgroups of docker containers
Delegate=yes

# kill only the docker process, not all processes in the cgroup
KillMode=process

[Install]
WantedBy=multi-user.target

调用/run/flannel/subnet.env中的DOCKER_NETWORK_OPTIONS指定pod的网段信息

四、补充

  1. CNI中,docker0的ip与Pod无关,Pod总是生成的时候才去动态的申请自己的IP
  2. CNM模式下,Pod的网段在docker engine启动时就已经决定
  3. 推荐使用CNI模式

以上是关于记一次网络故障——pod间无法通信的主要内容,如果未能解决你的问题,请参考以下文章

K8S故障排查指南:部分节点无法启动Pod资源-Pod处于ContainerCreating状态

记一次网络共享打印机故障

记一次因硬盘故障导致的docker服务无法启动

记一次无法远程故障排查20161211.2111

记一次mysql故障处理

记一次Linux磁盘空间占满无法删除的故障