在nvidia AGX 边缘服务器安装kubeEdge
Posted Kris_u
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了在nvidia AGX 边缘服务器安装kubeEdge相关的知识,希望对你有一定的参考价值。
Deploying using Keadm | KubeEdge 官方安装指南
本机系统ubuntu20.04,nvidia AGX 边缘服务器连接到本机。
本机网络设置iptables路由转发至边缘服务器, 本机通过终端ssh连接nvidia边缘服务器。
nvidia AGX 边缘服务器系统:ubuntu18.04
步骤(1-7)在nvidia服务器边缘节点运行,即在nvidia边缘服务器安装kubeedge所需的依赖。
1、设置root密码:
sudo passwd root
2、安装必要工具
sudo apt-get update
sudo apt-get install net-tools make vim ssh docker.io
3、打开ssh root登陆
sudo vim /etc/ssh/sshd_config
参数PermitRootLogin、Passworduthentication 值设为yes
sudo service sshd restart
设置ssh public key免密登陆nvidia边缘服务器
cd .ssh/ && ssh-kengen
ssh-copy-id 192.168.30.101(nvidia ip)
4、 IP Forward Setting
Enable ip forward:
$ sudo echo "net.ipv4.ip_forward = 1" >> /etc/sysctl.conf
$ sudo sysctl -p
Then check it:
$ sudo sysctl -p | grep ip_forward
net.ipv4.ip_forward = 1
5、
安装snap包管理, 通过snap安装kubernetes三件套
if command -v apt > /dev/null 2>&1; then
APT=apt
sudo $APT install net-tools ssh docker.io
sudo apt-get install snap
sudo snap install kubectl --classic
sudo snap install kubelet --classic
sudo snap install kubeadm --classic
else
APT=yum
sudo $APT install net-tools openssh-server docker
sudo $APT install kubectl
sudo $APT install kubelet
sudo snap install kubeadm
fi
6、GO语言中文网:下载安装go,并设置环境变量
Go下载 - Go语言中文网 - Golang中文社区使用sudo gedit ~/.bashrc
命令修改环境变量,在弹出的记事本中添加以下内容。同时也在GOPATH
目录下创建src和bin目录。
#下载Go安装包并解压至目录:/usr/local
cd /usr/local/
wget https://dl.google.com/go/go1.18.linux-amd64.tar.gz
tar -zxvf go1.18.linux-amd64.tar.gz
#设置Go环境变量
#GOROOT是系统上安装Go软件包的位置
sudo echo "export GOROOT=/usr/local/go" >> ~/.bashrc
#GOPATH是工作目录 的位置。
sudo echo "export GOPATH=/home/hadoop/GOPATH" >> ~/.bashrc
sudo echo "export PATH=$GOPATH/bin:$GOROOT/bin:$PATH" >> ~/.bashrc
source ~/.bashrc
sudo go versio
7、安装keadm
docker run kubeedge/installation-package:v1.10.0 cat /usr/local/bin/keadm > /usr/local/bin/keadm && chmod +x /usr/local/bin/keadm
8、master节点运行:(master节点是使用的华为的云服务器)
kubeadm init --apiserver-advertise-address=192.168.x.xx2 --pod-network-cidr=10.244.0.0/16
--kubernetes-version=1.23.1 --apiserver-cert-extra-sans=124.70.221.xxx #internal ip =
192.168.x.xx2; public ip=124.70.221.xxx
Your Kubernetes control-plane has initialized successfully!
To start using your cluster, you need to run the following as a regular user:
mkdir -p $HOME/.kube
sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
sudo chown $(id -u):$(id -g) $HOME/.kube/config
Alternatively, if you are the root user, you can run:
export KUBECONFIG=/etc/kubernetes/admin.conf
You should now deploy a pod network to the cluster.
Run "kubectl apply -f [podnetwork].yaml" with one of the options listed at:
https://kubernetes.io/docs/concepts/cluster-administration/addons/
Then you can join any number of worker nodes by running the following on each as root:
kubeadm join 192.168.0.xxx:6443 --token 64ij3s.705fntphiz6mzijw \\
--discovery-token-ca-cert-hash sha256:eb40233d27e17bdd1e585db9fadf05b2eff99ff5055f4775636b889a9edacecc
根据上述打印执行以上步骤。集群token若是忘记可以通过下面的命令获取:
kubeadm token create --print-join-command
9、使用keadm 安装kubeEdge , 首先初始化keadm (master节点运行)
最好手动下载kubeedge的xxx.tar.gz包,放在、/etc/kubeedge/目录下面
keadm init --advertise-address=124.70.221.xxx #master node public ip address:124.70.221.xxx
keadm beta init 使用容器安装cloudcore:
#通过容器、安装cloudcore
keadm beta init --advertise-address=$ip --kubeedge-version=1.10.0 \\
--kube-config=/root/.kube/config --force --set \\
cloudCore.modules.dynamicController.enable=true
10、Get the token for edge side (master节点运行)
keadm gettoken
复制保存获取的token,设置边缘节点时步骤11要使用 。
11、设置边缘节点: (edge节点运行指令join)
加入到集群
keadm join --cloudcore-ipport=124.70.221.xxx:10000 --token=$TOKEN
Check whether edge core runs successfully:
journalctl -u edgecore.service -b
master节点运行下面指令,查看边缘节点是否添加成功:
kubectl get nodes
Note:
1. It may take long time to download the kubeege tar and service file, you can manually
download them from official website, and then copy it to the /etc/kubeedge/.
issues during the kubeEdge setup :
1、kubeadm init : errors as the following
[root@huawei-node2 kubeedge]# kubeadm init --apiserver-advertise-address=192.168.x.xxx --pod-network-cidr=10.xxx.0.0/16 --kubernetes-version=1.23.1 --apiserver-cert-extra-sans=124.70.xxx.xxx
[init] Using Kubernetes version: v1.23.1
[preflight] Running pre-flight checks
error execution phase preflight: [preflight] Some fatal errors occurred:
[ERROR Port-6443]: Port 6443 is in use
[ERROR Port-10259]: Port 10259 is in use
[ERROR Port-10257]: Port 10257 is in use
[ERROR FileAvailable--etc-kubernetes-manifests-kube-apiserver.yaml]: /etc/kubernetes/manifests/kube-apiserver.yaml already exists
[ERROR FileAvailable--etc-kubernetes-manifests-kube-controller-manager.yaml]: /etc/kubernetes/manifests/kube-controller-manager.yaml already exists
[ERROR FileAvailable--etc-kubernetes-manifests-kube-scheduler.yaml]: /etc/kubernetes/manifests/kube-scheduler.yaml already exists
[ERROR FileAvailable--etc-kubernetes-manifests-etcd.yaml]: /etc/kubernetes/manifests/etcd.yaml already exists
[ERROR FileContent--proc-sys-net-bridge-bridge-nf-call-iptables]: /proc/sys/net/bridge/bridge-nf-call-iptables contents are not set to 1
[ERROR Port-10250]: Port 10250 is in use
[ERROR Port-2379]: Port 2379 is in use
[ERROR Port-2380]: Port 2380 is in use
[ERROR DirAvailable--var-lib-etcd]: /var/lib/etcd is not empty
[preflight] If you know what you are doing, you can make a check non-fatal with `--ignore-preflight-errors=...`
To see the stack trace of this error execute with --v=5 or higher
执行kubeadm reset即可解决。
2、3错误的出现的地方是:我通过本机Windows系统内的虚拟机ubuntu18.04 ssh到nvidia边缘服务器时出现的错误,其他博主说是网络权限导致很多文件无法下载,只能手动下载,坑比较多,最好还是不要通过电脑内的虚拟机去ssh到nvidia边缘服务器。我把电脑windows系统重装为ubuntu20.04之后就没有出现这些错误。(nvidia边缘服务器USB无法通过windows系统连接到本机电脑,所以我把电脑系统重装了ubuntu)
2、Error: fail to download service file,error:failed to exec 'bash -c cd /etc/kubeedge/ && sudo -E wget -t 5 -k --no-check-certificate https://raw.githubusercontent.com/kubeedge/kubeedge/release-1.10/build/tools/edgecore.service'
[Run as service] start to download service file for edgecore
Error: fail to download service file,error:failed to exec 'bash -c cd /etc/kubeedge/ && sudo -E wget -t 5 -k --no-check-certificate https://raw.githubusercontent.com/kubeedge/kubeedge/release-1.10/build/tools/edgecore.service', err: --2022-04-09 17:43:49-- https://raw.githubusercontent.com/kubeedge/kubeedge/release-1.10/build/tools/edgecore.service
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... failed: Connection refused.
Converted links in 0 files in 0 seconds.
, err: exit status 4
Connecting to raw.githubusercontent.com Unable to establish SSL connection
由于众所周知的原因,raw.githubusercontent.com的域名解析已被污染,无法访问。
获取真实ip进入ipaddress这个网站,在搜索框内输入raw.githubusercontent.com即可查询真实IP地址。
Linux修改hosts
以管理员权限打开/etc/hosts文件,在里面加入以下内容
185.199.108.133 raw.githubusercontent.com
cd /etc/kubeedge/ && sudo -E wget -t 5 -k --no-check-certificate https://raw.githubusercontent.com/kubeedge/kubeedge/release-1.10/build/tools/edgecore.service
如果还是失败同样的错误的话,尝试手动下载 to this dir :/etc/kubeedge/
3:查看失败系统日志
命令输出日志:
It seems like the kubelet isn't running or healthy. The HTTP call equal to 'curl
-sSL http://localhost:10248/healthz' failed with error:
Get "http://localhost:10248/healthz": dial tcp [::1]:10248:
connect: connection refused
解决方案:
1. 关掉swapoff
swapoff -a
2. 注释掉配置
vi /etc/fstab
注释掉最后一行swap的
#UUID=6042e061-f29b-4ac1-9f32-87980ddf0e1f swap swap defaults 0 0
4.keadm join Error: failed to get CA certificate 这个错误的原因是我使用了internal ip。改为pubilc ip 就ok了。
Error: failed to get CA certificate, err: Get "https://192.168.0.132:10002/ca.crt":
dial tcp 192.168.0.xxx:10002: i/o timeout //192.168.0.xxx:10002 is the edge server
core.service: Main process exited, code=exited, status=1/FAILURE
edgecore.service: Main process exited, code=exited,
4月 11 12:54:56 nvidia-desktop systemd[1]: edgecore.service: Failed with result 'exit-code'.
solution: #master node : internal ip =
192.168.x.xx2; public ip=124.70.221.xxx
using the public ip instead .
sudo keadm join --cloudcore-ipport=124.70.221.xxx:10000 kubeedge-version=1.9.1 token=1800442bfda63fe9f43bba266c257d535ab13967ddf6b102b3bb7fa2feb8ee50.eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJleHAiOjE2NDk3MzUxMDF9.0arUxCTpt0azdp99tR88aHeAObo5KAIEDiPfDWX44k0 edgenode-name=ru
master node running the following cmd:
kubectl get node
5、keadm init --advertise-address=124.70.221.177 --kubeedge-version=1.10.0
errors:wget -k --no-check-certificate --progress=bar:force https://raw.githubusercontent.com/kubeedge/kubeedge/release-1.10/build/crds/router/router_v1_rule.yaml 下载失败。
execute keadm command failed: failed to exec 'bash -c cd /etc/kubeedge/crds/router
&& wget -k --no-check-certificate --progress=bar:force
https://raw.githubusercontent.com/kubeedge/kubeedge/release-1.10/build/crds/router/router_v1_rule.yaml', err: --2022-04-25 10:24:11--
https://raw.githubusercontent.com/kubeedge/kubeedge/release-1.10/build/crds/router/router_v1_rule.yaml
解决办法:手动下载或者去源代码路径下copy到/etc/kubeedge/crds/router/目录下:
$GOPATH/src/github.com/kubeedge/kubeedge/build/crds/router/router_v1_ruleEndpoint.yaml
6 edgenode : keadm join error:
Error: failed to get CA certificate, err: Get "https://124.70.221.177:10002/ca.crt": dial tcp 124.70.221.177:10002: connect: connection refused
ted, status=1/FAILURE
cloudcore process is killed ,restart the cloudcore:
nohup cloudcore > cloudcore.log 2>&1 &
以上是关于在nvidia AGX 边缘服务器安装kubeEdge的主要内容,如果未能解决你的问题,请参考以下文章
Nvidia AGX Xavier GMSL 自动驾驶控制器设计方案
Jetson AGX Orin安装AnacondaCudaCudnnPytorch最全教程
Jetson AGX Xavier JetPack 4.2环境配置
Jetson AGX Xavier JetPack 4.2环境配置