在nvidia AGX 边缘服务器安装kubeEdge

Posted Kris_u

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了在nvidia AGX 边缘服务器安装kubeEdge相关的知识,希望对你有一定的参考价值。

Deploying using Keadm | KubeEdge 官方安装指南

本机系统ubuntu20.04,nvidia AGX 边缘服务器连接到本机。

本机网络设置iptables路由转发至边缘服务器, 本机通过终端ssh连接nvidia边缘服务器。

nvidia AGX 边缘服务器系统:ubuntu18.04

步骤(1-7)在nvidia服务器边缘节点运行,即在nvidia边缘服务器安装kubeedge所需的依赖。

1、设置root密码:

sudo passwd root

2、安装必要工具

sudo apt-get update 
sudo apt-get install net-tools make vim ssh docker.io

3、打开ssh root登陆

sudo vim /etc/ssh/sshd_config

参数PermitRootLogin、Passworduthentication 值设为yes

sudo service sshd restart

设置ssh public key免密登陆nvidia边缘服务器

cd .ssh/ && ssh-kengen
ssh-copy-id 192.168.30.101(nvidia ip)

 4、 IP Forward Setting
Enable ip forward:

$ sudo echo "net.ipv4.ip_forward = 1" >> /etc/sysctl.conf
$ sudo sysctl -p

Then check it:

$ sudo sysctl -p | grep ip_forward
net.ipv4.ip_forward = 1

5、安装snap包管理, 通过snap安装kubernetes三件套


if command -v apt > /dev/null 2>&1; then
    APT=apt
    sudo $APT install net-tools ssh docker.io
    sudo apt-get install snap
    sudo snap install kubectl --classic
    sudo snap install kubelet --classic
    sudo snap install kubeadm --classic
else
    APT=yum
    sudo $APT install net-tools openssh-server docker
    sudo $APT install kubectl 
    sudo $APT install kubelet 
    sudo snap install kubeadm
fi

6、GO语言中文网:下载安装go,并设置环境变量

Go下载 - Go语言中文网 - Golang中文社区使用sudo gedit ~/.bashrc命令修改环境变量,在弹出的记事本中添加以下内容。同时也在GOPATH目录下创建src和bin目录。

#下载Go安装包并解压至目录:/usr/local

cd /usr/local/
wget https://dl.google.com/go/go1.18.linux-amd64.tar.gz
tar -zxvf go1.18.linux-amd64.tar.gz

#设置Go环境变量
#GOROOT是系统上安装Go软件包的位置
sudo echo "export GOROOT=/usr/local/go" >> ~/.bashrc
#GOPATH是工作目录 的位置。
sudo echo "export GOPATH=/home/hadoop/GOPATH" >> ~/.bashrc
sudo echo "export PATH=$GOPATH/bin:$GOROOT/bin:$PATH" >> ~/.bashrc
source ~/.bashrc
sudo go versio

 7、安装keadm

docker run kubeedge/installation-package:v1.10.0 cat /usr/local/bin/keadm > /usr/local/bin/keadm && chmod +x /usr/local/bin/keadm

8、master节点运行:(master节点是使用的华为的云服务器)

kubeadm init --apiserver-advertise-address=192.168.x.xx2 --pod-network-cidr=10.244.0.0/16
--kubernetes-version=1.23.1 --apiserver-cert-extra-sans=124.70.221.xxx #internal ip =
192.168.x.xx2; public ip=124.70.221.xxx
Your Kubernetes control-plane has initialized successfully!

To start using your cluster, you need to run the following as a regular user:

  mkdir -p $HOME/.kube
  sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
  sudo chown $(id -u):$(id -g) $HOME/.kube/config

Alternatively, if you are the root user, you can run:

  export KUBECONFIG=/etc/kubernetes/admin.conf

You should now deploy a pod network to the cluster.
Run "kubectl apply -f [podnetwork].yaml" with one of the options listed at:
  https://kubernetes.io/docs/concepts/cluster-administration/addons/

Then you can join any number of worker nodes by running the following on each as root:

kubeadm join 192.168.0.xxx:6443 --token 64ij3s.705fntphiz6mzijw \\
        --discovery-token-ca-cert-hash sha256:eb40233d27e17bdd1e585db9fadf05b2eff99ff5055f4775636b889a9edacecc

根据上述打印执行以上步骤。集群token若是忘记可以通过下面的命令获取:

kubeadm token create --print-join-command 

9、使用keadm 安装kubeEdge , 首先初始化keadm  (master节点运行)

最好手动下载kubeedge的xxx.tar.gz包,放在、/etc/kubeedge/目录下面

keadm init --advertise-address=124.70.221.xxx #master node public ip address:124.70.221.xxx

keadm beta init 使用容器安装cloudcore:

#通过容器、安装cloudcore
keadm beta init --advertise-address=$ip  --kubeedge-version=1.10.0 \\
--kube-config=/root/.kube/config --force --set \\
cloudCore.modules.dynamicController.enable=true

10、Get the token for edge side  (master节点运行)

keadm gettoken

复制保存获取的token,设置边缘节点时步骤11要使用 。

11、设置边缘节点:  (edge节点运行指令join)

加入到集群

keadm join --cloudcore-ipport=124.70.221.xxx:10000 --token=$TOKEN

Check whether edge core runs successfully:

journalctl -u edgecore.service -b

master节点运行下面指令,查看边缘节点是否添加成功:

kubectl get nodes

Note:

        1. It may take long time to download the kubeege tar and service file, you can manually
download them from official website, and then copy it to the /etc/kubeedge/.

issues during the kubeEdge setup :

1、kubeadm init : errors as the following

[root@huawei-node2 kubeedge]# kubeadm init --apiserver-advertise-address=192.168.x.xxx --pod-network-cidr=10.xxx.0.0/16 --kubernetes-version=1.23.1 --apiserver-cert-extra-sans=124.70.xxx.xxx
[init] Using Kubernetes version: v1.23.1
[preflight] Running pre-flight checks
error execution phase preflight: [preflight] Some fatal errors occurred:
        [ERROR Port-6443]: Port 6443 is in use
        [ERROR Port-10259]: Port 10259 is in use
        [ERROR Port-10257]: Port 10257 is in use
        [ERROR FileAvailable--etc-kubernetes-manifests-kube-apiserver.yaml]: /etc/kubernetes/manifests/kube-apiserver.yaml already exists
        [ERROR FileAvailable--etc-kubernetes-manifests-kube-controller-manager.yaml]: /etc/kubernetes/manifests/kube-controller-manager.yaml already exists
        [ERROR FileAvailable--etc-kubernetes-manifests-kube-scheduler.yaml]: /etc/kubernetes/manifests/kube-scheduler.yaml already exists
        [ERROR FileAvailable--etc-kubernetes-manifests-etcd.yaml]: /etc/kubernetes/manifests/etcd.yaml already exists
        [ERROR FileContent--proc-sys-net-bridge-bridge-nf-call-iptables]: /proc/sys/net/bridge/bridge-nf-call-iptables contents are not set to 1
        [ERROR Port-10250]: Port 10250 is in use
        [ERROR Port-2379]: Port 2379 is in use
        [ERROR Port-2380]: Port 2380 is in use
        [ERROR DirAvailable--var-lib-etcd]: /var/lib/etcd is not empty
[preflight] If you know what you are doing, you can make a check non-fatal with `--ignore-preflight-errors=...`
To see the stack trace of this error execute with --v=5 or higher

执行kubeadm reset即可解决。

2、3错误的出现的地方是:我通过本机Windows系统内的虚拟机ubuntu18.04  ssh到nvidia边缘服务器时出现的错误,其他博主说是网络权限导致很多文件无法下载,只能手动下载,坑比较多,最好还是不要通过电脑内的虚拟机去ssh到nvidia边缘服务器。我把电脑windows系统重装为ubuntu20.04之后就没有出现这些错误。(nvidia边缘服务器USB无法通过windows系统连接到本机电脑,所以我把电脑系统重装了ubuntu)

2、Error: fail to download service file,error:failed to exec 'bash -c cd /etc/kubeedge/ && sudo -E wget -t 5 -k --no-check-certificate https://raw.githubusercontent.com/kubeedge/kubeedge/release-1.10/build/tools/edgecore.service'

[Run as service] start to download service file for edgecore
Error: fail to download service file,error:failed to exec 'bash -c cd /etc/kubeedge/ && sudo -E wget -t 5 -k --no-check-certificate https://raw.githubusercontent.com/kubeedge/kubeedge/release-1.10/build/tools/edgecore.service', err: --2022-04-09 17:43:49--  https://raw.githubusercontent.com/kubeedge/kubeedge/release-1.10/build/tools/edgecore.service
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... failed: Connection refused.
Converted links in 0 files in 0 seconds.
, err: exit status 4

Connecting to raw.githubusercontent.com Unable to establish SSL connection

由于众所周知的原因,raw.githubusercontent.com的域名解析已被污染,无法访问。

获取真实ip进入ipaddress这个网站,在搜索框内输入raw.githubusercontent.com即可查询真实IP地址。

 Linux修改hosts
以管理员权限打开/etc/hosts文件,在里面加入以下内容

185.199.108.133 raw.githubusercontent.com

cd /etc/kubeedge/ && sudo -E wget -t 5 -k --no-check-certificate https://raw.githubusercontent.com/kubeedge/kubeedge/release-1.10/build/tools/edgecore.service

如果还是失败同样的错误的话,尝试手动下载 to this dir :/etc/kubeedge/

3:查看失败系统日志
命令输出日志:

It seems like the kubelet isn't running or healthy. The HTTP call equal to 'curl 
-sSL http://localhost:10248/healthz' failed with error:
 Get "http://localhost:10248/healthz": dial tcp [::1]:10248: 
connect: connection refused

解决方案:
1. 关掉swapoff
swapoff -a
2. 注释掉配置
vi /etc/fstab
注释掉最后一行swap的
#UUID=6042e061-f29b-4ac1-9f32-87980ddf0e1f swap swap defaults 0 0

4.keadm join  Error: failed to get CA certificate 这个错误的原因是我使用了internal ip。改为pubilc ip 就ok了。

Error: failed to get CA certificate, err: Get "https://192.168.0.132:10002/ca.crt": 
dial tcp 192.168.0.xxx:10002: i/o timeout  //192.168.0.xxx:10002  is the edge server
core.service: Main process exited, code=exited, status=1/FAILURE
edgecore.service: Main process exited, code=exited,
4月 11 12:54:56 nvidia-desktop systemd[1]: edgecore.service: Failed with result 'exit-code'.

solution: #master node : internal ip =
192.168.x.xx2; public ip=124.70.221.xxx

using the public ip instead .

sudo keadm join --cloudcore-ipport=124.70.221.xxx:10000 kubeedge-version=1.9.1 token=1800442bfda63fe9f43bba266c257d535ab13967ddf6b102b3bb7fa2feb8ee50.eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJleHAiOjE2NDk3MzUxMDF9.0arUxCTpt0azdp99tR88aHeAObo5KAIEDiPfDWX44k0 edgenode-name=ru 

master node running the following cmd:

kubectl get node

5、keadm init  --advertise-address=124.70.221.177 --kubeedge-version=1.10.0

errors:wget -k --no-check-certificate --progress=bar:force https://raw.githubusercontent.com/kubeedge/kubeedge/release-1.10/build/crds/router/router_v1_rule.yaml 下载失败。

execute keadm command failed:  failed to exec 'bash -c cd /etc/kubeedge/crds/router 
&& wget -k --no-check-certificate --progress=bar:force 
https://raw.githubusercontent.com/kubeedge/kubeedge/release-1.10/build/crds/router/router_v1_rule.yaml', err: --2022-04-25 10:24:11--  
https://raw.githubusercontent.com/kubeedge/kubeedge/release-1.10/build/crds/router/router_v1_rule.yaml

解决办法:手动下载或者去源代码路径下copy到/etc/kubeedge/crds/router/目录下:

$GOPATH/src/github.com/kubeedge/kubeedge/build/crds/router/router_v1_ruleEndpoint.yaml

6  edgenode :    keadm join error:

Error: failed to get CA certificate, err: Get "https://124.70.221.177:10002/ca.crt": dial tcp 124.70.221.177:10002: connect: connection refused
ted, status=1/FAILURE

cloudcore  process is killed ,restart the cloudcore:

nohup cloudcore > cloudcore.log 2>&1 &

以上是关于在nvidia AGX 边缘服务器安装kubeEdge的主要内容,如果未能解决你的问题,请参考以下文章

Nvidia AGX Xavier GMSL 自动驾驶控制器设计方案

Jetson AGX Orin安装AnacondaCudaCudnnPytorch最全教程

Jetson AGX Xavier JetPack 4.2环境配置

Jetson AGX Xavier JetPack 4.2环境配置

连接到VM虚拟机的NVIDIA 边缘服务器apt-get更新源失败

Jetson AGX Xavier 刷机指南