Using lxd to do vlan test (by quqi99)

Posted quqi99

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了Using lxd to do vlan test (by quqi99)相关的知识,希望对你有一定的参考价值。

作者:张华 发表于:2022-08-15
版权声明:可以任意转载,转载时请务必以超链接形式标明文章原始出处和作者信息及本版权声明

问题

客户说sriov虚机里收不着arp reply, 他们的sriov虚机里是两个sriov网卡做一个ptk0 (bond ?), 由active NIC(pkt0_p)与standby NIC(pkt0_s)组成.

<no-ip>/fa:16:3e:d8:3f:b9(pkt0)
<no-ip>/fa:16:3e:d8:3f:b9(pkt0_p)
<no-ip>/fa:16:3e:70:be:ba(pkt0_s)
151.2.143.1/151.2.143.2/fa:16:3e:d8:3f:b9(pkt0.610@pkt0)
10.139.99.1/10.139.99.2/fa:16:3e:d8:3f:b9(pkt0.510@pkt0)
10.139.160.10/10.139.160.11/10.139.160.12/fa:16:3e:d8:3f:b9(pkt0.700@pkt0)

他说在active NIC作ICMP的心跳检查没问题,但是在standby NIC上做ARP到GW的心跳检查收不着arp reply (但下列数据似乎收着啦?)

1, arp for active port(fa:16:3e:d8:3f:b9)

$ tshark -r ./EXT_TMP-700.pcap-1.act.pcap eth.src==fa:16:3e:d8:3f:b9 and arp |tail -n1
357602 8141.824956 fa:16:3e:d8:3f:b9 → IETF-VRRP-VRID_64 ARP 60 Who has 10.139.160.254? Tell 10.139.160.10
$ tshark -r ./EXT_TMP-700.pcap-1.act.pcap eth.dst==fa:16:3e:d8:3f:b9 and arp |tail -n1
357603 8141.825416 IETF-VRRP-VRID_64 → fa:16:3e:d8:3f:b9 ARP 60 10.139.160.254 is at 00:00:5e:00:01:64

2, icmp for active port(fa:16:3e:d8:3f:b9)

$ tshark -r ./EXT_TMP-700.pcap-1.act.pcap eth.dst==fa:16:3e:d8:3f:b9 and icmp |tail -n1
358835 8169.867056 10.139.160.254 → 10.139.160.10 ICMP 102 Echo (ping) reply    id=0x000a, seq=15233/33083, ttl=64 (request in 358834)
$ tshark -r ./EXT_TMP-700.pcap-1.act.pcap eth.src==fa:16:3e:d8:3f:b9 and icmp |tail -n1
358834 8169.863263 10.139.160.10 → 10.139.160.254 ICMP 102 Echo (ping) request  id=0x000a, seq=15233/33083, ttl=64

3, arp for standby port(fa:16:3e:70:be:ba)

$ tshark -r ./EXT_TMP-700.pcap-1.act.pcap eth.src==fa:16:3e:70:be:ba and arp |tail -n1
358848 8170.244743 fa:16:3e:70:be:ba → Broadcast    ARP 60 Who has 10.139.160.254? (ARP Probe)
$ tshark -r ./EXT_TMP-700.pcap-1.act.pcap eth.dst==fa:16:3e:70:be:ba and arp |tail -n1
358849 8170.245117 IETF-VRRP-VRID_64 → fa:16:3e:70:be:ba ARP 60 10.139.160.254 is at 00:00:5e:00:01:64

4, icmp for standby port(fa:16:3e:70:be:ba)

$ tshark -r ./EXT_TMP-700.pcap-1.act.pcap eth.src==fa:16:3e:70:be:ba and icmp |tail -n1
<empty>
$ tshark -r ./EXT_TMP-700.pcap-1.act.pcap eth.dst==fa:16:3e:70:be:ba and icmp |tail -n1
<empty>

已经做过如下分析:

  • 确认下列的sriov ovn配置中用于external network的br-data里没有使用sriov NIC, 如果这里是sriov NIC,并且sriov NIC没有使用直通,而是使用mapvtap的话,可能存在发卡模式的问题,即一个host上的VM不能访问本chassis的网络,但可以访问其他chassis的网络.
juju config ovn-chassis-sriov-hugepages ovn-bridge-mappings
dcfabric:br-data sriovfabric1:br-data sriovfabric2:br-data
$ juju config ovn-chassis-sriov-hugepages bridge-interface-mappings
br-data:bond1
$ juju config ovn-chassis-sriov-hugepages sriov-device-mappings
sriovfabric1:ens3f0 sriovfabric1:ens6f0 sriovfabric2:ens3f1 sriovfabric2:ens6f1
$ juju config ovn-chassis-sriov-hugepages sriov-numvfs
ens3f0:32 ens3f1:32 ens6f0:32 ens6f1:32
  • 排除了lp bug 1875852, 客户没有使用vlan作为tenant network
  • 在PF上使用tcpdump只看到arp request是正常的.因为arp request是广播,那么在PF上能看到.但arp reply是单播,如果PF不是混杂模式(某些Intel sriov网卡有这个硬件bug不支持混杂模式)那么用PF上用tcpdump看不到arp reply是正常的.另外,在VF上是无法使用tcpdump的.
  • DHCP是禁用的.一般说来使用sr-iov ovn应该将sriov subnet打开dhcp. 但这里是禁用的,应该也没问题,因为客户会静态指定IP
  • 客户静态指定IP(由heat指定)与nova里分配的IP不一样,应该也不影响.因为sriov会bypass host,host上的SG不会影响它(主要是IP/MAC防欺骗的SW rule)
  • 实际IP与nova分配的IP不同,openstack应用层面的SG是不会影响到它,那sriov硬件层面的SG呢?确认spoof checking 也是off的.
i$ grep -E 'fa:16:3e:f8:42:fe|fa:16:3e:70:be:ba|fa:16:3e:8f:56:5a|fa:16:3e:d8:3f:b9' sos_commands/networking/ip_-s_-d_link
vf 30 MAC fa:16:3e:70:be:ba, spoof checking off, link-state auto, trust on
vf 31 MAC fa:16:3e:f8:42:fe, spoof checking off, link-state auto, trust on
vf 29 MAC fa:16:3e:8f:56:5a, spoof checking off, link-state auto, trust on
vf 30 MAC fa:16:3e:d8:3f:b9, spoof checking off, link-state auto, trust on
  • mac filting排除了(above spoof checking), 那vlan filting的问题呢?tcpdump数据显示客户似乎在虚机内部定义了一个vlan(pkt0.700@pkt0)

我们这篇文章的测试主要就是模拟这个vlan测试,当然这里不涉及sriov硬件.

vlan实验环境搭建

lxc remote add faster https://mirrors.tuna.tsinghua.edu.cn/lxc-images/ --protocol=simplestreams --public
lxc image list faster:
lxc remote list
#Failed creating instance record: Failed detecting root disk device: No root device could be found
#lxc profile device add default root disk path=/ pool=default
#lxc profile show default
#lxc launch ubuntu:focal master -p juju-default --config=user.network-config="$(cat network.yml)"
lxc launch faster:ubuntu/jammy test1
lxc launch faster:ubuntu/jammy test2

#add two NICs from NET1 for two containers
lxc network create NET1 ipv6.address=none ipv4.address=10.139.160.1/24
lxc network attach NET1 test1 eth1
lxc network attach NET1 test1 eth2
lxc network attach NET1 test2 eth1
lxc network attach NET1 test2 eth2

#https://developers.redhat.com/blog/2018/10/22/introduction-to-linux-interfaces-for-virtual-networking#vlan
#ip link add ptk0 type bond miimon 100 mode active-backup
#ip link set eth2 master ptk0
#ip link set eth1 master ptk0
lxc exec test1 -- /bin/bash
cat << EOF |tee /etc/netplan/11-test.yaml
network:
  version: 2
  renderer: networkd
  ethernets:
    eth1:
      addresses: []
      dhcp4: false
      dhcp6: false
      macaddress: 00:16:3e:15:bd:58
    eth2:
      addresses: []
      dhcp4: false
      dhcp6: false
      macaddress: 00:16:3e:68:72:0f
  bonds:
    ptk0:
      addresses: []
      dhcp4: false
      dhcp6: false
      interfaces:
        - eth1
        - eth2
      parameters:
        mode: active-backup
        primary: eth1
  vlans:
    ptk0.700:
      id: 700
      link: ptk0
      dhcp4: no
      addresses: [ 10.139.160.10/24 ]
      nameservers:
        search: [ domain.local ]
        addresses: [ 8.8.8.8 ]
EOF
netplan apply

lxc exec test2 -- /bin/bash
cat << EOF |tee /etc/netplan/11-test.yaml
network:
  version: 2
  renderer: networkd
  ethernets:
    eth1:
      addresses: []
      dhcp4: false
      dhcp6: false
      macaddress: 00:16:3e:1e:19:25
    eth2:
      addresses: []
      dhcp4: false
      dhcp6: false
      macaddress: 00:16:3e:f7:9e:22
  bonds:
    ptk0:
      addresses: []
      dhcp4: false
      dhcp6: false
      interfaces:
        - eth1
        - eth2
      parameters:
        mode: active-backup
        primary: eth1
  vlans:
    ptk0.700:
      id: 700
      link: ptk0
      dhcp4: no
      addresses: [ 10.139.160.11/24 ]
      nameservers:
        search: [ domain.local ]
        addresses: [ 8.8.8.8 ]
EOF
netplan apply

上面创建了两个lxd,并在两个lxd中创建了active/standby的bond (ptk0), 然后创建了一个vlan (ptk0.700), 要想上面的网络通,还得在host里设置trunk, 这样vlan网络就通了.
注意:上面需要使用macaddress为两个NIC来设置mac, 若不设置,在创建bond和vlan后会出现有所NIC的mac相同的情况.

$ sudo brctl show |grep NET1 -A3
NET1		8000.00163eeb79c4	no		veth2af34c1d
							veth3a5b458e
							veth82c292b2
							veth9b8e8cb6
#sudo bridge vlan add vid 2-4094 dev NET1 self
sudo bridge vlan add vid 700 dev NET1 self
sudo bridge vlan add vid 700 dev veth2af34c1d
sudo bridge vlan add vid 700 dev veth3a5b458e
sudo bridge vlan add vid 700 dev veth82c292b2
sudo bridge vlan add vid 700 dev veth9b8e8cb6
sudo bridge vlan show

此时,test1可以通过vlan700来ping test2

root@test1:~# ping 10.139.160.11 -c1
PING 10.139.160.11 (10.139.160.11) 56(84) bytes of data.
64 bytes from 10.139.160.11: icmp_seq=1 ttl=64 time=0.133 ms
root@test2:~# tcpdump -i eth1 -nn -e -l
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on eth1, link-type EN10MB (Ethernet), snapshot length 262144 bytes
05:54:36.128602 00:16:3e:15:bd:58 > 00:16:3e:1e:19:25, ethertype 802.1Q (0x8100), length 102: vlan 700, p 0, ethertype IPv4 (0x0800), 10.139.160.10 > 10.139.160.11: ICMP echo request, id 37135, seq 1, length 64
05:54:36.128643 00:16:3e:1e:19:25 > 00:16:3e:15:bd:58, ethertype 802.1Q (0x8100), length 102: vlan 700, p 0, ethertype IPv4 (0x0800), 10.139.160.11 > 10.139.160.10: ICMP echo reply, id 37135, seq 1, length 64

但是仍然无法ping GW的

root@test1:~# ping 10.139.160.1 -c1
PING 10.139.160.1 (10.139.160.1) 56(84) bytes of data.
From 10.139.160.10 icmp_seq=1 Destination Host Unreachable
$ sudo tcpdump -i NET1 -nn -e -l
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on NET1, link-type EN10MB (Ethernet), snapshot length 262144 bytes
14:25:24.761131 00:16:3e:15:bd:58 > ff:ff:ff:ff:ff:ff, ethertype 802.1Q (0x8100), length 46: vlan 700, p 0, ethertype ARP (0x0806), Request who-has 10.139.160.1 tell 10.139.160.10, length 28

无论是创建一个eth0.700, 还是创建一个vlan=700的tap0,均无法ping

#use eth0.700
sudo ip link add link eth0 name eth0.700 type vlan id 700
sudo brctl addif NET1 eth0.700
sudo ifconfig eth0.700 up
sudo ip addr add 10.139.160.254/24 dev eth0.700
sudo bridge vlan add vid 700 dev eth0.700

#use a tap
sudo ip tuntap add mode tap tap0
sudo ip link set tap0 master NET1
sudo bridge vlan add dev tap0 vid 700 pvid untagged master
sudo ip addr add 10.139.160.254/24 dev tap0
sudo bridge vlan show

测试1

那就将test2当成gw吧,然后我们从test1上ping它然后抓包
如果仅从active port使用icmp

root@test1:~# ping -I eth1 10.139.160.1 -c1
ping: Warning: source address might be selected on device other than: eth1
PING 10.139.160.1 (10.139.160.1) from 192.168.121.88 eth1: 56(84) bytes of data.
^C
--- 10.139.160.1 ping statistics ---
1 packets transmitted, 0 received, 100% packet loss, time 0ms

$ sudo tcpdump -i NET1 -nn -e -l
14:32:04.483156 00:16:3e:15:bd:58 > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 42: Request who-has 10.139.160.1 tell 192.168.121.88, length 28
14:32:04.483185 00:16:3e:eb:79:c4 > 00:16:3e:15:bd:58, ethertype ARP (0x0806), length 42: Reply 10.139.160.1 is-at 00:16:3e:eb:79:c4, length 28

运行’ping -I eth1 10.139.160.11 -c1’与'ping -I eth2 10.139.160.11 -c1’均无输出

测试2

使用arping命令发送arp request时必须指定一个IP, 但standby port上又没有IP,所以通过’-S’指定了一个.

root@test1:~# arping -I ptk0.700 10.139.160.11 -S 10.139.160.2 -C1
ARPING 10.139.160.11
42 bytes from 00:16:3e:1e:19:25 (10.139.160.11): index=0 time=8.119 usec
root@test2:~# sudo tcpdump -i ptk0.700 -nn -e -l
09:08:16.814374 00:16:3e:15:bd:58 > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 58: Request who-has 10.139.160.11 tell 10.139.160.2, length 44
09:08:16.814410 00:16:3e:1e:19:25 > 00:16:3e:15:bd:58, ethertype ARP (0x0806), length 42: Reply 10.139.160.11 is-at 00:16:3e:1e:19:25, length 28

运行’arping -I eth1 10.139.160.11 -S 10.139.160.2 -C1’与’arping -I eth2 10.139.160.11 -S 10.139.160.2 -C1’均无输出

root@test1:~# arping -I eth2 10.139.160.11 -S 10.139.160.2 -C1
ARPING 10.139.160.11
Timeout

那是因为eth1与eth2不是vlan=700?

Some Outputs

root@test1:~# cat /proc/net/bonding/ptk0 
Ethernet Channel Bonding Driver: v5.15.0-43-generic

Bonding Mode: fault-tolerance (active-backup)
Primary Slave: eth1 (primary_reselect always)
Currently Active Slave: eth1
MII Status: up
MII Polling Interval (ms): 100
Up Delay (ms): 0
Down Delay (ms): 0
Peer Notification Delay (ms): 0

Slave Interface: eth1
MII Status: up
Speed: 10000 Mbps
Duplex: full
Link Failure Count: 1
Permanent HW addr: 00:16:3e:15:bd:58
Slave queue ID: 0

Slave Interface: eth2
MII Status: up
Speed: 10000 Mbps
Duplex: full
Link Failure Count: 1
Permanent HW addr: 00:16:3e:68:72:0f
Slave queue ID: 0

另一种纯CLI方法

上面的不使用netplan还设置网络,而是直接使用纯CLI命令来创建bond, 并且不采用vlan-filtering的方法- https://developers.redhat.com/blog/2017/09/14/vlan-filter-support-on-bridge#bridge_and_vlan

lxc launch faster:ubuntu/jammy test1
lxc launch faster:ubuntu/jammy test2
#add two NICs from NET1 for two containers
lxc network create NET1 ipv6.address=none ipv4.address=10.139.160.1/24
lxc network attach NET1 test1 eth1
lxc network attach NET1 test1 eth2
lxc network attach NET1 test2 eth1
lxc network attach NET1 test2 eth2

#inside test1
lxc exec test1 -- /bin/bash
sudo ip link add ptk0 type bond miimon 100 mode active-backup
sudo ip link set eth1 down
sudo ip link set eth1 master ptk0
sudo ip link set eth2 down
sudo ip link set eth2 master ptk0
sudo ip link set dev ptk0 address 00:16:3e:15:bd:58
sudo ip link set dev eth1 address 00:16:3e:15:bd:58
sudo ip link set dev eth2 address 00:16:3e:68:72:0f
sudo ip link set ptk0 up
sudo ip link add link ptk0 name ptk0.700 type vlan id 700
sudo ip addr add 10.139.160.10/24 dev ptk0.700

#inside test2
lxc exec test2 -- /bin/bash
sudo ip link add ptk0 type bond miimon 100 mode active-backup
sudo ip link set eth1 down
sudo ip link set eth1 master ptk0
sudo ip link set eth2 down
sudo ip link set eth2 master ptk0
sudo ip link set dev ptk0 address 00:16:3e:1e:19:25
sudo ip link set dev eth1 address 00:16:3e:1e:19:25
sudo ip link set dev eth2 address 00:16:3e:f7:9e:22
sudo ip link set ptk0 up
sudo ip link add link ptk0 name ptk0.700 type vlan id 700
sudo ip addr add 10.139.160.11/24 dev ptk0.700

#on host
sudo bridge vlan add vid 700 dev NET1 self
brctl show NET1 |grep veth |xargs -i sudo bridge vlan add vid 700 dev 
sudo bridge vlan show

reference

[1] LACP Bond配置 - https://blog.csdn.net/quqi99/article/details/51251210
[2] 三种方式使用vlan - https://blog.csdn.net/quqi99/article/details/51218884
[3] creating vlan over openstack - https://blog.csdn.net/quqi99/article/details/118341936
[4] VLAN filter support on bridge - https://developers.redhat.com/blog/2017/09/14/vlan-filter-support-on-bridge#

以上是关于Using lxd to do vlan test (by quqi99)的主要内容,如果未能解决你的问题,请参考以下文章

Using lxd to do vlan test (by quqi99)

Using lxd to do vlan test (by quqi99)

Running Quagga on LXD to test OSPF (by quqi99)

Running Quagga on LXD to test OSPF (by quqi99)

Running Quagga on LXD to test OSPF (by quqi99)

Using Tensorflow SavedModel Format to Save and Do Predictions