作者:张华 发表于:2022-08-15
版权声明:可以任意转载,转载时请务必以超链接形式标明文章原始出处和作者信息及本版权声明
客户说sriov虚机里收不着arp reply, 他们的sriov虚机里是两个sriov网卡做一个ptk0 (bond ?), 由active NIC(pkt0_p)与standby NIC(pkt0_s)组成.
/fa:16:3e:d8:3f:b9(pkt0)
/fa:16:3e:d8:3f:b9(pkt0_p)
/fa:16:3e:70:be:ba(pkt0_s)
151.2.143.1/151.2.143.2/fa:16:3e:d8:3f:b9(pkt0.610@pkt0)
10.139.99.1/10.139.99.2/fa:16:3e:d8:3f:b9(pkt0.510@pkt0)
10.139.160.10/10.139.160.11/10.139.160.12/fa:16:3e:d8:3f:b9(pkt0.700@pkt0)
他说在active NIC作ICMP的心跳检查没问题,但是在standby NIC上做ARP到GW的心跳检查收不着arp reply (但下列数据似乎收着啦?)
1, arp for active port(fa:16:3e:d8:3f:b9)
$ tshark -r ./EXT_TMP-700.pcap-1.act.pcap eth.src==fa:16:3e:d8:3f:b9 and arp |tail -n1
357602 8141.824956 fa:16:3e:d8:3f:b9 → IETF-VRRP-VRID_64 ARP 60 Who has 10.139.160.254? Tell 10.139.160.10
$ tshark -r ./EXT_TMP-700.pcap-1.act.pcap eth.dst==fa:16:3e:d8:3f:b9 and arp |tail -n1
357603 8141.825416 IETF-VRRP-VRID_64 → fa:16:3e:d8:3f:b9 ARP 60 10.139.160.254 is at 00:00:5e:00:01:64
2, icmp for active port(fa:16:3e:d8:3f:b9)
$ tshark -r ./EXT_TMP-700.pcap-1.act.pcap eth.dst==fa:16:3e:d8:3f:b9 and icmp |tail -n1
358835 8169.867056 10.139.160.254 → 10.139.160.10 ICMP 102 Echo (ping) reply id=0x000a, seq=15233/33083, ttl=64 (request in 358834)
$ tshark -r ./EXT_TMP-700.pcap-1.act.pcap eth.src==fa:16:3e:d8:3f:b9 and icmp |tail -n1
358834 8169.863263 10.139.160.10 → 10.139.160.254 ICMP 102 Echo (ping) request id=0x000a, seq=15233/33083, ttl=64
3, arp for standby port(fa:16:3e:70:be:ba)
$ tshark -r ./EXT_TMP-700.pcap-1.act.pcap eth.src==fa:16:3e:70:be:ba and arp |tail -n1
358848 8170.244743 fa:16:3e:70:be:ba → Broadcast ARP 60 Who has 10.139.160.254? (ARP Probe)
$ tshark -r ./EXT_TMP-700.pcap-1.act.pcap eth.dst==fa:16:3e:70:be:ba and arp |tail -n1
358849 8170.245117 IETF-VRRP-VRID_64 → fa:16:3e:70:be:ba ARP 60 10.139.160.254 is at 00:00:5e:00:01:64
4, icmp for standby port(fa:16:3e:70:be:ba)
$ tshark -r ./EXT_TMP-700.pcap-1.act.pcap eth.src==fa:16:3e:70:be:ba and icmp |tail -n1
$ tshark -r ./EXT_TMP-700.pcap-1.act.pcap eth.dst==fa:16:3e:70:be:ba and icmp |tail -n1
已经做过如下分析:
juju config ovn-chassis-sriov-hugepages ovn-bridge-mappings
dcfabric:br-data sriovfabric1:br-data sriovfabric2:br-data
$ juju config ovn-chassis-sriov-hugepages bridge-interface-mappings
br-data:bond1
$ juju config ovn-chassis-sriov-hugepages sriov-device-mappings
sriovfabric1:ens3f0 sriovfabric1:ens6f0 sriovfabric2:ens3f1 sriovfabric2:ens6f1
$ juju config ovn-chassis-sriov-hugepages sriov-numvfs
ens3f0:32 ens3f1:32 ens6f0:32 ens6f1:32
i$ grep -E 'fa:16:3e:f8:42:fe|fa:16:3e:70:be:ba|fa:16:3e:8f:56:5a|fa:16:3e:d8:3f:b9' sos_commands/networking/ip_-s_-d_link
vf 30 MAC fa:16:3e:70:be:ba, spoof checking off, link-state auto, trust on
vf 31 MAC fa:16:3e:f8:42:fe, spoof checking off, link-state auto, trust on
vf 29 MAC fa:16:3e:8f:56:5a, spoof checking off, link-state auto, trust on
vf 30 MAC fa:16:3e:d8:3f:b9, spoof checking off, link-state auto, trust on
我们这篇文章的测试主要就是模拟这个vlan测试,当然这里不涉及sriov硬件.
lxc remote add faster https://mirrors.tuna.tsinghua.edu.cn/lxc-images/ --protocol=simplestreams --public
lxc image list faster:
lxc remote list
#Failed creating instance record: Failed detecting root disk device: No root device could be found
#lxc profile device add default root disk path=/ pool=default
#lxc profile show default
#lxc launch ubuntu:focal master -p juju-default --config=user.network-config="$(cat network.yml)"
lxc launch faster:ubuntu/jammy test1
lxc launch faster:ubuntu/jammy test2
#add two NICs from NET1 for two containers
lxc network create NET1 ipv6.address=none ipv4.address=10.139.160.1/24
lxc network attach NET1 test1 eth1
lxc network attach NET1 test1 eth2
lxc network attach NET1 test2 eth1
lxc network attach NET1 test2 eth2
#https://developers.redhat.com/blog/2018/10/22/introduction-to-linux-interfaces-for-virtual-networking#vlan
#ip link add ptk0 type bond miimon 100 mode active-backup
#ip link set eth2 master ptk0
#ip link set eth1 master ptk0
lxc exec test1 -- /bin/bash
cat << EOF |tee /etc/netplan/11-test.yaml
network:
version: 2
renderer: networkd
ethernets:
eth1:
addresses: []
dhcp4: false
dhcp6: false
macaddress: 00:16:3e:15:bd:58
eth2:
addresses: []
dhcp4: false
dhcp6: false
macaddress: 00:16:3e:68:72:0f
bonds:
ptk0:
addresses: []
dhcp4: false
dhcp6: false
interfaces:
- eth1
- eth2
parameters:
mode: active-backup
primary: eth1
vlans:
ptk0.700:
id: 700
link: ptk0
dhcp4: no
addresses: [ 10.139.160.10/24 ]
nameservers:
search: [ domain.local ]
addresses: [ 8.8.8.8 ]
EOF
netplan apply
lxc exec test2 -- /bin/bash
cat << EOF |tee /etc/netplan/11-test.yaml
network:
version: 2
renderer: networkd
ethernets:
eth1:
addresses: []
dhcp4: false
dhcp6: false
macaddress: 00:16:3e:1e:19:25
eth2:
addresses: []
dhcp4: false
dhcp6: false
macaddress: 00:16:3e:f7:9e:22
bonds:
ptk0:
addresses: []
dhcp4: false
dhcp6: false
interfaces:
- eth1
- eth2
parameters:
mode: active-backup
primary: eth1
vlans:
ptk0.700:
id: 700
link: ptk0
dhcp4: no
addresses: [ 10.139.160.11/24 ]
nameservers:
search: [ domain.local ]
addresses: [ 8.8.8.8 ]
EOF
netplan apply
上面创建了两个lxd,并在两个lxd中创建了active/standby的bond (ptk0), 然后创建了一个vlan (ptk0.700), 要想上面的网络通,还得在host里设置trunk, 这样vlan网络就通了.
注意:上面需要使用macaddress为两个NIC来设置mac, 若不设置,在创建bond和vlan后会出现有所NIC的mac相同的情况.
$ sudo brctl show |grep NET1 -A3
NET1 8000.00163eeb79c4 no veth2af34c1d
veth3a5b458e
veth82c292b2
veth9b8e8cb6
#sudo bridge vlan add vid 2-4094 dev NET1 self
sudo bridge vlan add vid 700 dev NET1 self
sudo bridge vlan add vid 700 dev veth2af34c1d
sudo bridge vlan add vid 700 dev veth3a5b458e
sudo bridge vlan add vid 700 dev veth82c292b2
sudo bridge vlan add vid 700 dev veth9b8e8cb6
sudo bridge vlan show
此时,test1可以通过vlan700来ping test2
root@test1:~# ping 10.139.160.11 -c1
PING 10.139.160.11 (10.139.160.11) 56(84) bytes of data.
64 bytes from 10.139.160.11: icmp_seq=1 ttl=64 time=0.133 ms
root@test2:~# tcpdump -i eth1 -nn -e -l
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on eth1, link-type EN10MB (Ethernet), snapshot length 262144 bytes
05:54:36.128602 00:16:3e:15:bd:58 > 00:16:3e:1e:19:25, ethertype 802.1Q (0x8100), length 102: vlan 700, p 0, ethertype IPv4 (0x0800), 10.139.160.10 > 10.139.160.11: ICMP echo request, id 37135, seq 1, length 64
05:54:36.128643 00:16:3e:1e:19:25 > 00:16:3e:15:bd:58, ethertype 802.1Q (0x8100), length 102: vlan 700, p 0, ethertype IPv4 (0x0800), 10.139.160.11 > 10.139.160.10: ICMP echo reply, id 37135, seq 1, length 64
但是仍然无法ping GW的
root@test1:~# ping 10.139.160.1 -c1
PING 10.139.160.1 (10.139.160.1) 56(84) bytes of data.
From 10.139.160.10 icmp_seq=1 Destination Host Unreachable
$ sudo tcpdump -i NET1 -nn -e -l
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on NET1, link-type EN10MB (Ethernet), snapshot length 262144 bytes
14:25:24.761131 00:16:3e:15:bd:58 > ff:ff:ff:ff:ff:ff, ethertype 802.1Q (0x8100), length 46: vlan 700, p 0, ethertype ARP (0x0806), Request who-has 10.139.160.1 tell 10.139.160.10, length 28
无论是创建一个eth0.700, 还是创建一个vlan=700的tap0,均无法ping
#use eth0.700
sudo ip link add link eth0 name eth0.700 type vlan id 700
sudo brctl addif NET1 eth0.700
sudo ifconfig eth0.700 up
sudo ip addr add 10.139.160.254/24 dev eth0.700
sudo bridge vlan add vid 700 dev eth0.700
#use a tap
sudo ip tuntap add mode tap tap0
sudo ip link set tap0 master NET1
sudo bridge vlan add dev tap0 vid 700 pvid untagged master
sudo ip addr add 10.139.160.254/24 dev tap0
sudo bridge vlan show
那就将test2当成gw吧,然后我们从test1上ping它然后抓包
如果仅从active port使用icmp
root@test1:~# ping -I eth1 10.139.160.1 -c1
ping: Warning: source address might be selected on device other than: eth1
PING 10.139.160.1 (10.139.160.1) from 192.168.121.88 eth1: 56(84) bytes of data.
^C
--- 10.139.160.1 ping statistics ---
1 packets transmitted, 0 received, 100% packet loss, time 0ms
$ sudo tcpdump -i NET1 -nn -e -l
14:32:04.483156 00:16:3e:15:bd:58 > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 42: Request who-has 10.139.160.1 tell 192.168.121.88, length 28
14:32:04.483185 00:16:3e:eb:79:c4 > 00:16:3e:15:bd:58, ethertype ARP (0x0806), length 42: Reply 10.139.160.1 is-at 00:16:3e:eb:79:c4, length 28
运行’ping -I eth1 10.139.160.11 -c1’与'ping -I eth2 10.139.160.11 -c1’均无输出
使用arping命令发送arp request时必须指定一个IP, 但standby port上又没有IP,所以通过’-S’指定了一个.
root@test1:~# arping -I ptk0.700 10.139.160.11 -S 10.139.160.2 -C1
ARPING 10.139.160.11
42 bytes from 00:16:3e:1e:19:25 (10.139.160.11): index=0 time=8.119 usec
root@test2:~# sudo tcpdump -i ptk0.700 -nn -e -l
09:08:16.814374 00:16:3e:15:bd:58 > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 58: Request who-has 10.139.160.11 tell 10.139.160.2, length 44
09:08:16.814410 00:16:3e:1e:19:25 > 00:16:3e:15:bd:58, ethertype ARP (0x0806), length 42: Reply 10.139.160.11 is-at 00:16:3e:1e:19:25, length 28
运行’arping -I eth1 10.139.160.11 -S 10.139.160.2 -C1’与’arping -I eth2 10.139.160.11 -S 10.139.160.2 -C1’均无输出
root@test1:~# arping -I eth2 10.139.160.11 -S 10.139.160.2 -C1
ARPING 10.139.160.11
Timeout
那是因为eth1与eth2不是vlan=700?
root@test1:~# cat /proc/net/bonding/ptk0
Ethernet Channel Bonding Driver: v5.15.0-43-generic
Bonding Mode: fault-tolerance (active-backup)
Primary Slave: eth1 (primary_reselect always)
Currently Active Slave: eth1
MII Status: up
MII Polling Interval (ms): 100
Up Delay (ms): 0
Down Delay (ms): 0
Peer Notification Delay (ms): 0
Slave Interface: eth1
MII Status: up
Speed: 10000 Mbps
Duplex: full
Link Failure Count: 1
Permanent HW addr: 00:16:3e:15:bd:58
Slave queue ID: 0
Slave Interface: eth2
MII Status: up
Speed: 10000 Mbps
Duplex: full
Link Failure Count: 1
Permanent HW addr: 00:16:3e:68:72:0f
Slave queue ID: 0
上面的不使用netplan还设置网络,而是直接使用纯CLI命令来创建bond, 并且不采用vlan-filtering的方法- https://developers.redhat.com/blog/2017/09/14/vlan-filter-support-on-bridge#bridge_and_vlan
lxc launch faster:ubuntu/jammy test1
lxc launch faster:ubuntu/jammy test2
#add two NICs from NET1 for two containers
lxc network create NET1 ipv6.address=none ipv4.address=10.139.160.1/24
lxc network attach NET1 test1 eth1
lxc network attach NET1 test1 eth2
lxc network attach NET1 test2 eth1
lxc network attach NET1 test2 eth2
#inside test1
lxc exec test1 -- /bin/bash
sudo ip link add ptk0 type bond miimon 100 mode active-backup
sudo ip link set eth1 down
sudo ip link set eth1 master ptk0
sudo ip link set eth2 down
sudo ip link set eth2 master ptk0
sudo ip link set dev ptk0 address 00:16:3e:15:bd:58
sudo ip link set dev eth1 address 00:16:3e:15:bd:58
sudo ip link set dev eth2 address 00:16:3e:68:72:0f
sudo ip link set ptk0 up
sudo ip link add link ptk0 name ptk0.700 type vlan id 700
sudo ip addr add 10.139.160.10/24 dev ptk0.700
#inside test2
lxc exec test2 -- /bin/bash
sudo ip link add ptk0 type bond miimon 100 mode active-backup
sudo ip link set eth1 down
sudo ip link set eth1 master ptk0
sudo ip link set eth2 down
sudo ip link set eth2 master ptk0
sudo ip link set dev ptk0 address 00:16:3e:1e:19:25
sudo ip link set dev eth1 address 00:16:3e:1e:19:25
sudo ip link set dev eth2 address 00:16:3e:f7:9e:22
sudo ip link set ptk0 up
sudo ip link add link ptk0 name ptk0.700 type vlan id 700
sudo ip addr add 10.139.160.11/24 dev ptk0.700
#on host
sudo bridge vlan add vid 700 dev NET1 self
brctl show NET1 |grep veth |xargs -i sudo bridge vlan add vid 700 dev {}
sudo bridge vlan show
[1] LACP Bond配置 - https://blog.csdn.net/quqi99/article/details/51251210
[2] 三种方式使用vlan - https://blog.csdn.net/quqi99/article/details/51218884
[3] creating vlan over openstack - https://blog.csdn.net/quqi99/article/details/118341936
[4] VLAN filter support on bridge - https://developers.redhat.com/blog/2017/09/14/vlan-filter-support-on-bridge#