深入理解kubernetes(k8s)网络原理之六-同主机pod连接的几种方式及性能对比
本来说这一篇是要撸个cni出来的,但感觉这个没什么好说的,cni的源码github一大堆了,大概的套路也就是把前面说的命令用go语言实现一下,于是本着想到啥写啥的原则,我决定介绍一下相同主机的pod连接的另外几种方式及性能对比。
在本系列的第一篇文章中,我们介绍过pod用veth网卡连接主机,其实同主机的pod的连接方式一共有以下几种:
- veth连接主机,把主机当路由器用,第一篇文章就是介绍的这种方式,(下面我们简称:veth方式)
- 用linux bridge连接各个pod,把网关地址挂在linux bridge,flannel就是使用这种方式
- macvlan,使用场景有些限制,云平台一般用不了
- ipvlan,对内核有要求,默认3.10是不支持的
- 开启eBPF支持
在这一章中我们专门来对比一下上面这几种方式的区别和性能。
因为eBPF对内核版本有要求,所以我使用的环境linux内核版本是4.18,普通的PC机,8核32G的配置;
在执行下面的命令时,注意创建的网卡名不要与主机的物理网卡冲突,笔者使用的主机网卡是eno2,所以我都是创建名为叫eth0的网卡,但这个网卡名在你的环境中可能很容易冲突,所以注意先修改一下
veth方式
这种方式在第一篇文章中详细介绍过,所以就直接上命令了;
创建pod-a和pod-b:
代码语言:txt复制ip netns add pod-a
ip link add eth0 type veth peer name veth-pod-a
ip link set eth0 netns pod-a
ip netns exec pod-a ip addr add 192.168.10.10/24 dev eth0
ip netns exec pod-a ip link set eth0 up
ip netns exec pod-a ip route add default via 169.254.10.24 dev eth0 onlink
ip link set veth-pod-a up
echo 1 > /proc/sys/net/ipv4/conf/veth-pod-a/proxy_arp
ip route add 192.168.10.10 dev veth-pod-a scope link
ip netns add pod-b
ip link add eth0 type veth peer name veth-pod-b
ip link set eth0 netns pod-b
ip netns exec pod-b ip addr add 192.168.10.11/24 dev eth0
ip netns exec pod-b ip link set eth0 up
ip netns exec pod-b ip route add default via 169.254.10.24 dev eth0 onlink
ip link set veth-pod-b up
echo 1 > /proc/sys/net/ipv4/conf/veth-pod-b/proxy_arp
ip route add 192.168.10.11 dev veth-pod-b scope link
echo 1 > /proc/sys/net/ipv4/ip_forward
iptables -I FORWARD -s 192.168.0.0/16 -d 192.168.0.0/16 -j ACCEPT
在pod-a中启动iperf3服务:
代码语言:txt复制ip netns exec pod-a iperf3 -s
另开一个终端,在pod-b中请求pod-a,测试连接性能:
代码语言:txt复制ip netns exec pod-b iperf3 -c 192.168.10.10 -i 1 -t 10
Connecting to host 192.168.10.10, port 5201
[ 5] local 192.168.10.11 port 35014 connected to 192.168.10.10 port 5201
[ ID] Interval Transfer Bitrate Retr Cwnd
[ 5] 0.00-1.00 sec 7.32 GBytes 62.9 Gbits/sec 0 964 KBytes
[ 5] 1.00-2.00 sec 7.90 GBytes 67.8 Gbits/sec 0 964 KBytes
[ 5] 2.00-3.00 sec 7.79 GBytes 66.9 Gbits/sec 0 1012 KBytes
[ 5] 3.00-4.00 sec 7.92 GBytes 68.0 Gbits/sec 0 1012 KBytes
[ 5] 4.00-5.00 sec 7.89 GBytes 67.8 Gbits/sec 0 1012 KBytes
[ 5] 5.00-6.00 sec 7.87 GBytes 67.6 Gbits/sec 0 1012 KBytes
[ 5] 6.00-7.00 sec 7.68 GBytes 66.0 Gbits/sec 0 1.16 MBytes
[ 5] 7.00-8.00 sec 7.79 GBytes 66.9 Gbits/sec 0 1.27 MBytes
[ 5] 8.00-9.00 sec 7.75 GBytes 66.5 Gbits/sec 0 1.47 MBytes
[ 5] 9.00-10.00 sec 7.90 GBytes 67.8 Gbits/sec 0 1.47 MBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-10.00 sec 77.8 GBytes 66.8 Gbits/sec 0 sender
[ 5] 0.00-10.04 sec 77.8 GBytes 66.6 Gbits/sec receiver
因为大多数时候,在部署在容器的应用间相互访问都是使用clusterIP的,所以也要测试一下用clusterIP访问,先用iptables创建clusterIP,这个我们在第二篇文章介绍过:
代码语言:txt复制iptables -A PREROUTING -t nat -d 10.96.0.100 -j DNAT --to-destination 192.168.10.10
为了保证测试结果不受其它因素的干扰,最好确保主机上只有一条nat规则,这条nat规则可以全程使用,后面不再反复地删除并创建了
结果如下:
代码语言:txt复制ip netns exec pod-b iperf3 -c 10.96.0.100 -i 1 -t 10
Connecting to host 10.96.0.100, port 5201
[ 5] local 192.168.10.11 port 44528 connected to 10.96.0.100 port 5201
[ ID] Interval Transfer Bitrate Retr Cwnd
[ 5] 0.00-1.00 sec 6.41 GBytes 55.0 Gbits/sec 0 725 KBytes
[ 5] 1.00-2.00 sec 7.21 GBytes 61.9 Gbits/sec 0 902 KBytes
[ 5] 2.00-3.00 sec 7.99 GBytes 68.6 Gbits/sec 0 902 KBytes
[ 5] 3.00-4.00 sec 7.88 GBytes 67.7 Gbits/sec 0 902 KBytes
[ 5] 4.00-5.00 sec 7.94 GBytes 68.2 Gbits/sec 0 947 KBytes
[ 5] 5.00-6.00 sec 7.86 GBytes 67.5 Gbits/sec 0 947 KBytes
[ 5] 6.00-7.00 sec 7.79 GBytes 66.9 Gbits/sec 0 947 KBytes
[ 5] 7.00-8.00 sec 8.04 GBytes 69.1 Gbits/sec 0 947 KBytes
[ 5] 8.00-9.00 sec 7.85 GBytes 67.5 Gbits/sec 0 996 KBytes
[ 5] 9.00-10.00 sec 7.88 GBytes 67.7 Gbits/sec 0 996 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-10.00 sec 76.9 GBytes 66.0 Gbits/sec 0 sender
[ 5] 0.00-10.04 sec 76.9 GBytes 65.8 Gbits/sec receiver
可以看到,在veth方式下,使用podIP或通过clusterIP访问pod在性能上区别是不大的,当然在iptables的规则很多的时候(例如2000个k8s服务),性能就会有影响了,但这并不是我们这一篇文章的重点;
清理一下现场,先停止iperf3服务,然后删除ns(上面的iptables规则留着):
代码语言:txt复制ip netns del pod-a
ip netns del pod-b
我们接着来试一下bridge方式。
bridge方式
如果把bridge当成纯二层的交换机来负责两个pod的连接性能是很不错的,比上面用veth的方式要好,但是因为要支持k8s的serviceIP,所以有必要让数据包走一遍主机的iptables规则,所以一般都会给bridge挂个网关的IP,一来能响应pod的ARP请求,这样也就不用开启veth网卡对主机这一端的arp代答,再来这样能让数据包走一遍主机的netfilter的扩展函数,这样iptables规则就能生效了。
按理说linux bridge作为交换机是工作在二层,可是从源码中可以看到bridge是实实在在地执行了netfilter的几个hook点的函数的(PREROUTING/INPUT/FORWARD/OUTPUT/POSTROUTING),当然也有开关可以关闭这个功能(net.bridge.bridge-nf-call-iptables)
下面我们来测一下用bridge连接两个pod时的性能,创建br0的bridge然后把pod-a和pod-b都接上去:
代码语言:txt复制ip link add br0 type bridge
ip addr add 192.168.10.1 dev br0 ## 给br0挂pod的默认网关的IP
ip link set br0 up
ip netns add pod-a
ip link add eth0 type veth peer name veth-pod-a
ip link set eth0 netns pod-a
ip netns exec pod-a ip addr add 192.168.10.10/24 dev eth0
ip netns exec pod-a ip link set eth0 up
ip netns exec pod-a ip route add default via 192.168.10.1 dev eth0 onlink ## 默认网关是br0的地址
ip link set veth-pod-a master br0 ## veth网卡主机这一端插到bridge上
ip link set veth-pod-a up
ip netns add pod-b
ip link add eth0 type veth peer name veth-pod-b
ip link set eth0 netns pod-b
ip netns exec pod-b ip addr add 192.168.10.11/24 dev eth0
ip netns exec pod-b ip link set eth0 up
ip netns exec pod-b ip route add default via 192.168.10.1 dev eth0 onlink
ip link set veth-pod-b master br0
ip link set veth-pod-b up
ip route add 192.168.10.0/24 via 192.168.10.1 dev br0 scope link ## 主机到所有的pod的路由,下一跳为br0
再次在pod-a中运行iperf3:
代码语言:txt复制ip netns exec pod-a iperf3 -s
在pod-b中测试连接性能:
代码语言:txt复制ip netns exec pod-b iperf3 -c 192.168.10.10 -i 1 -t 10
Connecting to host 192.168.10.10, port 5201
[ 5] local 192.168.10.11 port 38232 connected to 192.168.10.10 port 5201
[ ID] Interval Transfer Bitrate Retr Cwnd
[ 5] 0.00-1.00 sec 7.64 GBytes 65.6 Gbits/sec 0 642 KBytes
[ 5] 1.00-2.00 sec 7.71 GBytes 66.2 Gbits/sec 0 642 KBytes
[ 5] 2.00-3.00 sec 7.49 GBytes 64.4 Gbits/sec 0 786 KBytes
[ 5] 3.00-4.00 sec 7.61 GBytes 65.3 Gbits/sec 0 786 KBytes
[ 5] 4.00-5.00 sec 7.54 GBytes 64.8 Gbits/sec 0 786 KBytes
[ 5] 5.00-6.00 sec 7.71 GBytes 66.2 Gbits/sec 0 786 KBytes
[ 5] 6.00-7.00 sec 7.66 GBytes 65.8 Gbits/sec 0 826 KBytes
[ 5] 7.00-8.00 sec 7.63 GBytes 65.5 Gbits/sec 0 826 KBytes
[ 5] 8.00-9.00 sec 7.64 GBytes 65.7 Gbits/sec 0 826 KBytes
[ 5] 9.00-10.00 sec 7.71 GBytes 66.2 Gbits/sec 0 826 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-10.00 sec 76.3 GBytes 65.6 Gbits/sec 0 sender
[ 5] 0.00-10.04 sec 76.3 GBytes 65.3 Gbits/sec receiver
与veth差距并不大,反复测了几次,都差不多,对于这个结果我觉得有点不科学,从源码上看,bridge转发的流程肯定要比走一遍主机的协议栈转发要快的,于是我想了一下,是不是把bridge执行netfilter扩展函数关闭会好一点呢?于是:
代码语言:txt复制sysctl -w net.bridge.bridge-nf-call-iptables=0
再次试了一下:
代码语言:txt复制ip netns exec pod-b iperf3 -c 192.168.10.10 -i 1 -t 10
Connecting to host 192.168.10.10, port 5201
[ 5] local 192.168.10.11 port 40658 connected to 192.168.10.10 port 5201
[ ID] Interval Transfer Bitrate Retr Cwnd
[ 5] 0.00-1.00 sec 8.55 GBytes 73.4 Gbits/sec 0 810 KBytes
[ 5] 1.00-2.00 sec 8.86 GBytes 76.1 Gbits/sec 0 940 KBytes
[ 5] 2.00-3.00 sec 8.96 GBytes 77.0 Gbits/sec 0 940 KBytes
[ 5] 3.00-4.00 sec 9.04 GBytes 77.6 Gbits/sec 0 940 KBytes
[ 5] 4.00-5.00 sec 8.89 GBytes 76.4 Gbits/sec 0 987 KBytes
[ 5] 5.00-6.00 sec 9.18 GBytes 78.9 Gbits/sec 0 987 KBytes
[ 5] 6.00-7.00 sec 9.09 GBytes 78.1 Gbits/sec 0 987 KBytes
[ 5] 7.00-8.00 sec 9.10 GBytes 78.1 Gbits/sec 0 1.01 MBytes
[ 5] 8.00-9.00 sec 8.98 GBytes 77.1 Gbits/sec 0 1.12 MBytes
[ 5] 9.00-10.00 sec 9.13 GBytes 78.4 Gbits/sec 0 1.12 MBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-10.00 sec 89.8 GBytes 77.1 Gbits/sec 0 sender
[ 5] 0.00-10.04 sec 89.8 GBytes 76.8 Gbits/sec receiver
果然,是会快许多的,但这结果多少令我也有点吃惊,我已经确保主机上只有一条iptables规则,只是开启了bridge执行netfilter扩展函数居然前后会相差10Gbits/sec。
但是,这个标志是不能关的,前面说了,要执行iptables规则把clusterIP转成podIP,所以还是要开起来:
代码语言:txt复制sysctl -w net.bridge.bridge-nf-call-iptables=1
这时候用clusterIP试试:
代码语言:txt复制ip netns exec pod-b iperf3 -c 10.96.0.100 -i 1 -t 10
Connecting to host 10.96.0.100, port 5201
[ 5] local 192.168.10.11 port 47414 connected to 10.96.0.100 port 5201
[ ID] Interval Transfer Bitrate Retr Cwnd
[ 5] 0.00-1.00 sec 7.02 GBytes 60.3 Gbits/sec 0 669 KBytes
[ 5] 1.00-2.00 sec 7.23 GBytes 62.1 Gbits/sec 0 738 KBytes
[ 5] 2.00-3.00 sec 7.17 GBytes 61.6 Gbits/sec 0 899 KBytes
[ 5] 3.00-4.00 sec 7.21 GBytes 62.0 Gbits/sec 0 899 KBytes
[ 5] 4.00-5.00 sec 7.31 GBytes 62.8 Gbits/sec 0 899 KBytes
[ 5] 5.00-6.00 sec 7.19 GBytes 61.8 Gbits/sec 0 1008 KBytes
[ 5] 6.00-7.00 sec 7.24 GBytes 62.2 Gbits/sec 0 1008 KBytes
[ 5] 7.00-8.00 sec 7.22 GBytes 62.0 Gbits/sec 0 1.26 MBytes
[ 5] 8.00-9.00 sec 6.99 GBytes 60.0 Gbits/sec 0 1.26 MBytes
[ 5] 9.00-10.00 sec 7.07 GBytes 60.8 Gbits/sec 0 1.26 MBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-10.00 sec 71.6 GBytes 61.5 Gbits/sec 0 sender
[ 5] 0.00-10.04 sec 71.6 GBytes 61.3 Gbits/sec receiver
反复试了几次,大概都是这个值,可以看到,使用clusterIP比直接用podIP访问性能下降了8%;
接着来测macvlan,先清理一下现场:
代码语言:txt复制ip netns del pod-a
ip netns del pod-b
ip link del br0
macvlan
macvlan模式是从一个物理网卡虚拟出多个虚拟网络接口,每个虚拟的接口都有单独的mac地址,可以给这些虚拟接口配置IP地址,在bridge模式下(其它几种模式不适用,所以不作讨论),父接口作为交换机来完成子接口间的通信,子接口可以通过父接口访问外网;
我们使用bridge模式:
代码语言:txt复制ip link add link eno2 name eth0 type macvlan mode bridge ## eno2是我的物理网卡名称,eth0是我虚拟出来的接口名,请根据你的实际情况修改
ip netns add pod-a
ip l set eth0 netns pod-a
ip netns exec pod-a ip addr add 192.168.10.10/24 dev eth0
ip netns exec pod-a ip link set eth0 up
ip netns exec pod-a ip route add default via 192.168.10.1 dev eth0
ip link add link eno2 name eth0 type macvlan mode bridge
ip netns add pod-b
ip l set eth0 netns pod-b
ip netns exec pod-b ip addr add 192.168.10.11/24 dev eth0
ip netns exec pod-b ip link set eth0 up
ip netns exec pod-b ip route add default via 192.168.10.1 dev eth0
ip link add link eno2 name eth0 type macvlan mode bridge
ip addr add 192.168.10.1/24 dev eth0 ## 多创建一个子接口,留在主机上,把pod的默认网关的IP挂这里,这样pod里面请求clusterIP的流量会走到主机来,主机协议栈的iptables规则就有机会执行了
ip link set eth0 up
iptables -A POSTROUTING -t nat -s 192.168.10.0/24 -d 192.168.10.0/24 -j MASQUERADE ## 要让回包也经过主机协议栈,所以要做源地址转换
测试结果:
代码语言:txt复制ip netns exec pod-b iperf3 -c 192.168.10.10 -i 1 -t 10
Connecting to host 192.168.10.10, port 5201
[ 5] local 192.168.10.11 port 47050 connected to 192.168.10.10 port 5201
[ ID] Interval Transfer Bitrate Retr Cwnd
[ 5] 0.00-1.00 sec 9.47 GBytes 81.3 Gbits/sec 0 976 KBytes
[ 5] 1.00-2.00 sec 9.81 GBytes 84.3 Gbits/sec 0 976 KBytes
[ 5] 2.00-3.00 sec 9.74 GBytes 83.6 Gbits/sec 0 1.16 MBytes
[ 5] 3.00-4.00 sec 9.77 GBytes 83.9 Gbits/sec 0 1.16 MBytes
[ 5] 4.00-5.00 sec 9.73 GBytes 83.6 Gbits/sec 0 1.16 MBytes
[ 5] 5.00-6.00 sec 9.69 GBytes 83.2 Gbits/sec 0 1.16 MBytes
[ 5] 6.00-7.00 sec 9.75 GBytes 83.7 Gbits/sec 0 1.22 MBytes
[ 5] 7.00-8.00 sec 9.78 GBytes 84.0 Gbits/sec 0 1.30 MBytes
[ 5] 8.00-9.00 sec 9.81 GBytes 84.3 Gbits/sec 0 1.30 MBytes
[ 5] 9.00-10.00 sec 9.84 GBytes 84.5 Gbits/sec 0 1.30 MBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-10.00 sec 97.4 GBytes 83.6 Gbits/sec 0 sender
[ 5] 0.00-10.04 sec 97.4 GBytes 83.3 Gbits/sec receiver
直接用podIP访问是非常快的,几种方式中最快的;
用clusterIP试试:
代码语言:txt复制ip netns exec pod-b iperf3 -c 10.96.0.100 -i 1 -t 10
Connecting to host 10.96.0.100, port 5201
[ 5] local 192.168.10.11 port 35780 connected to 10.96.0.100 port 5201
[ ID] Interval Transfer Bitrate Retr Cwnd
[ 5] 0.00-1.00 sec 7.16 GBytes 61.5 Gbits/sec 0 477 KBytes
[ 5] 1.00-2.00 sec 7.18 GBytes 61.7 Gbits/sec 0 477 KBytes
[ 5] 2.00-3.00 sec 7.13 GBytes 61.2 Gbits/sec 0 608 KBytes
[ 5] 3.00-4.00 sec 7.18 GBytes 61.7 Gbits/sec 0 670 KBytes
[ 5] 4.00-5.00 sec 7.16 GBytes 61.5 Gbits/sec 0 670 KBytes
[ 5] 5.00-6.00 sec 7.30 GBytes 62.7 Gbits/sec 0 822 KBytes
[ 5] 6.00-7.00 sec 7.34 GBytes 63.1 Gbits/sec 0 822 KBytes
[ 5] 7.00-8.00 sec 7.30 GBytes 62.7 Gbits/sec 0 1.00 MBytes
[ 5] 8.00-9.00 sec 7.19 GBytes 61.8 Gbits/sec 0 1.33 MBytes
[ 5] 9.00-10.00 sec 7.15 GBytes 61.4 Gbits/sec 0 1.33 MBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-10.00 sec 72.1 GBytes 61.9 Gbits/sec 0 sender
[ 5] 0.00-10.04 sec 72.1 GBytes 61.7 Gbits/sec receiver
走一遍主机的内核协议栈就下降了将近25%;
清理一下现场:
代码语言:txt复制ip netns del pod-a
ip netns del pod-b
iptables -D POSTROUTING -t nat -s 192.168.10.0/24 -d 192.168.10.0/24 -j MASQUERADE
ipvlan
与macvlan类似,ipvlan也是在一个物理网卡上虚拟出多个子接口,与macvlan不同的是,ipvlan的每一个子接口的mac地址是一样的,IP地址不同;
ipvlan有l2和l3模式,l2模式下,与macvlan的工作原理类似,父接口作为交换机来转发子接口的数据包,不同的是,ipvlan的流量转发时是通过dmac==smac来判断这是子接口间的通信的;
l2模式:
代码语言:txt复制ip l add link eno2 name eth0 type ipvlan mode l2
ip netns add pod-a
ip l set eth0 netns pod-a
ip netns exec pod-a ip addr add 192.168.10.10/24 dev eth0
ip netns exec pod-a ip link set eth0 up
ip netns exec pod-a ip route add default via 192.168.10.1 dev eth0
ip l add link eno2 name eth0 type ipvlan mode l2
ip netns add pod-b
ip l set eth0 netns pod-b
ip netns exec pod-b ip addr add 192.168.10.11/24 dev eth0
ip netns exec pod-b ip link set eth0 up
ip netns exec pod-b ip route add default via 192.168.10.1 dev eth0
ip link add link eno2 name eth0 type ipvlan mode l2
ip link set eth0 up
ip addr add 192.168.10.1/24 dev eth0
iptables -A POSTROUTING -t nat -s 192.168.10.0/24 -d 192.168.10.0/24 -j MASQUERADE ##跟上面的一样,也是要让回包经过主机协议栈,所以要做源地址转换
测试结果:
代码语言:txt复制ip netns exec pod-b iperf3 -c 192.168.10.10 -i 1 -t 10
Connecting to host 192.168.10.10, port 5201
[ 5] local 192.168.10.11 port 59580 connected to 192.168.10.10 port 5201
[ ID] Interval Transfer Bitrate Retr Cwnd
[ 5] 0.00-1.00 sec 9.70 GBytes 83.3 Gbits/sec 0 748 KBytes
[ 5] 1.00-2.00 sec 9.71 GBytes 83.4 Gbits/sec 0 748 KBytes
[ 5] 2.00-3.00 sec 10.1 GBytes 86.9 Gbits/sec 0 748 KBytes
[ 5] 3.00-4.00 sec 9.84 GBytes 84.5 Gbits/sec 0 826 KBytes
[ 5] 4.00-5.00 sec 9.82 GBytes 84.3 Gbits/sec 0 826 KBytes
[ 5] 5.00-6.00 sec 9.74 GBytes 83.6 Gbits/sec 0 1.19 MBytes
[ 5] 6.00-7.00 sec 9.77 GBytes 83.9 Gbits/sec 0 1.19 MBytes
[ 5] 7.00-8.00 sec 9.53 GBytes 81.8 Gbits/sec 0 1.41 MBytes
[ 5] 8.00-9.00 sec 9.56 GBytes 82.1 Gbits/sec 0 1.41 MBytes
[ 5] 9.00-10.00 sec 9.54 GBytes 82.0 Gbits/sec 0 1.41 MBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-10.00 sec 97.3 GBytes 83.6 Gbits/sec 0 sender
[ 5] 0.00-10.04 sec 97.3 GBytes 83.3 Gbits/sec receiver
结果显示,ipvlan的l2模式下,同主机的pod通信的性能和macvlan差不多,l3模式下也差不多,所以就不展示了;
使用clusterIP访问:
代码语言:txt复制ip netns exec pod-b iperf3 -c 10.96.0.100 -i 1 -t 10
Connecting to host 10.96.0.100, port 5201
[ 5] local 192.168.10.11 port 38540 connected to 10.96.0.100 port 5201
[ ID] Interval Transfer Bitrate Retr Cwnd
[ 5] 0.00-1.00 sec 7.17 GBytes 61.6 Gbits/sec 0 611 KBytes
[ 5] 1.00-2.00 sec 7.22 GBytes 62.0 Gbits/sec 0 708 KBytes
[ 5] 2.00-3.00 sec 7.31 GBytes 62.8 Gbits/sec 0 708 KBytes
[ 5] 3.00-4.00 sec 7.29 GBytes 62.6 Gbits/sec 0 833 KBytes
[ 5] 4.00-5.00 sec 7.27 GBytes 62.4 Gbits/sec 0 833 KBytes
[ 5] 5.00-6.00 sec 7.36 GBytes 63.3 Gbits/sec 0 833 KBytes
[ 5] 6.00-7.00 sec 7.26 GBytes 62.4 Gbits/sec 0 874 KBytes
[ 5] 7.00-8.00 sec 7.19 GBytes 61.8 Gbits/sec 0 874 KBytes
[ 5] 8.00-9.00 sec 7.17 GBytes 61.6 Gbits/sec 0 874 KBytes
[ 5] 9.00-10.00 sec 7.28 GBytes 62.5 Gbits/sec 0 874 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-10.00 sec 72.5 GBytes 62.3 Gbits/sec 0 sender
[ 5] 0.00-10.04 sec 72.5 GBytes 62.0 Gbits/sec receiver
可以看到,使用clusterIP访问相比直接用podIP,性能下降了25%,而且还有个问题,同主机pod之间的访问不需要经过iptables规则,所以network policy无法生效,如果在ipvlan或macvlan的模式下,能很好解决clusterIP的问题就好了,下面我们来试试直接在pod内给网卡附加eBPF程序解决clusterIP的问题
ipvlan模式下附加eBPF程序
下面的方式因为需要附加我们自定义的eBPF程序(mst_lxc.o),所以各位就不能跟着做了,看个结果吧;
pod-b的eth0网卡的tc ingress和tc egress附加eBPF程序(因为是在pod里面,所以附加的程序刚好是反过来的,因为eBPF程序写的时候,针对的是主机的网卡):
代码语言:txt复制ip netns exec pod-b bash
ulimit -l unlimited
tc qdisc add dev eth0 clsact
tc filter add dev eth0 ingress prio 1 handle 1 bpf da obj mst_lxc.o sec lxc-egress #ingress方向附加lxc-egress,主要完成rev-DNAT:podIP->clusterIP
tc filter add dev eth0 egress prio 1 handle 1 bpf da obj mst_lxc.o sec lxc-ingress #egress方向完成DNAT:clusterIP->podIP
增加svc及后端(功能和上面的iptables的DNAT规则的功能是类似的)
代码语言:txt复制mustang svc add --service=10.96.0.100:5201 --backend=192.168.10.10:5201
I0908 09:51:35.754539 5694 bpffs_linux.go:260] Detected mounted BPF filesystem at /sys/fs/bpf
I0908 09:51:35.891046 5694 service.go:578] Restored services from maps
I0908 09:51:35.891103 5694 bpf_svc_add.go:86] created success,id:1
查看一下配置(mustang工具是我们自定义的用户态工具,用来操作eBPF的map)
代码语言:txt复制mustang svc ls
==========================================================================
Mustang Service count:1
==========================================================================
id pro service port backends
--------------------------------------------------------------------------
1 NONE 10.96.0.100 5201 192.168.10.10:5201,
==========================================================================
一切准备就绪,这时候再来测试用clusterIP访问pod-a:
代码语言:txt复制ip netns exec pod-b iperf3 -c 10.96.0.100 -i 1 -t 10
Connecting to host 10.96.0.100, port 5201
[ 5] local 192.168.10.11 port 57152 connected to 10.96.0.100 port 5201
[ ID] Interval Transfer Bitrate Retr Cwnd
[ 5] 0.00-1.00 sec 8.96 GBytes 77.0 Gbits/sec 0 1.04 MBytes
[ 5] 1.00-2.00 sec 8.83 GBytes 75.8 Gbits/sec 0 1.26 MBytes
[ 5] 2.00-3.00 sec 9.13 GBytes 78.4 Gbits/sec 0 1.46 MBytes
[ 5] 3.00-4.00 sec 9.10 GBytes 78.2 Gbits/sec 0 1.46 MBytes
[ 5] 4.00-5.00 sec 9.37 GBytes 80.5 Gbits/sec 0 1.46 MBytes
[ 5] 5.00-6.00 sec 9.44 GBytes 81.1 Gbits/sec 0 1.46 MBytes
[ 5] 6.00-7.00 sec 9.14 GBytes 78.5 Gbits/sec 0 1.46 MBytes
[ 5] 7.00-8.00 sec 9.36 GBytes 80.4 Gbits/sec 0 1.68 MBytes
[ 5] 8.00-9.00 sec 9.11 GBytes 78.3 Gbits/sec 0 1.68 MBytes
[ 5] 9.00-10.00 sec 9.29 GBytes 79.8 Gbits/sec 0 1.68 MBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-10.00 sec 91.7 GBytes 78.8 Gbits/sec 0 sender
[ 5] 0.00-10.04 sec 91.7 GBytes 78.5 Gbits/sec receiver
可以看到,在ipvlan模式下,附加eBPF程序来实现clusterIP后性能大大提升了,相对于直接用podIP来访问,性能只是有不到5%的损耗,相对于用主机的iptables规则实现clusterIP性能提升了20%,也是上述所有方案中最优的一种,我们自研的cni(mustang)支持这种方式,同时也支持veth方式上附加eBPF程序的方式,下面是veth附加eBPF程序的测试结果;
这里要先清理一下现场
veth方式附加eBPF程序
- 还是先创建pod-a和pod-b:
ip netns add pod-a
ip link add eth0 type veth peer name veth-pod-a
ip link set eth0 netns pod-a
ip netns exec pod-a ip addr add 192.168.10.10/24 dev eth0
ip netns exec pod-a ip link set eth0 up
ip netns exec pod-a ip route add default via 169.254.10.24 dev eth0 onlink
ip link set veth-pod-a up
echo 1 > /proc/sys/net/ipv4/conf/veth-pod-a/proxy_arp
ip route add 192.168.10.10 dev veth-pod-a scope link
ip netns add pod-b
ip link add eth0 type veth peer name veth-pod-b
ip link set eth0 netns pod-b
ip netns exec pod-b ip addr add 192.168.10.11/24 dev eth0
ip netns exec pod-b ip link set eth0 up
ip netns exec pod-b ip route add default via 169.254.10.24 dev eth0 onlink
ip link set veth-pod-b up
echo 1 > /proc/sys/net/ipv4/conf/veth-pod-b/proxy_arp
ip route add 192.168.10.11 dev veth-pod-b scope link
echo 1 > /proc/sys/net/ipv4/ip_forward
- 附加eBPF程序
给pod-a和pod-b主机一端的网卡的tc ingress和tc egress都附加eBPF程序:
代码语言:txt复制tc qdisc add dev veth-pod-b clsact
tc filter add dev veth-pod-b ingress prio 1 handle 1 bpf da obj mst_lxc.o sec lxc-ingress ## 因为这是附加在主机一端的网卡,与附加在容器时的方向是反的
tc filter add dev veth-pod-b egress prio 1 handle 1 bpf da obj mst_lxc.o sec lxc-egress
tc qdisc add dev veth-pod-a clsact
tc filter add dev veth-pod-a ingress prio 1 handle 1 bpf da obj mst_lxc.o sec lxc-ingress
tc filter add dev veth-pod-a egress prio 1 handle 1 bpf da obj mst_lxc.o sec lxc-egress
- 增加svc和增加endpoint(为了对数据包进行快速重定向):
mustang svc add --service=10.96.0.100:5201 --backend=192.168.10.10:5201
mustang ep add --ifname=eth0 --netns=/var/run/netns/pod-a
mustang ep add --ifname=eth0 --netns=/var/run/netns/pod-b
- 查看一下配置
mustang svc ls
==========================================================================
Mustang Service count:1
==========================================================================
id pro service port backends
--------------------------------------------------------------------------
1 NONE 10.96.0.100 5201 192.168.10.10:5201,
==========================================================================
mustang ep ls
==========================================================================
Mustang Endpoint count:2
==========================================================================
Id IP Host IfIndex LxcMAC NodeMAC
--------------------------------------------------------------------------
1 192.168.10.10 false 653 76:6C:37:2C:81:05 3E:E9:02:96:60:D6
2 192.168.10.11 false 655 6A:CD:11:13:76:2C 72:90:74:4A:CB:84
==========================================================================
- 可以开测了,老套路,pod-a跑服务端,在pod-b上测,先试试用clusterIP:
ip netns exec pod-b iperf3 -c 10.96.0.100 -i 1 -t 10
Connecting to host 10.96.0.100, port 5201
[ 5] local 192.168.10.11 port 36492 connected to 10.96.0.100 port 5201
[ ID] Interval Transfer Bitrate Retr Cwnd
[ 5] 0.00-1.00 sec 8.52 GBytes 73.2 Gbits/sec 0 1.07 MBytes
[ 5] 1.00-2.00 sec 8.54 GBytes 73.4 Gbits/sec 0 1.07 MBytes
[ 5] 2.00-3.00 sec 8.64 GBytes 74.2 Gbits/sec 0 1.07 MBytes
[ 5] 3.00-4.00 sec 8.57 GBytes 73.6 Gbits/sec 0 1.12 MBytes
[ 5] 4.00-5.00 sec 8.61 GBytes 73.9 Gbits/sec 0 1.18 MBytes
[ 5] 5.00-6.00 sec 8.48 GBytes 72.8 Gbits/sec 0 1.18 MBytes
[ 5] 6.00-7.00 sec 8.57 GBytes 73.6 Gbits/sec 0 1.18 MBytes
[ 5] 7.00-8.00 sec 9.11 GBytes 78.3 Gbits/sec 0 1.18 MBytes
[ 5] 8.00-9.00 sec 8.86 GBytes 76.1 Gbits/sec 0 1.18 MBytes
[ 5] 9.00-10.00 sec 8.71 GBytes 74.8 Gbits/sec 0 1.18 MBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-10.00 sec 86.6 GBytes 74.4 Gbits/sec 0 sender
[ 5] 0.00-10.04 sec 86.6 GBytes 74.1 Gbits/sec receiver
再试试直接用podIP:
代码语言:txt复制ip netns exec pod-b iperf3 -c 192.168.10.10 -i 1 -t 10
Connecting to host 192.168.10.10, port 5201
[ 5] local 192.168.10.11 port 56460 connected to 192.168.10.10 port 5201
[ ID] Interval Transfer Bitrate Retr Cwnd
[ 5] 0.00-1.00 sec 8.76 GBytes 75.3 Gbits/sec 0 676 KBytes
[ 5] 1.00-2.00 sec 8.87 GBytes 76.2 Gbits/sec 0 708 KBytes
[ 5] 2.00-3.00 sec 8.78 GBytes 75.5 Gbits/sec 0 708 KBytes
[ 5] 3.00-4.00 sec 8.99 GBytes 77.2 Gbits/sec 0 708 KBytes
[ 5] 4.00-5.00 sec 8.85 GBytes 76.1 Gbits/sec 0 782 KBytes
[ 5] 5.00-6.00 sec 8.98 GBytes 77.1 Gbits/sec 0 782 KBytes
[ 5] 6.00-7.00 sec 8.64 GBytes 74.2 Gbits/sec 0 1.19 MBytes
[ 5] 7.00-8.00 sec 8.40 GBytes 72.2 Gbits/sec 0 1.55 MBytes
[ 5] 8.00-9.00 sec 8.16 GBytes 70.1 Gbits/sec 0 1.88 MBytes
[ 5] 9.00-10.00 sec 8.16 GBytes 70.1 Gbits/sec 0 1.88 MBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-10.00 sec 86.6 GBytes 74.4 Gbits/sec 0 sender
[ 5] 0.00-10.04 sec 86.6 GBytes 74.1 Gbits/sec receiver
可以看到,附加eBPF程序后,用PodIP访问比原来性能提升了12%左右(见最上面veth方式的测试结果,66Gbits/sec),且clusterIP访问和podIP访问几乎没有差异;
总结
- veth :(PODIP)66Gbits/sec (clusterIP)66Gbits/sec
- bridge : (PODIP)77Gbits/sec (clusterIP)61Gbits/sec
- macvlan :(PODIP)83Gbits/sec (clusterIP)62Gbits/sec
- ipvlan :(PODIP)83Gbits/sec (clusterIP)62Gbits/sec
- ipvlan eBPF:(PODIP)83Gbits/sec (clusterIP)78Gbits/sec
- veth eBPF :(PODIP)74Gbits/sec (clusterIP)74Gbits/sec
综上所述:
- macvlan和ipvlan在使用podIP访问时性能是所有方案中最高的,但clusterIP的访问因为走了主机协议栈,降了25%,有eBPF加持后又提升了20%,不过macvlan在无线网卡不支持,且单网卡mac数量有限制,不支持云平台源目的检查等,所以一般都较为推荐ipvlan eBPF方式,不过这种方式下附加eBPF程序还是有点点麻烦,上面演示时为了简洁直接把eBPF的map在pod里创建了,实际是在主机上创建eBPF的map,然后让所有的pod共享这份eBPF的map;
- veth方式附加eBPF程序后,用serviceIP和直接用podIP性能上没有差异,且比不附加eBPF程序性能有12%的提升,而且还可以用主机上的iptables防火墙规则,在主机端附加eBPF程序比较简单,所以这是我们的mustang默认的方案;
- bridge在podIP访问时性能是比veth方式高的,但因为执行iptables规则性能被拖低了,所以在不附加eBPF的情况下,还是比较推荐veth方式,这是calico的默认方案;bridge方式是flannel的默认方案