深入理解kubernetes（k8s）网络原理之六-同主机pod连接的几种方式及性能对比

本来说这一篇是要撸个cni出来的，但感觉这个没什么好说的，cni的源码github一大堆了，大概的套路也就是把前面说的命令用go语言实现一下，于是本着想到啥写啥的原则，我决定介绍一下相同主机的pod连接的另外几种方式及性能对比。

在本系列的第一篇文章中，我们介绍过pod用veth网卡连接主机，其实同主机的pod的连接方式一共有以下几种：

veth连接主机，把主机当路由器用，第一篇文章就是介绍的这种方式，（下面我们简称：veth方式）
用linux bridge连接各个pod，把网关地址挂在linux bridge，flannel就是使用这种方式
macvlan，使用场景有些限制，云平台一般用不了
ipvlan，对内核有要求，默认3.10是不支持的
开启eBPF支持

在这一章中我们专门来对比一下上面这几种方式的区别和性能。

因为eBPF对内核版本有要求，所以我使用的环境linux内核版本是4.18，普通的PC机，8核32G的配置；

在执行下面的命令时，注意创建的网卡名不要与主机的物理网卡冲突，笔者使用的主机网卡是eno2，所以我都是创建名为叫eth0的网卡，但这个网卡名在你的环境中可能很容易冲突，所以注意先修改一下

veth方式

这种方式在第一篇文章中详细介绍过，所以就直接上命令了；

创建pod-a和pod-b：

代码语言：txt复制

ip netns add pod-a
ip link add eth0 type veth peer name veth-pod-a
ip link set eth0 netns pod-a
ip netns exec pod-a ip addr add 192.168.10.10/24 dev eth0
ip netns exec pod-a ip link set eth0 up
ip netns exec pod-a ip route add default via 169.254.10.24 dev eth0 onlink
ip link set veth-pod-a up
echo 1 > /proc/sys/net/ipv4/conf/veth-pod-a/proxy_arp
ip route add 192.168.10.10 dev veth-pod-a scope link

ip netns add pod-b
ip link add eth0 type veth peer name veth-pod-b
ip link set eth0 netns pod-b
ip netns exec pod-b ip addr add 192.168.10.11/24 dev eth0
ip netns exec pod-b ip link set eth0 up
ip netns exec pod-b ip route add default via 169.254.10.24 dev eth0 onlink
ip link set veth-pod-b up
echo 1 > /proc/sys/net/ipv4/conf/veth-pod-b/proxy_arp
ip route add 192.168.10.11 dev veth-pod-b scope link


echo 1 > /proc/sys/net/ipv4/ip_forward
iptables -I FORWARD -s 192.168.0.0/16 -d 192.168.0.0/16 -j ACCEPT

在pod-a中启动iperf3服务：

代码语言：txt复制

ip netns exec pod-a iperf3 -s

另开一个终端，在pod-b中请求pod-a，测试连接性能：

代码语言：txt复制

ip netns exec pod-b iperf3 -c 192.168.10.10 -i 1 -t 10

Connecting to host 192.168.10.10, port 5201
[  5] local 192.168.10.11 port 35014 connected to 192.168.10.10 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec  7.32 GBytes  62.9 Gbits/sec    0    964 KBytes
[  5]   1.00-2.00   sec  7.90 GBytes  67.8 Gbits/sec    0    964 KBytes
[  5]   2.00-3.00   sec  7.79 GBytes  66.9 Gbits/sec    0   1012 KBytes
[  5]   3.00-4.00   sec  7.92 GBytes  68.0 Gbits/sec    0   1012 KBytes
[  5]   4.00-5.00   sec  7.89 GBytes  67.8 Gbits/sec    0   1012 KBytes
[  5]   5.00-6.00   sec  7.87 GBytes  67.6 Gbits/sec    0   1012 KBytes
[  5]   6.00-7.00   sec  7.68 GBytes  66.0 Gbits/sec    0   1.16 MBytes
[  5]   7.00-8.00   sec  7.79 GBytes  66.9 Gbits/sec    0   1.27 MBytes
[  5]   8.00-9.00   sec  7.75 GBytes  66.5 Gbits/sec    0   1.47 MBytes
[  5]   9.00-10.00  sec  7.90 GBytes  67.8 Gbits/sec    0   1.47 MBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec  77.8 GBytes  66.8 Gbits/sec    0             sender
[  5]   0.00-10.04  sec  77.8 GBytes  66.6 Gbits/sec                  receiver

因为大多数时候，在部署在容器的应用间相互访问都是使用clusterIP的，所以也要测试一下用clusterIP访问，先用iptables创建clusterIP，这个我们在第二篇文章介绍过：

代码语言：txt复制

iptables -A PREROUTING -t nat -d 10.96.0.100 -j DNAT --to-destination 192.168.10.10

为了保证测试结果不受其它因素的干扰，最好确保主机上只有一条nat规则，这条nat规则可以全程使用，后面不再反复地删除并创建了

结果如下：

代码语言：txt复制

ip netns exec pod-b iperf3 -c 10.96.0.100 -i 1 -t 10

Connecting to host 10.96.0.100, port 5201
[  5] local 192.168.10.11 port 44528 connected to 10.96.0.100 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec  6.41 GBytes  55.0 Gbits/sec    0    725 KBytes
[  5]   1.00-2.00   sec  7.21 GBytes  61.9 Gbits/sec    0    902 KBytes
[  5]   2.00-3.00   sec  7.99 GBytes  68.6 Gbits/sec    0    902 KBytes
[  5]   3.00-4.00   sec  7.88 GBytes  67.7 Gbits/sec    0    902 KBytes
[  5]   4.00-5.00   sec  7.94 GBytes  68.2 Gbits/sec    0    947 KBytes
[  5]   5.00-6.00   sec  7.86 GBytes  67.5 Gbits/sec    0    947 KBytes
[  5]   6.00-7.00   sec  7.79 GBytes  66.9 Gbits/sec    0    947 KBytes
[  5]   7.00-8.00   sec  8.04 GBytes  69.1 Gbits/sec    0    947 KBytes
[  5]   8.00-9.00   sec  7.85 GBytes  67.5 Gbits/sec    0    996 KBytes
[  5]   9.00-10.00  sec  7.88 GBytes  67.7 Gbits/sec    0    996 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec  76.9 GBytes  66.0 Gbits/sec    0             sender
[  5]   0.00-10.04  sec  76.9 GBytes  65.8 Gbits/sec                  receiver

可以看到，在veth方式下，使用podIP或通过clusterIP访问pod在性能上区别是不大的，当然在iptables的规则很多的时候（例如2000个k8s服务），性能就会有影响了，但这并不是我们这一篇文章的重点；

清理一下现场，先停止iperf3服务，然后删除ns（上面的iptables规则留着）：

代码语言：txt复制

ip netns del pod-a
ip netns del pod-b

我们接着来试一下bridge方式。

bridge方式

如果把bridge当成纯二层的交换机来负责两个pod的连接性能是很不错的，比上面用veth的方式要好，但是因为要支持k8s的serviceIP，所以有必要让数据包走一遍主机的iptables规则，所以一般都会给bridge挂个网关的IP，一来能响应pod的ARP请求，这样也就不用开启veth网卡对主机这一端的arp代答，再来这样能让数据包走一遍主机的netfilter的扩展函数，这样iptables规则就能生效了。

按理说linux bridge作为交换机是工作在二层，可是从源码中可以看到bridge是实实在在地执行了netfilter的几个hook点的函数的（PREROUTING/INPUT/FORWARD/OUTPUT/POSTROUTING），当然也有开关可以关闭这个功能（net.bridge.bridge-nf-call-iptables）

下面我们来测一下用bridge连接两个pod时的性能，创建br0的bridge然后把pod-a和pod-b都接上去：

代码语言：txt复制

ip link add br0 type bridge 
ip addr add 192.168.10.1 dev br0  ## 给br0挂pod的默认网关的IP
ip link set br0 up

ip netns add pod-a
ip link add eth0 type veth peer name veth-pod-a
ip link set eth0 netns pod-a
ip netns exec pod-a ip addr add 192.168.10.10/24 dev eth0
ip netns exec pod-a ip link set eth0 up
ip netns exec pod-a ip route add default via 192.168.10.1 dev eth0 onlink  ## 默认网关是br0的地址
ip link set veth-pod-a master br0  ## veth网卡主机这一端插到bridge上
ip link set veth-pod-a up

ip netns add pod-b
ip link add eth0 type veth peer name veth-pod-b
ip link set eth0 netns pod-b
ip netns exec pod-b ip addr add 192.168.10.11/24 dev eth0
ip netns exec pod-b ip link set eth0 up
ip netns exec pod-b ip route add default via 192.168.10.1 dev eth0 onlink
ip link set veth-pod-b master br0
ip link set veth-pod-b up

ip route add 192.168.10.0/24 via 192.168.10.1 dev br0 scope link  ## 主机到所有的pod的路由，下一跳为br0

再次在pod-a中运行iperf3：

代码语言：txt复制

ip netns exec pod-a iperf3 -s

在pod-b中测试连接性能：

代码语言：txt复制

ip netns exec pod-b iperf3 -c 192.168.10.10 -i 1 -t 10

Connecting to host 192.168.10.10, port 5201
[  5] local 192.168.10.11 port 38232 connected to 192.168.10.10 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec  7.64 GBytes  65.6 Gbits/sec    0    642 KBytes
[  5]   1.00-2.00   sec  7.71 GBytes  66.2 Gbits/sec    0    642 KBytes
[  5]   2.00-3.00   sec  7.49 GBytes  64.4 Gbits/sec    0    786 KBytes
[  5]   3.00-4.00   sec  7.61 GBytes  65.3 Gbits/sec    0    786 KBytes
[  5]   4.00-5.00   sec  7.54 GBytes  64.8 Gbits/sec    0    786 KBytes
[  5]   5.00-6.00   sec  7.71 GBytes  66.2 Gbits/sec    0    786 KBytes
[  5]   6.00-7.00   sec  7.66 GBytes  65.8 Gbits/sec    0    826 KBytes
[  5]   7.00-8.00   sec  7.63 GBytes  65.5 Gbits/sec    0    826 KBytes
[  5]   8.00-9.00   sec  7.64 GBytes  65.7 Gbits/sec    0    826 KBytes
[  5]   9.00-10.00  sec  7.71 GBytes  66.2 Gbits/sec    0    826 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec  76.3 GBytes  65.6 Gbits/sec    0             sender
[  5]   0.00-10.04  sec  76.3 GBytes  65.3 Gbits/sec                  receiver

与veth差距并不大，反复测了几次，都差不多，对于这个结果我觉得有点不科学，从源码上看，bridge转发的流程肯定要比走一遍主机的协议栈转发要快的，于是我想了一下，是不是把bridge执行netfilter扩展函数关闭会好一点呢？于是：

代码语言：txt复制

sysctl -w net.bridge.bridge-nf-call-iptables=0

再次试了一下：

代码语言：txt复制

ip netns exec pod-b iperf3 -c 192.168.10.10 -i 1 -t 10
 
Connecting to host 192.168.10.10, port 5201
[  5] local 192.168.10.11 port 40658 connected to 192.168.10.10 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec  8.55 GBytes  73.4 Gbits/sec    0    810 KBytes
[  5]   1.00-2.00   sec  8.86 GBytes  76.1 Gbits/sec    0    940 KBytes
[  5]   2.00-3.00   sec  8.96 GBytes  77.0 Gbits/sec    0    940 KBytes
[  5]   3.00-4.00   sec  9.04 GBytes  77.6 Gbits/sec    0    940 KBytes
[  5]   4.00-5.00   sec  8.89 GBytes  76.4 Gbits/sec    0    987 KBytes
[  5]   5.00-6.00   sec  9.18 GBytes  78.9 Gbits/sec    0    987 KBytes
[  5]   6.00-7.00   sec  9.09 GBytes  78.1 Gbits/sec    0    987 KBytes
[  5]   7.00-8.00   sec  9.10 GBytes  78.1 Gbits/sec    0   1.01 MBytes
[  5]   8.00-9.00   sec  8.98 GBytes  77.1 Gbits/sec    0   1.12 MBytes
[  5]   9.00-10.00  sec  9.13 GBytes  78.4 Gbits/sec    0   1.12 MBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec  89.8 GBytes  77.1 Gbits/sec    0             sender
[  5]   0.00-10.04  sec  89.8 GBytes  76.8 Gbits/sec                  receiver

果然，是会快许多的，但这结果多少令我也有点吃惊，我已经确保主机上只有一条iptables规则，只是开启了bridge执行netfilter扩展函数居然前后会相差10Gbits/sec。

但是，这个标志是不能关的，前面说了，要执行iptables规则把clusterIP转成podIP，所以还是要开起来：

代码语言：txt复制

sysctl -w net.bridge.bridge-nf-call-iptables=1

这时候用clusterIP试试：

代码语言：txt复制

ip netns exec pod-b iperf3 -c 10.96.0.100 -i 1 -t 10

Connecting to host 10.96.0.100, port 5201
[  5] local 192.168.10.11 port 47414 connected to 10.96.0.100 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec  7.02 GBytes  60.3 Gbits/sec    0    669 KBytes
[  5]   1.00-2.00   sec  7.23 GBytes  62.1 Gbits/sec    0    738 KBytes
[  5]   2.00-3.00   sec  7.17 GBytes  61.6 Gbits/sec    0    899 KBytes
[  5]   3.00-4.00   sec  7.21 GBytes  62.0 Gbits/sec    0    899 KBytes
[  5]   4.00-5.00   sec  7.31 GBytes  62.8 Gbits/sec    0    899 KBytes
[  5]   5.00-6.00   sec  7.19 GBytes  61.8 Gbits/sec    0   1008 KBytes
[  5]   6.00-7.00   sec  7.24 GBytes  62.2 Gbits/sec    0   1008 KBytes
[  5]   7.00-8.00   sec  7.22 GBytes  62.0 Gbits/sec    0   1.26 MBytes
[  5]   8.00-9.00   sec  6.99 GBytes  60.0 Gbits/sec    0   1.26 MBytes
[  5]   9.00-10.00  sec  7.07 GBytes  60.8 Gbits/sec    0   1.26 MBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec  71.6 GBytes  61.5 Gbits/sec    0             sender
[  5]   0.00-10.04  sec  71.6 GBytes  61.3 Gbits/sec                  receiver

反复试了几次，大概都是这个值，可以看到，使用clusterIP比直接用podIP访问性能下降了8%；

接着来测macvlan，先清理一下现场：

代码语言：txt复制

ip netns del pod-a
ip netns del pod-b
ip link del br0

macvlan

macvlan模式是从一个物理网卡虚拟出多个虚拟网络接口，每个虚拟的接口都有单独的mac地址，可以给这些虚拟接口配置IP地址，在bridge模式下（其它几种模式不适用，所以不作讨论），父接口作为交换机来完成子接口间的通信，子接口可以通过父接口访问外网；

我们使用bridge模式：

代码语言：txt复制

ip link add link eno2 name eth0 type macvlan mode bridge  ## eno2是我的物理网卡名称，eth0是我虚拟出来的接口名，请根据你的实际情况修改
ip netns add pod-a
ip l set eth0 netns pod-a
ip netns exec pod-a ip addr add 192.168.10.10/24 dev eth0
ip netns exec pod-a ip link set eth0 up
ip netns exec pod-a ip route add default via 192.168.10.1 dev eth0  

ip link add link eno2 name eth0 type macvlan mode bridge
ip netns add pod-b
ip l set eth0 netns pod-b
ip netns exec pod-b ip addr add 192.168.10.11/24 dev eth0
ip netns exec pod-b ip link set eth0 up
ip netns exec pod-b ip route add default via 192.168.10.1 dev eth0

ip link add link eno2 name eth0 type macvlan mode bridge
ip addr add 192.168.10.1/24 dev eth0  ## 多创建一个子接口，留在主机上，把pod的默认网关的IP挂这里，这样pod里面请求clusterIP的流量会走到主机来，主机协议栈的iptables规则就有机会执行了
ip link set eth0 up

iptables -A POSTROUTING -t nat -s 192.168.10.0/24 -d 192.168.10.0/24 -j MASQUERADE  ## 要让回包也经过主机协议栈，所以要做源地址转换

测试结果：

代码语言：txt复制

ip netns exec pod-b iperf3 -c 192.168.10.10 -i 1 -t 10

Connecting to host 192.168.10.10, port 5201
[  5] local 192.168.10.11 port 47050 connected to 192.168.10.10 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec  9.47 GBytes  81.3 Gbits/sec    0    976 KBytes
[  5]   1.00-2.00   sec  9.81 GBytes  84.3 Gbits/sec    0    976 KBytes
[  5]   2.00-3.00   sec  9.74 GBytes  83.6 Gbits/sec    0   1.16 MBytes
[  5]   3.00-4.00   sec  9.77 GBytes  83.9 Gbits/sec    0   1.16 MBytes
[  5]   4.00-5.00   sec  9.73 GBytes  83.6 Gbits/sec    0   1.16 MBytes
[  5]   5.00-6.00   sec  9.69 GBytes  83.2 Gbits/sec    0   1.16 MBytes
[  5]   6.00-7.00   sec  9.75 GBytes  83.7 Gbits/sec    0   1.22 MBytes
[  5]   7.00-8.00   sec  9.78 GBytes  84.0 Gbits/sec    0   1.30 MBytes
[  5]   8.00-9.00   sec  9.81 GBytes  84.3 Gbits/sec    0   1.30 MBytes
[  5]   9.00-10.00  sec  9.84 GBytes  84.5 Gbits/sec    0   1.30 MBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec  97.4 GBytes  83.6 Gbits/sec    0             sender
[  5]   0.00-10.04  sec  97.4 GBytes  83.3 Gbits/sec                  receiver

直接用podIP访问是非常快的，几种方式中最快的；

用clusterIP试试：

代码语言：txt复制

ip netns exec pod-b iperf3 -c 10.96.0.100 -i 1 -t 10

Connecting to host 10.96.0.100, port 5201
[  5] local 192.168.10.11 port 35780 connected to 10.96.0.100 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec  7.16 GBytes  61.5 Gbits/sec    0    477 KBytes
[  5]   1.00-2.00   sec  7.18 GBytes  61.7 Gbits/sec    0    477 KBytes
[  5]   2.00-3.00   sec  7.13 GBytes  61.2 Gbits/sec    0    608 KBytes
[  5]   3.00-4.00   sec  7.18 GBytes  61.7 Gbits/sec    0    670 KBytes
[  5]   4.00-5.00   sec  7.16 GBytes  61.5 Gbits/sec    0    670 KBytes
[  5]   5.00-6.00   sec  7.30 GBytes  62.7 Gbits/sec    0    822 KBytes
[  5]   6.00-7.00   sec  7.34 GBytes  63.1 Gbits/sec    0    822 KBytes
[  5]   7.00-8.00   sec  7.30 GBytes  62.7 Gbits/sec    0   1.00 MBytes
[  5]   8.00-9.00   sec  7.19 GBytes  61.8 Gbits/sec    0   1.33 MBytes
[  5]   9.00-10.00  sec  7.15 GBytes  61.4 Gbits/sec    0   1.33 MBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec  72.1 GBytes  61.9 Gbits/sec    0             sender
[  5]   0.00-10.04  sec  72.1 GBytes  61.7 Gbits/sec                  receiver

走一遍主机的内核协议栈就下降了将近25%；

清理一下现场：

代码语言：txt复制

ip netns del pod-a
ip netns del pod-b
iptables -D POSTROUTING -t nat -s 192.168.10.0/24 -d 192.168.10.0/24 -j MASQUERADE

ipvlan

与macvlan类似，ipvlan也是在一个物理网卡上虚拟出多个子接口，与macvlan不同的是，ipvlan的每一个子接口的mac地址是一样的，IP地址不同；

ipvlan有l2和l3模式，l2模式下，与macvlan的工作原理类似，父接口作为交换机来转发子接口的数据包，不同的是，ipvlan的流量转发时是通过dmac==smac来判断这是子接口间的通信的；

l2模式：

代码语言：txt复制

ip l add link eno2 name eth0 type ipvlan mode l2
ip netns add pod-a
ip l set eth0 netns pod-a
ip netns exec pod-a ip addr add 192.168.10.10/24 dev eth0
ip netns exec pod-a ip link set eth0 up
ip netns exec pod-a ip route add default via 192.168.10.1 dev eth0

ip l add link eno2 name eth0 type ipvlan mode l2
ip netns add pod-b
ip l set eth0 netns pod-b
ip netns exec pod-b ip addr add 192.168.10.11/24 dev eth0
ip netns exec pod-b ip link set eth0 up
ip netns exec pod-b ip route add default via 192.168.10.1 dev eth0

ip link add link eno2 name eth0 type ipvlan mode l2
ip link set eth0 up
ip addr add 192.168.10.1/24 dev eth0

iptables -A POSTROUTING -t nat -s 192.168.10.0/24 -d 192.168.10.0/24 -j MASQUERADE  ##跟上面的一样，也是要让回包经过主机协议栈，所以要做源地址转换

测试结果：

代码语言：txt复制

ip netns exec pod-b iperf3 -c 192.168.10.10 -i 1 -t 10

Connecting to host 192.168.10.10, port 5201
[  5] local 192.168.10.11 port 59580 connected to 192.168.10.10 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec  9.70 GBytes  83.3 Gbits/sec    0    748 KBytes
[  5]   1.00-2.00   sec  9.71 GBytes  83.4 Gbits/sec    0    748 KBytes
[  5]   2.00-3.00   sec  10.1 GBytes  86.9 Gbits/sec    0    748 KBytes
[  5]   3.00-4.00   sec  9.84 GBytes  84.5 Gbits/sec    0    826 KBytes
[  5]   4.00-5.00   sec  9.82 GBytes  84.3 Gbits/sec    0    826 KBytes
[  5]   5.00-6.00   sec  9.74 GBytes  83.6 Gbits/sec    0   1.19 MBytes
[  5]   6.00-7.00   sec  9.77 GBytes  83.9 Gbits/sec    0   1.19 MBytes
[  5]   7.00-8.00   sec  9.53 GBytes  81.8 Gbits/sec    0   1.41 MBytes
[  5]   8.00-9.00   sec  9.56 GBytes  82.1 Gbits/sec    0   1.41 MBytes
[  5]   9.00-10.00  sec  9.54 GBytes  82.0 Gbits/sec    0   1.41 MBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec  97.3 GBytes  83.6 Gbits/sec    0             sender
[  5]   0.00-10.04  sec  97.3 GBytes  83.3 Gbits/sec                  receiver

结果显示，ipvlan的l2模式下，同主机的pod通信的性能和macvlan差不多，l3模式下也差不多，所以就不展示了；

使用clusterIP访问：

代码语言：txt复制

ip netns exec pod-b iperf3 -c 10.96.0.100 -i 1 -t 10

Connecting to host 10.96.0.100, port 5201
[  5] local 192.168.10.11 port 38540 connected to 10.96.0.100 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec  7.17 GBytes  61.6 Gbits/sec    0    611 KBytes
[  5]   1.00-2.00   sec  7.22 GBytes  62.0 Gbits/sec    0    708 KBytes
[  5]   2.00-3.00   sec  7.31 GBytes  62.8 Gbits/sec    0    708 KBytes
[  5]   3.00-4.00   sec  7.29 GBytes  62.6 Gbits/sec    0    833 KBytes
[  5]   4.00-5.00   sec  7.27 GBytes  62.4 Gbits/sec    0    833 KBytes
[  5]   5.00-6.00   sec  7.36 GBytes  63.3 Gbits/sec    0    833 KBytes
[  5]   6.00-7.00   sec  7.26 GBytes  62.4 Gbits/sec    0    874 KBytes
[  5]   7.00-8.00   sec  7.19 GBytes  61.8 Gbits/sec    0    874 KBytes
[  5]   8.00-9.00   sec  7.17 GBytes  61.6 Gbits/sec    0    874 KBytes
[  5]   9.00-10.00  sec  7.28 GBytes  62.5 Gbits/sec    0    874 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec  72.5 GBytes  62.3 Gbits/sec    0             sender
[  5]   0.00-10.04  sec  72.5 GBytes  62.0 Gbits/sec                  receiver

可以看到，使用clusterIP访问相比直接用podIP，性能下降了25%，而且还有个问题，同主机pod之间的访问不需要经过iptables规则，所以network policy无法生效，如果在ipvlan或macvlan的模式下，能很好解决clusterIP的问题就好了，下面我们来试试直接在pod内给网卡附加eBPF程序解决clusterIP的问题

ipvlan模式下附加eBPF程序

下面的方式因为需要附加我们自定义的eBPF程序（mst_lxc.o），所以各位就不能跟着做了，看个结果吧；

pod-b的eth0网卡的tc ingress和tc egress附加eBPF程序（因为是在pod里面，所以附加的程序刚好是反过来的，因为eBPF程序写的时候，针对的是主机的网卡）：

代码语言：txt复制

ip netns exec pod-b bash
ulimit -l unlimited
tc qdisc add dev eth0 clsact
tc filter add dev eth0 ingress prio 1 handle 1 bpf da obj mst_lxc.o sec lxc-egress   #ingress方向附加lxc-egress，主要完成rev-DNAT：podIP->clusterIP
tc filter add dev eth0 egress prio 1 handle 1 bpf da obj mst_lxc.o sec lxc-ingress   #egress方向完成DNAT：clusterIP->podIP

增加svc及后端（功能和上面的iptables的DNAT规则的功能是类似的）

代码语言：txt复制

mustang svc add --service=10.96.0.100:5201 --backend=192.168.10.10:5201

I0908 09:51:35.754539    5694 bpffs_linux.go:260] Detected mounted BPF filesystem at /sys/fs/bpf
I0908 09:51:35.891046    5694 service.go:578] Restored services from maps
I0908 09:51:35.891103    5694 bpf_svc_add.go:86] created success,id:1

查看一下配置（mustang工具是我们自定义的用户态工具，用来操作eBPF的map）

代码语言：txt复制

mustang svc ls

==========================================================================
Mustang Service count:1
==========================================================================
id      pro     service         port    backends
--------------------------------------------------------------------------
1       NONE    10.96.0.100     5201    192.168.10.10:5201,
==========================================================================

一切准备就绪，这时候再来测试用clusterIP访问pod-a：

代码语言：txt复制

ip netns exec pod-b iperf3 -c 10.96.0.100 -i 1 -t 10

Connecting to host 10.96.0.100, port 5201
[  5] local 192.168.10.11 port 57152 connected to 10.96.0.100 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec  8.96 GBytes  77.0 Gbits/sec    0   1.04 MBytes
[  5]   1.00-2.00   sec  8.83 GBytes  75.8 Gbits/sec    0   1.26 MBytes
[  5]   2.00-3.00   sec  9.13 GBytes  78.4 Gbits/sec    0   1.46 MBytes
[  5]   3.00-4.00   sec  9.10 GBytes  78.2 Gbits/sec    0   1.46 MBytes
[  5]   4.00-5.00   sec  9.37 GBytes  80.5 Gbits/sec    0   1.46 MBytes
[  5]   5.00-6.00   sec  9.44 GBytes  81.1 Gbits/sec    0   1.46 MBytes
[  5]   6.00-7.00   sec  9.14 GBytes  78.5 Gbits/sec    0   1.46 MBytes
[  5]   7.00-8.00   sec  9.36 GBytes  80.4 Gbits/sec    0   1.68 MBytes
[  5]   8.00-9.00   sec  9.11 GBytes  78.3 Gbits/sec    0   1.68 MBytes
[  5]   9.00-10.00  sec  9.29 GBytes  79.8 Gbits/sec    0   1.68 MBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec  91.7 GBytes  78.8 Gbits/sec    0             sender
[  5]   0.00-10.04  sec  91.7 GBytes  78.5 Gbits/sec                  receiver

可以看到，在ipvlan模式下，附加eBPF程序来实现clusterIP后性能大大提升了，相对于直接用podIP来访问，性能只是有不到5%的损耗，相对于用主机的iptables规则实现clusterIP性能提升了20%，也是上述所有方案中最优的一种，我们自研的cni（mustang）支持这种方式，同时也支持veth方式上附加eBPF程序的方式，下面是veth附加eBPF程序的测试结果；

这里要先清理一下现场

veth方式附加eBPF程序

还是先创建pod-a和pod-b：

代码语言：txt复制

ip netns add pod-a
ip link add eth0 type veth peer name veth-pod-a
ip link set eth0 netns pod-a
ip netns exec pod-a ip addr add 192.168.10.10/24 dev eth0
ip netns exec pod-a ip link set eth0 up
ip netns exec pod-a ip route add default via 169.254.10.24 dev eth0 onlink
ip link set veth-pod-a up
echo 1 > /proc/sys/net/ipv4/conf/veth-pod-a/proxy_arp
ip route add 192.168.10.10 dev veth-pod-a scope link

ip netns add pod-b
ip link add eth0 type veth peer name veth-pod-b
ip link set eth0 netns pod-b
ip netns exec pod-b ip addr add 192.168.10.11/24 dev eth0
ip netns exec pod-b ip link set eth0 up
ip netns exec pod-b ip route add default via 169.254.10.24 dev eth0 onlink
ip link set veth-pod-b up
echo 1 > /proc/sys/net/ipv4/conf/veth-pod-b/proxy_arp
ip route add 192.168.10.11 dev veth-pod-b scope link

echo 1 > /proc/sys/net/ipv4/ip_forward

附加eBPF程序

给pod-a和pod-b主机一端的网卡的tc ingress和tc egress都附加eBPF程序：

代码语言：txt复制

tc qdisc add dev veth-pod-b clsact
tc filter add dev veth-pod-b ingress prio 1 handle 1 bpf da obj mst_lxc.o sec lxc-ingress  ## 因为这是附加在主机一端的网卡，与附加在容器时的方向是反的
tc filter add dev veth-pod-b egress prio 1 handle 1 bpf da obj mst_lxc.o sec lxc-egress

tc qdisc add dev veth-pod-a clsact
tc filter add dev veth-pod-a ingress prio 1 handle 1 bpf da obj mst_lxc.o sec lxc-ingress
tc filter add dev veth-pod-a egress prio 1 handle 1 bpf da obj mst_lxc.o sec lxc-egress

增加svc和增加endpoint（为了对数据包进行快速重定向）：

代码语言：txt复制

mustang svc add --service=10.96.0.100:5201 --backend=192.168.10.10:5201
mustang ep add --ifname=eth0 --netns=/var/run/netns/pod-a
mustang ep add --ifname=eth0 --netns=/var/run/netns/pod-b

查看一下配置

代码语言：txt复制

mustang svc ls
==========================================================================
Mustang Service count:1
==========================================================================
id      pro     service         port    backends
--------------------------------------------------------------------------
1       NONE    10.96.0.100     5201    192.168.10.10:5201,
==========================================================================

mustang ep ls
==========================================================================
Mustang Endpoint count:2
==========================================================================
Id      IP              Host    IfIndex         LxcMAC                  NodeMAC
--------------------------------------------------------------------------
1       192.168.10.10   false   653             76:6C:37:2C:81:05       3E:E9:02:96:60:D6
2       192.168.10.11   false   655             6A:CD:11:13:76:2C       72:90:74:4A:CB:84
==========================================================================

可以开测了，老套路，pod-a跑服务端，在pod-b上测，先试试用clusterIP：

代码语言：txt复制

 ip netns exec pod-b iperf3 -c 10.96.0.100 -i 1 -t 10
Connecting to host 10.96.0.100, port 5201
[  5] local 192.168.10.11 port 36492 connected to 10.96.0.100 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec  8.52 GBytes  73.2 Gbits/sec    0   1.07 MBytes
[  5]   1.00-2.00   sec  8.54 GBytes  73.4 Gbits/sec    0   1.07 MBytes
[  5]   2.00-3.00   sec  8.64 GBytes  74.2 Gbits/sec    0   1.07 MBytes
[  5]   3.00-4.00   sec  8.57 GBytes  73.6 Gbits/sec    0   1.12 MBytes
[  5]   4.00-5.00   sec  8.61 GBytes  73.9 Gbits/sec    0   1.18 MBytes
[  5]   5.00-6.00   sec  8.48 GBytes  72.8 Gbits/sec    0   1.18 MBytes
[  5]   6.00-7.00   sec  8.57 GBytes  73.6 Gbits/sec    0   1.18 MBytes
[  5]   7.00-8.00   sec  9.11 GBytes  78.3 Gbits/sec    0   1.18 MBytes
[  5]   8.00-9.00   sec  8.86 GBytes  76.1 Gbits/sec    0   1.18 MBytes
[  5]   9.00-10.00  sec  8.71 GBytes  74.8 Gbits/sec    0   1.18 MBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec  86.6 GBytes  74.4 Gbits/sec    0             sender
[  5]   0.00-10.04  sec  86.6 GBytes  74.1 Gbits/sec                  receiver

再试试直接用podIP：

代码语言：txt复制

ip netns exec pod-b iperf3 -c 192.168.10.10 -i 1 -t 10
Connecting to host 192.168.10.10, port 5201
[  5] local 192.168.10.11 port 56460 connected to 192.168.10.10 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec  8.76 GBytes  75.3 Gbits/sec    0    676 KBytes
[  5]   1.00-2.00   sec  8.87 GBytes  76.2 Gbits/sec    0    708 KBytes
[  5]   2.00-3.00   sec  8.78 GBytes  75.5 Gbits/sec    0    708 KBytes
[  5]   3.00-4.00   sec  8.99 GBytes  77.2 Gbits/sec    0    708 KBytes
[  5]   4.00-5.00   sec  8.85 GBytes  76.1 Gbits/sec    0    782 KBytes
[  5]   5.00-6.00   sec  8.98 GBytes  77.1 Gbits/sec    0    782 KBytes
[  5]   6.00-7.00   sec  8.64 GBytes  74.2 Gbits/sec    0   1.19 MBytes
[  5]   7.00-8.00   sec  8.40 GBytes  72.2 Gbits/sec    0   1.55 MBytes
[  5]   8.00-9.00   sec  8.16 GBytes  70.1 Gbits/sec    0   1.88 MBytes
[  5]   9.00-10.00  sec  8.16 GBytes  70.1 Gbits/sec    0   1.88 MBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec  86.6 GBytes  74.4 Gbits/sec    0             sender
[  5]   0.00-10.04  sec  86.6 GBytes  74.1 Gbits/sec                  receiver

可以看到，附加eBPF程序后，用PodIP访问比原来性能提升了12%左右（见最上面veth方式的测试结果，66Gbits/sec），且clusterIP访问和podIP访问几乎没有差异；

总结

veth ：（PODIP）66Gbits/sec （clusterIP）66Gbits/sec
bridge : （PODIP）77Gbits/sec （clusterIP）61Gbits/sec
macvlan ：（PODIP）83Gbits/sec （clusterIP）62Gbits/sec
ipvlan ：（PODIP）83Gbits/sec （clusterIP）62Gbits/sec
ipvlan eBPF：（PODIP）83Gbits/sec （clusterIP）78Gbits/sec
veth eBPF ：（PODIP）74Gbits/sec （clusterIP）74Gbits/sec

综上所述：

macvlan和ipvlan在使用podIP访问时性能是所有方案中最高的，但clusterIP的访问因为走了主机协议栈，降了25%，有eBPF加持后又提升了20%，不过macvlan在无线网卡不支持，且单网卡mac数量有限制，不支持云平台源目的检查等，所以一般都较为推荐ipvlan eBPF方式，不过这种方式下附加eBPF程序还是有点点麻烦，上面演示时为了简洁直接把eBPF的map在pod里创建了，实际是在主机上创建eBPF的map，然后让所有的pod共享这份eBPF的map；
veth方式附加eBPF程序后，用serviceIP和直接用podIP性能上没有差异，且比不附加eBPF程序性能有12%的提升，而且还可以用主机上的iptables防火墙规则，在主机端附加eBPF程序比较简单，所以这是我们的mustang默认的方案；
bridge在podIP访问时性能是比veth方式高的，但因为执行iptables规则性能被拖低了，所以在不附加eBPF的情况下，还是比较推荐veth方式，这是calico的默认方案；bridge方式是flannel的默认方案

linux kubernetes 容器镜像服务容器容器网络 K8S网络 linuxeBPF macvlan ipvlan

1 人点赞