RDMA MellanoxCX-4 网卡RoCE/IB带宽测试

2024-05-30 22:38:59 浏览数 (2)

引言

本文仅用于分享记录学习RDMA/RoCE v2网络协议的一些基础内容。如有错误,欢迎评论区留言。

网卡的基本类型与选择

在数据中心、高性能计算和网络存储的世界中,网卡设备的选择对于确保系统性能、可靠性和效率至关重要。由于我是出于学习的目的,所以尽量选择价格低的支持RDMA的网卡。Mellanox网卡的产品非常多,有的支持VPI,有的只支持Ethernet,有的只支持Infiniband。可以根据下面的链接选择合适的网卡。

firmware-download

硬件

网卡:Mellanox ConnectX-4 MCX455A-ECAT 100G 2张

光模块:100G QSFP28 AOC 光模块

操作系统:ubuntu20.04

测试机器: 华硕z390-a主板,内存4G,DDR4

1. Mellanox ConnectX-4 setup

1.1 检测网卡设备是否被枚举:

代码语言:sh复制
bing@ubuntu2004:~$ lspci -k | grep Mellanox
01:00.0 Ethernet controller: Mellanox Technologies MT27700 Family [ConnectX-4]
	Subsystem: Mellanox Technologies MT27700 Family [ConnectX-4]
02:00.0 Ethernet controller: Mellanox Technologies MT27700 Family [ConnectX-4]
	Subsystem: Mellanox Technologies MT27700 Family [ConnectX-4]

如果显示如上的内容说明,网卡设备被枚举出来了。

1.2 下载mlnx_ofed驱动

mlox_ofed驱动下载链接

根据提示,由于我的网卡是ConnectX-4, 所以选择了 选择 MLNX_OFED 5.4/5.8-x LTS 版本。

MLNX_OFED Download Center 可以自行选择所需要的iso文件。

download centerdownload center

1.3 安装mlnx_ofed驱动

将iso文件拷贝到测试机上,执行mount命令,然后执行 mlnxofedinstall脚本即可,在执行这个脚本期间,会编译一些内核模块,速度会比较慢,耐心等待即可。安装完驱动后,会提示需要重启。

代码语言:sh复制
➜ scp -r MLNX_OFED_LINUX-5.8-4.1.5.0-ubuntu20.04-x86_64.iso bing@192.168.100.101:/tmp
root@ubuntu2004:/tmp# mount -o ro,loop MLNX_OFED_LINUX-5.8-4.1.5.0-ubuntu20.04-x86_64.iso /mnt/
root@ubuntu2004:/tmp# cd /mnt/
root@ubuntu2004:/mnt# ls
DEBS     RPM-GPG-KEY-Mellanox  common_installers.pl            distro  mlnx-ofed-keyring.gpg       mlnxofedinstall  uninstall.sh
LICENSE  common.pl             create_mlnx_ofed_installers.pl  docs    mlnx_add_kernel_support.sh  src
root@ubuntu2004:/mnt#./mlnxofedinstall

安装的日志,会存在/tmp目录下面,如果需要确认购买的网卡具体型号,可以通过命令查找Part Number关键字

代码语言:sh复制
root@ubuntu2004:/tmp# grep -Rsn "Part Number" ./
./mlnx_fw_update.log:8:  Part Number:      MCX455A-ECA_Ax
./mlnx_fw_update.log:28:  Part Number:      MCX455A-ECA_Ax
root@ubuntu2004:/tmp#

1.4 查看安装的结果

安装的结果大部分在/usr目录下,有一部分在/opt/目录下,驱动相关的安装在/lib/modules/目录下

代码语言:sh复制
root@ubuntu2004:/lib/modules/5.15.0-107-generic/updates/dkms# ls
ib_cm.ko     ib_iser.ko   ib_umad.ko    iw_cm.ko      mlx5_ib.ko     mlxfw.ko        rdma_cm.ko
ib_core.ko   ib_isert.ko  ib_uverbs.ko  knem.ko       mlx_compat.ko  mst_pci.ko      rdma_ucm.ko
ib_ipoib.ko  ib_srp.ko    irdma.ko      mlx5_core.ko  mlxdevm.ko     mst_pciconf.ko  scsi_transport_srp.ko
root@ubuntu2004:/lib/modules/5.15.0-107-generic/updates/dkms#
  • ib_cm.ko, ib_core.ko, ib_ipoib.ko, ib_iser.ko, ib_isert.ko, ib_srp.ko, ib_umad.ko, ib_uverbs.ko: 这些都是 InfiniBand 驱动模块,负责不同层面的 InfiniBand 通信协议。
  • mlx5_core.ko, mlx5_ib.ko: 这些是 Mellanox 技术公司生产的网络设备的驱动模块,支持 Mellanox 的以太网和 InfiniBand 网络接口卡。
  • rdma_cm.ko, rdma_ucm.ko: 这些模块支持 RDMA(远程直接内存访问)连接管理器。
  • knem.ko: 这是一个内核模块,支持高性能的内存复制操作,通常用于高性能计算环境。
  • mst_pci.ko, mst_pciconf.ko: 这些模块是 Mellanox 设备的管理工具,用于配置和管理通过 PCI 连接的设备。
  • scsi_transport_srp.ko: SCSI over RDMA 协议的传输层支持模块,允许通过 RDMA 接口进行 SCSI 传输

2 MellanoxCX-4 网卡配置

2.1 mst(Mellanox Software Tools) 或者ibv_devinfo命令用于显示和检查Mellanox设备的状态

代码语言:txt复制
root@ubuntu2004:~# mst status
MST modules:
------------
    MST PCI module is not loaded
    MST PCI configuration module loaded

MST devices:
------------
/dev/mst/mt4115_pciconf0         - PCI configuration cycles access.
                                   domain:bus:dev.fn=0000:01:00.0 addr.reg=88 data.reg=92 cr_bar.gw_offset=-1
                                   Chip revision is: 00
/dev/mst/mt4115_pciconf1         - PCI configuration cycles access.
                                   domain:bus:dev.fn=0000:02:00.0 addr.reg=88 data.reg=92 cr_bar.gw_offset=-1
                                   Chip revision is: 00

root@ubuntu2004:~#
代码语言:txt复制
root@ubuntu2004:~# ibv_devinfo
hca_id:	mlx5_0
	transport:			InfiniBand (0)
	fw_ver:				12.28.2006
	node_guid:			248a:0703:009c:81c4
	sys_image_guid:			248a:0703:009c:81c4
	vendor_id:			0x02c9
	vendor_part_id:			4115
	hw_ver:				0x0
	board_id:			MT_2180110032
	phys_port_cnt:			1
		port:	1
			state:			PORT_ACTIVE (4)
			max_mtu:		4096 (5)
			active_mtu:		1024 (3)
			sm_lid:			0
			port_lid:		0
			port_lmc:		0x00
			link_layer:		Ethernet

hca_id:	mlx5_1
	transport:			InfiniBand (0)
	fw_ver:				12.28.2006
	node_guid:			248a:0703:00a0:199a
	sys_image_guid:			248a:0703:00a0:199a
	vendor_id:			0x02c9
	vendor_part_id:			4115
	hw_ver:				0x0
	board_id:			MT_2180110032
	phys_port_cnt:			1
		port:	1
			state:			PORT_ACTIVE (4)
			max_mtu:		4096 (5)
			active_mtu:		1024 (3)
			sm_lid:			0
			port_lid:		0
			port_lmc:		0x00
			link_layer:		Ethernet

root@ubuntu2004:~#

2.2 使用mlxconfig 命令可以查询设备支持的参数信息

可以通过修改 LINK_TYPE_P1 参数,将网卡设置成IB(1)或者Ethernet(2)

代码语言:txt复制
root@ubuntu2004:~# mlxconfig  -d /dev/mst/mt4115_pciconf0 q

Device #1:
----------

Device type:    ConnectX4
Name:           MCX455A-ECA_Ax
Description:    ConnectX-4 VPI adapter card; EDR IB (100Gb/s) and 100GbE; single-port QSFP28; PCIe3.0 x16; ROHS R6
Device:         /dev/mst/mt4115_pciconf0

Configurations:                                      Next Boot
         MEMIC_BAR_SIZE                              0
         MEMIC_SIZE_LIMIT                            _256KB(1)
         FLEX_PARSER_PROFILE_ENABLE                  0
         FLEX_IPV4_OVER_VXLAN_PORT                   0
         ROCE_NEXT_PROTOCOL                          254
         NON_PREFETCHABLE_PF_BAR                     False(0)
         VF_VPD_ENABLE                               False(0)
         STRICT_VF_MSIX_NUM                          False(0)
         VF_NODNIC_ENABLE                            False(0)
         NUM_PF_MSIX_VALID                           True(1)
         NUM_OF_VFS                                  0
         NUM_OF_PF                                   1
         FPP_EN                                      True(1)
         SRIOV_EN                                    False(0)
         PF_LOG_BAR_SIZE                             5
         VF_LOG_BAR_SIZE                             1
         NUM_PF_MSIX                                 63
         NUM_VF_MSIX                                 11
         INT_LOG_MAX_PAYLOAD_SIZE                    AUTOMATIC(0)
         PCIE_CREDIT_TOKEN_TIMEOUT                   0
         PARTIAL_RESET_EN                            False(0)
         SW_RECOVERY_ON_ERRORS                       False(0)
         RESET_WITH_HOST_ON_ERRORS                   False(0)
         PCI_DOWNSTREAM_PORT_OWNER                   Array[0..15]
         CQE_COMPRESSION                             BALANCED(0)
         IP_OVER_VXLAN_EN                            False(0)
         MKEY_BY_NAME                                False(0)
         UCTX_EN                                     True(1)
         PCI_ATOMIC_MODE                             PCI_ATOMIC_DISABLED_EXT_ATOMIC_ENABLED(0)
         TUNNEL_ECN_COPY_DISABLE                     False(0)
         LRO_LOG_TIMEOUT0                            6
         LRO_LOG_TIMEOUT1                            7
         LRO_LOG_TIMEOUT2                            8
         LRO_LOG_TIMEOUT3                            13
         TX_SCHEDULER_BURST                          0
         LOG_DCR_HASH_TABLE_SIZE                     14
         MAX_PACKET_LIFETIME                         0
         DCR_LIFO_SIZE                               16384
         LINK_TYPE_P1                                ETH(2)
         ROCE_CC_PRIO_MASK_P1                        255
         CLAMP_TGT_RATE_AFTER_TIME_INC_P1            True(1)
         CLAMP_TGT_RATE_P1                           False(0)
         RPG_TIME_RESET_P1                           300
         RPG_BYTE_RESET_P1                           32767
         RPG_THRESHOLD_P1                            1
         RPG_MAX_RATE_P1                             0
         RPG_AI_RATE_P1                              5
         RPG_HAI_RATE_P1                             50
         RPG_GD_P1                                   11
         RPG_MIN_DEC_FAC_P1                          50
         RPG_MIN_RATE_P1                             1
         RATE_TO_SET_ON_FIRST_CNP_P1                 0
         DCE_TCP_G_P1                                1019
         DCE_TCP_RTT_P1                              1
         RATE_REDUCE_MONITOR_PERIOD_P1               4
         INITIAL_ALPHA_VALUE_P1                      1023
         MIN_TIME_BETWEEN_CNPS_P1                    0
         CNP_802P_PRIO_P1                            6
         CNP_DSCP_P1                                 48
         LLDP_NB_DCBX_P1                             False(0)
         LLDP_NB_RX_MODE_P1                          OFF(0)
         LLDP_NB_TX_MODE_P1                          OFF(0)
         DCBX_IEEE_P1                                True(1)
         DCBX_CEE_P1                                 True(1)
         DCBX_WILLING_P1                             True(1)
         KEEP_ETH_LINK_UP_P1                         True(1)
         KEEP_IB_LINK_UP_P1                          False(0)
         KEEP_LINK_UP_ON_BOOT_P1                     False(0)
         KEEP_LINK_UP_ON_STANDBY_P1                  False(0)
         DO_NOT_CLEAR_PORT_STATS_P1                  False(0)
         AUTO_POWER_SAVE_LINK_DOWN_P1                False(0)
         NUM_OF_VL_P1                                _4_VLs(3)
         NUM_OF_TC_P1                                _8_TCs(0)
         NUM_OF_PFC_P1                               8
         VL15_BUFFER_SIZE_P1                         0
         DUP_MAC_ACTION_P1                           LAST_CFG(0)
         SRIOV_IB_ROUTING_MODE_P1                    LID(1)
         IB_ROUTING_MODE_P1                          LID(1)
         PCI_WR_ORDERING                             per_mkey(0)
         MULTI_PORT_VHCA_EN                          False(0)
         PORT_OWNER                                  True(1)
         ALLOW_RD_COUNTERS                           True(1)
         RENEG_ON_CHANGE                             True(1)
         TRACER_ENABLE                               True(1)
         IP_VER                                      IPv4(0)
         BOOT_UNDI_NETWORK_WAIT                      0
         UEFI_HII_EN                                 False(0)
         BOOT_DBG_LOG                                False(0)
         UEFI_LOGS                                   DISABLED(0)
         BOOT_VLAN                                   1
         LEGACY_BOOT_PROTOCOL                        PXE(1)
         BOOT_INTERRUPT_DIS                          False(0)
         BOOT_LACP_DIS                               True(1)
         BOOT_VLAN_EN                                False(0)
         BOOT_PKEY                                   0
         P2P_ORDERING_MODE                           DEVICE_DEFAULT(0)
         DYNAMIC_VF_MSIX_TABLE                       False(0)
         EXP_ROM_UEFI_ARM_ENABLE                     False(0)
         EXP_ROM_UEFI_x86_ENABLE                     False(0)
         EXP_ROM_PXE_ENABLE                          True(1)
         ADVANCED_PCI_SETTINGS                       False(0)
         SAFE_MODE_THRESHOLD                         10
         SAFE_MODE_ENABLE                            True(1)
root@ubuntu2004:~#

2.3 设置VPI,配置ip, 使用iperf3测试带宽

代码语言:txt复制
#设置成ETH
mlxconfig -d /dev/mst/mt4115_pciconf0 set LINK_TYPE_P1=2
#设置成IB
mlxconfig -d /dev/mst/mt4115_pciconf0 set LINK_TYPE_P1=1
#设置完成后需要重启
代码语言:txt复制
sudo ip addr add 192.168.1.1/24 dev enp1s0np0
sudo ip link set enp1s0np0 up
sudo ip addr add 192.168.1.2/24 dev enp2s0np0
sudo ip link set enp2s0np0 up
ifconfig
enp1s0np0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 192.168.1.1  netmask 255.255.255.0  broadcast 0.0.0.0
        ether 24:8a:07:9c:81:c4  txqueuelen 1000  (Ethernet)
        RX packets 101  bytes 21028 (21.0 KB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 204  bytes 31375 (31.3 KB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

enp2s0np0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 192.168.1.2  netmask 255.255.255.0  broadcast 0.0.0.0
        ether 24:8a:07:a0:19:9a  txqueuelen 1000  (Ethernet)
        RX packets 102  bytes 21268 (21.2 KB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 195  bytes 30044 (30.0 KB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
#开启一个终端作为服务端
root@ubuntu2004:/home/bing# iperf3 -s
-----------------------------------------------------------
Server listening on 5201
-----------------------------------------------------------
Accepted connection from 192.168.1.1, port 51764
[  5] local 192.168.1.1 port 5201 connected to 192.168.1.1 port 51768
[ ID] Interval           Transfer     Bitrate
[  5]   0.00-1.00   sec  5.69 GBytes  48.8 Gbits/sec
[  5]   1.00-2.00   sec  5.93 GBytes  51.0 Gbits/sec
[  5]   2.00-3.00   sec  5.86 GBytes  50.4 Gbits/sec
[  5]   3.00-4.00   sec  5.90 GBytes  50.7 Gbits/sec
[  5]   4.00-5.00   sec  5.90 GBytes  50.7 Gbits/sec
[  5]   5.00-6.00   sec  5.88 GBytes  50.5 Gbits/sec
[  5]   6.00-7.00   sec  5.89 GBytes  50.6 Gbits/sec
[  5]   7.00-8.00   sec  5.94 GBytes  51.0 Gbits/sec
[  5]   8.00-9.00   sec  5.94 GBytes  51.0 Gbits/sec
[  5]   9.00-10.00  sec  5.91 GBytes  50.7 Gbits/sec
[  5]  10.00-10.04  sec   251 MBytes  50.5 Gbits/sec
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate
[  5]   0.00-10.04  sec  59.1 GBytes  50.5 Gbits/sec                  receiver
-----------------------------------------------------------
Server listening on 5201
-----------------------------------------------------------
#开启另外一个终端
root@ubuntu2004:/home/bing# iperf3 -c 192.168.1.1
Connecting to host 192.168.1.1, port 5201
[  5] local 192.168.1.1 port 51768 connected to 192.168.1.1 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec  5.93 GBytes  51.0 Gbits/sec    0   1.44 MBytes
[  5]   1.00-2.00   sec  5.93 GBytes  51.0 Gbits/sec    0   1.44 MBytes
[  5]   2.00-3.00   sec  5.86 GBytes  50.4 Gbits/sec    0   2.37 MBytes
[  5]   3.00-4.00   sec  5.90 GBytes  50.6 Gbits/sec    0   2.37 MBytes
[  5]   4.00-5.00   sec  5.90 GBytes  50.7 Gbits/sec    0   2.37 MBytes
[  5]   5.00-6.00   sec  5.88 GBytes  50.5 Gbits/sec    0   2.37 MBytes
[  5]   6.00-7.00   sec  5.89 GBytes  50.6 Gbits/sec    0   2.37 MBytes
[  5]   7.00-8.00   sec  5.94 GBytes  51.0 Gbits/sec    0   2.37 MBytes
[  5]   8.00-9.00   sec  5.94 GBytes  51.0 Gbits/sec    0   2.37 MBytes
[  5]   9.00-10.00  sec  5.91 GBytes  50.7 Gbits/sec    0   2.37 MBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec  59.1 GBytes  50.8 Gbits/sec    0             sender
[  5]   0.00-10.04  sec  59.1 GBytes  50.5 Gbits/sec                  receiver

iperf Done.
root@ubuntu2004:/home/bing#

2.4 瓶颈分析

很显然,没有达到100G,瓶颈在哪里呢?

由于我使用的这个测试机的主板比较旧,我查询了主板的用户手册,只支持1路 PCIe x 16, 同时插入两张网卡,PCIe的速率降低一半,只有 PCIe x 8 , gen3的速率。也可以通过dmesg查看出来。

代码语言:txt复制
root@ubuntu2004:/home/bing# dmesg  | grep mlx5_core
[    0.996646] mlx5_core 0000:01:00.0: firmware version: 12.28.2006
[    0.996676] mlx5_core 0000:01:00.0: 63.008 Gb/s available PCIe bandwidth, limited by 8.0 GT/s PCIe x8 link at 0000:00:01.0 (capable of 126.016 Gb/s with 8.0 GT/s PCIe x16 link)
[    1.253026] mlx5_core 0000:01:00.0: E-Switch: Total vports 2, per vport: max uc(1024) max mc(16384)
[    1.256063] mlx5_core 0000:01:00.0: Port module event: module 0, Cable plugged
[    1.261729] mlx5_core 0000:01:00.0: mlx5_fw_tracer_start:821:(pid 152): FWTracer: Ownership granted and active
[    1.270139] mlx5_core 0000:02:00.0: firmware version: 12.28.2006
[    1.270170] mlx5_core 0000:02:00.0: 63.008 Gb/s available PCIe bandwidth, limited by 8.0 GT/s PCIe x8 link at 0000:00:01.1 (capable of 126.016 Gb/s with 8.0 GT/s PCIe x16 link)
[    1.517526] mlx5_core 0000:02:00.0: E-Switch: Total vports 2, per vport: max uc(1024) max mc(16384)
[    1.520523] mlx5_core 0000:02:00.0: Port module event: module 0, Cable plugged
[    1.523676] mlx5_core 0000:02:00.0: mlx5_fw_tracer_start:821:(pid 152): FWTracer: Ownership granted and active
[    1.531393] mlx5_core 0000:01:00.0: MLX5E: StrdRq(0) RqSz(1024) StrdSz(256) RxCqeCmprss(0)
[    1.728205] mlx5_core 0000:01:00.0: Supported tc offload range - chains: 4294967294, prios: 4294967295
[    1.760452] mlx5_core 0000:02:00.0: MLX5E: StrdRq(0) RqSz(1024) StrdSz(256) RxCqeCmprss(0)
[    1.951984] mlx5_core 0000:02:00.0: Supported tc offload range - chains: 4294967294, prios: 4294967295
[    1.982937] mlx5_core 0000:01:00.0 enp1s0np0: renamed from eth0
[    2.002012] mlx5_core 0000:02:00.0 enp2s0np0: renamed from eth1
[    9.867953] mlx5_core 0000:01:00.0 enp1s0np0: Link up
[    9.960668] mlx5_core 0000:02:00.0 enp2s0np0: Link up
root@ubuntu2004:/home/bing#

2.5 在IB模式下可以使用下面的命令进行带宽测试

代码语言:txt复制
sudo opensm
sudo ip addr add 192.168.1.1/24 dev ibp1s0
sudo ip link set ibp1s0 up
sudo ip addr add 192.168.1.2/24 dev ibp2s0
sudo ip link set ibp2s0 up

ib_send_bw -d mlx5_0 -i 1 --report_gbits
#开启另外一个终端
ib_send_bw -d mlx5_1 -i 1 192.168.1.1 --report_gbits

结束语:

RDMA、RoCE v2的知识非常的丰富,关于代码部分,IB模式下的带宽测试,有机会我会写在后续的文章中。

0 人点赞