引言
本文仅用于分享记录学习RDMA/RoCE v2网络协议的一些基础内容。如有错误,欢迎评论区留言。
网卡的基本类型与选择
在数据中心、高性能计算和网络存储的世界中,网卡设备的选择对于确保系统性能、可靠性和效率至关重要。由于我是出于学习的目的,所以尽量选择价格低的支持RDMA的网卡。Mellanox网卡的产品非常多,有的支持VPI,有的只支持Ethernet,有的只支持Infiniband。可以根据下面的链接选择合适的网卡。
firmware-download
硬件
网卡:Mellanox ConnectX-4 MCX455A-ECAT 100G 2张
光模块:100G QSFP28 AOC 光模块
操作系统:ubuntu20.04
测试机器: 华硕z390-a主板,内存4G,DDR4
1. Mellanox ConnectX-4 setup
1.1 检测网卡设备是否被枚举:
代码语言:sh复制bing@ubuntu2004:~$ lspci -k | grep Mellanox
01:00.0 Ethernet controller: Mellanox Technologies MT27700 Family [ConnectX-4]
Subsystem: Mellanox Technologies MT27700 Family [ConnectX-4]
02:00.0 Ethernet controller: Mellanox Technologies MT27700 Family [ConnectX-4]
Subsystem: Mellanox Technologies MT27700 Family [ConnectX-4]
如果显示如上的内容说明,网卡设备被枚举出来了。
1.2 下载mlnx_ofed驱动
mlox_ofed驱动下载链接
根据提示,由于我的网卡是ConnectX-4, 所以选择了 选择 MLNX_OFED 5.4/5.8-x LTS 版本。
MLNX_OFED Download Center 可以自行选择所需要的iso文件。
1.3 安装mlnx_ofed驱动
将iso文件拷贝到测试机上,执行mount命令,然后执行 mlnxofedinstall脚本即可,在执行这个脚本期间,会编译一些内核模块,速度会比较慢,耐心等待即可。安装完驱动后,会提示需要重启。
代码语言:sh复制➜ scp -r MLNX_OFED_LINUX-5.8-4.1.5.0-ubuntu20.04-x86_64.iso bing@192.168.100.101:/tmp
root@ubuntu2004:/tmp# mount -o ro,loop MLNX_OFED_LINUX-5.8-4.1.5.0-ubuntu20.04-x86_64.iso /mnt/
root@ubuntu2004:/tmp# cd /mnt/
root@ubuntu2004:/mnt# ls
DEBS RPM-GPG-KEY-Mellanox common_installers.pl distro mlnx-ofed-keyring.gpg mlnxofedinstall uninstall.sh
LICENSE common.pl create_mlnx_ofed_installers.pl docs mlnx_add_kernel_support.sh src
root@ubuntu2004:/mnt#./mlnxofedinstall
安装的日志,会存在/tmp目录下面,如果需要确认购买的网卡具体型号,可以通过命令查找Part Number关键字
代码语言:sh复制root@ubuntu2004:/tmp# grep -Rsn "Part Number" ./
./mlnx_fw_update.log:8: Part Number: MCX455A-ECA_Ax
./mlnx_fw_update.log:28: Part Number: MCX455A-ECA_Ax
root@ubuntu2004:/tmp#
1.4 查看安装的结果
安装的结果大部分在/usr目录下,有一部分在/opt/目录下,驱动相关的安装在/lib/modules/目录下
代码语言:sh复制root@ubuntu2004:/lib/modules/5.15.0-107-generic/updates/dkms# ls
ib_cm.ko ib_iser.ko ib_umad.ko iw_cm.ko mlx5_ib.ko mlxfw.ko rdma_cm.ko
ib_core.ko ib_isert.ko ib_uverbs.ko knem.ko mlx_compat.ko mst_pci.ko rdma_ucm.ko
ib_ipoib.ko ib_srp.ko irdma.ko mlx5_core.ko mlxdevm.ko mst_pciconf.ko scsi_transport_srp.ko
root@ubuntu2004:/lib/modules/5.15.0-107-generic/updates/dkms#
- ib_cm.ko, ib_core.ko, ib_ipoib.ko, ib_iser.ko, ib_isert.ko, ib_srp.ko, ib_umad.ko, ib_uverbs.ko: 这些都是 InfiniBand 驱动模块,负责不同层面的 InfiniBand 通信协议。
- mlx5_core.ko, mlx5_ib.ko: 这些是 Mellanox 技术公司生产的网络设备的驱动模块,支持 Mellanox 的以太网和 InfiniBand 网络接口卡。
- rdma_cm.ko, rdma_ucm.ko: 这些模块支持 RDMA(远程直接内存访问)连接管理器。
- knem.ko: 这是一个内核模块,支持高性能的内存复制操作,通常用于高性能计算环境。
- mst_pci.ko, mst_pciconf.ko: 这些模块是 Mellanox 设备的管理工具,用于配置和管理通过 PCI 连接的设备。
- scsi_transport_srp.ko: SCSI over RDMA 协议的传输层支持模块,允许通过 RDMA 接口进行 SCSI 传输
2 MellanoxCX-4 网卡配置
2.1 mst(Mellanox Software Tools) 或者ibv_devinfo命令用于显示和检查Mellanox设备的状态
代码语言:txt复制root@ubuntu2004:~# mst status
MST modules:
------------
MST PCI module is not loaded
MST PCI configuration module loaded
MST devices:
------------
/dev/mst/mt4115_pciconf0 - PCI configuration cycles access.
domain:bus:dev.fn=0000:01:00.0 addr.reg=88 data.reg=92 cr_bar.gw_offset=-1
Chip revision is: 00
/dev/mst/mt4115_pciconf1 - PCI configuration cycles access.
domain:bus:dev.fn=0000:02:00.0 addr.reg=88 data.reg=92 cr_bar.gw_offset=-1
Chip revision is: 00
root@ubuntu2004:~#
代码语言:txt复制root@ubuntu2004:~# ibv_devinfo
hca_id: mlx5_0
transport: InfiniBand (0)
fw_ver: 12.28.2006
node_guid: 248a:0703:009c:81c4
sys_image_guid: 248a:0703:009c:81c4
vendor_id: 0x02c9
vendor_part_id: 4115
hw_ver: 0x0
board_id: MT_2180110032
phys_port_cnt: 1
port: 1
state: PORT_ACTIVE (4)
max_mtu: 4096 (5)
active_mtu: 1024 (3)
sm_lid: 0
port_lid: 0
port_lmc: 0x00
link_layer: Ethernet
hca_id: mlx5_1
transport: InfiniBand (0)
fw_ver: 12.28.2006
node_guid: 248a:0703:00a0:199a
sys_image_guid: 248a:0703:00a0:199a
vendor_id: 0x02c9
vendor_part_id: 4115
hw_ver: 0x0
board_id: MT_2180110032
phys_port_cnt: 1
port: 1
state: PORT_ACTIVE (4)
max_mtu: 4096 (5)
active_mtu: 1024 (3)
sm_lid: 0
port_lid: 0
port_lmc: 0x00
link_layer: Ethernet
root@ubuntu2004:~#
2.2 使用mlxconfig 命令可以查询设备支持的参数信息
可以通过修改 LINK_TYPE_P1 参数,将网卡设置成IB(1)或者Ethernet(2)
代码语言:txt复制root@ubuntu2004:~# mlxconfig -d /dev/mst/mt4115_pciconf0 q
Device #1:
----------
Device type: ConnectX4
Name: MCX455A-ECA_Ax
Description: ConnectX-4 VPI adapter card; EDR IB (100Gb/s) and 100GbE; single-port QSFP28; PCIe3.0 x16; ROHS R6
Device: /dev/mst/mt4115_pciconf0
Configurations: Next Boot
MEMIC_BAR_SIZE 0
MEMIC_SIZE_LIMIT _256KB(1)
FLEX_PARSER_PROFILE_ENABLE 0
FLEX_IPV4_OVER_VXLAN_PORT 0
ROCE_NEXT_PROTOCOL 254
NON_PREFETCHABLE_PF_BAR False(0)
VF_VPD_ENABLE False(0)
STRICT_VF_MSIX_NUM False(0)
VF_NODNIC_ENABLE False(0)
NUM_PF_MSIX_VALID True(1)
NUM_OF_VFS 0
NUM_OF_PF 1
FPP_EN True(1)
SRIOV_EN False(0)
PF_LOG_BAR_SIZE 5
VF_LOG_BAR_SIZE 1
NUM_PF_MSIX 63
NUM_VF_MSIX 11
INT_LOG_MAX_PAYLOAD_SIZE AUTOMATIC(0)
PCIE_CREDIT_TOKEN_TIMEOUT 0
PARTIAL_RESET_EN False(0)
SW_RECOVERY_ON_ERRORS False(0)
RESET_WITH_HOST_ON_ERRORS False(0)
PCI_DOWNSTREAM_PORT_OWNER Array[0..15]
CQE_COMPRESSION BALANCED(0)
IP_OVER_VXLAN_EN False(0)
MKEY_BY_NAME False(0)
UCTX_EN True(1)
PCI_ATOMIC_MODE PCI_ATOMIC_DISABLED_EXT_ATOMIC_ENABLED(0)
TUNNEL_ECN_COPY_DISABLE False(0)
LRO_LOG_TIMEOUT0 6
LRO_LOG_TIMEOUT1 7
LRO_LOG_TIMEOUT2 8
LRO_LOG_TIMEOUT3 13
TX_SCHEDULER_BURST 0
LOG_DCR_HASH_TABLE_SIZE 14
MAX_PACKET_LIFETIME 0
DCR_LIFO_SIZE 16384
LINK_TYPE_P1 ETH(2)
ROCE_CC_PRIO_MASK_P1 255
CLAMP_TGT_RATE_AFTER_TIME_INC_P1 True(1)
CLAMP_TGT_RATE_P1 False(0)
RPG_TIME_RESET_P1 300
RPG_BYTE_RESET_P1 32767
RPG_THRESHOLD_P1 1
RPG_MAX_RATE_P1 0
RPG_AI_RATE_P1 5
RPG_HAI_RATE_P1 50
RPG_GD_P1 11
RPG_MIN_DEC_FAC_P1 50
RPG_MIN_RATE_P1 1
RATE_TO_SET_ON_FIRST_CNP_P1 0
DCE_TCP_G_P1 1019
DCE_TCP_RTT_P1 1
RATE_REDUCE_MONITOR_PERIOD_P1 4
INITIAL_ALPHA_VALUE_P1 1023
MIN_TIME_BETWEEN_CNPS_P1 0
CNP_802P_PRIO_P1 6
CNP_DSCP_P1 48
LLDP_NB_DCBX_P1 False(0)
LLDP_NB_RX_MODE_P1 OFF(0)
LLDP_NB_TX_MODE_P1 OFF(0)
DCBX_IEEE_P1 True(1)
DCBX_CEE_P1 True(1)
DCBX_WILLING_P1 True(1)
KEEP_ETH_LINK_UP_P1 True(1)
KEEP_IB_LINK_UP_P1 False(0)
KEEP_LINK_UP_ON_BOOT_P1 False(0)
KEEP_LINK_UP_ON_STANDBY_P1 False(0)
DO_NOT_CLEAR_PORT_STATS_P1 False(0)
AUTO_POWER_SAVE_LINK_DOWN_P1 False(0)
NUM_OF_VL_P1 _4_VLs(3)
NUM_OF_TC_P1 _8_TCs(0)
NUM_OF_PFC_P1 8
VL15_BUFFER_SIZE_P1 0
DUP_MAC_ACTION_P1 LAST_CFG(0)
SRIOV_IB_ROUTING_MODE_P1 LID(1)
IB_ROUTING_MODE_P1 LID(1)
PCI_WR_ORDERING per_mkey(0)
MULTI_PORT_VHCA_EN False(0)
PORT_OWNER True(1)
ALLOW_RD_COUNTERS True(1)
RENEG_ON_CHANGE True(1)
TRACER_ENABLE True(1)
IP_VER IPv4(0)
BOOT_UNDI_NETWORK_WAIT 0
UEFI_HII_EN False(0)
BOOT_DBG_LOG False(0)
UEFI_LOGS DISABLED(0)
BOOT_VLAN 1
LEGACY_BOOT_PROTOCOL PXE(1)
BOOT_INTERRUPT_DIS False(0)
BOOT_LACP_DIS True(1)
BOOT_VLAN_EN False(0)
BOOT_PKEY 0
P2P_ORDERING_MODE DEVICE_DEFAULT(0)
DYNAMIC_VF_MSIX_TABLE False(0)
EXP_ROM_UEFI_ARM_ENABLE False(0)
EXP_ROM_UEFI_x86_ENABLE False(0)
EXP_ROM_PXE_ENABLE True(1)
ADVANCED_PCI_SETTINGS False(0)
SAFE_MODE_THRESHOLD 10
SAFE_MODE_ENABLE True(1)
root@ubuntu2004:~#
2.3 设置VPI,配置ip, 使用iperf3测试带宽
代码语言:txt复制#设置成ETH
mlxconfig -d /dev/mst/mt4115_pciconf0 set LINK_TYPE_P1=2
#设置成IB
mlxconfig -d /dev/mst/mt4115_pciconf0 set LINK_TYPE_P1=1
#设置完成后需要重启
代码语言:txt复制sudo ip addr add 192.168.1.1/24 dev enp1s0np0
sudo ip link set enp1s0np0 up
sudo ip addr add 192.168.1.2/24 dev enp2s0np0
sudo ip link set enp2s0np0 up
ifconfig
enp1s0np0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
inet 192.168.1.1 netmask 255.255.255.0 broadcast 0.0.0.0
ether 24:8a:07:9c:81:c4 txqueuelen 1000 (Ethernet)
RX packets 101 bytes 21028 (21.0 KB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 204 bytes 31375 (31.3 KB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
enp2s0np0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
inet 192.168.1.2 netmask 255.255.255.0 broadcast 0.0.0.0
ether 24:8a:07:a0:19:9a txqueuelen 1000 (Ethernet)
RX packets 102 bytes 21268 (21.2 KB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 195 bytes 30044 (30.0 KB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
#开启一个终端作为服务端
root@ubuntu2004:/home/bing# iperf3 -s
-----------------------------------------------------------
Server listening on 5201
-----------------------------------------------------------
Accepted connection from 192.168.1.1, port 51764
[ 5] local 192.168.1.1 port 5201 connected to 192.168.1.1 port 51768
[ ID] Interval Transfer Bitrate
[ 5] 0.00-1.00 sec 5.69 GBytes 48.8 Gbits/sec
[ 5] 1.00-2.00 sec 5.93 GBytes 51.0 Gbits/sec
[ 5] 2.00-3.00 sec 5.86 GBytes 50.4 Gbits/sec
[ 5] 3.00-4.00 sec 5.90 GBytes 50.7 Gbits/sec
[ 5] 4.00-5.00 sec 5.90 GBytes 50.7 Gbits/sec
[ 5] 5.00-6.00 sec 5.88 GBytes 50.5 Gbits/sec
[ 5] 6.00-7.00 sec 5.89 GBytes 50.6 Gbits/sec
[ 5] 7.00-8.00 sec 5.94 GBytes 51.0 Gbits/sec
[ 5] 8.00-9.00 sec 5.94 GBytes 51.0 Gbits/sec
[ 5] 9.00-10.00 sec 5.91 GBytes 50.7 Gbits/sec
[ 5] 10.00-10.04 sec 251 MBytes 50.5 Gbits/sec
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval Transfer Bitrate
[ 5] 0.00-10.04 sec 59.1 GBytes 50.5 Gbits/sec receiver
-----------------------------------------------------------
Server listening on 5201
-----------------------------------------------------------
#开启另外一个终端
root@ubuntu2004:/home/bing# iperf3 -c 192.168.1.1
Connecting to host 192.168.1.1, port 5201
[ 5] local 192.168.1.1 port 51768 connected to 192.168.1.1 port 5201
[ ID] Interval Transfer Bitrate Retr Cwnd
[ 5] 0.00-1.00 sec 5.93 GBytes 51.0 Gbits/sec 0 1.44 MBytes
[ 5] 1.00-2.00 sec 5.93 GBytes 51.0 Gbits/sec 0 1.44 MBytes
[ 5] 2.00-3.00 sec 5.86 GBytes 50.4 Gbits/sec 0 2.37 MBytes
[ 5] 3.00-4.00 sec 5.90 GBytes 50.6 Gbits/sec 0 2.37 MBytes
[ 5] 4.00-5.00 sec 5.90 GBytes 50.7 Gbits/sec 0 2.37 MBytes
[ 5] 5.00-6.00 sec 5.88 GBytes 50.5 Gbits/sec 0 2.37 MBytes
[ 5] 6.00-7.00 sec 5.89 GBytes 50.6 Gbits/sec 0 2.37 MBytes
[ 5] 7.00-8.00 sec 5.94 GBytes 51.0 Gbits/sec 0 2.37 MBytes
[ 5] 8.00-9.00 sec 5.94 GBytes 51.0 Gbits/sec 0 2.37 MBytes
[ 5] 9.00-10.00 sec 5.91 GBytes 50.7 Gbits/sec 0 2.37 MBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-10.00 sec 59.1 GBytes 50.8 Gbits/sec 0 sender
[ 5] 0.00-10.04 sec 59.1 GBytes 50.5 Gbits/sec receiver
iperf Done.
root@ubuntu2004:/home/bing#
2.4 瓶颈分析
很显然,没有达到100G,瓶颈在哪里呢?
由于我使用的这个测试机的主板比较旧,我查询了主板的用户手册,只支持1路 PCIe x 16, 同时插入两张网卡,PCIe的速率降低一半,只有 PCIe x 8 , gen3的速率。也可以通过dmesg查看出来。
代码语言:txt复制root@ubuntu2004:/home/bing# dmesg | grep mlx5_core
[ 0.996646] mlx5_core 0000:01:00.0: firmware version: 12.28.2006
[ 0.996676] mlx5_core 0000:01:00.0: 63.008 Gb/s available PCIe bandwidth, limited by 8.0 GT/s PCIe x8 link at 0000:00:01.0 (capable of 126.016 Gb/s with 8.0 GT/s PCIe x16 link)
[ 1.253026] mlx5_core 0000:01:00.0: E-Switch: Total vports 2, per vport: max uc(1024) max mc(16384)
[ 1.256063] mlx5_core 0000:01:00.0: Port module event: module 0, Cable plugged
[ 1.261729] mlx5_core 0000:01:00.0: mlx5_fw_tracer_start:821:(pid 152): FWTracer: Ownership granted and active
[ 1.270139] mlx5_core 0000:02:00.0: firmware version: 12.28.2006
[ 1.270170] mlx5_core 0000:02:00.0: 63.008 Gb/s available PCIe bandwidth, limited by 8.0 GT/s PCIe x8 link at 0000:00:01.1 (capable of 126.016 Gb/s with 8.0 GT/s PCIe x16 link)
[ 1.517526] mlx5_core 0000:02:00.0: E-Switch: Total vports 2, per vport: max uc(1024) max mc(16384)
[ 1.520523] mlx5_core 0000:02:00.0: Port module event: module 0, Cable plugged
[ 1.523676] mlx5_core 0000:02:00.0: mlx5_fw_tracer_start:821:(pid 152): FWTracer: Ownership granted and active
[ 1.531393] mlx5_core 0000:01:00.0: MLX5E: StrdRq(0) RqSz(1024) StrdSz(256) RxCqeCmprss(0)
[ 1.728205] mlx5_core 0000:01:00.0: Supported tc offload range - chains: 4294967294, prios: 4294967295
[ 1.760452] mlx5_core 0000:02:00.0: MLX5E: StrdRq(0) RqSz(1024) StrdSz(256) RxCqeCmprss(0)
[ 1.951984] mlx5_core 0000:02:00.0: Supported tc offload range - chains: 4294967294, prios: 4294967295
[ 1.982937] mlx5_core 0000:01:00.0 enp1s0np0: renamed from eth0
[ 2.002012] mlx5_core 0000:02:00.0 enp2s0np0: renamed from eth1
[ 9.867953] mlx5_core 0000:01:00.0 enp1s0np0: Link up
[ 9.960668] mlx5_core 0000:02:00.0 enp2s0np0: Link up
root@ubuntu2004:/home/bing#
2.5 在IB模式下可以使用下面的命令进行带宽测试
代码语言:txt复制sudo opensm
sudo ip addr add 192.168.1.1/24 dev ibp1s0
sudo ip link set ibp1s0 up
sudo ip addr add 192.168.1.2/24 dev ibp2s0
sudo ip link set ibp2s0 up
ib_send_bw -d mlx5_0 -i 1 --report_gbits
#开启另外一个终端
ib_send_bw -d mlx5_1 -i 1 192.168.1.1 --report_gbits
结束语:
RDMA、RoCE v2的知识非常的丰富,关于代码部分,IB模式下的带宽测试,有机会我会写在后续的文章中。