UCX-UCT统一通信传输层2-深入-建连_数据收发主流程

2023-11-19 11:09:24 浏览数 (2)

术语

EN

CN

DETAIL

devx

mlx开发库

DevX库通过使用KABI机制实现从用户空间区域直接访问mlx5设备驱动程序。这里的主要目的是使用户空间驱动程序尽可能独立于内核,以便可以通过以下方式激活未来的设备功能和命令 内核更改最少甚至没有, 参考: https://github.com/Mellanox/devx

TSC

时间戳计数器 (TSC)

这是在每个 x86 微处理器中通过称为 TSC 寄存器的 64 位寄存器实现的计数器。它计算到达处理器 CLK 引脚的时钟信号的数量。当前计数器值可以通过访问 TSC 寄存器来读取。每秒计数的滴答数可以计算为 1/(时钟频率);对于 1 GHz 时钟,它转换为每纳秒一次。 了解两个连续报价之间的持续时间非常重要。事实上,一个处理器时钟的频率可能与其他处理器的频率不同,这使得它在不同的处理器上有所不同。CPU时钟频率是在系统启动期间通过calibrate_tsc()x86_platform_ops结构中定义的回调例程计算的, 如果无法从CPU型号中读取x86 TSC值,请不要从/proc/cpuinfo中读取测量的CPU频率,因为它只能代表核心频率而不是TSC频率

ODP

按需分页

按需分页 (ODP) 是一种可以缓解内存注册缺点的技术。 应用程序不再需要确定地址空间的底层物理页,并跟踪映射的有效性。 相反,当页面不存在时,HCA 向操作系统请求最新的转换,并且操作系统使由于不存在页面或映射更改而不再有效的转换无效。 ODP 不支持连续页。ODP 可以进一步分为 2 个子类:显式 ODP 和隐式 ODP。显式 ODP 在显式 ODP 中,应用程序仍然注册内存缓冲区以进行通信,但此操作用于定义 IO 的访问控制而不是 pin-down 页面。 ODP 内存区域 (MR) 在注册时不需要具有有效的映射。 隐式 ODP 在隐式 ODP 中,为应用程序提供了一个特殊的内存密钥,该密钥代表其完整的地址空间。 所有引用该键的 IO 访问(受限于与该键关联的访问权限)不需要注册任何虚拟地址范围。 有关 ODP 的更多信息,请参阅了解按需寻呼 (ODP) 社区帖子

wire_speed

线速

计算线速(公式): Width(位宽) * SignalRate(速率) * Encoding(编码) * Num_paths(路径数量), 如: wire_speed = (width * signal_rate * encoding * num_path) / 8.0

LAG链路聚合(bond)

LAG (Link Aggregation Group): 链路汇聚(bonding), 网络绑定可以将两个或多个网络接口组合成一个接口。它可以提高网络吞吐量和带宽,并在其中一个接口发生故障时提供冗余。NVIDIA ® BlueField ® DPU 可以选择以对主机透明的方式在 Arm 端配置网络绑定。在这种配置下,主机只能看到一个 PF, 参考: https://docs.nvidia.com/networking/display/bluefielddpuosv385/link aggregation

主流程

主流程(服务端或客户端):

  1. 主函数中解析命令行参数(parse_cmd), 设置默认服务端口
  2. 初始化上下文(ucs_async_context_create, 异步事件上下文用于管理定时器和FD通知), 在其中, 初始化多生产者/多消费者队列(ucs_mpmc_queue_init), 初始化非阻塞异步轮询器(ucs_async_poll_init), 初始化可重入自旋锁上下文等
  3. 创建工人(uct_worker_create), 工人代表着 progress 的引擎。 可以在应用程序中创建多个进度引擎,例如供多个线程使用
  4. 根据入参查找期望的传输层(dev_tl_lookup, 由最小延迟决定要使用的设备和传输)
  5. 设置回调(uct_iface_set_am_handler), 设置服务端接收到客户端数据后的回调
  6. 建立socket连接(connect_common), 服务端监听端口, 等待客户端发起socket连接
  7. 客户端连接服务端后, 两边交换地址(sendrecv, 先通过socket发送和接收长度, 然后发送和接收地址, 交换地址)
  8. 创建端点(uct_ep_create), 获取端点地址(uct_ep_get_address), 连接对等端点(uct_ep_connect_to_ep, 内部通过 ibv_modify_qp 设置QP状态机建立QP连接)
  9. 连接建立后, 客户端调用短消息(do_am_short)/缓冲区(do_am_bcopy)/零拷贝(do_am_zcopy)发送数据
  10. 显示驱动工人推进(uct_worker_progress, 该例程显式地处理任何未完成的通信操作和活动消息请求, 底层通过poll网卡完成事件,ibv_poll_cq)
  11. 资源销毁(uct_ep_destroy,free其他资源等)

初始化(constructor(CTOR)构造器在main前执行)

代码语言:javascript复制
​
void UCS_F_CTOR ucm_init()
  ucm_init_log() -> ucm_log_hostname 获取主机名
  ucm_init_malloc_hook() -> ucs_recursive_spinlock_init -> 初始化可重入锁
​
void UCS_F_CTOR ucs_init()
  ucs_check_cpu_flags -> 检查cpu特性
  ucs_log_early_init -> 初始化日志,进程ID等
  ucs_global_opts_init -> 初始化全局配置选项, 获取配置文件路径, 填充配置文件等, 添加全局vfs节点(ucs/global_opts), vfs日志级别等
  ucs_init_ucm_opts -> 添加ucm全局配置表(内存对齐16, 日志级别为WARN等), 填充配置文件(包含环境变量,前缀为UCX_), 初始化ucs库(kh的hash表初始化, map初始化), 打开动态库: /home/xb/project/ucx/src/ucm/.libs/libucm.so.0, 为 aarch64 实现了 CUDA bistro 挂钩(以在此平台上启用内存缓存)
  ucs_memtype_cache_global_init -> 互斥锁初始化
  ucs_cpu_init -> 优化版本的memcpy, ucs_cpu_builtin_memcpy, 获取CPU提供者(ucs_cpu_vendor)
  ucs_log_init -> 日志格式化宏, 打开日志文件,fork进程间互斥锁处理
  ucs_stats_init -> 状态统计初始化, 链表初始化, 数据库等初始化
  ucs_memtrack_init -> 内存跟踪初始化
  ucs_debug_init -> 调试和信号堆栈(sigaltstack)初始化
  ucs_profile_init -> 初始化线程私有上下文,私有数据
  ucs_async_global_init -> 异步事件和定时器初始化
  ucs_numa_init -> numa距离hash表初始化
  ucs_topo_init -> 用总线id查找设备
  ucs_rand_seed_init -> tcp允许使用端口范围, 生成高低位的启动ID(3817213743958597775), 随机值, 进程id, 时间, 线程id, 版本等
  ucs_sys_get_lib_path -> 查找当前库 libucs.so.0 (多个路径下查找库)
  ucs_get_process_cmdline -> 打印当前cmdline
  ucs_modules_load -> 监控进程的虚拟文件系统初始化, 加载所有ucs模块

服务端火焰图

客户端火焰图

短消息(am_short)

调用堆栈
代码语言:javascript复制
ucs_status_t do_am_short
  UCT_INLINE_API ucs_status_t uct_ep_am_short
    ucs_status_t uct_rc_verbs_ep_am_short
      uct_rc_verbs_iface_fill_inl_am_sge
      uct_rc_verbs_ep_post_send
        ibv_post_send

拷贝缓冲区(bcopy)

零拷贝(zcopy)

重要数据结构

组件(struct uct_component)

如下图所示, 搜索(.name =)

支持的传输层组件

基础数据结构

src/ucs/datastruct

API头文件

src/uct/api/uct.h

接口配置

struct uct_ib_iface_config 传输层IB接口配置, 继承父类接口配置,自己的发送和接收配置, 内联, IB流分类, MTU, 计数器等配置

CPU详情

代码语言:javascript复制
cpu提供商:
typedef enum ucs_cpu_vendor {
    UCS_CPU_VENDOR_UNKNOWN,
    UCS_CPU_VENDOR_INTEL,
    UCS_CPU_VENDOR_AMD,
    UCS_CPU_VENDOR_GENERIC_ARM,
    UCS_CPU_VENDOR_GENERIC_PPC,
    UCS_CPU_VENDOR_FUJITSU_ARM,
    UCS_CPU_VENDOR_ZHAOXIN,
    UCS_CPU_VENDOR_GENERIC_RV64G,
    UCS_CPU_VENDOR_LAST
} ucs_cpu_vendor_t;
​
​
CPU缓存类型
/* CPU cache types */
typedef enum ucs_cpu_cache_type {
    UCS_CPU_CACHE_L1d, /**< L1 data cache */          数据缓存
    UCS_CPU_CACHE_L1i, /**< L1 instruction cache */   指令缓存
    UCS_CPU_CACHE_L2,  /**< L2 cache */
    UCS_CPU_CACHE_L3,  /**< L3 cache */
    UCS_CPU_CACHE_LAST
} ucs_cpu_cache_type_t;
struct { /* sysfs entries for system cache sizes */
    int         level;
    const char *type;
} const ucs_cpu_cache_sysfs_name[] = {
    [UCS_CPU_CACHE_L1d] = {.level = 1, .type = "Data"},
    [UCS_CPU_CACHE_L1i] = {.level = 1, .type = "Instruction"},
    [UCS_CPU_CACHE_L2]  = {.level = 2, .type = "Unified"},
    [UCS_CPU_CACHE_L3]  = {.level = 3, .type = "Unified"}
};

传输层IO描述

代码语言:javascript复制
io描述
typedef struct uct_iov {
    void     *buffer;   /**< Data buffer */
    size_t    length;   /**< Length of the payload in bytes */
    uct_mem_h memh;     /**< Local memory key descriptor for the data */
    size_t    stride;   /**< Stride between beginnings of payload elements in
                             the buffer in bytes */
    unsigned  count;    /**< Number of payload elements in the buffer */
} uct_iov_t;

内存域资源描述符

typedef struct uct_md_resource_desc

接口属性(能力和限制), 嵌套结构体

struct uct_iface_attr

接口信息

代码语言:javascript复制
typedef struct {
    uct_iface_attr_t    iface_attr; /* Interface attributes: capabilities and limitations */
    uct_iface_h         iface;      /* Communication interface context */
    uct_md_attr_t       md_attr;    /* Memory domain attributes: capabilities and limitations */
    uct_md_h            md;         /* Memory domain */
    uct_worker_h        worker;     /* Workers represent allocated resources in a communication thread */
} iface_info_t;

IB内存域配置

代码语言:javascript复制
typedef struct uct_ib_md_config {
    uct_md_config_t          super;
​
    ucs_linear_func_t        reg_cost;     /**< Memory registration cost estimation  内存注册成本估算 */
    unsigned                 fork_init;    /**< Use ibv_fork_init() */
    int                      async_events; /**< Whether async events should be delivered */
​
    uct_ib_md_ext_config_t   ext;          /**< External configuration */
​
    UCS_CONFIG_STRING_ARRAY_FIELD(spec) custom_devices; /**< Custom device specifications */
​
    char                     *subnet_prefix; /**< Filter of subnet_prefix for IB ports */
​
    UCS_CONFIG_ARRAY_FIELD(ucs_config_bw_spec_t, device) pci_bw; /**< List of PCI BW for devices */
​
    int                      mlx5dv; /**< mlx5 support */
    int                      devx; /**< DEVX support */
    unsigned                 devx_objs;    /**< Objects to be created by DevX */
    ucs_ternary_auto_value_t mr_relaxed_order; /**< Allow reorder memory accesses */
    int                      enable_gpudirect_rdma; /**< Enable GPUDirect RDMA */
} uct_ib_md_config_t;

uct_ib_md_config_t IB内存域配置 devx支持, DevX库通过使用KABI机制实现从用户空间区域直接访问mlx5设备驱动程序。这里的主要目的是使用户空间驱动程序尽可能独立于内核,以便可以通过以下方式激活未来的设备功能和命令 内核更改最少甚至没有

迈络思5内存域类型(mlx5)

代码语言:javascript复制
typedef struct uct_ib_mlx5_md {
    uct_ib_md_t               super;
    uint32_t                  flags;
    ucs_mpool_t               dbrec_pool;
    ucs_recursive_spinlock_t  dbrec_lock;
    uct_ib_port_select_mode_t port_select_mode;
#if HAVE_DEVX
    void                     *zero_buf;
    uct_ib_mlx5_devx_umem_t  zero_mem;
​
    struct {
        ucs_list_link_t      list;
        khash_t(rkeys)       hash;
        size_t               count;
    } lru_rkeys;
​
    struct ibv_mr            *flush_mr;
    struct mlx5dv_devx_obj   *flush_dvmr;
    uint8_t                  mkey_tag;
​
    /* Cached values of counter set id per port */
    uint8_t                  port_counter_set_ids[UCT_IB_DEV_MAX_PORTS];
#endif
    struct {
        size_t dc;
        size_t ud;
    } dv_tx_wqe_ratio;
    /* The maximum number of outstanding RDMA Read/Atomic operations per DC QP. */
    uint8_t                  max_rd_atomic_dc;
    uint8_t                  log_max_dci_stream_channels;
} uct_ib_mlx5_md_t;

IB传输层mlx5通用查询网卡特性的hca命令(位)

struct uct_ib_mlx5_cmd_hca_cap_bits

  • 启用硬件 DCS 支持
  • 添加了运行时配置以限制 DCI 流的使用
  • 更新了 *post_op 例程以设置 DCI 流通道 ID
  • 添加了读取 HCA 功能以获取最大 DCI 流数量

DEVX API 允许使用 KABI 机制从用户空间区域直接访问 mlx5 设备驱动程序。主要目的是使用户空间驱动程序尽可能独立于内核,以便可以在最少甚至无需更改内核的情况下激活未来的设备功能和命令。

DEVX 对象表示一些底层固件对象,创建它的输入命令是用户应用程序给出的一些原始数据,该数据应与设备规范相匹配。成功创建后,输出缓冲区包含根据其规范来自设备的原始数据,该数据可以用作该对象的相关固件命令的一部分。

重要函数

uct_md_open -> 重要函数, 打开内存域

ibv_fork_init IB初始化进程分叉(控制内存)

ibv_fork_init -> 核心原理: 通过对所有已注册的MR所在内存页打MADV_DONTFORK标记,创建子进程后,MR所在内存页不会触发COW拷贝,避免了前面所说的COW带来网卡DMA内存地址不一致的问题, 但会引入额外的内存记录和查找开销(降低性能)

涉及概念: fork类型(fork-exec, 启动另外的进程, 单纯fork, 内存地址向上对齐)

IB配置文件

static ucs_config_field_t uct_ib_md_config_table[]

// default默认值(dfl)

显示UCX所有的配置及说明(ucx_info -f)

查看UCX默认配置(ucx_info -c)

代码语言:javascript复制
ucx_info -c // 打印默认配置
UCX_LOG_LEVEL=WARN
UCX_LOG_FILE_FILTER=*
UCX_LOG_BUFFER=1K
UCX_LOG_DATA_SIZE=0
UCX_LOG_PRINT_ENABLE=n
UCX_HANDLE_ERRORS=bt
UCX_ERROR_MAIL_TO=
UCX_ERROR_MAIL_FOOTER=
UCX_GDB_COMMAND=gdb -quiet
UCX_DEBUG_SIGNO=HUP
UCX_LOG_LEVEL_TRIGGER=FATAL
UCX_WARN_UNUSED_ENV_VARS=y
UCX_MEMTYPE_CACHE=try
UCX_ASYNC_MAX_EVENTS=1024
UCX_ASYNC_SIGNO=ALRM
UCX_MEMTRACK_LIMIT=inf
UCX_RCACHE_CHECK_PFN=0
UCX_MODULE_DIR=/usr/lib64/ucx
UCX_MODULE_LOG_LEVEL=TRACE
UCX_MODULES=all
UCX_TOPO_PRIO=sysfs,default
UCX_LOG_FILE=
UCX_LOG_FILE_SIZE=inf
UCX_LOG_FILE_ROTATE=0
UCX_ERROR_SIGNALS=ILL,SEGV,BUS,FPE
UCX_VFS_ENABLE=y
UCX_VFS_THREAD_AFFINITY=n
UCX_MEMTRACK_DEST=
UCX_PROFILE_MODE=
UCX_PROFILE_FILE=ucx_%h_%p.prof
UCX_PROFILE_LOG_SIZE=4M
UCX_RCACHE_STAT_MIN=4K
UCX_RCACHE_STAT_MAX=1M
UCX_BUILTIN_MEMCPY_MIN=auto
UCX_BUILTIN_MEMCPY_MAX=auto
UCX_MEM_LOG_LEVEL=WARN
UCX_MEM_ALLOC_ALIGN=16
UCX_MEM_EVENTS=y
UCX_MEM_MMAP_HOOK_MODE=bistro
UCX_MEM_MALLOC_HOOKS=y
UCX_MEM_MALLOC_RELOC=y
UCX_MEM_CUDA_HOOK_MODE=bistro
UCX_MEM_DYNAMIC_MMAP_THRESH=y
UCX_MEM_DLOPEN_PROCESS_RPATH=y
UCX_MEM_MODULE_UNLOAD_PREVENT_MODE=lazy
UCX_SELF_NUM_DEVICES=1
UCX_SELF_ALLOC=huge,thp,md,mmap,heap
UCX_SELF_FAILURE=DIAG
UCX_SELF_MAX_NUM_EPS=inf
UCX_SELF_SEG_SIZE=8K
UCX_TCP_AF_PRIO=inet,inet6
UCX_TCP_CM_FAILURE=DIAG
UCX_TCP_CM_REUSEADDR=n
UCX_TCP_CM_PRIV_DATA_LEN=2K
UCX_TCP_CM_SNDBUF=auto
UCX_TCP_CM_RCVBUF=auto
UCX_TCP_CM_SYN_CNT=auto
UCX_TCP_ALLOC=huge,thp,md,mmap,heap
UCX_TCP_FAILURE=DIAG
UCX_TCP_MAX_NUM_EPS=256
UCX_TCP_TX_SEG_SIZE=8K
UCX_TCP_RX_SEG_SIZE=64K
UCX_TCP_MAX_IOV=6
UCX_TCP_SENDV_THRESH=2K
UCX_TCP_PREFER_DEFAULT=y
UCX_TCP_PUT_ENABLE=y
UCX_TCP_CONN_NB=n
UCX_TCP_MAX_POLL=16
UCX_TCP_MAX_CONN_RETRIES=25
UCX_TCP_NODELAY=y
UCX_TCP_SNDBUF=auto
UCX_TCP_RCVBUF=auto
UCX_TCP_SYN_CNT=auto
UCX_TCP_TX_MAX_BUFS=-1
UCX_TCP_TX_BUFS_GROW=8
UCX_TCP_RX_MAX_BUFS=-1
UCX_TCP_RX_BUFS_GROW=8
UCX_TCP_PORT_RANGE=0
UCX_TCP_KEEPIDLE=10000000.00us
UCX_TCP_KEEPCNT=auto
UCX_TCP_KEEPINTVL=2000000.00us
UCX_SYSV_HUGETLB_MODE=try
UCX_SYSV_ALLOC=md,mmap,heap
UCX_SYSV_FAILURE=DIAG
UCX_SYSV_MAX_NUM_EPS=inf
UCX_SYSV_BW=12179.00MBps
UCX_SYSV_FIFO_SIZE=64
UCX_SYSV_SEG_SIZE=8256
UCX_SYSV_FIFO_RELEASE_FACTOR=0.500
UCX_SYSV_RX_MAX_BUFS=-1
UCX_SYSV_RX_BUFS_GROW=512
UCX_SYSV_FIFO_HUGETLB=n
UCX_SYSV_FIFO_ELEM_SIZE=128
UCX_SYSV_FIFO_MAX_POLL=16
UCX_SYSV_ERROR_HANDLING=n
UCX_POSIX_HUGETLB_MODE=try
UCX_POSIX_DIR=/dev/shm
UCX_POSIX_USE_PROC_LINK=y
UCX_POSIX_ALLOC=md,mmap,heap
UCX_POSIX_FAILURE=DIAG
UCX_POSIX_MAX_NUM_EPS=inf
UCX_POSIX_BW=12179.00MBps
UCX_POSIX_FIFO_SIZE=64
UCX_POSIX_SEG_SIZE=8256
UCX_POSIX_FIFO_RELEASE_FACTOR=0.500
UCX_POSIX_RX_MAX_BUFS=-1
UCX_POSIX_RX_BUFS_GROW=512
UCX_POSIX_FIFO_HUGETLB=n
UCX_POSIX_FIFO_ELEM_SIZE=128
UCX_POSIX_FIFO_MAX_POLL=16
UCX_POSIX_ERROR_HANDLING=n
UCX_NET_DEVICES=all
UCX_SHM_DEVICES=all
UCX_ACC_DEVICES=all
UCX_SELF_DEVICES=all
UCX_TLS=all
UCX_PROTOS=all
UCX_ALLOC_PRIO=md:sysv,md:posix,huge,thp,md:*,mmap,heap
UCX_SOCKADDR_TLS_PRIORITY=rdmacm,tcp,sockcm
UCX_SELECT_DISTANCE_MD=cuda_cpy
UCX_MEMTYPE_REG_WHOLE_ALLOC_TYPES=cuda
UCX_RNDV_MEMTYPE_DIRECT_SIZE=inf
UCX_WARN_INVALID_CONFIG=y
UCX_BCOPY_THRESH=auto
UCX_RNDV_THRESH=auto
UCX_RNDV_SEND_NBR_THRESH=256K
UCX_RNDV_THRESH_FALLBACK=inf
UCX_RNDV_PERF_DIFF=1.000
UCX_MULTI_LANE_MAX_RATIO=4.000
UCX_MAX_EAGER_RAILS=1
UCX_MAX_RNDV_RAILS=2
UCX_RNDV_SCHEME=auto
UCX_RKEY_PTR_SEG_SIZE=512K
UCX_ZCOPY_THRESH=auto
UCX_BCOPY_BW=auto
UCX_ATOMIC_MODE=guess
UCX_ADDRESS_DEBUG_INFO=n
UCX_MAX_WORKER_ADDRESS_NAME=32
UCX_USE_MT_MUTEX=n
UCX_ADAPTIVE_PROGRESS=y
UCX_SEG_SIZE=8K
UCX_TM_THRESH=1K
UCX_TM_MAX_BB_SIZE=1K
UCX_TM_FORCE_THRESH=8K
UCX_TM_SW_RNDV=n
UCX_NUM_EPS=auto
UCX_NUM_PPN=auto
UCX_RNDV_FRAG_SIZE=host:512K,cuda:4M
UCX_RNDV_FRAG_ALLOC_COUNT=host:128,cuda:128
UCX_RNDV_FRAG_MEM_TYPE=host
UCX_RNDV_PIPELINE_SEND_THRESH=inf
UCX_RNDV_PIPELINE_SHM_ENABLE=y
UCX_FLUSH_WORKER_EPS=y
UCX_UNIFIED_MODE=n
UCX_CM_USE_ALL_DEVICES=y
UCX_LISTENER_BACKLOG=auto
UCX_PROTO_ENABLE=n
UCX_KEEPALIVE_INTERVAL=20000000.00us
UCX_KEEPALIVE_NUM_EPS=128
UCX_RESOLVE_REMOTE_EP_ID=off
UCX_PROTO_INDIRECT_ID=auto
UCX_RNDV_PUT_FORCE_FLUSH=n
UCX_SA_DATA_VERSION=v1
UCX_RKEY_MPOOL_MAX_MD=2
UCX_RX_MPOOL_SIZES=64,1K
UCX_ADDRESS_VERSION=v1
UCX_RCACHE_ENABLE=try
UCX_PROTO_INFO=n
UCX_IB_REG_METHODS=rcache,odp,direct
UCX_IB_RCACHE_MEM_PRIO=1000
UCX_IB_RCACHE_OVERHEAD=auto
UCX_IB_RCACHE_ADDR_ALIGN=16
UCX_IB_RCACHE_MAX_REGIONS=inf
UCX_IB_RCACHE_MAX_SIZE=inf
UCX_IB_RCACHE_MAX_UNRELEASED=512M
UCX_IB_RCACHE_PURGE_ON_FORK=y
UCX_IB_MEM_REG_OVERHEAD=16.00us
UCX_IB_MEM_REG_GROWTH=0.00us
UCX_IB_FORK_INIT=try
UCX_IB_ASYNC_EVENTS=y
UCX_IB_ETH_PAUSE_ON=y
UCX_IB_ODP_NUMA_POLICY=preferred
UCX_IB_ODP_PREFETCH=n
UCX_IB_ODP_MAX_SIZE=auto
UCX_IB_DEVICE_SPECS=
UCX_IB_PREFER_NEAREST_DEVICE=y
UCX_IB_INDIRECT_ATOMIC=y
UCX_IB_GID_INDEX=auto
UCX_IB_SUBNET_PREFIX=
UCX_IB_GPU_DIRECT_RDMA=try
UCX_IB_PCI_BW=
UCX_IB_MLX5_DEVX=try
UCX_IB_MLX5_DEVX_OBJECTS=rcqp,rcsrq,dct,dcsrq,dci
UCX_IB_REG_MT_THRESH=4G
UCX_IB_REG_MT_CHUNK=2G
UCX_IB_REG_MT_BIND=n
UCX_IB_PCI_RELAXED_ORDERING=auto
UCX_IB_MAX_IDLE_RKEY_COUNT=16
UCX_DC_MLX5_ALLOC=huge,thp,md,mmap,heap
UCX_DC_MLX5_FAILURE=DIAG
UCX_DC_MLX5_MAX_NUM_EPS=inf
UCX_DC_MLX5_SEG_SIZE=8256
UCX_DC_MLX5_TX_QUEUE_LEN=128
UCX_DC_MLX5_TX_MAX_BATCH=16
UCX_DC_MLX5_TX_MAX_POLL=16
UCX_DC_MLX5_TX_MIN_INLINE=64
UCX_DC_MLX5_TX_INLINE_RESP=64
UCX_DC_MLX5_TX_MIN_SGE=5
UCX_DC_MLX5_TX_MAX_BUFS=-1
UCX_DC_MLX5_TX_BUFS_GROW=1024
UCX_DC_MLX5_RX_QUEUE_LEN=4095
UCX_DC_MLX5_RX_MAX_BATCH=16
UCX_DC_MLX5_RX_MAX_POLL=16
UCX_DC_MLX5_RX_INLINE=64
UCX_DC_MLX5_RX_MAX_BUFS=-1
UCX_DC_MLX5_RX_BUFS_GROW=0
UCX_DC_MLX5_ADDR_TYPE=auto
UCX_DC_MLX5_IS_GLOBAL=n
UCX_DC_MLX5_SL=auto
UCX_DC_MLX5_TRAFFIC_CLASS=auto
UCX_DC_MLX5_HOP_LIMIT=255
UCX_DC_MLX5_NUM_PATHS=auto
UCX_DC_MLX5_ROCE_LOCAL_SUBNET=n
UCX_DC_MLX5_ROCE_SUBNET_PREFIX_LEN=auto
UCX_DC_MLX5_ROCE_PATH_FACTOR=1
UCX_DC_MLX5_LID_PATH_BITS=0
UCX_DC_MLX5_PKEY=auto
UCX_DC_MLX5_PATH_MTU=default
UCX_DC_MLX5_MAX_RD_ATOMIC=auto
UCX_DC_MLX5_TIMEOUT=1000000.00us
UCX_DC_MLX5_RETRY_COUNT=7
UCX_DC_MLX5_RNR_TIMEOUT=1000.00us
UCX_DC_MLX5_RNR_RETRY_COUNT=7
UCX_DC_MLX5_FC_ENABLE=y
UCX_DC_MLX5_FC_WND_SIZE=512
UCX_DC_MLX5_FC_HARD_THRESH=0.250
UCX_DC_MLX5_FENCE=auto
UCX_DC_MLX5_MAX_GET_ZCOPY=auto
UCX_DC_MLX5_TX_NUM_GET_BYTES=inf
UCX_DC_MLX5_TX_POLL_ALWAYS=n
UCX_DC_MLX5_DM_SIZE=2K
UCX_DC_MLX5_DM_COUNT=1
UCX_DC_MLX5_MMIO_MODE=auto
UCX_DC_MLX5_AR_ENABLE=auto
UCX_DC_MLX5_CQE_ZIPPING_ENABLE=n
UCX_DC_MLX5_TX_MAX_BB=inf
UCX_DC_MLX5_TM_ENABLE=n
UCX_DC_MLX5_TM_LIST_SIZE=1024
UCX_DC_MLX5_TM_SEG_SIZE=48K
UCX_DC_MLX5_TM_MP_SRQ_ENABLE=try
UCX_DC_MLX5_TM_MP_NUM_STRIDES=8
UCX_DC_MLX5_EXP_BACKOFF=0
UCX_DC_MLX5_SRQ_TOPO=list
UCX_DC_MLX5_LOG_ACK_REQ_FREQ=8
UCX_DC_MLX5_RX_QUEUE_LEN_INIT=128
UCX_DC_MLX5_NUM_DCI=8
UCX_DC_MLX5_TX_POLICY=dcs_quota
UCX_DC_MLX5_DCI_FULL_HANDSHAKE=n
UCX_DC_MLX5_DCI_KA_FULL_HANDSHAKE=n
UCX_DC_MLX5_DCT_FULL_HANDSHAKE=n
UCX_DC_MLX5_RAND_DCI_SEED=0
UCX_DC_MLX5_QUOTA=32
UCX_DC_MLX5_FC_HARD_REQ_TIMEOUT=5000000.00us
UCX_DC_MLX5_COMPACT_AV=y
UCX_RC_VERBS_ALLOC=huge,thp,md,mmap,heap
UCX_RC_VERBS_FAILURE=DIAG
UCX_RC_VERBS_MAX_NUM_EPS=256
UCX_RC_VERBS_SEG_SIZE=8256
UCX_RC_VERBS_TX_QUEUE_LEN=256
UCX_RC_VERBS_TX_MAX_BATCH=16
UCX_RC_VERBS_TX_MAX_POLL=16
UCX_RC_VERBS_TX_MIN_INLINE=64
UCX_RC_VERBS_TX_INLINE_RESP=64
UCX_RC_VERBS_TX_MIN_SGE=5
UCX_RC_VERBS_TX_MAX_BUFS=-1
UCX_RC_VERBS_TX_BUFS_GROW=1024
UCX_RC_VERBS_RX_QUEUE_LEN=4095
UCX_RC_VERBS_RX_MAX_BATCH=16
UCX_RC_VERBS_RX_MAX_POLL=16
UCX_RC_VERBS_RX_INLINE=64
UCX_RC_VERBS_RX_MAX_BUFS=-1
UCX_RC_VERBS_RX_BUFS_GROW=0
UCX_RC_VERBS_ADDR_TYPE=auto
UCX_RC_VERBS_IS_GLOBAL=n
UCX_RC_VERBS_SL=auto
UCX_RC_VERBS_TRAFFIC_CLASS=auto
UCX_RC_VERBS_HOP_LIMIT=255
UCX_RC_VERBS_NUM_PATHS=auto
UCX_RC_VERBS_ROCE_LOCAL_SUBNET=n
UCX_RC_VERBS_ROCE_SUBNET_PREFIX_LEN=auto
UCX_RC_VERBS_ROCE_PATH_FACTOR=1
UCX_RC_VERBS_LID_PATH_BITS=0
UCX_RC_VERBS_PKEY=auto
UCX_RC_VERBS_PATH_MTU=default
UCX_RC_VERBS_MAX_RD_ATOMIC=auto
UCX_RC_VERBS_TIMEOUT=1000000.00us
UCX_RC_VERBS_RETRY_COUNT=7
UCX_RC_VERBS_RNR_TIMEOUT=1000.00us
UCX_RC_VERBS_RNR_RETRY_COUNT=7
UCX_RC_VERBS_FC_ENABLE=y
UCX_RC_VERBS_FC_WND_SIZE=512
UCX_RC_VERBS_FC_HARD_THRESH=0.250
UCX_RC_VERBS_FENCE=auto
UCX_RC_VERBS_MAX_GET_ZCOPY=auto
UCX_RC_VERBS_TX_NUM_GET_BYTES=inf
UCX_RC_VERBS_TX_POLL_ALWAYS=n
UCX_RC_VERBS_FC_SOFT_THRESH=0.500
UCX_RC_VERBS_TX_CQ_MODERATION=64
UCX_RC_VERBS_TX_CQ_LEN=4096
UCX_RC_VERBS_MAX_AM_HDR=128
UCX_RC_VERBS_TX_MAX_WR=inf
UCX_RC_VERBS_FLUSH_MODE=auto
UCX_RC_MLX5_ALLOC=huge,thp,md,mmap,heap
UCX_RC_MLX5_FAILURE=DIAG
UCX_RC_MLX5_MAX_NUM_EPS=256
UCX_RC_MLX5_SEG_SIZE=8256
UCX_RC_MLX5_TX_QUEUE_LEN=256
UCX_RC_MLX5_TX_MAX_BATCH=16
UCX_RC_MLX5_TX_MAX_POLL=16
UCX_RC_MLX5_TX_MIN_INLINE=64
UCX_RC_MLX5_TX_INLINE_RESP=64
UCX_RC_MLX5_TX_MIN_SGE=5
UCX_RC_MLX5_TX_MAX_BUFS=-1
UCX_RC_MLX5_TX_BUFS_GROW=1024
UCX_RC_MLX5_RX_QUEUE_LEN=4095
UCX_RC_MLX5_RX_MAX_BATCH=16
UCX_RC_MLX5_RX_MAX_POLL=16
UCX_RC_MLX5_RX_INLINE=64
UCX_RC_MLX5_RX_MAX_BUFS=-1
UCX_RC_MLX5_RX_BUFS_GROW=0
UCX_RC_MLX5_ADDR_TYPE=auto
UCX_RC_MLX5_IS_GLOBAL=n
UCX_RC_MLX5_SL=auto
UCX_RC_MLX5_TRAFFIC_CLASS=auto
UCX_RC_MLX5_HOP_LIMIT=255
UCX_RC_MLX5_NUM_PATHS=auto
UCX_RC_MLX5_ROCE_LOCAL_SUBNET=n
UCX_RC_MLX5_ROCE_SUBNET_PREFIX_LEN=auto
UCX_RC_MLX5_ROCE_PATH_FACTOR=1
UCX_RC_MLX5_LID_PATH_BITS=0
UCX_RC_MLX5_PKEY=auto
UCX_RC_MLX5_PATH_MTU=default
UCX_RC_MLX5_MAX_RD_ATOMIC=auto
UCX_RC_MLX5_TIMEOUT=1000000.00us
UCX_RC_MLX5_RETRY_COUNT=7
UCX_RC_MLX5_RNR_TIMEOUT=1000.00us
UCX_RC_MLX5_RNR_RETRY_COUNT=7
UCX_RC_MLX5_FC_ENABLE=y
UCX_RC_MLX5_FC_WND_SIZE=512
UCX_RC_MLX5_FC_HARD_THRESH=0.250
UCX_RC_MLX5_FENCE=auto
UCX_RC_MLX5_MAX_GET_ZCOPY=auto
UCX_RC_MLX5_TX_NUM_GET_BYTES=inf
UCX_RC_MLX5_TX_POLL_ALWAYS=n
UCX_RC_MLX5_FC_SOFT_THRESH=0.500
UCX_RC_MLX5_TX_CQ_MODERATION=64
UCX_RC_MLX5_TX_CQ_LEN=4096
UCX_RC_MLX5_DM_SIZE=2K
UCX_RC_MLX5_DM_COUNT=1
UCX_RC_MLX5_MMIO_MODE=auto
UCX_RC_MLX5_AR_ENABLE=auto
UCX_RC_MLX5_CQE_ZIPPING_ENABLE=n
UCX_RC_MLX5_TX_MAX_BB=inf
UCX_RC_MLX5_TM_ENABLE=n
UCX_RC_MLX5_TM_LIST_SIZE=1024
UCX_RC_MLX5_TM_SEG_SIZE=48K
UCX_RC_MLX5_TM_MP_SRQ_ENABLE=try
UCX_RC_MLX5_TM_MP_NUM_STRIDES=8
UCX_RC_MLX5_EXP_BACKOFF=0
UCX_RC_MLX5_SRQ_TOPO=cyclic,cyclic_emulated
UCX_RC_MLX5_LOG_ACK_REQ_FREQ=8
UCX_UD_VERBS_ALLOC=huge,thp,md,mmap,heap
UCX_UD_VERBS_FAILURE=DIAG
UCX_UD_VERBS_MAX_NUM_EPS=inf
UCX_UD_VERBS_SEG_SIZE=8K
UCX_UD_VERBS_TX_QUEUE_LEN=256
UCX_UD_VERBS_TX_MAX_BATCH=16
UCX_UD_VERBS_TX_MAX_POLL=16
UCX_UD_VERBS_TX_MIN_INLINE=64
UCX_UD_VERBS_TX_INLINE_RESP=0
UCX_UD_VERBS_TX_MIN_SGE=5
UCX_UD_VERBS_TX_MAX_BUFS=-1
UCX_UD_VERBS_TX_BUFS_GROW=1024
UCX_UD_VERBS_RX_QUEUE_LEN=4096
UCX_UD_VERBS_RX_MAX_BATCH=16
UCX_UD_VERBS_RX_MAX_POLL=16
UCX_UD_VERBS_RX_INLINE=0
UCX_UD_VERBS_RX_MAX_BUFS=-1
UCX_UD_VERBS_RX_BUFS_GROW=0
UCX_UD_VERBS_ADDR_TYPE=auto
UCX_UD_VERBS_IS_GLOBAL=n
UCX_UD_VERBS_SL=auto
UCX_UD_VERBS_TRAFFIC_CLASS=auto
UCX_UD_VERBS_HOP_LIMIT=255
UCX_UD_VERBS_NUM_PATHS=auto
UCX_UD_VERBS_ROCE_LOCAL_SUBNET=n
UCX_UD_VERBS_ROCE_SUBNET_PREFIX_LEN=auto
UCX_UD_VERBS_ROCE_PATH_FACTOR=1
UCX_UD_VERBS_LID_PATH_BITS=0
UCX_UD_VERBS_PKEY=auto
UCX_UD_VERBS_PATH_MTU=default
UCX_UD_VERBS_RX_QUEUE_LEN_INIT=128
UCX_UD_VERBS_LINGER_TIMEOUT=300000000.00us
UCX_UD_VERBS_TIMEOUT=30000000.00us
UCX_UD_VERBS_TIMER_TICK=10000.00us
UCX_UD_VERBS_TIMER_BACKOFF=2.000
UCX_UD_VERBS_ASYNC_TIMER_TICK=100000.00us
UCX_UD_VERBS_MIN_POKE_TIME=250000.00us
UCX_UD_VERBS_ETH_DGID_CHECK=y
UCX_UD_VERBS_MAX_WINDOW=1025
UCX_UD_VERBS_RX_ASYNC_MAX_POLL=64
UCX_UD_MLX5_ALLOC=huge,thp,md,mmap,heap
UCX_UD_MLX5_FAILURE=DIAG
UCX_UD_MLX5_MAX_NUM_EPS=inf
UCX_UD_MLX5_SEG_SIZE=8K
UCX_UD_MLX5_TX_QUEUE_LEN=256
UCX_UD_MLX5_TX_MAX_BATCH=16
UCX_UD_MLX5_TX_MAX_POLL=16
UCX_UD_MLX5_TX_MIN_INLINE=64
UCX_UD_MLX5_TX_INLINE_RESP=0
UCX_UD_MLX5_TX_MIN_SGE=5
UCX_UD_MLX5_TX_MAX_BUFS=-1
UCX_UD_MLX5_TX_BUFS_GROW=1024
UCX_UD_MLX5_RX_QUEUE_LEN=4096
UCX_UD_MLX5_RX_MAX_BATCH=16
UCX_UD_MLX5_RX_MAX_POLL=16
UCX_UD_MLX5_RX_INLINE=0
UCX_UD_MLX5_RX_MAX_BUFS=-1
UCX_UD_MLX5_RX_BUFS_GROW=0
UCX_UD_MLX5_ADDR_TYPE=auto
UCX_UD_MLX5_IS_GLOBAL=n
UCX_UD_MLX5_SL=auto
UCX_UD_MLX5_TRAFFIC_CLASS=auto
UCX_UD_MLX5_HOP_LIMIT=255
UCX_UD_MLX5_NUM_PATHS=auto
UCX_UD_MLX5_ROCE_LOCAL_SUBNET=n
UCX_UD_MLX5_ROCE_SUBNET_PREFIX_LEN=auto
UCX_UD_MLX5_ROCE_PATH_FACTOR=1
UCX_UD_MLX5_LID_PATH_BITS=0
UCX_UD_MLX5_PKEY=auto
UCX_UD_MLX5_PATH_MTU=default
UCX_UD_MLX5_RX_QUEUE_LEN_INIT=128
UCX_UD_MLX5_LINGER_TIMEOUT=300000000.00us
UCX_UD_MLX5_TIMEOUT=30000000.00us
UCX_UD_MLX5_TIMER_TICK=10000.00us
UCX_UD_MLX5_TIMER_BACKOFF=2.000
UCX_UD_MLX5_ASYNC_TIMER_TICK=100000.00us
UCX_UD_MLX5_MIN_POKE_TIME=250000.00us
UCX_UD_MLX5_ETH_DGID_CHECK=y
UCX_UD_MLX5_MAX_WINDOW=1025
UCX_UD_MLX5_RX_ASYNC_MAX_POLL=64
UCX_UD_MLX5_DM_SIZE=2K
UCX_UD_MLX5_DM_COUNT=1
UCX_UD_MLX5_MMIO_MODE=auto
UCX_UD_MLX5_AR_ENABLE=auto
UCX_UD_MLX5_CQE_ZIPPING_ENABLE=n
UCX_UD_MLX5_COMPACT_AV=y
UCX_RDMA_CM_FAILURE=DIAG
UCX_RDMA_CM_REUSEADDR=n
UCX_RDMA_CM_SOURCE_ADDRESS=
UCX_RDMA_CM_TIMEOUT=10000000.00us
UCX_RDMA_CM_RESERVED_QPN=try
UCX_CMA_MEMORY_INVALIDATE=n
UCX_CMA_ALLOC=huge,thp,mmap,heap
UCX_CMA_FAILURE=DIAG
UCX_CMA_MAX_NUM_EPS=inf
UCX_CMA_BW=11145.00MBps
UCX_CMA_MAX_IOV=16
UCX_CMA_SEG_SIZE=512K
UCX_CMA_TX_QUOTA=1
UCX_CMA_TX_MAX_BUFS=-1
UCX_CMA_TX_BUFS_GROW=8
UCX_KNEM_RCACHE=try
UCX_KNEM_RCACHE_MEM_PRIO=1000
UCX_KNEM_RCACHE_OVERHEAD=auto
UCX_KNEM_RCACHE_ADDR_ALIGN=64
UCX_KNEM_RCACHE_MAX_REGIONS=inf
UCX_KNEM_RCACHE_MAX_SIZE=inf
UCX_KNEM_RCACHE_MAX_UNRELEASED=512M
UCX_KNEM_RCACHE_PURGE_ON_FORK=y
UCX_KNEM_ALLOC=huge,thp,md,mmap,heap
UCX_KNEM_FAILURE=DIAG
UCX_KNEM_MAX_NUM_EPS=inf
UCX_KNEM_BW=13862.00MBps
UCX_KNEM_MAX_IOV=16
UCX_KNEM_SEG_SIZE=512K
UCX_KNEM_TX_QUOTA=1
UCX_KNEM_TX_MAX_BUFS=-1
UCX_KNEM_TX_BUFS_GROW=8
UCX_XPMEM_HUGETLB_MODE=try
UCX_XPMEM_ALLOC=md,mmap,heap
UCX_XPMEM_FAILURE=DIAG
UCX_XPMEM_MAX_NUM_EPS=inf
UCX_XPMEM_BW=12179.00MBps
UCX_XPMEM_FIFO_SIZE=64
UCX_XPMEM_SEG_SIZE=8256
UCX_XPMEM_FIFO_RELEASE_FACTOR=0.500
UCX_XPMEM_RX_MAX_BUFS=-1
UCX_XPMEM_RX_BUFS_GROW=512
UCX_XPMEM_FIFO_HUGETLB=n
UCX_XPMEM_FIFO_ELEM_SIZE=128
UCX_XPMEM_FIFO_MAX_POLL=16
UCX_XPMEM_ERROR_HANDLING=n
​

打印配置文档

代码语言:javascript复制
​
ucx_info -f
print_flags |= UCS_CONFIG_PRINT_CONFIG | UCS_CONFIG_PRINT_HEADER | UCS_CONFIG_PRINT_DOC
if (flags & UCS_CONFIG_PRINT_DOC)
  ucs_config_print_doc_line_by_line

ppn: 每个节点的进程

代码语言:javascript复制
每节点, 每个进程 (PPN) 的带宽规范:f(ppn) = 专用   共享 / ppn,该结构指定了一个函数,该函数用作各种 UCT 操作的带宽估计的基础。 此信息可用于选择 UCT 操作的最佳性能组合
typedef struct uct_ppn_bandwidth {
    double                   dedicated; /**< Dedicated bandwidth, bytes/second */
    double                   shared;    /**< Shared bandwidth, bytes/second */
} uct_ppn_bandwidth_t;
bandwidth: 11.91/ppn   0.00 MB/sec
           共享         专用

日志级别

代码语言:javascript复制
日志级别: 
ucs_global_opts_table
export UCX_LOG_LEVEL=debug
export UCX_LOG_LEVEL=trace
​
        [no],          [AC_DEFINE([UCS_MAX_LOG_LEVEL], [UCS_LOG_LEVEL_DEBUG], [Highest log level])],
        [warn],        [AC_DEFINE([UCS_MAX_LOG_LEVEL], [UCS_LOG_LEVEL_WARN], [Highest log level])],
        [diag],        [AC_DEFINE([UCS_MAX_LOG_LEVEL], [UCS_LOG_LEVEL_DIAG], [Highest log level])],
        [info],        [AC_DEFINE([UCS_MAX_LOG_LEVEL], [UCS_LOG_LEVEL_INFO], [Highest log level])],
        [debug],       [AC_DEFINE([UCS_MAX_LOG_LEVEL], [UCS_LOG_LEVEL_DEBUG], [Highest log level])],
        [trace],       [AC_DEFINE([UCS_MAX_LOG_LEVEL], [UCS_LOG_LEVEL_TRACE], [Highest log level])],
        [trace_req],   [AC_DEFINE([UCS_MAX_LOG_LEVEL], [UCS_LOG_LEVEL_TRACE_REQ], [Highest log level])],
        [trace_data],  [AC_DEFINE([UCS_MAX_LOG_LEVEL], [UCS_LOG_LEVEL_TRACE_DATA], [Highest log level])],
        [trace_async], [AC_DEFINE([UCS_MAX_LOG_LEVEL], [UCS_LOG_LEVEL_TRACE_ASYNC], [Highest log level])],
        [trace_func],  [AC_DEFINE([UCS_MAX_LOG_LEVEL], [UCS_LOG_LEVEL_TRACE_FUNC], [Highest log level])],
        [trace_poll],  [AC_DEFINE([UCS_MAX_LOG_LEVEL], [UCS_LOG_LEVEL_TRACE_POLL], [Highest log level])],
                       [AC_DEFINE([UCS_MAX_LOG_LEVEL], [UCS_LOG_LEVEL_TRACE_POLL], [Highest log level])])

所有共享库模块(*.so)

代码语言:javascript复制
[root@node63 examples]# ls -alh /usr/lib64/ucx/*so.0
lrwxrwxrwx 1 root root 19 Sep  7 18:58 /usr/lib64/ucx/libuct_cma.so.0 -> libuct_cma.so.0.0.0
lrwxrwxrwx 1 root root 18 Sep  7 18:58 /usr/lib64/ucx/libuct_ib.so.0 -> libuct_ib.so.0.0.0
lrwxrwxrwx 1 root root 20 Sep  7 18:58 /usr/lib64/ucx/libuct_knem.so.0 -> libuct_knem.so.0.0.0
lrwxrwxrwx 1 root root 22 Sep  7 18:58 /usr/lib64/ucx/libuct_rdmacm.so.0 -> libuct_rdmacm.so.0.0.0
lrwxrwxrwx 1 root root 21 Sep  7 18:58 /usr/lib64/ucx/libuct_xpmem.so.0 -> libuct_xpmem.so.0.0.0

网卡速度

数据传输率

以太网:10/25/40/50/100 Gb/s

InfiniBand:SDR、DDR、QDR、FDR、EDR、HDR100

代码语言:javascript复制
/**
 * IB port active speed. IB端口活动速度
 */
enum {
    UCT_IB_SPEED_SDR     = 1,  // 2.5Gbps
    UCT_IB_SPEED_DDR     = 2,  // 5Gbps
    UCT_IB_SPEED_QDR     = 4,  // 10Gbps
    UCT_IB_SPEED_FDR10   = 8,
    UCT_IB_SPEED_FDR     = 16,
    UCT_IB_SPEED_EDR     = 32,
    UCT_IB_SPEED_HDR     = 64,
    UCT_IB_SPEED_NDR     = 128,
    UCT_IB_SPEED_LAST
};
​
case UCT_IB_SPEED_SDR:
        iface_attr->latency.c = 5000e-9;
        signal_rate           = 2.5e9;
        encoding              = 8.0/10.0;
        break;
    case UCT_IB_SPEED_DDR:
        iface_attr->latency.c = 2500e-9;
        signal_rate           = 5.0e9;
        encoding              = 8.0/10.0;
        break;
    case UCT_IB_SPEED_QDR:
        iface_attr->latency.c = 1300e-9;
        if (uct_ib_iface_is_roce(iface)) {
            /* 10/40g Eth  */
            signal_rate       = 10.3125e9;
            encoding          = 64.0/66.0;
        } else {
            /* QDR */
            signal_rate       = 10.0e9;
            encoding          = 8.0/10.0;
        }
        break;
    case UCT_IB_SPEED_FDR10:
        iface_attr->latency.c = 700e-9;
        signal_rate           = 10.3125e9;
        encoding              = 64.0/66.0;
        break;
    case UCT_IB_SPEED_FDR:
        iface_attr->latency.c = 700e-9;
        signal_rate           = 14.0625e9;
        encoding              = 64.0/66.0;
        break;
    case UCT_IB_SPEED_EDR:
        iface_attr->latency.c = 600e-9;
        signal_rate           = 25.78125e9;
        encoding              = 64.0/66.0;
        break;
    case UCT_IB_SPEED_HDR:
        iface_attr->latency.c = 600e-9;
        signal_rate           = 25.78125e9 * 2;
        encoding              = 64.0/66.0;
        break;
    case UCT_IB_SPEED_NDR:
        iface_attr->latency.c = 600e-9;
        signal_rate           = 100e9;
        encoding              = 64.0/66.0;
        break;
    }

重要概念

vhca_id

vhca_id 用于 vport rx 规则中的设备索引, 设备索引类似于 PF 索引,并且仅限于最大物理端口。 例如,在PF下创建的SF,设备索引就是PF设备索引。 使用 vhca_id 获取每个 vport 的固件索引,用于 vport rx 规则和 vport 对事件

用户访问区域(UAR)

用户访问区域(UAR),通过用户访问区域(UAR)机制实现多个进程对HCA HW的隔离、受保护、独立的直接访问。 UAR 是 PCI 地址空间的一部分,映射为从 CPU 直接访问 HCA。 UAR 由多个页面组成,每个页面包含控制 HCA 操作的寄存器。 UAR机制用于向HCA发送执行或控制请求。 HCA 使用它来加强不同进程之间的保护和隔离。 跨进程隔离和保护通过四个关键机制实现: 1. 主机SW将不同的UAR页面映射到不同的消费者,从而在不同的消费者访问HCA控制空间中的同一页面时强制隔离。 2. 每个控制对象都可以通过UAR页面来访问(控制)。 3. HCA 驱动程序在初始化(“打开”)对象时将 UAR 页面与控制对象相关联。 4. 在对对象执行控制操作之前,HCA 验证用于发布命令的 UAR 页面是否与该对象上下文中指定的页面相匹配。 通过将 WQE 发布到相应的工作 WQ、更新门铃记录(如果适用)以及写入与该工作 WQ 关联的 UAR 页面中的相应门铃寄存器,将操作传递给 HCA。 写入 HCA 的门铃寄存器进一步称为响铃门铃。 UAR中的DoorBell寄存器是蓝焰缓冲区的前2个DWORDS

IB事件

代码语言:javascript复制
enum ibv_event_type {
    IBV_EVENT_CQ_ERR,
    IBV_EVENT_QP_FATAL,
    IBV_EVENT_QP_REQ_ERR,
    IBV_EVENT_QP_ACCESS_ERR,
    IBV_EVENT_COMM_EST,
    IBV_EVENT_SQ_DRAINED,
    IBV_EVENT_PATH_MIG,
    IBV_EVENT_PATH_MIG_ERR,
    IBV_EVENT_DEVICE_FATAL,
    IBV_EVENT_PORT_ACTIVE,
    IBV_EVENT_PORT_ERR,
    IBV_EVENT_LID_CHANGE,
    IBV_EVENT_PKEY_CHANGE,
    IBV_EVENT_SM_CHANGE,
    IBV_EVENT_SRQ_ERR,
    IBV_EVENT_SRQ_LIMIT_REACHED,
    IBV_EVENT_QP_LAST_WQE_REACHED,
    IBV_EVENT_CLIENT_REREGISTER,
    IBV_EVENT_GID_CHANGE,
    IBV_EVENT_WQ_FATAL,
};

内存类型

代码语言:javascript复制
内存类型
typedef enum ucs_memory_type {
    UCS_MEMORY_TYPE_HOST,          /**< Default system memory */
    UCS_MEMORY_TYPE_CUDA,          /**< NVIDIA CUDA memory */
    UCS_MEMORY_TYPE_CUDA_MANAGED,  /**< NVIDIA CUDA managed (or unified) memory */
    UCS_MEMORY_TYPE_ROCM,          /**< AMD ROCM memory */
    UCS_MEMORY_TYPE_ROCM_MANAGED,  /**< AMD ROCM managed system memory */
    UCS_MEMORY_TYPE_LAST,
    UCS_MEMORY_TYPE_UNKNOWN = UCS_MEMORY_TYPE_LAST
} ucs_memory_type_t;

其他参考

UCT笔记

https://github.com/ssbandjl/ucx/blob/master/category/uct_readme

InfiniBand 上 UCX 的性能研究

https://www.semanticscholar.org/paper/A-Performance-Study-of-UCX-over-InfiniBand-Papadopoulou-Oden/f063a0fae6909af162baa5795a285e067bb197f3

晓兵(ssbandjl)

博客: https://logread.cn | https://blog.csdn.net/ssbandjl | https://cloud.tencent.com/developer/user/5060293/articles

DAOS汇总: https://cloud.tencent.com/developer/article/2344030

0 人点赞