DPU/IPU SPDK存储卸载之用户态vfio(vfio_user)

2024-09-01 14:16:04 浏览数 (1)

术语

IPU: Infrastructure Processing Units (lPUs), 基础设施处理单元(硬件卡), 如存储处理/卸载到IPU

简介

利用IPU卸载nvme_target的参考架构如下:

SPDK块设备可以是本地NVMeSSD/远端(Nvmeof, iscsi, ceph, daos等):

SPDK中Nvme_target发展时间线:

  • 2016年支持nvmf_target(仅支持RDMA)
  • 2020年支持vfio-user传输层(软件仿真Nvme盘)

SPDK IPU K8S结合:

VFIO-USER简介

VFIO-USER 是一种协议,允许在虚拟机监视器 (VMM) 之外的单独进程中模拟设备。 ▪ VFIO-USER 规范主要基于 Linux VFIO ioctl 接口,以将其实现为通过 UNIX 域套接字发送的消息。 ▪ VFIO-USER 有两个部分: • VFIO-USER 客户端在 VMM 或应用程序中运行。 • VFIO-USER 服务器用于在单独进程中模拟设备。

背景

向虚拟机呈现半虚拟化设备历来要求客户操作系统中存在专门的驱动程序。最常见的示例是 virtio-blk 或 virtio-scsi。这些设备可以使用主机系统操作系统(例如,KVM 可以模拟 virtio-blk 和 virtio-scsi 设备供客户使用)或单独的用户空间进程(vhost-user 协议可以连接到这些目标,通常由 SPDK 提供)构建。但是,目前只有 Linux 附带内置的 virtio-blk 和 virtio-scsi 驱动程序。BIOS 通常没有可用的驱动程序,因此无法从这些设备启动,而 Windows 等操作系统需要单独安装驱动程序。在本次演讲中,我们将介绍一种新的标准化协议,供虚拟机用于与其他可用于模拟任何 PCI 设备的进程进行通信,该协议称为 vfio-user。QEMU 将在即将发布的版本中支持此协议。然后,我们将介绍 SPDK 如何使用此新协议向客户机呈现半虚拟化 NVMe 设备,从而允许客户机 BIOS 和客户机操作系统加载其现有的 NVMe 驱动程序而无需修改即可使用这些磁盘。然后,我们将通过一些基准测试来结束本文,这些基准测试展示了这种虚拟化的极低开销。

VFIO-USER:一种新的虚拟化协议,2021 年 5 月 4 日 • 本·沃克 我们很高兴地宣布支持 NVMe over vfio-user,这项技术允许 SPDK 将完全模拟的 NVMe 设备呈现到虚拟机中。虚拟机可以利用其现有的 NVMe 驱动程序与设备进行通信,并且数据可以使用共享内存高效地传输到 SPDK 或从 SPDK 传输。换句话说,就像 vhost-user 一样,但能够模拟 NVMe 设备而不是 virtio-blk 或 virtio-scsi 设备。 QEMU 社区目前正在对 vfio-user 进行标准化和支持。草案规范已让所有相关方达成一致,并且正在迅速成熟。该协议本身能够模拟任何物理设备,而不仅仅是 NVMe,但到目前为止,使用 SPDK 模拟 NVMe 设备是新接口的第一个也是主要的消费者。 Nutanix 发布了一个便捷库,可帮助实现协议的服务器,SPDK 在其实现中利用了这一点。 NVMe 设备模拟是使用 SPDK 现有的 NVMe-oF 目标实现的,将 vfio-user 视为共享内存“传输”,就像它支持 TCP、RDMA、PCIe 和光纤通道一样。NVMe-oF 规范已经要求对 NVMe 设备进行广泛的模拟,因此模拟物理 NVMe SSD 所需的几乎所有逻辑都已经存在。这真的就像为 vfio-user 添加一个新的传输插件一样简单。虽然这种传输不是 NVMe-oF 规范的正式组成部分,但传输插件系统已设置为允许自定义传输。这些传输插件甚至可以存在于 SPDK 存储库之外的共享库中,并在运行时被发现,尽管 vfio-user 传输内置于 SPDK 本身。 我们对虚拟化技术的这一进步感到兴奋。通过模拟物理 NVMe 设备,任何具有 NVMe 驱动程序的操作系统(即所有操作系统)都可以与该设备通信。这对于 Windows 来说尤其重要 - 无需再加载 virtio 驱动程序!它还允许虚拟机从模拟的 NVMe 设备启动,因为 UEFI BIOS 包含 NVMe 驱动程序。 从长远来看,我们完全相信 vfio-user 不仅能在存储行业,还能在网络和其他领域赢得市场份额。我们预计其性能将非常出色 - 至少与 virtio-blk 一样好,但可能更好 - 并且一旦基准测试可用,我们就会非常高兴地分享它们

libvfio-user整体架构

SPDK vfio-user QEMU测试架构

组件

客户端组件:

服务端组件:

virtio仿真设备PCI布局

代码语言:javascript复制
/* virtio device layout:
 *
 * region 1: MSI-X Table
 * region 2: MSI-X PBA
 * region 4: virtio modern memory 64bits BAR
 *     Common configuration          0x0    - 0x1000
 *     ISR access                    0x1000 - 0x2000
 *     Device specific configuration 0x2000 - 0x3000
 *     Notifications                 0x3000 - 0x4000
 */
#define VIRTIO_PCI_COMMON_CFG_OFFSET    (0x0)
#define VIRTIO_PCI_COMMON_CFG_LENGTH    (0x1000)
#define VIRTIO_PCI_ISR_ACCESS_OFFSET    (VIRTIO_PCI_COMMON_CFG_OFFSET   VIRTIO_PCI_COMMON_CFG_LENGTH)
#define VIRTIO_PCI_ISR_ACCESS_LENGTH    (0x1000)
#define VIRTIO_PCI_SPECIFIC_CFG_OFFSET    (VIRTIO_PCI_ISR_ACCESS_OFFSET   VIRTIO_PCI_ISR_ACCESS_LENGTH)
#define VIRTIO_PCI_SPECIFIC_CFG_LENGTH    (0x1000)
#define VIRTIO_PCI_NOTIFICATIONS_OFFSET    (VIRTIO_PCI_SPECIFIC_CFG_OFFSET   VIRTIO_PCI_SPECIFIC_CFG_LENGTH)
#define VIRTIO_PCI_NOTIFICATIONS_LENGTH    (0x1000)
​
#define VIRTIO_PCI_BAR4_LENGTH        (VIRTIO_PCI_NOTIFICATIONS_OFFSET   VIRTIO_PCI_NOTIFICATIONS_LENGTH)

源码分析

代码语言:javascript复制
SPDK目标侧创建仿真nvme设备和用户态vfio设备端点, 然后QEMU指定vfio-user-pci设备类型启动VM
1. build/bin/spdk_tgt
2. scripts/rpc.py bdev_malloc_create -b malloc0 $((512)) 512
3. scripts/rpc.py vfu_virtio_create_blk_endpoint vfu.0 --bdev-name malloc0 
--cpumask=0x1 --num-queues=2 
--qsize=256 --packed-ring
4. Start QEMU with '-device vfio-user-pci,socket=/spdk/vfu.0'

创建VFU块设备端点流程

代码语言:javascript复制
执行RPC:
SPDK_RPC_REGISTER("vfu_virtio_create_blk_endpoint", rpc_vfu_virtio_create_blk_endpoint, SPDK_RPC_RUNTIME)
rpc_vfu_virtio_create_blk_endpoint
    spdk_vfu_create_endpoint(req.name, req.cpumask, "virtio_blk")
        vfu_parse_core_mask
        spdk_vfu_get_endpoint_by_name(endpoint_name)
            TAILQ_FOREACH_SAFE(endpoint, &g_endpoint, link, tmp)
        ops = tgt_get_pci_device_ops(dev_type_name)
            TAILQ_FOREACH_SAFE(pci_ops, &g_pci_device_ops, link, tmp) -> 从全局操作表中查找
        basename = tgt_get_base_path() -> g_endpoint_path_dirname
        endpoint->endpoint_ctx = ops->init(endpoint, basename, endpoint_name) -> vfu_virtio_blk_endpoint_init
        tgt_endpoint_realize(endpoint)
            ret = endpoint->ops.get_device_info(endpoint, &pci_dev) -> vfu_virtio_blk_get_device_info
                vfu_virtio_get_device_info(&blk_endpoint->virtio, device_info)
                    memcpy(device_info, &vfu_virtio_device_info, sizeof(*device_info))
                    device_info->regions[VFU_PCI_DEV_BAR4_REGION_IDX].fd = virtio_endpoint->devmem_fd
                device_info->id.did = PCI_DEVICE_ID_VIRTIO_BLK_MODERN
            endpoint->vfu_ctx = vfu_create_ctx
                vfu_ctx = calloc(1, sizeof(vfu_ctx_t))
                vfu_ctx->tran = &tran_sock_ops
                vfu_setup_device_nr_irqs(vfu_ctx, VFU_DEV_ERR_IRQ, 1) -> vfu_ctx->irq_count[type] = count
                vfu_ctx->tran->init(vfu_ctx) -> tran_sock_init
            vfu_pci_init(endpoint->vfu_ctx, VFU_PCI_TYPE_EXPRESS, PCI_HEADER_TYPE_NORMAL, 0)
                cfg_space = calloc(1, size)
                vfu_ctx->pci.config_space = cfg_space
                vfu_ctx->reg_info[VFU_PCI_DEV_CFG_REGION_IDX].size = size
            vfu_pci_set_id
            vfu_pci_set_class
            cap_size = endpoint->ops.get_vendor_capability(endpoint, buf, 256, vendor_cap_idx)
            cap_offset = vfu_pci_add_capability(endpoint->vfu_ctx, 0, 0, vendor_cap)
            vfu_setup_region(endpoint->vfu_ctx, region_idx, region->len, region->access_cb, region->flags, region->nr_sparse_mmaps ? sparse_mmap : NULL, region->nr_sparse_mmaps, region->fd, region->offset)
                copyin_mmap_areas(reg, mmap_areas, nr_mmap_areas)
                    memcpy(reg_info->mmap_areas, mmap_areas, size)
            vfu_setup_device_dma(endpoint->vfu_ctx, tgt_memory_region_add_cb, tgt_memory_region_remove_cb)
                vfu_ctx->dma = dma_controller_create(vfu_ctx, MAX_DMA_REGIONS, MAX_DMA_SIZE)
                    dma = malloc(offsetof(dma_controller_t, regions)  
            vfu_setup_device_reset_cb(endpoint->vfu_ctx, tgt_device_reset_cb)
            vfu_setup_device_quiesce_cb(endpoint->vfu_ctx, tgt_device_quiesce_cb)
            vfu_setup_device_nr_irqs
            vfu_realize_ctx
            endpoint->pci_config_space = vfu_pci_get_config_space(endpoint->vfu_ctx)
            init_pci_config_space
                p->hdr.bars[0].raw = 0x0
                p->hdr.intr.ipin = ipin
                ...
        endpoint->thread = spdk_thread_create(endpoint_name, &cpumask)
        spdk_thread_send_msg(endpoint->thread, tgt_endpoint_start_thread, endpoint)
            endpoint->accept_poller = SPDK_POLLER_REGISTER(tgt_accept_poller, endpoint, 1000) -> 服务收到vfio-user协议请求
                vfu_attach_ctx -> tran_sock_attach
                    ts->conn_fd = accept(ts->listen_fd, NULL, NULL)
                    ret = tran_negotiate(vfu_ctx, &ts->client_cmd_socket_fd) -> switch ctx
                endpoint->vfu_ctx_poller = SPDK_POLLER_REGISTER(tgt_vfu_ctx_poller, endpoint, 1000)
                    vfu_run_ctx(vfu_ctx)
                    do
                        err = get_request(vfu_ctx, &msg)
                        handle_request(vfu_ctx, msg)
                            switch (msg->hdr.cmd) -> 根据不同的协议类型进行处理
                            case VFIO_USER_DMA_MAP
                                handle_dma_map(vfu_ctx, msg, msg->in.iov.iov_base)
                            ...
                endpoint->ops.attach_device(endpoint)
    vfu_virtio_blk_add_bdev(req.name, req.bdev_name, req.num_queues, req.qsize, req.packed_ring)
        spdk_bdev_open_ext(bdev_name, true, bdev_event_cb, blk_endpoint, &blk_endpoint->bdev_desc) -> open and set bdev_desc -> 关联bdev
        virtio_blk_update_config(&blk_endpoint->blk_cfg, blk_endpoint->bdev, blk_endpoint->virtio.num_queues)
            blk_cfg->blk_size = spdk_bdev_get_block_size(bdev)

全局操作表 g_pci_device_ops 与 vfu_virtio_blk_ops 关联

代码语言:javascript复制
__attribute__((constructor)) _vfu_virtio_blk_pci_model_register
    spdk_vfu_register_endpoint_ops(&vfu_virtio_blk_ops)
        TAILQ_INSERT_TAIL(&g_pci_device_ops, pci_ops, link)
​
struct spdk_vfu_endpoint_ops vfu_virtio_blk_ops = {
    .name = "virtio_blk",
    .init = vfu_virtio_blk_endpoint_init,
        vfu_virtio_endpoint_setup(&blk_endpoint->virtio, endpoint, basename, endpoint_name, &virtio_blk_ops)
            snprintf(path, PATH_MAX, "%s%s_bar4", basename, endpoint_name)
            open(path, O_RDWR | O_CREAT, S_IRUSR | S_IWUSR)
            ftruncate(virtio_endpoint->devmem_fd, VIRTIO_PCI_BAR4_LENGTH)
            virtio_endpoint->doorbells = mmap(NULL, VIRTIO_PCI_NOTIFICATIONS_LENGTH, PROT_READ | PROT_WRITE, MAP_SHARED, virtio_endpoint->devmem_fd, VIRTIO_PCI_NOTIFICATIONS_OFFSET)
            virtio_endpoint->virtio_ops = *ops
            virtio_endpoint->num_queues = VIRTIO_DEV_MAX_VQS -> 64
            virtio_endpoint->qsize = VIRTIO_VQ_DEFAULT_SIZE -> 128
    .get_device_info = vfu_virtio_blk_get_device_info,
    .get_vendor_capability = vfu_virtio_get_vendor_capability,
    .post_memory_add = vfu_virtio_post_memory_add,
        virtio_dev_map_vq(dev, &dev->vqs[i])
            phys_addr = ((((uint64_t)vq->desc_hi) << 32) | vq->desc_lo)
            vfu_virtio_map_q(dev, &vq->desc, phys_addr, len)
                addr = spdk_vfu_map_one(virtio_endpoint->endpoint, phys_addr, len, mapping->sg, &mapping->iov, PROT_READ | PROT_WRITE)
            vfu_virtio_map_q(dev, &vq->avail, phys_addr, len)
            vfu_virtio_map_q(dev, &vq->used, phys_addr, len)
    .pre_memory_remove = vfu_virtio_pre_memory_remove,
    .reset_device = vfu_virtio_pci_reset_cb,
    .quiesce_device = vfu_virtio_quiesce_cb,
    .destruct = vfu_virtio_blk_endpoint_destruct,
    .attach_device = vfu_virtio_attach_device,
        for (j = 0; j <= vq->qsize; j  )
            req = vfu_virtio_vq_alloc_req(virtio_endpoint, vq)
                endpoint->virtio_ops.alloc_req(endpoint, vq)
            req->indirect_sg = virtio_req_to_sg_t(req, VIRTIO_DEV_MAX_IOVS)
            req->dev = dev
            STAILQ_INSERT_TAIL(&vq->free_reqs, req, link)
    .detach_device = vfu_virtio_detach_device,
};

IO处理流程:

代码语言:c复制
virtio blk OP IO操作类型:
#define VIRTIO_BLK_T_IN 0   // 读操作, 将块设备的数据读到VIRTIO_VQ中(入队)
#define VIRTIO_BLK_T_OUT 1  // 写操作, 将VIRTIO VQ中的数据(出队), 并写入块设备
#define VIRTIO_BLK_T_FLUSH 4
#define VIRTIO_BLK_T_GET_ID 8
#define VIRTIO_BLK_T_GET_LIFETIME 10
#define VIRTIO_BLK_T_DISCARD 11
#define VIRTIO_BLK_T_WRITE_ZEROES 13
#define VIRTIO_BLK_T_SECURE_ERASE 14


vfu_virito_dev_process_split_ring
    virtio_endpoint->virtio_ops.exec_request(virtio_endpoint, vq, req)

vfu_virtio_blk_vring_poll
    struct vfu_virtio_dev *dev = blk_endpoint->virtio.dev
    count  = vfu_virito_dev_process_packed_ring(dev, vq)
        req = vfu_virtio_dev_get_req(virtio_endpoint, vq) -> init req
        virtio_dev_packed_iovs_setup(dev, vq, vq->last_avail_idx, desc, req)
            if (virtio_vring_packed_desc_is_indirect(current_desc))
                desc_table = virtio_vring_packed_desc_to_iov(dev, current_desc, req->indirect_sg, req->indirect_iov)
                    spdk_vfu_map_one(virtio_endpoint->endpoint, desc->addr, desc->len, sg, iov, PROT_READ | PROT_WRITE)
                        vfu_addr_to_sgl(endpoint->vfu_ctx, (void *)(uintptr_t)addr, len, sg, 1, prot) -> 获取客户机物理地址范围并填充分散/聚集条目数组,这些条目可以单独映射到程序的虚拟内存中。由于内存映射方式的限制,单个线性客户机物理地址跨度可能需要拆分为多个分散/聚集区域。,在使用此函数之前必须先调用 vfu_setup_device_dma()
                        vfu_sgl_get(endpoint->vfu_ctx, sg, iov, 1, 0)
            virtio_vring_packed_desc_to_iov(dev, desc, virtio_req_to_sg_t(req, req->iovcnt), &req->iovs[req->iovcnt])
        virtio_endpoint->virtio_ops.exec_request(virtio_endpoint, vq, req)

struct vfu_virtio_ops virtio_blk_ops = {
	.get_device_features = virtio_blk_get_supported_features,
	.alloc_req = virtio_blk_alloc_req,
        blk_req = calloc(1, sizeof(*blk_req)   dma_sg_size() * (VIRTIO_DEV_MAX_IOVS   1))
	.free_req = virtio_blk_free_req,
	.exec_request = virtio_blk_process_req, -> 执行IO请求
        iov = &req->iovs[0]
        hdr = iov->iov_base
        case VIRTIO_BLK_T_IN -> 读
            spdk_bdev_readv(blk_endpoint->bdev_desc, blk_endpoint->io_channel, &req->iovs[1], iovcnt, hdr->sector * 512, payload_len, blk_request_complete_cb, blk_req)
        case VIRTIO_BLK_T_OUT -> 写
            spdk_bdev_writev(blk_endpoint->bdev_desc, blk_endpoint->io_channel, &req->iovs[1], iovcnt, hdr->sector * 512, payload_len, blk_request_complete_cb, blk_req)
        ...
        case VIRTIO_BLK_T_FLUSH
            spdk_bdev_flush(blk_endpoint->bdev_desc, blk_endpoint->io_channel, 0, flush_bytes, blk_request_complete_cb, blk_req)
	.get_config = virtio_blk_get_device_specific_config,
	.start_device = virtio_blk_start,
        blk_endpoint->io_channel = spdk_bdev_get_io_channel(blk_endpoint->bdev_desc)
        blk_endpoint->ring_poller = SPDK_POLLER_REGISTER(vfu_virtio_blk_vring_poll, blk_endpoint, 0)
	.stop_device = virtio_blk_stop,
};

libvfio-user服务端与客户端源码分析

代码语言:javascript复制
启动服务端:
server.c -> main
    vfu_create_ctx
        vfu_ctx->tran = &tran_sock_ops
        vfu_setup_device_nr_irqs(vfu_ctx, VFU_DEV_ERR_IRQ, 1)
        vfu_setup_device_nr_irqs(vfu_ctx, VFU_DEV_REQ_IRQ, 1)
        vfu_ctx->tran->init(vfu_ctx)
    vfu_setup_log(vfu_ctx, _log, verbose ? LOG_DEBUG : LOG_ERR)
    vfu_pci_init(vfu_ctx, VFU_PCI_TYPE_CONVENTIONAL, PCI_HEADER_TYPE_NORMAL, 0)
        vfu_pci_config_space_t *cfg_space
        case VFU_PCI_TYPE_PCI_X_1
            size = PCI_CFG_SPACE_SIZE -> 256
        cfg_space = calloc(1, size)
    vfu_pci_set_id(vfu_ctx, 0xdead, 0xbeef, 0xcafe, 0xbabe)
    vfu_setup_region(vfu_ctx, VFU_PCI_DEV_BAR0_REGION_IDX, sizeof(time_t), &bar0_access, VFU_REGION_FLAG_RW, NULL, 0, -1, 0)
        reg->cb = cb
    umask(0022)
    tmpfd = mkstemp(template)
    unlink(template)
    ftruncate(tmpfd, server_data.bar1_size)
    server_data.bar1 = mmap(NULL, server_data.bar1_size, PROT_READ | PROT_WRITE, MAP_SHARED, tmpfd, 0)
    vfu_setup_region(vfu_ctx, VFU_PCI_DEV_BAR1_REGION_IDX, server_data.bar1_size, &bar1_access, VFU_REGION_FLAG_RW, bar1_mmap_areas, 2, tmpfd, 0)
        copyin_mmap_areas(reg, mmap_areas, nr_mmap_areas)
            reg_info->mmap_areas = malloc(size)
            memcpy(reg_info->mmap_areas, mmap_areas, size)
    vfu_setup_device_migration_callbacks(vfu_ctx, &migr_callbacks)
        vfu_ctx->migration = init_migration(callbacks, &ret)
            migr->pgsize = sysconf(_SC_PAGESIZE)
            migr->callbacks = *callbacks
    vfu_setup_device_reset_cb(vfu_ctx, &device_reset)
    vfu_setup_device_dma(vfu_ctx, &dma_register, &dma_unregister)
        vfu_ctx->dma = dma_controller_create(vfu_ctx, MAX_DMA_REGIONS, MAX_DMA_SIZE)
            dma = malloc(offsetof(dma_controller_t, regions)   max_regions * sizeof(dma->regions[0]))
            memset(dma->regions, 0, max_regions * sizeof(dma->regions[0]))
            dma->dirty_pgsize = 0
    vfu_setup_device_nr_irqs(vfu_ctx, VFU_DEV_INTX_IRQ, 1)
    vfu_realize_ctx(vfu_ctx)
        vfu_ctx->pci.config_space = calloc(1, cfg_reg->size)
    vfu_attach_ctx(vfu_ctx) -> vfu_ctx->tran->attach(vfu_ctx)
    do
        vfu_run_ctx(vfu_ctx)
            do
                err = get_request(vfu_ctx, &msg)
                    should_exec_command(vfu_ctx, msg->hdr.cmd)
                handle_request(vfu_ctx, msg)
                    handle_region_access(vfu_ctx, msg)
                    region_access
                        ret = pci_config_space_access(vfu_ctx, buf, count, offset, is_write)
                    do_reply
        if (ret == -1 && errno == EINTR)
            ret = vfu_irq_trigger(vfu_ctx, 0)
                eventfd_write(vfu_ctx->irqs->efds[subindex], val)
            do_dma_io(vfu_ctx, &server_data, 1, false)
                sg = alloca(dma_sg_size())
                vfu_addr_to_sgl
                    dma_addr_to_sgl(vfu_ctx->dma, dma_addr, len, sgl, max_nr_sgs, prot) -> 获取线性 DMA 地址跨度并返回适合 DMA 的 sg 列表。由于内存映射方式的限制,可能需要将单个线性 DMA 地址跨度拆分为多个分散聚集区域
                        dma_init_sg
                            sg->dma_addr = region->info.iova.iov_base
                            sg->offset = dma_addr - region->info.iova.iov_base
                        _dma_addr_sg_split
                            dma_init_sg
                vfu_sgl_write
                    vfu_dma_transfer
                        memcpy(rbuf   sizeof(*dma_req), data   count, dma_req->count)
                        ret = vfu_ctx->tran->send_msg(vfu_ctx, msg_id  , VFIO_USER_DMA_WRITE, rbuf,
                vfu_sgl_mark_dirty(vfu_ctx, sg, 1)
                    dma_sgl_mark_dirty(vfu_ctx->dma, sgl, cnt)
                vfu_sgl_put(vfu_ctx, sg, &iov, 1)
                crc1 = rte_hash_crc(buf, sizeof(buf), 0)
            do_dma_io(vfu_ctx, &server_data, 0, true)
​
​
​
迁移回调:
const vfu_migration_callbacks_t migr_callbacks = {
    .version = VFU_MIGR_CALLBACKS_VERS,
    .transition = &migration_device_state_transition,
    .read_data = &migration_read_data,
    .write_data = &migration_write_data
        memcpy(server_data->bar1   write_start, buf, length_in_bar1)
        memcpy((char *)&server_data->bar0   write_start, buf   length_in_bar1, length_in_bar0)
};
​
​
socket传输函数操作集:
struct transport_ops tran_sock_ops = {
    .init = tran_sock_init,
        ts->listen_fd = socket(AF_UNIX, SOCK_STREAM, 0) -> bind -> listen
    .get_poll_fd = tran_sock_get_poll_fd,
    .attach = tran_sock_attach,
        ts->conn_fd = accept(ts->listen_fd, NULL, NULL)
        tran_negotiate(vfu_ctx, &ts->client_cmd_socket_fd)
            recv_version(vfu_ctx, &msg_id, &client_version, &twin_socket_supported)
                vfu_ctx->tran->recv_msg(vfu_ctx, &msg)
    .get_request_header = tran_sock_get_request_header,
    .recv_body = tran_sock_recv_body,
    .reply = tran_sock_reply,
    .recv_msg = tran_sock_recv_msg,
        tran_sock_recv_alloc(ts->conn_fd, &msg->hdr, false, NULL, &msg->in.iov.iov_base, &msg->in.iov.iov_len)
            tran_sock_recv(sock, hdr, is_reply, msg_id, NULL, NULL) -> tran_sock_recv_fds
                get_msg(hdr, sizeof(*hdr), fds, nr_fds, sock, 0)
                    recvmsg(sock_fd, &msg, sock_flags)
            recv(sock, data, len, MSG_WAITALL)
    .send_msg = tran_sock_send_msg,
        maybe_print_cmd_collision_warning
        tran_sock_msg(fd, msg_id, cmd, send_data, send_len, hdr, recv_data, recv_len) -> tran_sock_msg_fds
            tran_sock_msg_iovec
    .detach = tran_sock_detach,
    .fini = tran_sock_fini
};
​
​
​
客户端启动:
client.c -> main
    void *dirty_pages = malloc(dirty_pages_size) -> 24B
    dirty_pages_control = (void *)(dirty_pages_feature   1)
    sock = init_sock(argv[optind]) -> sock -> connect
    negotiate(sock, &server_max_fds, &server_max_data_xfer_size, &pgsize)
        send_version(sock)
        recv_version(sock, server_max_fds, server_max_data_xfer_size, pgsize)
    ret = access_region(sock, 0xdeadbeef, false, 0, &ret, sizeof(ret))
        op = VFIO_USER_REGION_READ
        tran_sock_msg_iovec -> tran_sock_send_iovec
    get_device_info(sock, &client_dev_info)
    get_device_regions_info(sock, &client_dev_info)
        get_device_region_info
            do_get_device_region_info
    send_device_reset(sock)
    map_dma_regions(sock, dma_regions, nr_dma_regions)
        tran_sock_msg_iovec(sock, 0x1234   i, VFIO_USER_DMA_MAP,
    irq_fd = configure_irqs(sock) -> VFIO_USER_DEVICE_GET_IRQ_INFO -> VFIO_USER_DEVICE_SET_IRQS
    access_bar0(sock, &t)
    wait_for_irq(irq_fd) -> read(irq_fd, &val, sizeof(val)
    handle_dma_io(sock, dma_regions, nr_dma_regions)
        handle_dma_write
            c = pwrite(dma_regions[i].fd, data, dma_access.count, offset)
            tran_sock_send(sock, msg_id, true, VFIO_USER_DMA_WRITE, &dma_access, sizeof(dma_access))
        handle_dma_read
    get_dirty_bitmap
    nr_iters = migrate_from
    migrate_to
        fork()
        ret = execvp(_argv[0] , _argv)
        sock = init_sock(sock_path) -> reconnect to new server
        set_migration_state(sock, device_state)
        write_migr_data -> tran_sock_msg_iovec(sock, msg_id--, VFIO_USER_MIG_DATA_WRITE,
        dst_crc = rte_hash_crc(buf, bar1_size, 0)
            init_val = rte_hash_crc_8byte(*(const uint64_t *)pd, init_val) -> crc32c_2words(data, init_val)
                term1 = CRC32_UPD(crc, 7) -> static const uint32_t crc32c_tables[8][256]
    map_dma_regions
    configure_irqs
    wait_for_irq
    handle_dma_io
​
​
​
DMA映射处理:
handle_dma_map
    dma_controller_add_region -> MOCK_DEFINE(dma_controller_add_region)
        region->info.iova.iov_base = (void *)dma_addr
        dma_map_region(dma, region)
            mmap_len = ROUND_UP(region->info.iova.iov_len, region->info.page_size) -> 4096
            mmap_base = mmap(NULL, mmap_len, region->info.prot, MAP_SHARED, region->fd, offset)
            madvise(mmap_base, mmap_len, MADV_DONTDUMP)
            region->info.vaddr = mmap_base   (region->offset - offset)
    vfu_ctx->dma_register(vfu_ctx, &vfu_ctx->dma->regions[ret].info) -> dma_register(vfu_ctx_t *vfu_ctx, vfu_dma_info_t *info)
        struct server_data *server_data = vfu_get_private(vfu_ctx)
        server_data->regions[idx].iova = info->iova
                                               
在线迁移状态集合:
/* Analogous to enum vfio_device_mig_state */
enum vfio_user_device_mig_state {
    VFIO_USER_DEVICE_STATE_ERROR = 0,
    VFIO_USER_DEVICE_STATE_STOP = 1,
    VFIO_USER_DEVICE_STATE_RUNNING = 2,
    VFIO_USER_DEVICE_STATE_STOP_COPY = 3,
    VFIO_USER_DEVICE_STATE_RESUMING = 4,
    VFIO_USER_DEVICE_STATE_RUNNING_P2P = 5,
    VFIO_USER_DEVICE_STATE_PRE_COPY = 6,
    VFIO_USER_DEVICE_STATE_PRE_COPY_P2P = 7,
    VFIO_USER_DEVICE_NUM_STATES = 8,
};

服务端初始化

DMA流程

访问BAR空间

客户端初始化

读配置空间

中断流程

参考

SPDK IPU OFFLOAD NVME: https://www.sniadeveloper.org/sites/default/files/SDC/2022/pdfs/SNIA-SDC22-Harris-SPDK-and-Infrastructure-Offload.pdf

libvfio-user git repo: https://github.com/nutanix/libvfio-user.git

晓兵(ssbandjl)

博客: https://cloud.tencent.com/developer/user/5060293/articles | https://logread.cn | https://blog.csdn.net/ssbandjl | https://www.zhihu.com/people/ssbandjl/posts

https://chattoyou.cn(吐槽/留言)

DPU专栏

https://cloud.tencent.com/developer/column/101987

技术会友: 欢迎对DPU/智能网卡/卸载/网络,存储加速/安全隔离等技术感兴趣的朋友加入DPU技术交流群

0 人点赞