术语
IPU: Infrastructure Processing Units (lPUs), 基础设施处理单元(硬件卡), 如存储处理/卸载到IPU
简介
利用IPU卸载nvme_target的参考架构如下:
SPDK块设备可以是本地NVMeSSD/远端(Nvmeof, iscsi, ceph, daos等):
SPDK中Nvme_target发展时间线:
- 2016年支持nvmf_target(仅支持RDMA)
- 2020年支持vfio-user传输层(软件仿真Nvme盘)
SPDK IPU K8S结合:
VFIO-USER简介
VFIO-USER 是一种协议,允许在虚拟机监视器 (VMM) 之外的单独进程中模拟设备。 ▪ VFIO-USER 规范主要基于 Linux VFIO ioctl 接口,以将其实现为通过 UNIX 域套接字发送的消息。 ▪ VFIO-USER 有两个部分: • VFIO-USER 客户端在 VMM 或应用程序中运行。 • VFIO-USER 服务器用于在单独进程中模拟设备。
背景
向虚拟机呈现半虚拟化设备历来要求客户操作系统中存在专门的驱动程序。最常见的示例是 virtio-blk 或 virtio-scsi。这些设备可以使用主机系统操作系统(例如,KVM 可以模拟 virtio-blk 和 virtio-scsi 设备供客户使用)或单独的用户空间进程(vhost-user 协议可以连接到这些目标,通常由 SPDK 提供)构建。但是,目前只有 Linux 附带内置的 virtio-blk 和 virtio-scsi 驱动程序。BIOS 通常没有可用的驱动程序,因此无法从这些设备启动,而 Windows 等操作系统需要单独安装驱动程序。在本次演讲中,我们将介绍一种新的标准化协议,供虚拟机用于与其他可用于模拟任何 PCI 设备的进程进行通信,该协议称为 vfio-user。QEMU 将在即将发布的版本中支持此协议。然后,我们将介绍 SPDK 如何使用此新协议向客户机呈现半虚拟化 NVMe 设备,从而允许客户机 BIOS 和客户机操作系统加载其现有的 NVMe 驱动程序而无需修改即可使用这些磁盘。然后,我们将通过一些基准测试来结束本文,这些基准测试展示了这种虚拟化的极低开销。
VFIO-USER:一种新的虚拟化协议,2021 年 5 月 4 日 • 本·沃克 我们很高兴地宣布支持 NVMe over vfio-user,这项技术允许 SPDK 将完全模拟的 NVMe 设备呈现到虚拟机中。虚拟机可以利用其现有的 NVMe 驱动程序与设备进行通信,并且数据可以使用共享内存高效地传输到 SPDK 或从 SPDK 传输。换句话说,就像 vhost-user 一样,但能够模拟 NVMe 设备而不是 virtio-blk 或 virtio-scsi 设备。 QEMU 社区目前正在对 vfio-user 进行标准化和支持。草案规范已让所有相关方达成一致,并且正在迅速成熟。该协议本身能够模拟任何物理设备,而不仅仅是 NVMe,但到目前为止,使用 SPDK 模拟 NVMe 设备是新接口的第一个也是主要的消费者。 Nutanix 发布了一个便捷库,可帮助实现协议的服务器,SPDK 在其实现中利用了这一点。 NVMe 设备模拟是使用 SPDK 现有的 NVMe-oF 目标实现的,将 vfio-user 视为共享内存“传输”,就像它支持 TCP、RDMA、PCIe 和光纤通道一样。NVMe-oF 规范已经要求对 NVMe 设备进行广泛的模拟,因此模拟物理 NVMe SSD 所需的几乎所有逻辑都已经存在。这真的就像为 vfio-user 添加一个新的传输插件一样简单。虽然这种传输不是 NVMe-oF 规范的正式组成部分,但传输插件系统已设置为允许自定义传输。这些传输插件甚至可以存在于 SPDK 存储库之外的共享库中,并在运行时被发现,尽管 vfio-user 传输内置于 SPDK 本身。 我们对虚拟化技术的这一进步感到兴奋。通过模拟物理 NVMe 设备,任何具有 NVMe 驱动程序的操作系统(即所有操作系统)都可以与该设备通信。这对于 Windows 来说尤其重要 - 无需再加载 virtio 驱动程序!它还允许虚拟机从模拟的 NVMe 设备启动,因为 UEFI BIOS 包含 NVMe 驱动程序。 从长远来看,我们完全相信 vfio-user 不仅能在存储行业,还能在网络和其他领域赢得市场份额。我们预计其性能将非常出色 - 至少与 virtio-blk 一样好,但可能更好 - 并且一旦基准测试可用,我们就会非常高兴地分享它们
libvfio-user整体架构
SPDK vfio-user QEMU测试架构
组件
客户端组件:
服务端组件:
virtio仿真设备PCI布局
代码语言:javascript复制/* virtio device layout:
*
* region 1: MSI-X Table
* region 2: MSI-X PBA
* region 4: virtio modern memory 64bits BAR
* Common configuration 0x0 - 0x1000
* ISR access 0x1000 - 0x2000
* Device specific configuration 0x2000 - 0x3000
* Notifications 0x3000 - 0x4000
*/
#define VIRTIO_PCI_COMMON_CFG_OFFSET (0x0)
#define VIRTIO_PCI_COMMON_CFG_LENGTH (0x1000)
#define VIRTIO_PCI_ISR_ACCESS_OFFSET (VIRTIO_PCI_COMMON_CFG_OFFSET VIRTIO_PCI_COMMON_CFG_LENGTH)
#define VIRTIO_PCI_ISR_ACCESS_LENGTH (0x1000)
#define VIRTIO_PCI_SPECIFIC_CFG_OFFSET (VIRTIO_PCI_ISR_ACCESS_OFFSET VIRTIO_PCI_ISR_ACCESS_LENGTH)
#define VIRTIO_PCI_SPECIFIC_CFG_LENGTH (0x1000)
#define VIRTIO_PCI_NOTIFICATIONS_OFFSET (VIRTIO_PCI_SPECIFIC_CFG_OFFSET VIRTIO_PCI_SPECIFIC_CFG_LENGTH)
#define VIRTIO_PCI_NOTIFICATIONS_LENGTH (0x1000)
#define VIRTIO_PCI_BAR4_LENGTH (VIRTIO_PCI_NOTIFICATIONS_OFFSET VIRTIO_PCI_NOTIFICATIONS_LENGTH)
源码分析
代码语言:javascript复制SPDK目标侧创建仿真nvme设备和用户态vfio设备端点, 然后QEMU指定vfio-user-pci设备类型启动VM
1. build/bin/spdk_tgt
2. scripts/rpc.py bdev_malloc_create -b malloc0 $((512)) 512
3. scripts/rpc.py vfu_virtio_create_blk_endpoint vfu.0 --bdev-name malloc0
--cpumask=0x1 --num-queues=2
--qsize=256 --packed-ring
4. Start QEMU with '-device vfio-user-pci,socket=/spdk/vfu.0'
创建VFU块设备端点流程
代码语言:javascript复制执行RPC:
SPDK_RPC_REGISTER("vfu_virtio_create_blk_endpoint", rpc_vfu_virtio_create_blk_endpoint, SPDK_RPC_RUNTIME)
rpc_vfu_virtio_create_blk_endpoint
spdk_vfu_create_endpoint(req.name, req.cpumask, "virtio_blk")
vfu_parse_core_mask
spdk_vfu_get_endpoint_by_name(endpoint_name)
TAILQ_FOREACH_SAFE(endpoint, &g_endpoint, link, tmp)
ops = tgt_get_pci_device_ops(dev_type_name)
TAILQ_FOREACH_SAFE(pci_ops, &g_pci_device_ops, link, tmp) -> 从全局操作表中查找
basename = tgt_get_base_path() -> g_endpoint_path_dirname
endpoint->endpoint_ctx = ops->init(endpoint, basename, endpoint_name) -> vfu_virtio_blk_endpoint_init
tgt_endpoint_realize(endpoint)
ret = endpoint->ops.get_device_info(endpoint, &pci_dev) -> vfu_virtio_blk_get_device_info
vfu_virtio_get_device_info(&blk_endpoint->virtio, device_info)
memcpy(device_info, &vfu_virtio_device_info, sizeof(*device_info))
device_info->regions[VFU_PCI_DEV_BAR4_REGION_IDX].fd = virtio_endpoint->devmem_fd
device_info->id.did = PCI_DEVICE_ID_VIRTIO_BLK_MODERN
endpoint->vfu_ctx = vfu_create_ctx
vfu_ctx = calloc(1, sizeof(vfu_ctx_t))
vfu_ctx->tran = &tran_sock_ops
vfu_setup_device_nr_irqs(vfu_ctx, VFU_DEV_ERR_IRQ, 1) -> vfu_ctx->irq_count[type] = count
vfu_ctx->tran->init(vfu_ctx) -> tran_sock_init
vfu_pci_init(endpoint->vfu_ctx, VFU_PCI_TYPE_EXPRESS, PCI_HEADER_TYPE_NORMAL, 0)
cfg_space = calloc(1, size)
vfu_ctx->pci.config_space = cfg_space
vfu_ctx->reg_info[VFU_PCI_DEV_CFG_REGION_IDX].size = size
vfu_pci_set_id
vfu_pci_set_class
cap_size = endpoint->ops.get_vendor_capability(endpoint, buf, 256, vendor_cap_idx)
cap_offset = vfu_pci_add_capability(endpoint->vfu_ctx, 0, 0, vendor_cap)
vfu_setup_region(endpoint->vfu_ctx, region_idx, region->len, region->access_cb, region->flags, region->nr_sparse_mmaps ? sparse_mmap : NULL, region->nr_sparse_mmaps, region->fd, region->offset)
copyin_mmap_areas(reg, mmap_areas, nr_mmap_areas)
memcpy(reg_info->mmap_areas, mmap_areas, size)
vfu_setup_device_dma(endpoint->vfu_ctx, tgt_memory_region_add_cb, tgt_memory_region_remove_cb)
vfu_ctx->dma = dma_controller_create(vfu_ctx, MAX_DMA_REGIONS, MAX_DMA_SIZE)
dma = malloc(offsetof(dma_controller_t, regions)
vfu_setup_device_reset_cb(endpoint->vfu_ctx, tgt_device_reset_cb)
vfu_setup_device_quiesce_cb(endpoint->vfu_ctx, tgt_device_quiesce_cb)
vfu_setup_device_nr_irqs
vfu_realize_ctx
endpoint->pci_config_space = vfu_pci_get_config_space(endpoint->vfu_ctx)
init_pci_config_space
p->hdr.bars[0].raw = 0x0
p->hdr.intr.ipin = ipin
...
endpoint->thread = spdk_thread_create(endpoint_name, &cpumask)
spdk_thread_send_msg(endpoint->thread, tgt_endpoint_start_thread, endpoint)
endpoint->accept_poller = SPDK_POLLER_REGISTER(tgt_accept_poller, endpoint, 1000) -> 服务收到vfio-user协议请求
vfu_attach_ctx -> tran_sock_attach
ts->conn_fd = accept(ts->listen_fd, NULL, NULL)
ret = tran_negotiate(vfu_ctx, &ts->client_cmd_socket_fd) -> switch ctx
endpoint->vfu_ctx_poller = SPDK_POLLER_REGISTER(tgt_vfu_ctx_poller, endpoint, 1000)
vfu_run_ctx(vfu_ctx)
do
err = get_request(vfu_ctx, &msg)
handle_request(vfu_ctx, msg)
switch (msg->hdr.cmd) -> 根据不同的协议类型进行处理
case VFIO_USER_DMA_MAP
handle_dma_map(vfu_ctx, msg, msg->in.iov.iov_base)
...
endpoint->ops.attach_device(endpoint)
vfu_virtio_blk_add_bdev(req.name, req.bdev_name, req.num_queues, req.qsize, req.packed_ring)
spdk_bdev_open_ext(bdev_name, true, bdev_event_cb, blk_endpoint, &blk_endpoint->bdev_desc) -> open and set bdev_desc -> 关联bdev
virtio_blk_update_config(&blk_endpoint->blk_cfg, blk_endpoint->bdev, blk_endpoint->virtio.num_queues)
blk_cfg->blk_size = spdk_bdev_get_block_size(bdev)
全局操作表 g_pci_device_ops 与 vfu_virtio_blk_ops 关联
代码语言:javascript复制__attribute__((constructor)) _vfu_virtio_blk_pci_model_register
spdk_vfu_register_endpoint_ops(&vfu_virtio_blk_ops)
TAILQ_INSERT_TAIL(&g_pci_device_ops, pci_ops, link)
struct spdk_vfu_endpoint_ops vfu_virtio_blk_ops = {
.name = "virtio_blk",
.init = vfu_virtio_blk_endpoint_init,
vfu_virtio_endpoint_setup(&blk_endpoint->virtio, endpoint, basename, endpoint_name, &virtio_blk_ops)
snprintf(path, PATH_MAX, "%s%s_bar4", basename, endpoint_name)
open(path, O_RDWR | O_CREAT, S_IRUSR | S_IWUSR)
ftruncate(virtio_endpoint->devmem_fd, VIRTIO_PCI_BAR4_LENGTH)
virtio_endpoint->doorbells = mmap(NULL, VIRTIO_PCI_NOTIFICATIONS_LENGTH, PROT_READ | PROT_WRITE, MAP_SHARED, virtio_endpoint->devmem_fd, VIRTIO_PCI_NOTIFICATIONS_OFFSET)
virtio_endpoint->virtio_ops = *ops
virtio_endpoint->num_queues = VIRTIO_DEV_MAX_VQS -> 64
virtio_endpoint->qsize = VIRTIO_VQ_DEFAULT_SIZE -> 128
.get_device_info = vfu_virtio_blk_get_device_info,
.get_vendor_capability = vfu_virtio_get_vendor_capability,
.post_memory_add = vfu_virtio_post_memory_add,
virtio_dev_map_vq(dev, &dev->vqs[i])
phys_addr = ((((uint64_t)vq->desc_hi) << 32) | vq->desc_lo)
vfu_virtio_map_q(dev, &vq->desc, phys_addr, len)
addr = spdk_vfu_map_one(virtio_endpoint->endpoint, phys_addr, len, mapping->sg, &mapping->iov, PROT_READ | PROT_WRITE)
vfu_virtio_map_q(dev, &vq->avail, phys_addr, len)
vfu_virtio_map_q(dev, &vq->used, phys_addr, len)
.pre_memory_remove = vfu_virtio_pre_memory_remove,
.reset_device = vfu_virtio_pci_reset_cb,
.quiesce_device = vfu_virtio_quiesce_cb,
.destruct = vfu_virtio_blk_endpoint_destruct,
.attach_device = vfu_virtio_attach_device,
for (j = 0; j <= vq->qsize; j )
req = vfu_virtio_vq_alloc_req(virtio_endpoint, vq)
endpoint->virtio_ops.alloc_req(endpoint, vq)
req->indirect_sg = virtio_req_to_sg_t(req, VIRTIO_DEV_MAX_IOVS)
req->dev = dev
STAILQ_INSERT_TAIL(&vq->free_reqs, req, link)
.detach_device = vfu_virtio_detach_device,
};
IO处理流程:
代码语言:c复制virtio blk OP IO操作类型:
#define VIRTIO_BLK_T_IN 0 // 读操作, 将块设备的数据读到VIRTIO_VQ中(入队)
#define VIRTIO_BLK_T_OUT 1 // 写操作, 将VIRTIO VQ中的数据(出队), 并写入块设备
#define VIRTIO_BLK_T_FLUSH 4
#define VIRTIO_BLK_T_GET_ID 8
#define VIRTIO_BLK_T_GET_LIFETIME 10
#define VIRTIO_BLK_T_DISCARD 11
#define VIRTIO_BLK_T_WRITE_ZEROES 13
#define VIRTIO_BLK_T_SECURE_ERASE 14
vfu_virito_dev_process_split_ring
virtio_endpoint->virtio_ops.exec_request(virtio_endpoint, vq, req)
vfu_virtio_blk_vring_poll
struct vfu_virtio_dev *dev = blk_endpoint->virtio.dev
count = vfu_virito_dev_process_packed_ring(dev, vq)
req = vfu_virtio_dev_get_req(virtio_endpoint, vq) -> init req
virtio_dev_packed_iovs_setup(dev, vq, vq->last_avail_idx, desc, req)
if (virtio_vring_packed_desc_is_indirect(current_desc))
desc_table = virtio_vring_packed_desc_to_iov(dev, current_desc, req->indirect_sg, req->indirect_iov)
spdk_vfu_map_one(virtio_endpoint->endpoint, desc->addr, desc->len, sg, iov, PROT_READ | PROT_WRITE)
vfu_addr_to_sgl(endpoint->vfu_ctx, (void *)(uintptr_t)addr, len, sg, 1, prot) -> 获取客户机物理地址范围并填充分散/聚集条目数组,这些条目可以单独映射到程序的虚拟内存中。由于内存映射方式的限制,单个线性客户机物理地址跨度可能需要拆分为多个分散/聚集区域。,在使用此函数之前必须先调用 vfu_setup_device_dma()
vfu_sgl_get(endpoint->vfu_ctx, sg, iov, 1, 0)
virtio_vring_packed_desc_to_iov(dev, desc, virtio_req_to_sg_t(req, req->iovcnt), &req->iovs[req->iovcnt])
virtio_endpoint->virtio_ops.exec_request(virtio_endpoint, vq, req)
struct vfu_virtio_ops virtio_blk_ops = {
.get_device_features = virtio_blk_get_supported_features,
.alloc_req = virtio_blk_alloc_req,
blk_req = calloc(1, sizeof(*blk_req) dma_sg_size() * (VIRTIO_DEV_MAX_IOVS 1))
.free_req = virtio_blk_free_req,
.exec_request = virtio_blk_process_req, -> 执行IO请求
iov = &req->iovs[0]
hdr = iov->iov_base
case VIRTIO_BLK_T_IN -> 读
spdk_bdev_readv(blk_endpoint->bdev_desc, blk_endpoint->io_channel, &req->iovs[1], iovcnt, hdr->sector * 512, payload_len, blk_request_complete_cb, blk_req)
case VIRTIO_BLK_T_OUT -> 写
spdk_bdev_writev(blk_endpoint->bdev_desc, blk_endpoint->io_channel, &req->iovs[1], iovcnt, hdr->sector * 512, payload_len, blk_request_complete_cb, blk_req)
...
case VIRTIO_BLK_T_FLUSH
spdk_bdev_flush(blk_endpoint->bdev_desc, blk_endpoint->io_channel, 0, flush_bytes, blk_request_complete_cb, blk_req)
.get_config = virtio_blk_get_device_specific_config,
.start_device = virtio_blk_start,
blk_endpoint->io_channel = spdk_bdev_get_io_channel(blk_endpoint->bdev_desc)
blk_endpoint->ring_poller = SPDK_POLLER_REGISTER(vfu_virtio_blk_vring_poll, blk_endpoint, 0)
.stop_device = virtio_blk_stop,
};
libvfio-user服务端与客户端源码分析
代码语言:javascript复制启动服务端:
server.c -> main
vfu_create_ctx
vfu_ctx->tran = &tran_sock_ops
vfu_setup_device_nr_irqs(vfu_ctx, VFU_DEV_ERR_IRQ, 1)
vfu_setup_device_nr_irqs(vfu_ctx, VFU_DEV_REQ_IRQ, 1)
vfu_ctx->tran->init(vfu_ctx)
vfu_setup_log(vfu_ctx, _log, verbose ? LOG_DEBUG : LOG_ERR)
vfu_pci_init(vfu_ctx, VFU_PCI_TYPE_CONVENTIONAL, PCI_HEADER_TYPE_NORMAL, 0)
vfu_pci_config_space_t *cfg_space
case VFU_PCI_TYPE_PCI_X_1
size = PCI_CFG_SPACE_SIZE -> 256
cfg_space = calloc(1, size)
vfu_pci_set_id(vfu_ctx, 0xdead, 0xbeef, 0xcafe, 0xbabe)
vfu_setup_region(vfu_ctx, VFU_PCI_DEV_BAR0_REGION_IDX, sizeof(time_t), &bar0_access, VFU_REGION_FLAG_RW, NULL, 0, -1, 0)
reg->cb = cb
umask(0022)
tmpfd = mkstemp(template)
unlink(template)
ftruncate(tmpfd, server_data.bar1_size)
server_data.bar1 = mmap(NULL, server_data.bar1_size, PROT_READ | PROT_WRITE, MAP_SHARED, tmpfd, 0)
vfu_setup_region(vfu_ctx, VFU_PCI_DEV_BAR1_REGION_IDX, server_data.bar1_size, &bar1_access, VFU_REGION_FLAG_RW, bar1_mmap_areas, 2, tmpfd, 0)
copyin_mmap_areas(reg, mmap_areas, nr_mmap_areas)
reg_info->mmap_areas = malloc(size)
memcpy(reg_info->mmap_areas, mmap_areas, size)
vfu_setup_device_migration_callbacks(vfu_ctx, &migr_callbacks)
vfu_ctx->migration = init_migration(callbacks, &ret)
migr->pgsize = sysconf(_SC_PAGESIZE)
migr->callbacks = *callbacks
vfu_setup_device_reset_cb(vfu_ctx, &device_reset)
vfu_setup_device_dma(vfu_ctx, &dma_register, &dma_unregister)
vfu_ctx->dma = dma_controller_create(vfu_ctx, MAX_DMA_REGIONS, MAX_DMA_SIZE)
dma = malloc(offsetof(dma_controller_t, regions) max_regions * sizeof(dma->regions[0]))
memset(dma->regions, 0, max_regions * sizeof(dma->regions[0]))
dma->dirty_pgsize = 0
vfu_setup_device_nr_irqs(vfu_ctx, VFU_DEV_INTX_IRQ, 1)
vfu_realize_ctx(vfu_ctx)
vfu_ctx->pci.config_space = calloc(1, cfg_reg->size)
vfu_attach_ctx(vfu_ctx) -> vfu_ctx->tran->attach(vfu_ctx)
do
vfu_run_ctx(vfu_ctx)
do
err = get_request(vfu_ctx, &msg)
should_exec_command(vfu_ctx, msg->hdr.cmd)
handle_request(vfu_ctx, msg)
handle_region_access(vfu_ctx, msg)
region_access
ret = pci_config_space_access(vfu_ctx, buf, count, offset, is_write)
do_reply
if (ret == -1 && errno == EINTR)
ret = vfu_irq_trigger(vfu_ctx, 0)
eventfd_write(vfu_ctx->irqs->efds[subindex], val)
do_dma_io(vfu_ctx, &server_data, 1, false)
sg = alloca(dma_sg_size())
vfu_addr_to_sgl
dma_addr_to_sgl(vfu_ctx->dma, dma_addr, len, sgl, max_nr_sgs, prot) -> 获取线性 DMA 地址跨度并返回适合 DMA 的 sg 列表。由于内存映射方式的限制,可能需要将单个线性 DMA 地址跨度拆分为多个分散聚集区域
dma_init_sg
sg->dma_addr = region->info.iova.iov_base
sg->offset = dma_addr - region->info.iova.iov_base
_dma_addr_sg_split
dma_init_sg
vfu_sgl_write
vfu_dma_transfer
memcpy(rbuf sizeof(*dma_req), data count, dma_req->count)
ret = vfu_ctx->tran->send_msg(vfu_ctx, msg_id , VFIO_USER_DMA_WRITE, rbuf,
vfu_sgl_mark_dirty(vfu_ctx, sg, 1)
dma_sgl_mark_dirty(vfu_ctx->dma, sgl, cnt)
vfu_sgl_put(vfu_ctx, sg, &iov, 1)
crc1 = rte_hash_crc(buf, sizeof(buf), 0)
do_dma_io(vfu_ctx, &server_data, 0, true)
迁移回调:
const vfu_migration_callbacks_t migr_callbacks = {
.version = VFU_MIGR_CALLBACKS_VERS,
.transition = &migration_device_state_transition,
.read_data = &migration_read_data,
.write_data = &migration_write_data
memcpy(server_data->bar1 write_start, buf, length_in_bar1)
memcpy((char *)&server_data->bar0 write_start, buf length_in_bar1, length_in_bar0)
};
socket传输函数操作集:
struct transport_ops tran_sock_ops = {
.init = tran_sock_init,
ts->listen_fd = socket(AF_UNIX, SOCK_STREAM, 0) -> bind -> listen
.get_poll_fd = tran_sock_get_poll_fd,
.attach = tran_sock_attach,
ts->conn_fd = accept(ts->listen_fd, NULL, NULL)
tran_negotiate(vfu_ctx, &ts->client_cmd_socket_fd)
recv_version(vfu_ctx, &msg_id, &client_version, &twin_socket_supported)
vfu_ctx->tran->recv_msg(vfu_ctx, &msg)
.get_request_header = tran_sock_get_request_header,
.recv_body = tran_sock_recv_body,
.reply = tran_sock_reply,
.recv_msg = tran_sock_recv_msg,
tran_sock_recv_alloc(ts->conn_fd, &msg->hdr, false, NULL, &msg->in.iov.iov_base, &msg->in.iov.iov_len)
tran_sock_recv(sock, hdr, is_reply, msg_id, NULL, NULL) -> tran_sock_recv_fds
get_msg(hdr, sizeof(*hdr), fds, nr_fds, sock, 0)
recvmsg(sock_fd, &msg, sock_flags)
recv(sock, data, len, MSG_WAITALL)
.send_msg = tran_sock_send_msg,
maybe_print_cmd_collision_warning
tran_sock_msg(fd, msg_id, cmd, send_data, send_len, hdr, recv_data, recv_len) -> tran_sock_msg_fds
tran_sock_msg_iovec
.detach = tran_sock_detach,
.fini = tran_sock_fini
};
客户端启动:
client.c -> main
void *dirty_pages = malloc(dirty_pages_size) -> 24B
dirty_pages_control = (void *)(dirty_pages_feature 1)
sock = init_sock(argv[optind]) -> sock -> connect
negotiate(sock, &server_max_fds, &server_max_data_xfer_size, &pgsize)
send_version(sock)
recv_version(sock, server_max_fds, server_max_data_xfer_size, pgsize)
ret = access_region(sock, 0xdeadbeef, false, 0, &ret, sizeof(ret))
op = VFIO_USER_REGION_READ
tran_sock_msg_iovec -> tran_sock_send_iovec
get_device_info(sock, &client_dev_info)
get_device_regions_info(sock, &client_dev_info)
get_device_region_info
do_get_device_region_info
send_device_reset(sock)
map_dma_regions(sock, dma_regions, nr_dma_regions)
tran_sock_msg_iovec(sock, 0x1234 i, VFIO_USER_DMA_MAP,
irq_fd = configure_irqs(sock) -> VFIO_USER_DEVICE_GET_IRQ_INFO -> VFIO_USER_DEVICE_SET_IRQS
access_bar0(sock, &t)
wait_for_irq(irq_fd) -> read(irq_fd, &val, sizeof(val)
handle_dma_io(sock, dma_regions, nr_dma_regions)
handle_dma_write
c = pwrite(dma_regions[i].fd, data, dma_access.count, offset)
tran_sock_send(sock, msg_id, true, VFIO_USER_DMA_WRITE, &dma_access, sizeof(dma_access))
handle_dma_read
get_dirty_bitmap
nr_iters = migrate_from
migrate_to
fork()
ret = execvp(_argv[0] , _argv)
sock = init_sock(sock_path) -> reconnect to new server
set_migration_state(sock, device_state)
write_migr_data -> tran_sock_msg_iovec(sock, msg_id--, VFIO_USER_MIG_DATA_WRITE,
dst_crc = rte_hash_crc(buf, bar1_size, 0)
init_val = rte_hash_crc_8byte(*(const uint64_t *)pd, init_val) -> crc32c_2words(data, init_val)
term1 = CRC32_UPD(crc, 7) -> static const uint32_t crc32c_tables[8][256]
map_dma_regions
configure_irqs
wait_for_irq
handle_dma_io
DMA映射处理:
handle_dma_map
dma_controller_add_region -> MOCK_DEFINE(dma_controller_add_region)
region->info.iova.iov_base = (void *)dma_addr
dma_map_region(dma, region)
mmap_len = ROUND_UP(region->info.iova.iov_len, region->info.page_size) -> 4096
mmap_base = mmap(NULL, mmap_len, region->info.prot, MAP_SHARED, region->fd, offset)
madvise(mmap_base, mmap_len, MADV_DONTDUMP)
region->info.vaddr = mmap_base (region->offset - offset)
vfu_ctx->dma_register(vfu_ctx, &vfu_ctx->dma->regions[ret].info) -> dma_register(vfu_ctx_t *vfu_ctx, vfu_dma_info_t *info)
struct server_data *server_data = vfu_get_private(vfu_ctx)
server_data->regions[idx].iova = info->iova
在线迁移状态集合:
/* Analogous to enum vfio_device_mig_state */
enum vfio_user_device_mig_state {
VFIO_USER_DEVICE_STATE_ERROR = 0,
VFIO_USER_DEVICE_STATE_STOP = 1,
VFIO_USER_DEVICE_STATE_RUNNING = 2,
VFIO_USER_DEVICE_STATE_STOP_COPY = 3,
VFIO_USER_DEVICE_STATE_RESUMING = 4,
VFIO_USER_DEVICE_STATE_RUNNING_P2P = 5,
VFIO_USER_DEVICE_STATE_PRE_COPY = 6,
VFIO_USER_DEVICE_STATE_PRE_COPY_P2P = 7,
VFIO_USER_DEVICE_NUM_STATES = 8,
};
服务端初始化
DMA流程
访问BAR空间
客户端初始化
读配置空间
中断流程
参考
SPDK IPU OFFLOAD NVME: https://www.sniadeveloper.org/sites/default/files/SDC/2022/pdfs/SNIA-SDC22-Harris-SPDK-and-Infrastructure-Offload.pdf
libvfio-user git repo: https://github.com/nutanix/libvfio-user.git
晓兵(ssbandjl)
博客: https://cloud.tencent.com/developer/user/5060293/articles | https://logread.cn | https://blog.csdn.net/ssbandjl | https://www.zhihu.com/people/ssbandjl/posts
https://chattoyou.cn(吐槽/留言)
DPU专栏
https://cloud.tencent.com/developer/column/101987
技术会友: 欢迎对DPU/智能网卡/卸载/网络,存储加速/安全隔离等技术感兴趣的朋友加入DPU技术交流群