kernel|network| Linux Networking Stack: Sending Data

2022-04-18 19:40:09 浏览数 (1)

This blog post explains how computers running the Linux kernel send packets, as well as how to monitor and tune each component of the networking stack as packets flow from user programs to network hardware.

这篇博客文章解释了运行 Linux 内核的计算机如何发送数据包,以及如何在数据包从用户程序流向网络硬件时监视和调整网络堆栈的每个组件。

It is impossible to tune or monitor the Linux networking stack without reading the source code of the kernel and having a deep understanding of what exactly is happening.

This blog post will hopefully serve as a reference to anyone looking to do this.

如果不阅读内核的源代码并深入了解究竟发生了什么,就不可能调整或监控Linux网络堆栈。

这篇博客文章有望作为任何希望这样做的人的参考。

General advice on monitoring and tuning the Linux networking stack

有关监视和调整 Linux 网络堆栈的一般建议

As mentioned in our previous article, the Linux network stack is complex and there is no one size fits all solution for monitoring or tuning. If you truly want to tune the network stack, you will have no choice but to invest a considerable amount of time, effort, and money into understanding how the various parts of networking system interact.

正如我们在上一篇文章中提到的,Linux 网络堆栈很复杂,没有一种适合所有监视或调优的解决方案。如果您真的想调整网络堆栈,您将别无选择,只能投入大量的时间,精力和金钱来了解网络系统的各个部分如何交互。

Many of the example settings provided in this blog post are used solely for illustrative purposes and are not a recommendation for or against a certain configuration or default setting. Before adjusting any setting, you should develop a frame of reference around what you need to be monitoring to notice a meaningful change.

此博客文章中提供的许多示例设置仅用于说明目的,而不是支持或反对特定配置或默认设置的建议。在调整任何设置之前,您应该围绕需要监视的内容开发一个参考框架,以注意到有意义的变化。

Adjusting networking settings while connected to the machine over a network is dangerous; you could very easily lock yourself out or completely take out your networking. Do not adjust these settings on production machines; instead, make adjustments on new machines and rotate them into production, if possible.

通过网络连接到机器时调整网络设置是危险的;你可以很容易地把自己锁在外面,或者完全把你的网络拿出来。请勿在生产计算机上调整这些设置;相反,如果可能的话,在新机器上进行调整并将其旋转到生产中。

Overview 概述

For reference, you may want to have a copy of the device data sheet handy. This post will examine the Intel I350 Ethernet controller, controlled by the igb device driver. You can find that data sheet (warning: LARGE PDF) here for your reference.

作为参考,您可能希望手头有一份设备数据手册。这篇文章将研究由 igb 设备驱动程序控制的英特尔 I350 以太网控制器。您可以在此处找到该数据表(警告:大PDF)供您参考。

The high-level path network data takes from a user program to a network device is as follows:

网络数据从用户程序获取到网络设备的高级路径如下所示:

  1. Data is written using a system call (like sendto, sendmsg, et. al.).
  2. Data passes through the socket subsystem on to the socket’s protocol family’s system (in our case, AF_INET).
  3. The protocol family passes data through the protocol layers which (in many cases) arrange the data into packets.
  4. The data passes through the routing layer, populating the destination and neighbour caches along the way (if they are cold). This can generate ARP traffic if an ethernet address needs to be looked up.
  5. After passing through the protocol layers, packets reach the device agnostic layer.
  6. The output queue is chosen using XPS (if enabled) or a hash function.
  7. The device driver’s transmit function is called.
  8. The data is then passed on to the queue discipline (qdisc) attached to the output device.
  9. The qdisc will either transmit the data directly if it can, or queue it up to be sent during the NET_TX softirq.
  10. Eventually the data is handed down to the driver from the qdisc.
  11. The driver creates the needed DMA mappings so the device can read the data from RAM.
  12. The driver signals the device that the data is ready to be transmit.
  13. The device fetches the data from RAM and transmits it.
  14. Once transmission is complete, the device raises an interrupt to signal transmit completion.
  15. The driver’s registered IRQ handler for transmit completion runs. For many devices, this handler simply triggers the NAPI poll loop to start running via the NET_RX softirq.
  16. The poll function runs via a softIRQ and calls down into the driver to unmap DMA regions and free packet data.
  17. 数据是使用系统调用(如 sendto、sendmsg 等)写入的。
  18. 数据通过套接字子系统传递到套接字的协议系列系统(在我们的例子中,AF_INET)。
  19. 协议系列通过协议层传递数据,协议层(在许多情况下)将数据排列成数据包。
  20. 数据通过路径图层,沿途填充目标和邻居缓存(如果它们是冷的)。如果需要查找以太网地址,这可能会生成 ARP 流量。
  21. 通过协议层后,数据包到达与设备无关的层。
  22. 输出队列是使用 XPS(如果启用)或哈希函数选择的。
  23. 调用设备驱动程序的传输函数。
  24. 然后,数据将传递到附加到输出设备的队列规则 (qdisc)。
  25. 如果可以的话,qdisc 将直接传输数据,或者将其排队等待在NET_TX softirq 期间发送。
  26. 最终,数据将从 qdisc 传递到驱动程序。
  27. 驱动程序创建所需的 DMA 映射,以便设备可以从 RAM 读取数据。
  28. 驱动程序向设备发出信号,指示数据已准备好传输。
  29. 设备从 RAM 中获取数据并进行传输。
  30. 传输完成后,设备会引发中断以发出传输完成的信号。
  31. 驱动程序的已注册 IRQ 处理程序用于传输完成运行。对于许多设备,此处理程序仅触发 NAPI 轮询循环,以通过NET_RX softirq 开始运行。
  32. 轮询功能通过 softIRQ 运行,并向下调用驱动程序以取消映射 DMA 区域和释放数据包数据。

This entire flow will be examined in detail in the following sections.

The protocol layers examined below are the IP and UDP protocol layers. Much of the information presented will serve as a reference for other protocol layers, as well.

以下各节将详细介绍整个流程。

下面检查的协议层是 IP 和 UDP 协议层。提供的大部分信息也将作为其他协议层的参考。

Detailed Look

This blog post will be examining the Linux kernel version 3.13.0 with links to code on GitHub and code snippets throughout this post, much like the companion post.

这篇博客文章将研究Linux内核版本3.13.0,其中包含指向GitHub上的代码的链接以及本文中的代码片段,就像配套文章一样。

Let’s begin by examining how protocol families are registered in the kernel and used by the socket subsystem, then we can proceed to receiving data.

让我们首先检查协议家族如何在内核中注册并由套接字子系统使用,然后我们可以继续接收数据。

Protocol family registration

What happens when you run a piece of code like this in a user program to create a UDP socket? 当您在用户程序中运行这样的一段代码以创建UDP套接字时,会发生什么情况?

代码语言:javascript复制
sock = socket(AF_INET, SOCK_DGRAM, IPPROTO_UDP)

In short, the Linux kernel looks up a set of functions exported by the UDP protocol stack that deal with many things including sending and receiving network data. To understand exactly how this work, we have to look into the AF_INET address family code.

简而言之,Linux内核查找由UDP协议栈导出的一组函数,这些函数处理许多事情,包括发送和接收网络数据。为了准确理解它是如何工作的,我们必须研究AF_INET解决家庭代码。

The Linux kernel executes the inet_init function early during kernel initialization. This function registers the AF_INET protocol family, the individual protocol stacks within that family (TCP, UDP, ICMP, and RAW), and calls initialization routines to get protocol stacks ready to process network data. You can find the code for inet_init in ./net/ipv4/af_inet.c.

Linux 内核在内核初始化期间很早就执行 inet_init 函数。此函数注册AF_INET协议系列、该系列中的各个协议栈(TCP、UDP、ICMP 和 RAW),并调用初始化例程以使协议栈准备好处理网络数据。您可以在 ./net/ipv4/af_inet.c 中找到inet_init的代码。

https://github.com/torvalds/linux/blob/v3.13/net/ipv4/af_inet.c#L1678-L1804

代码语言:javascript复制
static int __init inet_init(void)
{
  struct inet_protosw *q;
  struct list_head *r;
  int rc = -EINVAL;

  BUILD_BUG_ON(sizeof(struct inet_skb_parm) > FIELD_SIZEOF(struct sk_buff, cb));

  sysctl_local_reserved_ports = kzalloc(65536 / 8, GFP_KERNEL);
  if (!sysctl_local_reserved_ports)
    goto out;

  rc = proto_register(&tcp_prot, 1);
  if (rc)
    goto out_free_reserved_ports;

  rc = proto_register(&udp_prot, 1);
  if (rc)
    goto out_unregister_tcp_proto;

  rc = proto_register(&raw_prot, 1);
  if (rc)
    goto out_unregister_udp_proto;

  rc = proto_register(&ping_prot, 1);
  if (rc)
    goto out_unregister_raw_proto;

  /*
   *  Tell SOCKET that we are alive...
   */

  (void)sock_register(&inet_family_ops);

#ifdef CONFIG_SYSCTL
  ip_static_sysctl_init();
#endif

  /*
   *  Add all the base protocols.
   */

  if (inet_add_protocol(&icmp_protocol, IPPROTO_ICMP) < 0)
    pr_crit("%s: Cannot add ICMP protocoln", __func__);
  if (inet_add_protocol(&udp_protocol, IPPROTO_UDP) < 0)
    pr_crit("%s: Cannot add UDP protocoln", __func__);
  if (inet_add_protocol(&tcp_protocol, IPPROTO_TCP) < 0)
    pr_crit("%s: Cannot add TCP protocoln", __func__);
#ifdef CONFIG_IP_MULTICAST
  if (inet_add_protocol(&igmp_protocol, IPPROTO_IGMP) < 0)
    pr_crit("%s: Cannot add IGMP protocoln", __func__);
#endif

  /* Register the socket-side information for inet_create. */
  for (r = &inetsw[0]; r < &inetsw[SOCK_MAX];   r)
    INIT_LIST_HEAD(r);

  for (q = inetsw_array; q < &inetsw_array[INETSW_ARRAY_LEN];   q)
    inet_register_protosw(q);

  /*
   *  Set the ARP module up
   */

  arp_init();

  /*
   *  Set the IP module up
   */

  ip_init();

  tcp_v4_init();

  /* Setup TCP slab cache for open requests. */
  tcp_init();

  /* Setup UDP memory threshold */
  udp_init();

  /* Add UDP-Lite (RFC 3828) */
  udplite4_register();

  ping_init();

  /*
   *  Set the ICMP layer up
   */

  if (icmp_init() < 0)
    panic("Failed to create the ICMP control socket.n");

  /*
   *  Initialise the multicast router
   */
#if defined(CONFIG_IP_MROUTE)
  if (ip_mr_init())
    pr_crit("%s: Cannot init ipv4 mrouten", __func__);
#endif
  /*
   *  Initialise per-cpu ipv4 mibs
   */

  if (init_ipv4_mibs())
    pr_crit("%s: Cannot init ipv4 mibsn", __func__);

  ipv4_proc_init();

  ipfrag_init();

  dev_add_pack(&ip_packet_type);

  rc = 0;
out:
  return rc;
out_unregister_raw_proto:
  proto_unregister(&raw_prot);
out_unregister_udp_proto:
  proto_unregister(&udp_prot);
out_unregister_tcp_proto:
  proto_unregister(&tcp_prot);
out_free_reserved_ports:
  kfree(sysctl_local_reserved_ports);
  goto out;
}

fs_initcall(inet_init);

The AF_INET protocol family exports a structure that has a create function. This function is called by the kernel when a socket is created from a user program:

AF_INET协议系列导出具有创建函数的结构。当从用户程序创建套接字时,内核调用此函数:

https://github.com/torvalds/linux/blob/d8ec26d7f8287f5788a494f56e8814210f0e64be/net/ipv4/af_inet.c#L992

代码语言:javascript复制
static const struct net_proto_family inet_family_ops = {
  .family = PF_INET,
  .create = inet_create,
  .owner  = THIS_MODULE,
};

The inet_create function takes the arguments passed to the socket system call and searches the registered protocols to find a set of operations to link to the socket. Take a look:

inet_create函数获取传递给套接字系统调用的参数,并搜索已注册的协议以查找要链接到套接字的一组操作。看一看:

https://github.com/torvalds/linux/blob/v3.13/net/ipv4/af_inet.c#L267

代码语言:javascript复制
/* Look for the requested type/protocol pair. */
lookup_protocol:
  err = -ESOCKTNOSUPPORT;
  rcu_read_lock();
  list_for_each_entry_rcu(answer, &inetsw[sock->type], list) {

    err = 0;
    /* Check the non-wild match. */
    if (protocol == answer->protocol) {
      if (protocol != IPPROTO_IP)
        break;
    } else {
      /* Check for the two wild cases. */
      if (IPPROTO_IP == protocol) {
        protocol = answer->protocol;
        break;
      }
      if (IPPROTO_IP == answer->protocol)
        break;
    }
    err = -EPROTONOSUPPORT;
  }

Later, answer which holds a reference to a particular protocol stack has its ops fields copied into the socket structure:

稍后,包含对特定协议栈的引用的答案会将其操作字段复制到套接字结构中:

https://github.com/torvalds/linux/blob/d8ec26d7f8287f5788a494f56e8814210f0e64be/net/ipv4/af_inet.c#L316

代码语言:javascript复制
sock->ops = answer->ops;

You can find the structure definitions for all of the protocol stacks in af_inet.c. Let’s take a look at the TCP and UDP protocol structures:

您可以在af_inet.c中找到所有协议栈的结构定义。让我们看一下 TCP 和 UDP 协议结构:

https://github.com/torvalds/linux/blob/v3.13/net/ipv4/af_inet.c#L998-L1020

代码语言:javascript复制
/* Upon startup we insert all the elements in inetsw_array[] into
 * the linked list inetsw.
 */
static struct inet_protosw inetsw_array[] =
{
  {
    .type =       SOCK_STREAM,
    .protocol =   IPPROTO_TCP,
    .prot =       &tcp_prot,
    .ops =        &inet_stream_ops,
    .no_check =   0,
    .flags =      INET_PROTOSW_PERMANENT |
            INET_PROTOSW_ICSK,
  },

  {
    .type =       SOCK_DGRAM,
    .protocol =   IPPROTO_UDP,
    .prot =       &udp_prot,
    .ops =        &inet_dgram_ops,
    .no_check =   UDP_CSUM_DEFAULT,
    .flags =      INET_PROTOSW_PERMANENT,
       },

       {
    .type =       SOCK_DGRAM,
    .protocol =   IPPROTO_ICMP,
    .prot =       &ping_prot,
    .ops =        &inet_dgram_ops,
    .no_check =   UDP_CSUM_DEFAULT,
    .flags =      INET_PROTOSW_REUSE,
       },

       {
         .type =       SOCK_RAW,
         .protocol =   IPPROTO_IP,  /* wild card */
         .prot =       &raw_prot,
         .ops =        &inet_sockraw_ops,
         .no_check =   UDP_CSUM_DEFAULT,
         .flags =      INET_PROTOSW_REUSE,
       }
};

In the case of IPPROTO_UDP, an ops structure is linked into place which contains functions for various things, including sending and receiving data:

在IPPROTO_UDP的情况下,一个运维结构被链接到适当的位置,其中包含各种功能,包括发送和接收数据:

https://github.com/torvalds/linux/blob/v3.13/net/ipv4/af_inet.c#L935-L960

代码语言:javascript复制
const struct proto_ops inet_dgram_ops = {
  .family       = PF_INET,
  .owner       = THIS_MODULE,
  .release     = inet_release,
  .bind       = inet_bind,
  .connect     = inet_dgram_connect,
  .socketpair     = sock_no_socketpair,
  .accept       = sock_no_accept,
  .getname     = inet_getname,
  .poll       = udp_poll,
  .ioctl       = inet_ioctl,
  .listen       = sock_no_listen,
  .shutdown     = inet_shutdown,
  .setsockopt     = sock_common_setsockopt,
  .getsockopt     = sock_common_getsockopt,
  .sendmsg     = inet_sendmsg,
  .recvmsg     = inet_recvmsg,
  .mmap       = sock_no_mmap,
  .sendpage     = inet_sendpage,
#ifdef CONFIG_COMPAT
  .compat_setsockopt = compat_sock_common_setsockopt,
  .compat_getsockopt = compat_sock_common_getsockopt,
  .compat_ioctl     = inet_compat_ioctl,
#endif
};
EXPORT_SYMBOL(inet_dgram_ops);

and a protocol-specific structure prot, which contains function pointers to all the internal UDP protocol stack function. For the UDP protocol, this structure is called udp_prot and is exported by ./net/ipv4/udp.c:

和特定于协议的结构 prot,其中包含指向所有内部 UDP 协议栈函数的函数指针。对于 UDP 协议,此结构称为 udp_prot,由 ./net/ipv4/udp 导出.c:

https://github.com/torvalds/linux/blob/v3.13/net/ipv4/udp.c#L2171-L2203

代码语言:javascript复制
struct proto udp_prot = {
  .name       = "UDP",
  .owner       = THIS_MODULE,
  .close       = udp_lib_close,
  .connect     = ip4_datagram_connect,
  .disconnect     = udp_disconnect,
  .ioctl       = udp_ioctl,
  .destroy     = udp_destroy_sock,
  .setsockopt     = udp_setsockopt,
  .getsockopt     = udp_getsockopt,
  .sendmsg     = udp_sendmsg,
  .recvmsg     = udp_recvmsg,
  .sendpage     = udp_sendpage,
  .backlog_rcv     = __udp_queue_rcv_skb,
  .release_cb     = ip4_datagram_release_cb,
  .hash       = udp_lib_hash,
  .unhash       = udp_lib_unhash,
  .rehash       = udp_v4_rehash,
  .get_port     = udp_v4_get_port,
  .memory_allocated  = &udp_memory_allocated,
  .sysctl_mem     = sysctl_udp_mem,
  .sysctl_wmem     = &sysctl_udp_wmem_min,
  .sysctl_rmem     = &sysctl_udp_rmem_min,
  .obj_size     = sizeof(struct udp_sock),
  .slab_flags     = SLAB_DESTROY_BY_RCU,
  .h.udp_table     = &udp_table,
#ifdef CONFIG_COMPAT
  .compat_setsockopt = compat_udp_setsockopt,
  .compat_getsockopt = compat_udp_getsockopt,
#endif
  .clear_sk     = sk_prot_clear_portaddr_nulls,
};
EXPORT_SYMBOL(udp_prot);

Now, let’s turn to a user program that sends UDP data to see how udp_sendmsg is called in the kernel!

现在,让我们转向一个发送UDP数据的用户程序,看看内核中如何调用udp_sendmsg!

Sending network data via a socket

A user program wants to send UDP network data and so it uses the sendto system call, maybe like this:

用户程序想要发送UDP网络数据,因此它使用sentto系统调用,可能如下所示:

代码语言:javascript复制
ret = sendto(socket, buffer, buflen, 0, &dest, sizeof(dest));

This system call passes through the Linux system call layer and lands in this function in ./net/socket.c:

此系统调用通过 Linux 系统调用层,并驻留在 ./net/socket 中的此函数中.c:

https://github.com/torvalds/linux/blob/v3.13/net/socket.c#L1756-L1803

代码语言:javascript复制
/*
 *  Send a datagram to a given address. We move the address into kernel
 *  space and check the user space data area is readable before invoking
 *  the protocol.
 */

SYSCALL_DEFINE6(sendto, int, fd, void __user *, buff, size_t, len,
    unsigned int, flags, struct sockaddr __user *, addr,
    int, addr_len)
{
  struct socket *sock;
  struct sockaddr_storage address;
  int err;
  struct msghdr msg;
  struct iovec iov;
  int fput_needed;

  if (len > INT_MAX)
    len = INT_MAX;
  sock = sockfd_lookup_light(fd, &err, &fput_needed);
  if (!sock)
    goto out;

  iov.iov_base = buff;
  iov.iov_len = len;
  msg.msg_name = NULL;
  msg.msg_iov = &iov;
  msg.msg_iovlen = 1;
  msg.msg_control = NULL;
  msg.msg_controllen = 0;
  msg.msg_namelen = 0;
  if (addr) {
    err = move_addr_to_kernel(addr, addr_len, &address);
    if (err < 0)
      goto out_put;
    msg.msg_name = (struct sockaddr *)&address;
    msg.msg_namelen = addr_len;
  }
  if (sock->file->f_flags & O_NONBLOCK)
    flags |= MSG_DONTWAIT;
  msg.msg_flags = flags;
  err = sock_sendmsg(sock, &msg, len);

out_put:
  fput_light(sock->file, fput_needed);
out:
  return err;
}

The SYSCALL_DEFINE6 macro unfolds into a pile of macros, which in turn, set up the infrastructure needed to create a system call with 6 arguments (hence DEFINE6). One of the results of this is that inside the kernel, system call function names have sys_ prepended to them.

SYSCALL_DEFINE6宏展开为一堆宏,这些宏反过来又设置了创建具有 6 个参数的系统调用所需的基础结构(因此定义为 6)。这样做的结果之一是,在内核内部,系统调用函数名称具有附加sys_。

The system call code for sendto calls sock_sendmsg after arranging the data in a way that the lower layers will be able to handle. In particular, it takes the destination address passed into sendto and arranges it into a structure, let’s take a look:

sentto调用的系统调用代码sock_sendmsg以较低层能够处理的方式排列数据之后。特别是,它将传递到 sendto 的目标地址并排列成一个结构,让我们看一下:

代码语言:javascript复制
iov.iov_base = buff;
  iov.iov_len = len;
  msg.msg_name = NULL;
  msg.msg_iov = &iov;
  msg.msg_iovlen = 1;
  msg.msg_control = NULL;
  msg.msg_controllen = 0;
  msg.msg_namelen = 0;
  if (addr) {
    err = move_addr_to_kernel(addr, addr_len, &address);
    if (err < 0)
      goto out_put;
    msg.msg_name = (struct sockaddr *)&address;
    msg.msg_namelen = addr_len;
  }

This code is copying addr, passed in via the user program into the kernel data structure address, which is then embedded into a struct msghdr structure as msg_name. This is similar to what a userland program would do if it were calling sendmsg instead of sendto. The kernel provides this mutation because both sendto and sendmsg do call down to sock_sendmsg.

此代码是复制 addr,通过用户程序传入内核数据结构地址,然后将其嵌入到结构 msghdr 结构中,作为msg_name。这类似于用户空间程序在调用 sendmsg 而不是 sendto 时会执行的操作。内核提供了这种突变,因为 sendto 和 sendmsg 都调用了sock_sendmsg。

sock_sendmsg, __sock_sendmsg, and __sock_sendmsg_nosec

sock_sendmsg performs some error checking before calling __sock_sendmsg does its own error checking before calling __sock_sendmsg_nosec. __sock_sendmsg_nosec passes the data deeper into the socket subsystem:

sock_sendmsg在调用之前执行一些错误检查,__sock_sendmsg在调用__sock_sendmsg_nosec之前执行自己的错误检查。__sock_sendmsg_nosec将数据更深地传递到套接字子系统中:

https://github.com/torvalds/linux/blob/d8ec26d7f8287f5788a494f56e8814210f0e64be/net/socket.c#L622

代码语言:javascript复制
static inline int __sock_sendmsg_nosec(struct kiocb *iocb, struct socket *sock,
               struct msghdr *msg, size_t size)
{
  struct sock_iocb *si = kiocb_to_siocb(iocb);

  si->sock = sock;
  si->scm = NULL;
  si->msg = msg;
  si->size = size;

  return sock->ops->sendmsg(iocb, sock, msg, size);
}

As seen in the previous section explaining socket creation, the sendmsg function registered to this socket ops structure is inet_sendmsg.

如上一节解释套接字创建所示,注册到此套接字操作结构的 sendmsg 函数inet_sendmsg。

inet_sendmsg

As you may have guessed from the name, this is a generic function provided by the AF_INET protocol family. This function starts by calling sock_rps_record_flow to record the last CPU that the flow was processed on; this is used by Receive Packet Steering. Next, this function looks up the sendmsg function on the socket’s internal protocol operations structure and calls it:

正如您可能已经从名称中猜到的那样,这是AF_INET协议系列提供的通用函数。此函数首先调用sock_rps_record_flow以记录上次处理流的CPU;这由接收数据包转向使用。接下来,此函数在套接字的内部协议操作结构上查找 sendmsg 函数并调用它:

https://github.com/torvalds/linux/blob/v3.13/net/ipv4/af_inet.c#L935-L960

代码语言:javascript复制
int inet_sendmsg(struct kiocb *iocb, struct socket *sock, struct msghdr *msg,
     size_t size)
{
  struct sock *sk = sock->sk;

  sock_rps_record_flow(sk);

  /* We may need to bind the socket. */
  if (!inet_sk(sk)->inet_num && !sk->sk_prot->no_autobind &&
      inet_autobind(sk))
    return -EAGAIN;

  return sk->sk_prot->sendmsg(iocb, sk, msg, size);
}
EXPORT_SYMBOL(inet_sendmsg);

When dealing with UDP, sk->sk_prot->sendmsg above is udp_sendmsg as exported by the UDP protocol layer, via the udp_prot structure we saw earlier. This function call transitions from the generic AF_INET protocol family on to the UDP protocol stack.

在处理 UDP 时,上面的 sk->sk_prot->sendmsg udp_sendmsg由 UDP 协议层通过我们之前看到的udp_prot结构导出。此函数调用从通用AF_INET协议系列转换到 UDP 协议栈。

UDP protocol layer

udp_sendmsg

The udp_sendmsg function can be found in ./net/ipv4/udp.c. The entire function is quite long, so we’ll examine pieces of it below. Follow the previous link if you’d like to read it in its entirety

udp_sendmsg函数可以在 ./net/ipv4/udp.c 中找到。整个函数很长,因此我们将在下面检查其中的各个部分。如果您想完整阅读,请点击上一个链接。

https://github.com/torvalds/linux/blob/v3.13/net/ipv4/udp.c#L845-L1088

代码语言:javascript复制
int udp_sendmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
    size_t len)
{
  struct inet_sock *inet = inet_sk(sk);
  struct udp_sock *up = udp_sk(sk);
  struct flowi4 fl4_stack;
  struct flowi4 *fl4;
  int ulen = len;
  struct ipcm_cookie ipc;
  struct rtable *rt = NULL;
  int free = 0;
  int connected = 0;
  __be32 daddr, faddr, saddr;
  __be16 dport;
  u8  tos;
  int err, is_udplite = IS_UDPLITE(sk);
  int corkreq = up->corkflag || msg->msg_flags&MSG_MORE;
  int (*getfrag)(void *, char *, int, int, int, struct sk_buff *);
  struct sk_buff *skb;
  struct ip_options_data opt_copy;
  }
UDP corking

After variable declarations and some basic error checking, one of the first things udp_sendmsg does is check if the socket is “corked”. UDP corking is a feature that allows a user program request that the kernel accumulate data from multiple calls to send into a single datagram before sending. There are two ways to enable this option in your user program:

在变量声明和一些基本的错误检查之后,udp_sendmsg做的第一件事就是检查套接字是否“软木塞”。UDP 软木塞是一项功能,它允许用户程序请求内核从多个调用中累积数据,以便在发送之前发送到单个数据报中。有两种方法可以在用户程序中启用此选项:

  1. Use the setsockopt system call and pass UDP_CORK as the socket option.
  2. Pass MSG_MORE as one of the flags when calling send, sendto, or sendmsg from your program.
  3. 使用 setsockopt 系统调用并将UDP_CORK传递为套接字选项。
  4. 从程序中调用 send、sendto 或 sendmsg 时,将 MSG_MORE 作为标志之一传递。

The code from udp_sendmsg checks up->pending to determine if the socket is currently corked, and if so, it proceeds directly to appending data. We’ll see how data is appended later.

udp_sendmsg中的代码将检查 up->pending 以确定套接字当前是否已填充,如果是,则直接进行追加数据。稍后我们将看到如何追加数据。

代码语言:javascript复制
int udp_sendmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
                size_t len)
{

  /* variables and error checking ... */

  fl4 = &inet->cork.fl.u.ip4;
  if (up->pending) {
          /*
           * There are pending frames.
           * The socket lock must be held while it's corked.
           */
          lock_sock(sk);
          if (likely(up->pending)) {
                  if (unlikely(up->pending != AF_INET)) {
                          release_sock(sk);
                          return -EINVAL;
                  }
                  goto do_append_data;
          }
          release_sock(sk);
  }
Get the UDP destination address and port

Next, the destination address and port are determined from one of two possible sources:

接下来,从以下两个可能的来源之一确定目标地址和端口:

  1. The socket itself has the destination address stored because the socket was connected at some point.
  2. The address is passed in via an auxiliary structure, as we saw in the kernel code for sendto.
  3. 套接字本身存储了目标地址,因为套接字在某个点已连接。
  4. 该地址通过辅助结构传入,正如我们在 sendto 的内核代码中看到的那样。

Here’s how the kernel deals with this:

内核是这样处理这个问题的:

代码语言:javascript复制
  /*
   *      Get and verify the address.
   */
  if (msg->msg_name) {
          struct sockaddr_in *usin = (struct sockaddr_in *)msg->msg_name;
          if (msg->msg_namelen < sizeof(*usin))
                  return -EINVAL;
          if (usin->sin_family != AF_INET) {
                  if (usin->sin_family != AF_UNSPEC)
                          return -EAFNOSUPPORT;
          }

          daddr = usin->sin_addr.s_addr;
          dport = usin->sin_port;
          if (dport == 0)
                  return -EINVAL;
  } else {
          if (sk->sk_state != TCP_ESTABLISHED)
                  return -EDESTADDRREQ;
          daddr = inet->inet_daddr;
          dport = inet->inet_dport;
          /* Open fast path for connected socket.
             Route will not be used, if at least one option is set.
           */
          connected = 1;
  }

Yes, that is a TCP_ESTABLISHED in the UDP protocol layer! The socket states for better or worse use TCP state descriptions.

是的,这是UDP协议层中的一个TCP_ESTABLISHED!无论好坏,套接字状态都使用 TCP 状态描述。

Recall earlier that we saw how the kernel arranges a struct msghdr structure on behalf of the user when the user program calls sendto. The code above shows how the kernel parses that data back out in order to set daddr and dport.

回想一下,我们之前看到内核如何在用户程序调用 sendto 时代表用户安排结构 msghdr 结构。上面的代码显示了内核如何解析该数据以设置 daddr 和 dport。

If the udp_sendmsg function was reached by kernel function which did not arrange a struct msghdr structure, the destination address and port are retrieved from the socket itself and the socket is marked as “connected.”

如果 udp_sendmsg内核函数未排列结构 msghdr 结构,则从套接字本身检索目标地址和端口,并将套接字标记为“已连接”。

In either case daddr and dport will be set to the destination address and port.

在任何一种情况下,daddr 和 dport 都将设置为目标地址和端口。

Socket transmit bookkeeping and timestamping

Next, the source address, device index, and any timestamping options which were set on the socket (like SOCK_TIMESTAMPING_TX_HARDWARE, SOCK_TIMESTAMPING_TX_SOFTWARE, SOCK_WIFI_STATUS) are retrieved and stored:

接下来,检索并存储源地址、设备索引以及在套接字上设置的任何时间戳选项(如SOCK_TIMESTAMPING_TX_HARDWARE、SOCK_TIMESTAMPING_TX_SOFTWARE、SOCK_WIFI_STATUS):

代码语言:javascript复制
ipc.addr = inet->inet_saddr;

ipc.oif = sk->sk_bound_dev_if;

sock_tx_timestamp(sk, &ipc.tx_flags);

https://blog.packagecloud.io/monitoring-tuning-linux-networking-stack-sending-data/

0 人点赞