When you need to send small data packets over TCP, the design of your Winsock application is especially critical. A design that does not take into account the interaction of delayed acknowledgment, the Nagle algorithm, and Winsock buffering can drastically effect performance. This article discusses these issues, using a couple of cases studies, and derives a series of recommendations for sending small data packets efficiently from a Winsock application.
您需要通过 TCP 发送较小的数据数据包,Winsock 应用程序的设计时尤其重要。延迟的ACK、 Nagle 算法和 Winsock 缓冲的交互的设计方案可以极大地影响性能。本文讨论了这些问题使用几个案例研究和派生的 Winsock 应用程序中有效地发送较小的数据的数据包的建议的一系列。
When a Microsoft TCP stack receives a data packet, a 200-ms delay timer goes off. When an ACK is eventually sent, the delay timer is reset and will initiate another 200-ms delay when the next data packet is received. To increase the efficiency in both Internet and the intranet applications, Microsoft TCP stack uses the following criteria to decide when to send one ACK on received data packets:
当 Microsoft TCP 堆栈接收数据包中时,200 毫秒延迟计时器熄灭。最终发送 ACK,时延迟计时器重置,将启动另一个 200 毫秒延迟为下一个包。 若要增加在 Internet 和 intranet 应用程序效率,Microsoft TCP 堆栈来决定何时上接收的数据包发送一个 ACK 中使用以下标准:
- If the second data packet is received before the delay timer expires, the ACK is sent.
- If there are data to be sent in the same direction as the ACK before the second data packet is received and the delay timer expires, the ACK is piggybacked with the data segment and sent immediately.
- When the delay timer expires, the ACK is sent.
- 如果第二个数据数据包接收到延迟计时器过期前,将确认发送。
- 如果要收到第二个数据数据包并延迟计时器过期前在该 ACK 相同的方向发送的数据在 ACK 是此用数据段,立即发送。
- 在200ms延迟计时器过期时, 将确认发送。
To avoid having small data packets congest the network, Microsoft TCP stack enables the Nagle algorithm by default, which coalesces a small data buffer from multiple send calls and delays sending it until an ACK for the previous data packet sent is received from the remote host. The following are two exceptions to the Nagle algorithm:
- If the stack has coalesced a data buffer larger than the Maximum Transmission Unit (MTU), a full-sized packet is sent immediately without waiting for the ACK from the remote host. On an Ethernet network, the MTU for TCP/IP is 1460 bytes.
- The TCP_NODELAY socket option is applied to disable the Nagle algorithm so that the small data packets are delivered to the remote host without delay.
若要避免出现较小的数据数据包 congest 网络,Microsoft TCP 堆栈默认将合并来自多个发送调用和延迟 ACK 为以前的数据数据包发送之前接收来自远程主机发送的较小的数据缓冲区的使 Nagle 算法。以下是Nagle 算法的两个例外:
- 如果堆栈已合并数据的缓冲区比最大传输单位 (MTU) 大小,完整大小的数据包而无需等待来自远程主机 ACK 立即发送。一个以太网网络上的 TCP/IP MTU 是可以 1460 字节。
- 使用TCP_NODELAY 套接字选项 禁用 Nagle 算法,以便在传输较小的数据的数据包获得更低的延迟。
To optimize performance at the application layer, Winsock copies data buffers from application send calls to a Winsock kernel buffer. Then, the stack uses its own heuristics (such as Nagle algorithm) to determine when to actually put the packet on the wire. You can change the amount of Winsock kernel buffer allocated to the socket using the SO_SNDBUF option (it is 8K by default). If necessary, Winsock can buffer significantly more than the SO_SNDBUF buffer size. In most cases, the send completion in the application only indicates the data buffer in an application send call is copied to the Winsock kernel buffer and does not indicate that the data has hit the network medium. The only exception is when you disable the Winsock buffering by setting SO_SNDBUF to 0.
为了在应用层优化性能,winsock拷贝数据从应用层到内核缓冲区。然后,TCP栈使用自己的算法(如Nagel算法)去决定是否要将数据实际的发送出去。你可以使用SO_SNDBUF 选项修改Winsock kernel buffer大小(默认8K)。如果有必要,Winsock可以缓冲明显比SO_SNDBUF缓冲区大小。在大多数时候,send只是表示数据被拷贝到系统缓冲区而不是发送出去了。唯一的例外就是你通过设置SO_SNDBUF 为 0禁用了Winsock buffering 。 Winsock uses the following rules to indicate a send completion to the application (depending on how the send is invoked, the completion notification could be the function returning from a blocking call, signaling an event or calling a notification function, and so forth):
- If the socket is still within SO_SNDBUF quota, Winsock copies the data from the application send and indicates the send completion to the application.
- If the socket is beyond SO_SNDBUF quota and there is only one previously buffered send still in the stack kernel buffer, Winsock copies the data from the application send and indicates the send completion to the application.
- If the socket is beyond SO_SNDBUF quota and there is more than one previously buffered send in the stack kernel buffer, Winsock copies the data from the application send. Winsock does not indicate the send completion to the application until the stack completes enough sends to put the socket back within SO_SNDBUF quota or only one outstanding send condition.
Winsock使用以下规则来确定一个send请求是否完成(依赖send是怎么调用的,完成通知可以是一个阻塞调用的返回,或者一个事件通知或者调用一个回调函数等):
- 如果该套接字仍然处于 SO_SNDBUF 配额,Winsock 将数据拷贝到缓冲区,并指示完成发送到应用程序。
- If the socket is beyond SO_SNDBUF quota and there is only one previously buffered send still in the stack kernel buffer, Winsock copies the data from the application send and indicates the send completion to the application.
- If the socket is beyond SO_SNDBUF quota and there is more than one previously buffered send in the stack kernel buffer, Winsock copies the data from the application send. Winsock does not indicate the send completion to the application until the stack completes enough sends to put the socket back within SO_SNDBUF quota or only one outstanding send condition。
Case Study 1
Overview:
A Winsock TCP client needs to send 10000 records to a Winsock TCP server to store in a database. The size of the records varies from 20 bytes to 100 bytes long. To simplify the application logic, the design is as follows:
- The client does blocking send only. The server does blocking recv only.
- The client socket sets the SO_SNDBUF to 0 so that each record goes out in a single data segment.
- The server calls recv in a loop. The buffer posted in recv is 200 bytes so that each record can be received in one recv call.
Performance:
During testing, the developer finds the client could only send five records per second to the server. The total 10000 records, maximum at 976K bytes of data (10000 * 100 / 1024), takes more than half an hour to send to the server.
Analysis:
Because the client does not set the TCP_NODELAY option, the Nagle algorithm forces the TCP stack to wait for an ACK before it can send another packet on the wire. However, the client has disabled the Winsock buffering by setting the SO_SNDBUF option to 0. Therefore, the 10000 send calls have to be sent and ACK'ed individually. Each ACK is delayed 200-ms because the following occurs on the server's TCP stack:
- When the server gets a packet, its 200-ms delay timer goes off.
- The server does not need to send anything back, so the ACK cannot be piggybacked.
- The client will not send another packet unless the previous packet is acknowledged.
- The delay timer on the server expires and the ACK is sent back.
How to Improve:
There are two problems with this design. First, there is the delay timer issue. The client needs to be able to send two packets to the server within 200-ms. Because the client uses the Nagle algorithm by default, it should just use the default Winsock buffering and not set SO_SNDBUF to 0. Once the TCP stack has coalesced a buffer larger than the Maximum Transmission Unit (MTU), a full-sized packet is sent immediately without waiting for the ACK from the remote host. Secondly, this design calls one send for each record of such small size. Sending this small of a size is not very efficient. In this case, the developer might want to pad each record to 100 bytes and send 80 records at a time from one client send call. To let the server know how many records will be sent in total, the client might want to start off the communication with a fix-sized header containing the number of records to follow.
如何优化:
1,不要设置SO_SNDBUF为0,这样用户层可以顺序调用很多个send,这些send的数据会合并成大包,当超过MTU之后就会发送出去而不用等待上一个ACK。
2,应用层自己合并数据,不要发太多的小包。
Case Study 2
Overview:
A Winsock TCP client application opens two connections with a Winsock TCP server application providing stock quotes service. The first connection is used as a command channel to send the stock symbol to the server. The second connection is used as a data channel to receive the stock quote. After the two connections have been established, the client sends a stock symbol to the server through the command channel and waits for the stock quote to come back through the data channel. It sends the next stock symbol request to the server only after the first stock quote has been received. The client and the server do not set the SO_SNDBUF and TCP_NODELAY option.
Performance:
During testing, the developer finds the client could only get five quotes per second.
Analysis:
This design only allows one outstanding stock quote request at a time. The first stock symbol is sent to the server through the command channel (connection) and a response is immediately sent back from the server to the client over the data channel (connection). Then, the client immediately sends the second stock symbol request and the send returns immediately as the request buffer in the send call is copied to the Winsock kernel buffer. However, the client TCP stack cannot send the request from its kernel buffer immediately because the first send over the command channel is not acknowledged yet. After the 200-ms delay timer at the server command channel expires, the ACK for the first symbol request comes back to the client. Then, the second quote request is successfully sent to the server after being delayed for 200-ms. The quote for the second stock symbol comes back immediately through the data channel because, at this time, the delay timer at the client data channel has expired. An ACK for the previous quote response is received by the server. (Remember that the client could not send a second stock quote request for 200-ms, thus giving time for the delay timer on the client to expire and send an ACK to the server.) As a result, the client gets the second quote response and can issue another quote request, which is subject to the same cycle.
How to Improve:
The two connection (channel) design is unnecessary here. If you use only one connection for the stock quote request and response, the ACK for the quote request can be piggybacked on the quote response and come back immediately. To further improve the performance, the client could "multiplex" multiple stock quote requests into one send call to the server and the server could also "multiplex" multiple quote responses into one send call to the client. If the two unidirectional channel design is really necessary for some reason, both sides should set the TCP_NODELAY option so that the small packets can be sent immediately without having to wait for an ACK for the previous packet.
Recommendations:
While these two case studies are fabricated, they help to illustrate some worst case scenarios. When you design an application that involves extensive small data segment sends and recvs, you should consider the following guidelines:
- If the data segments are not time critical, the application should coalesce them into a larger data block to pass to a send call. Because the send buffer is likely to be copied to the Winsock kernel buffer, the buffer should not be too large. A little bit less than 8K is usually effective. As long as the Winsock kernel gets a block larger than the MTU, it will send out multiple full-sized packets and a last packet with whatever is left. The sending side, except the last packet, will not be hit by the 200-ms delay timer. The last packet, if it happens to be an odd-numbered packet, is still subject to the delayed acknowledgment algorithm. If the sending end stack gets another block larger than the MTU, it can still bypass the Nagle algorithm.
- If possible, avoid socket connections with unidirectional data flow. Communications over unidirectional sockets are more easily impacted by the Nagle and delayed acknowledgment algorithms. If the communication follows a request and a response flow, you should use a single socket to do both sends and recvs so that the ACK can be piggybacked on the response.
- If all the small data segments have to be sent immediately, set TCP_NODELAY option on the sending end.
- Unless you want to guarantee a packet is sent on the wire when a send completion is indicated by Winsock, you should not set the SO_SNDBUF to zero. In fact, the default 8K buffer has been heuristically determined to work well for most situations and you should not change it unless you have tested that your new Winsock buffer setting gives you better performance than the default. Also, setting SO_SNDBUF to zero is mostly beneficial for applications that do bulk data transfer. Even then, for maximum efficiency you should use it in conjunction with double buffering (more than one outstanding send at any given time) and overlapped I/O.
- If the data delivery does not have to be guaranteed, use UDP.
建议:
1,如果数据的实时性要求不高,可以在应用层把小包并成大包一次发送。因为发送缓冲区的数据会被拷贝到winsock内核缓冲 去,发送缓冲区也不要太大。比8K小一点效率会比较好。当winsock获得的数据比MTU大,就会发送若干个full大小的包,最后会 剩余一个小包。发送方除了最后一个包,都不会遇到200ms的问题。最后一个包,如果恰好是奇数序号,仍然会受到Delay-ACK 的影响。 如果 发送端 堆栈 获取 另一个块 大于MTU , 它仍然可以 绕过 Nagle算法 。
2,如果可能,避免使用单向数据流套接字连接。单向套接字更容易受Nagle 和DelayACK的影响。
3,如果所有的小包都要立刻发送,在发送方使用TCP_NODELAY 。
4,除非你需要确定send完成代表数据已经发送到网络,请不要设置SO_SNDBUF为0.
5,如果数据不需要保证一定到达,使用UDP是个不错的选择。
REFERENCES
For more information about Delayed Acknowledgment and the Nagle algorithm, please see the following: Braden, R.[1989], RFC 1122, Requirements for Internet Hosts--Communication Layers, Internet Engineering Task Force.