Check the health and availability of your Linux servers for optimal performance with Site24x7's Linux monitoring tool.
The Transmission Control Protocol (TCP) is a stream-oriented network protocol built on top of the Internet Protocol (IP). TCP guarantees that network packets are received in the order they have been sent, without duplication. This is done through TCP built-in mechanisms of acknowledgments and re-transmission.
TCP does not require each IP packet to be acknowledged before sending the next one. How much data (IP packets) can be “in-flight” before being acknowledged by the other side is called the “TCP window.” Originally, the TCP window could have a maximum size of 65,535 bytes. But the advent of modern networks made this limit impractical.
Nowadays, data is transmitted at a much higher speed than when the TCP protocol was invented. Data circulating within a data center can easily achieve 100 GB/s, making the small TCP window a bottleneck for efficient network utilization.
This is why TCP window scaling was added as an official TCP extension in RFC 1323 back in 1992. This extension allows for much larger TCP windows, up to approximately 1 GB in size. In this article, we will explore how TCP window scaling works, along with its advantages and disadvantages.
First, let’s dive a bit deeper into TCP windows and how they work in practice.
There are actually two TCP windows, one for each side of the stream. One side of the stream maintains a “send counter” and an “acknowledgment counter.” The send counter counts how many bytes have been sent, while the acknowledgment counter informs the other side of how many bytes have been received. The other side of the TCP connection maintains the same counters. Both counters are 32-bit fields embedded in the TCP packet headers.
The following diagram shows the left peer sending 90 bytes in three packets and the right peer acknowledging the reception of those 90 bytes by setting its acknowledgment counter to the sequence number sent by the left peer:
Fig 1. How TCP windows worksIn the above example, the window was 90 bytes. In the original TCP specifications, the maximum number of bytes that could be sent without receiving an acknowledgment was 65,535.
In the case of a fast and reliable network, the original TCP window limit is a huge bottleneck. The original maximum window size of 65,535 bytes would translate, for a fast network of 100 GB/s, into about 6 microseconds worth of traffic. This means the other side would have to acknowledge the received data at least every 6 microseconds to fully utilize the available network bandwidth.
Unfortunately, this is unlikely to happen given the short period of time and because the operating system will be extremely busy processing incoming data and most likely won’t send acknowledgments in a timely manner.
Transmissions are usually uneven. In most cases, one side of the TCP connection acts as a client that sends a query, and the other side responds with the data. So in practice, in a TCP connection, the vast majority of the data flows unidirectionally, and the receiving side will have to send empty packets just to acknowledge the received data, which is inefficient.
For high-volume intra-datacenter traffic, there are Ethernet jumbo frames. They can carry about 9,000 bytes of data while the regular ethernet frames carry about 1,500 bytes. With the use of jumbo frames, the TCP window limit can be reached after only 7 packets. Again, the constant need for acknowledgment from the receiving side will slow down the network traffic
TCP window scaling is a TCP extension defined in RFC 1323. Both parties negotiate during the establishment of the connection with the first two packets that are exchanged between both parties. These two packets have the “synchronize” TCP flag set (usually abbreviated as a SYN flag). The TCP window scaling is a multiplier of the original TCP window. The multiplier is a power of 2 and can go from 2^0 (i.e., no change in the TCP window) to 2^14.
If the initial SYN packet sent by the client includes the window scaling extension with a scale of 0, this allows the other party to negotiate the window scaling. Therefore, it is advised that even if the client side doesn’t want to increase its TCP window size, it still includes a null TCP window scaling extension to allow the server side to negotiate its own TCP window scaling.
The maximum TCP window size that can be achieved using TCP window scaling is for a scaling factor of 2^14=16,384, which would give a window size of 16,384 x 65,535 = 1,073,725,440 bytes, or approximately 1 GB. It should be noted that the TCP counters are 32 bits in size, so they can still be used without change because they can hold a maximum value of 4 GB.
If you are running Linux, you can check whether the TCP window scaling extension is enabled by running the following command:
$ cat /proc/sys/net/ipv4/tcp_window_scaling
This will show “1” if it is enabled, or “0” if not.
To enable TCP window scaling temporarily, run the following command:
$ sysctl net.ipv4.tcp_window_scaling=1
Here is how to enable it permanently:
$ echo net.ipv4.tcp_window_scaling=1 \ > /etc/sysctl.d/50-tcp-window-scaling.conf \ && sysctl --system
The benefits of TCP window scaling seem clear and you might be tempted to enable it everywhere. However, there are circumstances where TCP window scaling will actually cause more problems than it solves, especially when the network is unreliable or slow.
In this case, it’s better to have a smaller window; otherwise, there will be too many retransmissions, and the performance of the TCP connection can drop dramatically—to the point where the vast majority of the traffic consists of retransmissions. Given that the network is slow and/or unreliable to begin with, you can understand that this situation will be detrimental to the performance of the TCP connection.
As long as the network you are using is reliable and the number of dropped or delayed packets is low, enabling TCP window scaling is a good idea and will significantly improve transmission rates. You should, however, use a small window on unreliable or congested networks to avoid TCP triggering too many retransmissions and congesting the network further.
Write for Site24x7 is a special writing program that supports writers who create content for Site24x7 “Learn” portal. Get paid for your writing.
Apply Now