For years I've been a big fan of Mark Russinovich's "The case of" blog posts. So I decided to do a similar post this time. A couple of months ago my team was troubleshooting an issue related to slow responses from a 3rd party API. This particular API is located in the US and our code runs in Australia. Typically we saw response times of a few hundreds of milliseconds (which includes time to establish connection, round-trip to a different continent and back plus processing time). Everything worked well until suddenly one day our monitoring systems picked up a significant increase in request processing time. It looked like this:
message time | elapsed time (ms) |
12/18/2015 10:15:31.938 +1100 | 10085 |
12/18/2015 10:15:24.107 +1100 | 10114 |
12/18/2015 10:15:17.490 +1100 | 9924 |
12/18/2015 10:15:11.704 +1100 | 9991 |
12/18/2015 10:15:05.796 +1100 | 9953 |
12/18/2015 10:14:50.723 +1100 | 9964 |
12/18/2015 10:14:49.815 +1100 | 9911 |
12/18/2015 10:14:40.021 +1100 | 10140 |
12/18/2015 10:14:29.147 +1100 | 10151 |
12/18/2015 10:14:28.646 +1100 | 9937 |
Everything still worked fine but instead of sub-second responses we saw requests taking 9-10 seconds to complete. Further investigation was required. We performed the usual troubleshooting steps but still could not figure out what was going on there. We had to go deeper and deeper in our analysis - eventually all the way to the network packet capture. In fact, it's the packet capture that gave us the first hint of what the problem was. We saw a lot of the TCP retransmissions.
Two things were clear for us now.
Firstly, we noticed that only the SYN packets had delivery problems and had to be retransmitted. SYN packet is the first packet of a 3 packet "handshake" used to establish a TCP/IP connection. We saw that once the connection was established there were no more retransmissions during the session/data transfer.
Secondly, we could see where all those extra seconds were coming from!
See how there is a 3 seconds difference between the initial SYN packet (packet 22921) and the retransmission in line 23003 which is 3 seconds after (103.53… and then 106.54…)
And then we retransmit again 6 seconds later (packet 23087).
After that the connection is finally established but we’ve just lost 3+6=9 seconds during the TCP handshake.
Another interesting observation was that when we retransmit for the second time (packet 23087) we remove the ECN and CWR flags.
Firstly, we noticed that only the SYN packets had delivery problems and had to be retransmitted. SYN packet is the first packet of a 3 packet "handshake" used to establish a TCP/IP connection. We saw that once the connection was established there were no more retransmissions during the session/data transfer.
Secondly, we could see where all those extra seconds were coming from!
See how there is a 3 seconds difference between the initial SYN packet (packet 22921) and the retransmission in line 23003 which is 3 seconds after (103.53… and then 106.54…)
And then we retransmit again 6 seconds later (packet 23087).
After that the connection is finally established but we’ve just lost 3+6=9 seconds during the TCP handshake.
Another interesting observation was that when we retransmit for the second time (packet 23087) we remove the ECN and CWR flags.
We performed several packet captures and it became clear that our SYN packets were not reaching the destination and we had to retransmit them (or their SYN/ACK packets were not reaching us)
SYN packet retransmission (at least on Windows) by default works like this:
“The retransmission timer is initialized to three seconds when a TCP connection is established. However, it is adjusted on the fly to match the characteristics of the connection by using Smoothed Round Trip Time (SRTT) calculations as described in RFC793. The timer for a given segment is doubled after each retransmission of that segment. By using this algorithm, TCP tunes itself to the normal delay of a connection”
This is where we get 3 seconds (initial retransmission delay) plus 6 seconds (3 seconds doubled for the second retransmission).
Also given that “Max SYN Retransmissions” is set to 2, the system will only retransmit the SYN packet twice hence the ~9 seconds delay we see in the worst cases. The initial retransmission timer value is set in the "Initial RTO" parameter (see the screenshot below). To test this theory we decided to change this value from 3 seconds to 2 seconds:
This can be achieved by running this command:
netsh int tcp set global initialRto=2000
Once this change went live, straight away we saw request processing time decreasing from 9-10 seconds down to ~6 seconds. We knew we were on the right track.
Another suspicious finding (as mentioned above) was that most of the SYN packets with the ECN and CWR flags were dropped while SYN packets without these flags were going through.
ECN (Explicit Congestion Notification) is an interesting protocol extension defined in the RFC 3168. In the TCP/IP world the standard way for the receiver to "notify" sender of network congestion is to drop packets. This behaviour obviously can have a significant impact on the overall network performance. ECN (when supported and negotiated by both ends) allows signalling/notification of network congestion to happen without dropping packets.
Windows had ECN for TCP support since Windows Server 2008 and Vista (but it was disabled by default). But it is enabled in Windows 2012. (Linux passively supports ECN - will negotiate if asked by the other end)
ECN support has improved significantly since the introduction 15 years ago but apparently some issues still exist.
The next step for us was to try to disable ECN to see if this was the culprit.
ECN capability can be turned off by executing this command:
ECN capability can be turned off by executing this command:
netsh int tcp set global ecncapability=disabled
Once this change was applied, all TCP Retransmissions disappeared and request processing time was back to a few hundred milliseconds.
We contacted the API vendor and they reassured us that their end had proper ECN support. The fact that not all of the SYN packets with ECN flag had this issue (but most of them), led us to believe that we saw a "Path-dependent connectivity dependency" as described on slide 6. This is also indirectly supported by the fact that some of the BGP routes changed roughly around the same time when we started experiencing this issue.
We were happy to see this issue resolved. Hope this blog post will help someone in a similar situation.
Keywords: Max SYN Retransmissions, maxsynretransmissions, slow connection, ECN, ecncapability, TCP retransmission
No comments:
Post a Comment