Sunday, December 18, 2016

My first months in America - part 2

First of all I would like to say thank you to all my readers - I received a very positive feedback after I published the first part of my initial impressions after the first 2 months of living in the US. Here is part 2, where I continue covering various topics related to what I find interesting, strange, and amusing here in the US based on my international experiences.

I've briefly covered climate in my opening remark of part 1. But what surprised me most here is a notion of micro-climate. I haven't experienced anything remotely similar to this. San Francisco is located next to the ocean. And climate there is affected by the ocean a lot. But the more you move away from the ocean deeper inland the climate becomes more continental (hotter in the summer and colder during the winter time). The Bay Area is also surrounded by the hills, which act like walls in some way isolating this area from the large atmospheric flows. This is all normal (ocean is a huge heat accumulator after all) but what makes it fascinating is the extremely short distances (in my view) where we can observe large temperature variations. I have seen it several times when driving 15-20km towards San Francisco - the temperature would drop by 15 degrees. I am wearing a t-shirt in a car and wondering why people in SF are wearing jackets. This is truly a MICRO-climate.
Another impressive weather feature over here is fog. Fog is very common around SF and quite often it spreads deeper into the Bay Area crawling at the bottom of the valleys between the hills surrounding this area.

By Brocken Inaglory - Own work, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=7908286

Back to the cars ;) Most of the cars are equipped with the radios capable of receiving SiriusXM. These are radio stations delivered via a satellite. Lots of stations to choose from, good sound quality but requires a paid subscription. Without it you are only getting a demo channel.

When you start a car here all (most?) of them display an annoying warning/popup (before you can start using the navigation screen). The signs of a country run by the lawyers - no one wants a liability.

A navigation system warning

Paper money. After 12 years in Australia I was so used to plastic banknotes that returning back to paper based banknotes felt strange. From the longevity and usability perspective I am a big fan of plastic banknotes.

You can have hundreds of TV channels here in the US. There are different packages - some of them containing 500+ TV channels. But it is hard to find anything interesting to watch - especially if you have one of the basic packages only. You get a bunch of news channels, a few sports channels (ESPN, TNT - where one can occasionally catch a good game of basketball), lots of religious channels... There is even a special channel for dogs! Apparently "Scientifically developed. Pup approved". You can watch some local games (Raiders, Warriors) but if you follow a different team - tough luck! They will tease you in the TV guide but then you will quickly discover that these games are "blacked out" in your region. If you want to watch proper sports you will need to buy a special package to watch NBA or NHL games. Same applies to movies - you will need CINE or HBO to watch anything decent. And if you combine a few packages together it can become quite expensive.
In regards to the news channels - Fox News has a strong bias/support towards the Republican party (right, conservative party), while a whole bunch of other channels (ABC, CNN, NBC) have a tendency to support Democrats/Pro-liberals more. It was amazing to watch the elections campaign and switch between the channels to see the same issue being presented from the totally different angles.

Guns. Where do I even start?!
The biggest disconnect is how normal this looks to the majority of people born and raised in the US versus people who came from other parts of the world. To better explain this phenomenon I should start with the 2nd amendment of the United States Constitution:

A well regulated Militia, being necessary to the security of a free State, the right of the people to keep and bear Arms, shall not be infringed.
And this is the foundation that gives individuals the "right to bear arms" (i.e. to legally own guns). Gun ownership is very common - spearheaded by the powerful NRA (National Rifle Association) lobby. You can even judge this by this photo I took in the bookshop - SO many gun related magazines!

Various gun magazines
Many people I've spoken to do have guns at home. Some of them even attend shooting ranges on a regular basis. There are a lot of nuances though. First of all - gun laws vary by state - e.g. California laws are more restrictive compared to some other (mostly Southern) states (Utah, Arizona, Texas, Louisiana etc). There are other limiting factors like age restrictions, the size of a magazine (i.e. assault weapons containing 20 or more rounds of ammunition), whether it was designed to be equipped with a silencer, etc

There are 2 types of licences - Concealed Carry (where a gun should not be visible to the public - in some cases even partially including a contour of the weapon in the back pocket) and Open Carry.
I've heard an interesting consideration - if you carry a gun visible to anyone in (say) a shop and robbers enter/target this shop - who are they going to shoot first? Correct - a person carrying a gun.

Shootings in public places (especially schools) are terrifying. Teenagers (sometimes as a response to bullying) shoot their classmates. Some people would say "how come that guns are so accessible to kids" while others would complain "if only a teacher had a gun to take down the shooter faster to reduce the number of victims"

In essence, the logic goes like this - if the bad guys can have guns, what can we do about it to protect ourselves? In many other countries the answer is naturally to limit the accessibility of various weapons to make it harder for the criminals to gain access to those weapons, to have better policing and to prevent weapons from being stolen from the policemen, etc.

In the US it is the opposite - in order to defend yourself against people with weapons you need to have a weapon too.

I'd like to quote an old Colt Manufacturing ad slogan:

"God may have made men, but Samuel Colt made them equal"

This is a very divisive and contentious issue - on one side pro-gun groups like NRA and Gun Owners of America push for more lenient gun ownership laws while groups like The Brady Campaign and The Coalition to Stop Gun Violence advocate for much stricter rules. My observation is that republicans are generally more pro-gun while democrats are usually more inclined to favor more gun controls.

I remember watching the local (!) news in Miami last year for a week and (I am not exaggerating) - every single day they were reporting shootings. This sounds very normal for the Americans but this is WAY too crazy for most of the foreigners including myself.

But this is nothing compared to what happens in Chicago. The Chicago Tribune does an excellent job tracking Chicago shootings. They keep an up-to-date map and stats:

Courtesy of Chicago Tribune - shooting stats

Multiple shootings each day! And this is only one city. Unbelievable.

Ok, let's talk about something different now.

Sales tax is not included in the displayed price. Coming from Australia I find this being crazy and unfair to the consumers. I really like the fact that in Australia you pay the amount displayed. Also I like that in Australia price tags (in addition to the full amount) display price per unit (e.g. price for a liter, 100g etc) - this makes it easy to compare various packages of different size/volume.

And in addition to not displaying tax there are also tips. I have to say that I disagree with a compulsory nature of tips here in the US (although I don't mind giving a tip for a really good service). Lots of various services (restaurants, cafes, taxis etc) expect you to tip. Printed checks contain a special field for tips that you are supposed to fill in manually when you sign the check. Some more advances electronic facilities have special tipping buttons pre-configured in the interface. Tips usually start at 10% (minimum) and go up to 15-20%. A quick calculation rule - strip the last digit (which gives your roughly 10%) and then double the amount. I understand that in some cases tips are included/expected as part of the salary but I like how other countries deal with it (somehow avoiding this inconvenience and ambiguity of tips)

Square. It is a simple magstripe card reader. Square credit card payment processing system is used by many small businesses here in the US. Any business owner can go to the Square web site and request a free reader. It can be inserted into a standard 3.5mm headset jack (both iOS and Android phones are supported). It is simply brilliant. Cards can be processed anywhere (there is even an offline mode). Customer can sign with their finger on the phone screen.

All rights belong to Square - screen grab from https://squareup.com/reader

Square charges 2.75% per swipe. Absolutely brilliant in its simplicity.

Crazy lane splitting. I've seen bad lane splitting in Australia too. Lane filtering is allowed in Australia (when cars are not moving at all or move slowly with the speed below 30km/h). Lane splitting (i.e. speeds higher than 30km/h) is illegal in Australia. It is also illegal in all US states except California. What I've seen here in California is plain crazy. You can drive on a freeway at 65mph (~105km/h) and be passed as if you were stationary by a motorcycle, driving between the lanes navigating around cars at crazy speeds. Surely they feel a lot of air turbulence and changing conditions passing cars at that speed. I understand the appeal to beat traffic but this is a dangerous practice in my view.

Christmas and New Year time is fast approaching. And it feels right in the Northern hemisphere. It is winter, it is cold. And I really like the way Americans decorate the streets, shopping malls, their houses and even their cars. It does feel festive. And from what I've heard many buy the real Christmas trees. This brings back my childhood memories - the smell of a real fur tree...

I would like to wish Merry Christmas and a Happy New Year to all my readers.




Sunday, October 16, 2016

My first 2 months in America - part 1

It's been 2 months since I left Melbourne, Australia and moved to California, USA. 2 months is not enough to fully understand and get a feeling of a new country but it is certainly enough to make some observations. I wanted to capture these first impressions while they are still fresh.
California (Bay Area) is surprisingly very similar to Melbourne both from the climate and landscape perspectives - very similar temperatures, drought-like conditions during the summer months, ocean proximity, hills surrounding the area.
But there are certainly some differences too. For some reason a large part of these differences revolves around cars and driving for me. So let's begin:

After 12 years of driving on the left side of the road in Australia, I am back to driving on the right side. It is fairly easy to adjust, just need to keep thinking when making turns for the first couple of weeks. What made it easier for me is that I was driving a European car in Australia with controls (wipers, indicators) already in the same arrangement, so I avoided the usually inevitable wipers instead of indicators when making a turn.

Petrol stations!
First of all, it is not "petrol" anymore. It's "gas". An engineer in me cringes but I do realise that it is a contraction of "gasoline".

Octane numbers are different. And the reason for this is that Australia uses RON (Research Octane Rating) while in the US it is (R+M)/2 (an average of RON and MON). In Australia we had 91, 95, 98 (plus you could get a 100 racing grade up until recently). Here in the US it's 87, 89, 91.

The actual process of how you buy petrol is different too. In Australia you put the nozzle in and start pumping petrol straight away. You can either fill a full tank or use one of the presets. Then you go inside and pay to a person. In US you can pay to a person too but the usual way is to swipe your card right at the pump. And the biggest surprise - a requirement to enter your postcode (at the pump!) when using a credit card (even when using a debit card in a credit card mode). Apparently this is to reduce the amounts of fraud and stop people from using stolen credit cards but I found this unusual. For a debit card you will just be asked for a pin number.

Driving style had its own share of surprises. It is very common here for drivers to change lanes and turn without signalling. They just turn. Speeding is another common issue. As an example - the speed limit on the freeways is 65mph but hardly anyone drives at that speed. I estimate that on average people will exceed this limit by 10-15mph (driving at ~75mph or ~120km/h).

The Stop signs are there but the vast majority of people don't stop. They just slow down and roll through. This is called a "California stop".

Another mildly confusing finding was how the right of way is implemented in California when crossing an intersection. The rules are similar with one exception: "yield to the vehicle or bicycle that arrives first". This is so arbitrary and quite confusing to me. I am used to certain road rules. And here you need to pay attention to who arrives to the intersection first (even by a second or two) because some of the car moves may surprise you. People may turn in front of you even if you are driving straight.

I liked the idea of slip lanes in Australia that allow a car to turn without entering the intersection. Slip lanes do exist here in US but they are not very common. Instead, there is a rule that allows cars to turn right on the red signal. So far I have to overcome a sort of psychological barrier every time I execute a turn like this.

Carpool lanes - or officially high-occupancy vehicle (HOV) lanes - are very similar to transit lanes (T2, T3) in Australia. You can use these lanes during the certain (peak) hours if you have 2 or more people in your car. And unlike Australia, there are special overhead cameras that monitor all passing cars. Another difference is that there is also a possibility to pay a fee to use these lanes if you travel alone and require faster commute (similar to the paid roads with the eTag). And local version of the eTag is called FasTrack.

When I first saw Australian roads I was pleasantly surprised by the quality of the road surface and the surrounding infrastructure. I can't say the same thing about the roads in California. The road surface on the major freeways is uneven, with patches and cracks. And there is rubbish on the sides of these roads.

Many people use an app called Waze to navigate around. It's based on a crowd sourced model with the actual drivers providing updates about police sightings, objects on the road and various other hazards.

The fuel economy is measured in miles per gallon (MPG), which is an inverse approach to the usual litres per 100km.

And speaking about "strange" measurement units - cars' power is measured in horse powers (HP) - not kilowatts but it was OK for me because the same unit is used in Russia. But "pound foot" (lb·ft) used to measure torque instead of Newton meters is a complete mystery for me. I don't "feel" these values.

I guess that's enough talking about cars.

Let's talk about finances. Credit rating(s) is such a visible and important part of one's financial life. These ratings do exist in Australia too and are used to assess the borrowing power (creditworthiness), various risks etc but it's all kind of hidden. But not in the US. There are credit bureau agencies that keep track of your credit history, there are apps that can display your credit score, every time you do anything remotely related to money/finances you can be sure your credit score will be examined at that point. This leads to situations, where people actively working on improving their score. This also created a bizarre (in my view) type of a credit card called "secured credit card". In Australia usually you can have a bank (debit) card (where you use your own money) or a credit card (where you use bank's money for a period of time for free - 44 or 55 days - and then you need to repay money back or you are going to incur some interest on the amount owed to the bank). Both types do exist in the US too but you cannot get a proper credit card if your credit score is low. A secured credit card can be used in this situations as a way to repair/improve your credit score. It uses your own money under the hood but acts as a credit card. E.g. you can put a $500 deposit and the bank will issue you a secured credit card with the same amount allocated to your "credit line". So how is it different to a standard debit card (which also can act as a VISA card)? Apparently the difference is - when you use your  debit cards it only affects you and your bank. It helps building your relationship with the bank but this is where it stops. With the secured credit card your activity (late or on-time repayments, the amount owed etc) is fed/reported to the credit bureaus, which directly affects your credit score. So the theory is that if you need to improve your credit score then it's a low risk for the bank to issue you such card (afterall it's your own money) but all sensitive operations are tracked and it allows you to demonstrate that you are sensible type when it comes to managing finances and ultimately improves your creditworthiness.

Well, this is it so far. I will continue writing about my US experiences as I tackle and learn new things.
Stay tuned!

Sunday, March 13, 2016

What is wrong with this code?

Every now and then I run training sessions for the dev teams, where we go through the small code samples and I ask audience a question - "what's wrong with this piece of code?". Sometimes it could be a trivial security blunder that leads to a SQL injection or XSS. In other cases the answer could be less obvious (e.g. a security weakness). And sometimes the question should really be "What CAN go wrong with this code" depending on certain implementation details. The purpose of these training sessions is to raise security awareness by demonstrating real security issues that I came across during my career (originating from multiple sources - code reviews, bug bounties, social media etc)

I would like to share these samples with you and hopefully together we can make Internet a safer place. Please let me know if you decide to use any of these samples as part of your own training sessions - I'd be very keen to know how it goes and to receive any feedback.

Note for security professionals - these examples are very simple. They are not meant to be hard, they are just a starting point and in most cases I seek a nearly immediate response from the audience.

"Forewarned is forearmed!"

Question 1

"SELECT ItemID, CONVERT(varchar(20), SubmitDate ,6) As SubmitDate FROM tblItem
WHERE LoginID = " & CStr(getCookie("Myapp", "iUserID")) & "
ORDER BY SubmitDate desc"

Answer 1

Potential SQL injection via getCookie("Myapp", "iUserID"). Cookies = untrusted input. Avoid constructing dynamic SQL statements (string concatenation) - this coding style often leads to SQL injections.

Also it looks like iUserID is an Integer. I always recommend (where possible) to constrain input for length, range, format, and expected data type (i.e. where we know expected type upfront). By wrapping getCookie() in either CInt() or CLng() before feeding it into the SQL statement we can essentially eliminate the risk of SQL injection. 

A second issue to consider (especially if iUserID is an integer) is that it might be possible to supply another user ID and bypass security controls to gain access to someone else's data.

Question 2

Code behind:
if(Request.QueryString.Get("ver") != null)
  ItemVersion.Text = Request.QueryString.Get("ver").ToString();
Page:
<asp:label id="ItemVersion" runat="server"></asp:label>

Answer 2

XSS (cross-site scripting) via the "ver" parameter.
Label.Text is unfortunately unsafe - by default it sets HTML markup. An attacker can supply a value/payload similar to this: ver=<marquee>xss</marquee> or ver=<script>alert(1)</script>


Question 3

From /myapp/logout.asp
' Get User ID. if found then log the user out of My App
if getCookie("Myapp", "iUserID")<>"" then
  [skipped]
  Set mySession = Server.CreateObject("Myapp.Session")
  mySession.SessionID = getCookie("Myapp", "SessionID")
  if mySession.Delete(connStrDB) then
   writeCookie "Myapp", "", "iUserId"
   writeCookie "Myapp", "", "sUserName"
   Response.Buffer
   SafeRedirect "/myapp/logout.asp?Success=True"
   Response.End
  [skipped]
  end if
end if

Answer 3

It is possible to cause a denial of service - to log out any session. getCookie("Myapp", "SessionID") is untrusted input and the code never checks that this SessionID actually belongs to this user (iUserID). If session IDs are easy to guess (e.g. integers) then it is trivial to iterate through and log out all active users.


Question 4

sUploadedFileName = Mid(fUpload.UserFilename, InstrRev(fUpload.UserFilename, "\") + 1)
If InStr(sUploadedFileName, ".gif") <= 0 And Instr(sUploadedFileName, ".jpg") <= 0 Then
            DisplayErrorPage "Only accept GIF or JPEG files, please try again"
            Response.End
End If

Answer 4

The idea is to allow uploading only *.gif and *.jpg files. But it is possible to upload any file as long as it has ".jpg" or ".gif" somewhere in the name. E.g. MyEvilFile.jpg.SomeOtherText.MyExt

Weak validation of this kind often leads to hackers being able to upload executable files (or shells)  - especially if uploaded "images" are accessible from the web (i.e. the Upload directory is under the web root) - e.g. http://mysite.com/UploadedImages/MyEvilFile.jpg.php


Question 5

sCustMediaDir = sCustHomeRoot & "\" & Replace(getCookie("Myapp", "sCustName"), " ", "_") & "\media\"

sNewFileName = sCustMediaDir & sUploadedFileName

fUpload.Form("dlgFile").SaveAs sNewFileName

Answer 5

The problem is in the way how we construct the sCustMediaDir string. sCustName is untrusted input and it can contain anything (including paths like "..\..\mypath\"). As a minimum this bug allows the attacker to rewrite files that belong to other customers (by changing the sCustName cookie value to "..\customer2"). Also if file system permissions for the web application user are weak and allow writing outside of the sCustHomeRoot directory then it could be possible to create or overwrite other files on this drive (the "\media\" part will make it less useful though)


Question 6

'if website address contains http:// strip it out
if inStr(strWebsiteAddress, "http://") > 0 then
                strWebsiteAddress = Right(strWebsiteAddress, len(strWebsiteAddress) - 7)
End if

Answer 6

This attempt to strip out "http://" can be bypassed by supplying "http://http://mysite.com"
It is also worth noting that "https://" is not stripped out and potentially can be used as a bypass too.


Question 7

http://mysite.com/embed.aspx?frameSrc=/mypath/campaign1.htm

inside embed.aspx
<iframe id="theFrame" name="theFrame" width="780" scrolling="auto" frameborder="no" border="0" scrolling="no" src="<% Response.Write(System.Web.HttpUtility.UrlEncode(Request.QueryString["frameSrc"])); %>" ></iframe>

Answer 7

An attempt is made to load local content (relative path) into an iframe. Unfortunately a developer here forgot that the value in frameSrc is untrusted and can be controlled by the attacker.

E.g. we can supply an external malicious page http://mysite.com/embed.aspx?frameSrc=//www.externalevilsite.com/evilpage.php which will be rendered/embedded into the web page. This approach can be leveraged in phishing campaigns etc

I would recommend to avoid referencing pages by name/URL directly and instead have a whitelist or a resource map, where each allowed page should have a corresponding ID associated with it.

E.g. if we have an internal map matching /mypath/campaign1.htm to "myresource123" then we can request it i na safe way as http://mysite.com/embed.aspx?resourceID=myresource123


Question 8

isAdministrator = (getCookie("Myapp",  "iEditedBy") = "1")

Answer 8

A classic insecure cookie handling vulnerability. The presence of a cookie iEditedBy with a value of "1" means you are an admin! And apparently these types of issues are quite common.


Question 9

' check the incoming remote address

if 0 < Instr(Request.ServerVariables("REMOTE_ADDR"), "10.11.12") then
 ' all OK
 [skipped]
else
 ' not allowed
 errorMessage = ERROR_PREFIX & " IP address is not allowed"
 displayErrorMessage(errorMessage)
End if

Answer 9

An attempt is made to only allow IP addresses from the 10.11.12.0/24 range (from 10.11.12.0 to 10.11.12.255). Unfortunately the way this filtering is implemented will also allow IP addresses that follow this pattern: xxx.10.11.12, which is most likely undesired.


Question 10

http://www.somesite.com/Search.asp?query=SELECT+cname%2C+sname%2C+description%2C+pid%2C+picture+FROM+tblCategory+c%2C+tblSubcategory+s%2C+tblItem+i+WHERE+i.cid%3D5+AND+i.sid%3D31+AND+i.cid%3Dc.id+AND+i.sid%3Ds.id+ORDER+BY+cname%2C+sname%2C+description%2C+pid

Answer 10

This is just for the giggles. But this is a real example (real web site) that I came across a few years ago. I wouldn't even call it a SQL injection. It is more than that. These guys allow anyone to execute any SQL statement on their web site. And surprisingly they are not alone. I see this approach time and time again in the old ASP and PHP based web sites.

In fact, you can run a Google search (aka Google dork) similar to this one to see what I mean: inurl:"query="+inurl:SELECT+inurl:FROM+inurl:WHERE+inurl:"order by"

Another web site had 2 separate parameters for the "where" clause and the "order by" part of the query but the "idea" remains the same:

www.somesite.com/mypath?where_clause=+item_status_id+in+(select+status_id+from+item_statuses+where+category_id=1)+and+item_type_id+in+(1,2)&order_by_clause=ORDER+BY+create_date+DESC


Question 11

       customerIDs = Convert.ToString(Request.QueryString[CustomerIDParam]);
       if (customerIDs == null || !Regex.IsMatch(customerIDs, "[0-9,]+"))

Answer 11

The weakness is that this RegEx checks that we have digits or commas but it doesn’t prevent an attacker from entering other characters (like an apostrophe as an example) as long as there IS at least one digit or comma.

Once accepted these values are fed into a SQL query. A defence in depth principle dictates that we should try to defend our systems at each level. The application level is certainly capable of performing some input parameter validation. In order to fix this particular weakness we can make a regex tighter:
       customerIDs = Convert.ToString(Request.QueryString[CustomerIDParam]);
       if (customerIDs == null || !Regex.IsMatch(customerIDs, "^[0-9,]+$"))


Question 12

www.somesite.com/mypath?stdTextCol=&linkTextCol=&p1imageLoc=http://db1.img.somesite.com/i/123.jpg&p1thumbImageLoc=http://db2abcde01:83/i/456.jpg

Answer 12

I can see several potential issues with this request URL:

  1. We are allowing images to be loaded from a different domain (p1imageLoc and p1thumbImageLoc parameters). An attacker can supply their image to alter the look of the web site (and potentially use this in a phishing style attack)
  2. Look at the difference in the way how image server is specified in p1imageLoc and p1thumbImageLoc. In the first case this is just a normal domain name but in the second case we see an internal server name followed by a non-standard port (!!!). What happens next really depends on the implementation.
As a minimum we leak information that an attacker might find useful (as part of their reconnaissance effort). They now know that there is an internal server called db2abcde01 that runs a web service on port 83.

But the situation can actually be worse. This URL can potentially give an attacker a leg into the internal network (again - depending on how much information is actually returned back to the attacker - e.g. as part of the detailed error messages).

E.g. an attacker may try to perform a port scan by iteration through the port numbers (83, 84, 85 etc)

Or try a different URI scheme (ftp://, file://, telnet:// or even svn:// ;) )

Or execute an admin request (db2abcde01:83/admin/SensitiveOperation) - either unauthenticated by themselves or embedding this URL somewhere waiting for a logged in person with admin privileges to inadvertently execute this request.

Or an attacker may try to find other servers on the internal network. What if they try db2abcde02? Or SuperSecretServer01?  


Final words

12 questions should be enough for the first blog post of the series. I've got a lot more examples and I am sure this is not the last post of this kind. I might even try something different next time. I can post just questions (avoiding the most trivial ones) and let the audience try their "hacker" thinking and then publish the answers a week later. What do you think?

Thursday, March 3, 2016

The case of slow API connections and TCP retransmission

For years I've been a big fan of Mark Russinovich's "The case of" blog posts. So I decided to do a similar post this time. A couple of months ago my team was troubleshooting an issue related to slow responses from a 3rd party API. This particular API is located in the US and our code runs in Australia. Typically we saw response times of a few hundreds of milliseconds (which includes time to establish connection, round-trip to a different continent and back plus processing time). Everything worked well until suddenly one day our monitoring systems picked up a significant increase in request processing time. It looked like this:

message time elapsed time (ms)
12/18/2015 10:15:31.938 +1100 10085
12/18/2015 10:15:24.107 +1100 10114
12/18/2015 10:15:17.490 +1100 9924
12/18/2015 10:15:11.704 +1100 9991
12/18/2015 10:15:05.796 +1100 9953
12/18/2015 10:14:50.723 +1100 9964
12/18/2015 10:14:49.815 +1100 9911
12/18/2015 10:14:40.021 +1100 10140
12/18/2015 10:14:29.147 +1100 10151
12/18/2015 10:14:28.646 +1100 9937

Everything still worked fine but instead of sub-second responses we saw requests taking 9-10 seconds to complete. Further investigation was required. We performed the usual troubleshooting steps but still could not figure out what was going on there. We had to go deeper and deeper in our analysis - eventually all the way to the network packet capture. In fact, it's the packet capture that gave us the first hint of what the problem was. We saw a lot of the TCP retransmissions.



Two things were clear for us now.
Firstly, we noticed that only the SYN packets had delivery problems and had to be retransmitted. SYN packet is the first packet of a 3 packet "handshake" used to establish a TCP/IP connection. We saw that once the connection was established there were no more retransmissions during the session/data transfer.

Secondly, we could see where all those extra seconds were coming from!
See how there is a 3 seconds difference between the initial SYN packet (packet 22921) and the retransmission in line 23003 which is 3 seconds after (103.53… and then 106.54…)

And then we retransmit again 6 seconds later (packet 23087).
After that the connection is finally established but we’ve just lost 3+6=9 seconds during the TCP handshake.

Another interesting observation was that when we retransmit for the second time (packet 23087) we remove the ECN and CWR flags.

We performed several packet captures and it became clear that our SYN packets were not reaching  the destination and we had to retransmit them (or their SYN/ACK packets were not reaching us)

SYN packet retransmission (at least on Windows) by default works like this:

“The retransmission timer is initialized to three seconds when a TCP connection is established. However, it is adjusted on the fly to match the characteristics of the connection by using Smoothed Round Trip Time (SRTT) calculations as described in RFC793. The timer for a given segment is doubled after each retransmission of that segment. By using this algorithm, TCP tunes itself to the normal delay of a connection”

This is where we get 3 seconds (initial retransmission delay) plus 6 seconds (3 seconds doubled for the second retransmission).

Also given that “Max SYN Retransmissions” is set to 2, the system will only retransmit the SYN packet twice hence the ~9 seconds delay we see in the worst cases. The initial retransmission timer value is set in the "Initial RTO" parameter (see the screenshot below). To test this theory we decided to change this value from 3 seconds to 2 seconds:



This can be achieved by running this command:
netsh int tcp set global initialRto=2000

Once this change went live, straight away we saw request processing time decreasing from 9-10 seconds down to ~6 seconds. We knew we were on the right track.

Another suspicious finding (as mentioned above) was that most of the SYN packets with the ECN and CWR flags were dropped while SYN packets without these flags were going through.

ECN (Explicit Congestion Notification) is an interesting protocol extension defined in the RFC 3168. In the TCP/IP world the standard way for the receiver to "notify" sender of network congestion is to drop packets. This behaviour obviously can have a significant impact on the overall network performance. ECN (when supported and negotiated by both ends) allows signalling/notification of network congestion to happen without dropping packets.

Windows had ECN for TCP support since Windows Server 2008 and Vista (but it was disabled by default). But it is enabled in Windows 2012. (Linux passively supports ECN - will negotiate if asked by the other end)

ECN support has improved significantly since the introduction 15 years ago but apparently some issues still exist.

The next step for us was to try to disable ECN to see if this was the culprit.
ECN capability can be turned off by executing this command:

netsh int tcp set global ecncapability=disabled



Once this change was applied, all TCP Retransmissions disappeared and request processing time was back to a few hundred milliseconds.

We contacted the API vendor and they reassured us that their end had proper ECN support. The fact that not all of the SYN packets with ECN flag had this issue (but most of them), led us to believe that we saw a "Path-dependent connectivity dependency" as described on slide 6. This is also indirectly supported by the fact that some of the BGP routes changed roughly around the same time when we started experiencing this issue.

We were happy to see this issue resolved. Hope this blog post will help someone in a similar situation.

Keywords: Max SYN Retransmissions, maxsynretransmissions, slow connection, ECN, ecncapability, TCP retransmission

Tuesday, January 12, 2016

Resilience - Part 3 - The Aerospace Industry Approach

In Part 2 we've discussed SLAs. As an IT engineer I think of a 99.95% SLA (for a single instance) as a pretty good one. But you probably already know by now that I like to compare IT with the aerospace industry. As a passenger - would you be happy if say a flight computer in your jet was allowed to malfunction for ~21 minutes in a given month? I'd be scared. And our experience tells us that this is not the case in real world. So how do they achieve high availability in the aerospace industry?

Aerospace industry approach

I wanted to write about this for quite some time now. In fact, I've been thinking about it since May 2015 after reading a fascinating presentation by Peter Seiler and Bin Hu called "Design and Analysis of Safety Critical Systems". I reached out to Peter Seiler (Assistant Professor from the Department of Aerospace Engineering and Mechanics, University of Minnesota) and asked for permission to reuse some of the slides from this presentation. Thank you Peter!

Let's take Boeing 777-200 as an example (Boeing's first fly-by-wire aircraft). "Fly by wire" is defined by the Dictionary of Aeronautical Terms as: 

Fly-by-wire (FBW) is a system that replaces the conventional manual flight controls of an aircraft with an electronic interface. The movements of flight controls are converted to electronic signals transmitted by wires (hence the fly-by-wire term), and flight control computers determine how to move the actuators at each control surface to provide the ordered response. The fly-by-wire system also allows automatic signals sent by the aircraft's computers to perform functions without the pilot's input, as in systems that automatically help stabilize the aircraft, or prevent unsafe operation of the aircraft outside of its performance envelope.
To put it simply - pilots (their controls) are no longer directly connected to the control surfaces (ailerons, rudder etc). Instead pilot actions are sent to the computers, which then "move" the control surfaces accordingly. This arrangement obviously makes these flight control computers critical to the overall safety of the aircraft.



Consequently, the reliability requirements are very strict: less than 10-9 catastrophic failures per hour.

Modern flight control systems are very complex but at their heart they have a simple classic feedback loop:


Simple indeed. But with just a single flight computer we will probably be only achieving availability similar to the cloud instances above - far cry from the required reliability targets.

What can we do to increase reliability? Based on the information from Part 1, we know that we can add redundant components to improve the fault tolerance of the system.

In IT world if we add one extra server we can have a 2-node cluster (with the active/active or active/passive arrangements). If we add more nodes then it can get more complicated. One of the approaches is the Majority Node Set. The triple modular redundancy approach is very common in the aerospace industry. By having 3 redundant components we arrive to the classic "Triplex" architecture:

Instead of one we have 3 independent components for each critical subsystem. These components are involved in the voting process to work out the correct result/action. 

But even this is not enough and aerospace engineers go further. The 1996 "Triple-Triple Redundant 777 Primary Flight Computer" paper by Y.C. (Bob) Yeh describes 5 principles of Boeing's 777 FBW (fly by wire) design philosophy/design constraints related to safety:
  1. Common Mode/Common Area Faults
  2. Separation of FBW Components
  3. FBW Functional Separation
  4. Dissimilarity
  5. FBW Effect on Structure

This is where we see the triple modular redundancy evolving into the triple-triple architecture. We have 3 similar/identical channels (left, centre, right)...



For obvious reasons physical electrical wires for different channels will be located as far from each other as possible (to satisfy the second "separation of components" principle).

... and 3 dissimilar lanes in each channel (one in command, the other 2 functioning as monitors). 



I was fascinated by the dissimilarity principle. There were many methods used in 777 architecture to satisfy this principle but the IT engineer in me was really impressed by this particular approach:

Dissimilar Microprocessor and Compilers (with Common software)

Or quoting [Yeh, 96]:

The microprocessors are considered to be the most complex hardware devices. The INTEL 80486, Motorola 68040 and AMD 29050 microprocessors were selected for the PFCs (Primary Flight Computers - DK). The dissimilar microprocessors lead to dissimilar interface hardware circuitries and dissimilar ADA compilers.

How cool is that?!!!

Intel 80486 (that powered PCs around the world in the 90s), 68040 (Macintosh Quadra 700 anyone?)... ah, memories!

So the designers selected 3 different CPU architectures. This means 3 different versions of machine code. So we need 3 different ADA compilers for each platform to provide triple dissimilarity. Wow... What an incredible level of assurance this approach provides!

When I was reading about this approach I was also contemplating an idea of having 3 independent groups of software developers implementing the same project requirements to avoid (or at least reduce the probability of) producing the same bugs... 

Anyway, I hope you enjoyed this overview. While researching this topic I have certainly felt a lot of respect to aerospace designers, architects, and engineers. They produce highly reliable systems that make it safe for all of us to fly. And we (IT people) can certainly learn a few tricks there (especially in the mission critical systems).



Resilience - Part 2 - SLAs explained

SLAs

In Part 1 we've covered the basics. Now let's talk about the real world situations. In IT world we often talk about Service Level Agreements (SLAs).

An SLA is an agreement/document that describes the expected level of service (including specific metrics used to measure provided quality of service and potentially penalties for not meeting the expectations i.e. not achieving agreed levels of service).

SLAs can be both formal (e.g. between an external 3rd party service provider and a client) and informal (e.g. between 2 internal departments or teams within the organisation)

Service providers usually have an option to choose different levels of service quality/uptime. It is natural for customers to expect to pay more for higher levels of system availability.

How do we specify SLAs?

E.g. you might get a server running in a data centre and your hosting provider will promise a 99.9 SLA. This is a typical SLA for a single server setup. But what does it mean? How reliable is this server going to be? "99.9" ("three nines") means that the hosting provider guarantees that this server will be up and running (i.e. will be available) 99.9% of the time. 

If we take a "standard" month that consists of 30 days then all these "nines" can be translated in real terms of downtime as:
SLADowntime
99% (two nines)7 hours 12 minutes
99.9% (three nines)43 minutes 12 seconds
99.95%21 minutes 36 seconds
99.99% (four nines)4 minutes 19 seconds
99.999% (five nines)26 seconds
99.9999% (six nines)3 seconds

You can use a very convenient uptime calculator if you want to experiment with some other numbers.

To give you a few examples let's see what some of the most popular cloud providers commit to. For simplicity let's check the SLAs for single instances/VMs


AWSAzure
Service commitment99.95% during any monthly billing cycle99.95%
Service credit<99.95% - 10%
<99% - 30%
<99.95% - 10%
<99% - 25%
The actual SLAAWS SLAAzure VM SLA

As a side note - it is also interesting to note how AWS and Azure define "downtime" or being "unavailable".

AWS: 
"Unavailable" and "Unavailability" mean:
For Amazon EC2, when all of your running instances have no external connectivity.

Azure:
Downtime - The total accumulated minutes that are part of Maximum Available Minutes that have no External Connectivity.

So both vendors define being unavailable as having no external connectivity.

I'd like to mention another consideration that I was made aware of while visiting Telstra's GSOC in Melbourne. Imagine if a telco dropped just 1 packet in a whole month. So a particular client just hasn't received one single packet. The telco might think their availability was nearly 100% for that month. But from the client's perspective this same situation may result in a very different outcome. Some (especially old legacy) systems cannot tolerate a single packet loss and enter the error state. In order to recover, engineers on the client site might be forced to go through an hour long process of restarting their systems in a predefined order to recover from this failure. Just think about it - a single lost packet can cause an hour long outage on the client side (straight away - this client won't be able to achieve a 99.9% SLA for that month). This may sound like an extreme case but trust me - these things do happen in the real world.