Tuesday, January 12, 2016

Resilience - Part 3 - The Aerospace Industry Approach

In Part 2 we've discussed SLAs. As an IT engineer I think of a 99.95% SLA (for a single instance) as a pretty good one. But you probably already know by now that I like to compare IT with the aerospace industry. As a passenger - would you be happy if say a flight computer in your jet was allowed to malfunction for ~21 minutes in a given month? I'd be scared. And our experience tells us that this is not the case in real world. So how do they achieve high availability in the aerospace industry?

Aerospace industry approach

I wanted to write about this for quite some time now. In fact, I've been thinking about it since May 2015 after reading a fascinating presentation by Peter Seiler and Bin Hu called "Design and Analysis of Safety Critical Systems". I reached out to Peter Seiler (Assistant Professor from the Department of Aerospace Engineering and Mechanics, University of Minnesota) and asked for permission to reuse some of the slides from this presentation. Thank you Peter!

Let's take Boeing 777-200 as an example (Boeing's first fly-by-wire aircraft). "Fly by wire" is defined by the Dictionary of Aeronautical Terms as: 

Fly-by-wire (FBW) is a system that replaces the conventional manual flight controls of an aircraft with an electronic interface. The movements of flight controls are converted to electronic signals transmitted by wires (hence the fly-by-wire term), and flight control computers determine how to move the actuators at each control surface to provide the ordered response. The fly-by-wire system also allows automatic signals sent by the aircraft's computers to perform functions without the pilot's input, as in systems that automatically help stabilize the aircraft, or prevent unsafe operation of the aircraft outside of its performance envelope.
To put it simply - pilots (their controls) are no longer directly connected to the control surfaces (ailerons, rudder etc). Instead pilot actions are sent to the computers, which then "move" the control surfaces accordingly. This arrangement obviously makes these flight control computers critical to the overall safety of the aircraft.

Consequently, the reliability requirements are very strict: less than 10-9 catastrophic failures per hour.

Modern flight control systems are very complex but at their heart they have a simple classic feedback loop:

Simple indeed. But with just a single flight computer we will probably be only achieving availability similar to the cloud instances above - far cry from the required reliability targets.

What can we do to increase reliability? Based on the information from Part 1, we know that we can add redundant components to improve the fault tolerance of the system.

In IT world if we add one extra server we can have a 2-node cluster (with the active/active or active/passive arrangements). If we add more nodes then it can get more complicated. One of the approaches is the Majority Node Set. The triple modular redundancy approach is very common in the aerospace industry. By having 3 redundant components we arrive to the classic "Triplex" architecture:

Instead of one we have 3 independent components for each critical subsystem. These components are involved in the voting process to work out the correct result/action. 

But even this is not enough and aerospace engineers go further. The 1996 "Triple-Triple Redundant 777 Primary Flight Computer" paper by Y.C. (Bob) Yeh describes 5 principles of Boeing's 777 FBW (fly by wire) design philosophy/design constraints related to safety:
  1. Common Mode/Common Area Faults
  2. Separation of FBW Components
  3. FBW Functional Separation
  4. Dissimilarity
  5. FBW Effect on Structure

This is where we see the triple modular redundancy evolving into the triple-triple architecture. We have 3 similar/identical channels (left, centre, right)...

For obvious reasons physical electrical wires for different channels will be located as far from each other as possible (to satisfy the second "separation of components" principle).

... and 3 dissimilar lanes in each channel (one in command, the other 2 functioning as monitors). 

I was fascinated by the dissimilarity principle. There were many methods used in 777 architecture to satisfy this principle but the IT engineer in me was really impressed by this particular approach:

Dissimilar Microprocessor and Compilers (with Common software)

Or quoting [Yeh, 96]:

The microprocessors are considered to be the most complex hardware devices. The INTEL 80486, Motorola 68040 and AMD 29050 microprocessors were selected for the PFCs (Primary Flight Computers - DK). The dissimilar microprocessors lead to dissimilar interface hardware circuitries and dissimilar ADA compilers.

How cool is that?!!!

Intel 80486 (that powered PCs around the world in the 90s), 68040 (Macintosh Quadra 700 anyone?)... ah, memories!

So the designers selected 3 different CPU architectures. This means 3 different versions of machine code. So we need 3 different ADA compilers for each platform to provide triple dissimilarity. Wow... What an incredible level of assurance this approach provides!

When I was reading about this approach I was also contemplating an idea of having 3 independent groups of software developers implementing the same project requirements to avoid (or at least reduce the probability of) producing the same bugs... 

Anyway, I hope you enjoyed this overview. While researching this topic I have certainly felt a lot of respect to aerospace designers, architects, and engineers. They produce highly reliable systems that make it safe for all of us to fly. And we (IT people) can certainly learn a few tricks there (especially in the mission critical systems).

Resilience - Part 2 - SLAs explained


In Part 1 we've covered the basics. Now let's talk about the real world situations. In IT world we often talk about Service Level Agreements (SLAs).

An SLA is an agreement/document that describes the expected level of service (including specific metrics used to measure provided quality of service and potentially penalties for not meeting the expectations i.e. not achieving agreed levels of service).

SLAs can be both formal (e.g. between an external 3rd party service provider and a client) and informal (e.g. between 2 internal departments or teams within the organisation)

Service providers usually have an option to choose different levels of service quality/uptime. It is natural for customers to expect to pay more for higher levels of system availability.

How do we specify SLAs?

E.g. you might get a server running in a data centre and your hosting provider will promise a 99.9 SLA. This is a typical SLA for a single server setup. But what does it mean? How reliable is this server going to be? "99.9" ("three nines") means that the hosting provider guarantees that this server will be up and running (i.e. will be available) 99.9% of the time. 

If we take a "standard" month that consists of 30 days then all these "nines" can be translated in real terms of downtime as:
99% (two nines)7 hours 12 minutes
99.9% (three nines)43 minutes 12 seconds
99.95%21 minutes 36 seconds
99.99% (four nines)4 minutes 19 seconds
99.999% (five nines)26 seconds
99.9999% (six nines)3 seconds

You can use a very convenient uptime calculator if you want to experiment with some other numbers.

To give you a few examples let's see what some of the most popular cloud providers commit to. For simplicity let's check the SLAs for single instances/VMs

Service commitment99.95% during any monthly billing cycle99.95%
Service credit<99.95% - 10%
<99% - 30%
<99.95% - 10%
<99% - 25%
The actual SLAAWS SLAAzure VM SLA

As a side note - it is also interesting to note how AWS and Azure define "downtime" or being "unavailable".

"Unavailable" and "Unavailability" mean:
For Amazon EC2, when all of your running instances have no external connectivity.

Downtime - The total accumulated minutes that are part of Maximum Available Minutes that have no External Connectivity.

So both vendors define being unavailable as having no external connectivity.

I'd like to mention another consideration that I was made aware of while visiting Telstra's GSOC in Melbourne. Imagine if a telco dropped just 1 packet in a whole month. So a particular client just hasn't received one single packet. The telco might think their availability was nearly 100% for that month. But from the client's perspective this same situation may result in a very different outcome. Some (especially old legacy) systems cannot tolerate a single packet loss and enter the error state. In order to recover, engineers on the client site might be forced to go through an hour long process of restarting their systems in a predefined order to recover from this failure. Just think about it - a single lost packet can cause an hour long outage on the client side (straight away - this client won't be able to achieve a 99.9% SLA for that month). This may sound like an extreme case but trust me - these things do happen in the real world.