Dmitry Kulshitsky: Resilience - Part 1

Introduction

The longer I work in IT the more I become fascinated by the fact that we don't focus enough on system resilience. And to make it clearer, when I talk about resilience in the context of IT (especially around Programming, Network/Systems engineering, DevOps, Security) I am pretty happy with this definition:

Resilience is "the ability to provide and maintain an acceptable level of service in the face of faults and challenges to normal operation"

And like I promised in one of my earlier posts, I would love to explore the similarities and differences in approaches to engineering for resilience in IT vs Aerospace industry. Typically the impact of the system failure in space is higher. LOM (loss of mission) and LOC (loss of crew) terms may sound like tech jargon to an outsider but these are real people, whose lives depend on the systems' ability to survive various failure conditions. And even mission costs themselves are usually measured in tens or hundreds of millions of dollars. Having said that, some of the IT failures can result in catastrophic consequences too (think SCADA as an example).

But before we dive deeper, I would like to cover some basics. Some of you may find this post boring. There will be even some maths. And I remember this quote from Stephen Hawking's "A Brief History of Time": "Someone told me that each equation I included in the book would halve the sales"

And yet I believe that it is important to get the foundation right. So let's get started.

Reliability and probability of failure events

Many systems can be logically viewed as a set of connected elements/parts/components that form this system. Those parts may fail with various probabilities and the overall system reliability (or a probability of the whole system failure) depends on and can be calculated from the reliability of its components.

2 classic (and simplest) scenarios to consider - series and parallel systems

Credit: Wikipedia

Series systems

In the series systems a failure of ANY components results in (overall) system failure.

A proper scientific way to express this statement:

P[system failure] = 1 − P[system survival] = 1− P[X₁ ∩ X₂ ∩ ... ∩ X_n]

where

P[system failure] - probability of system failure

P[system survival] - probability that system survives/remains operational. It's also called system reliability

P(X_i) - probability that component X_i remains operational.

Probability values lay in the 0..1 range

If for simplicity we ignore the common mode failure - i.e. all components are independent and a failure of one component doesn't affect the reliability of another component - then we can simply multiply the individual probabilities to get the probability of system failure:

P[system failure] = 1 − P[system survival] = 1− P(X₁)P(X₂) ... P(X_n)

If all probabilities are the same then the formula looks even simpler:

P[system failure] = 1 − (1 − P)ⁿ

Example: A rocket with 2 engines or a 2 node shard. Each component remains operational with the probability 0.9 - what is the overall system reliability?

P[system survival] = 0.9*0.9 = 0.81 (or 81%)

As you can see the overall series system reliability is lower than the reliability of its components.

If we consider different probabilities (say 0.9 and 0.8) then

0.9*0.8 = 0.72

we can deduce an even stronger statement (the "weakest link" principle): the overall series system reliability is less than the reliability of the least reliable component.

It is also important to note that by increasing the number of components in the series system we reduce overall system reliability (hello microservices/multi-tier applications)

Parallel systems

In the parallel systems ALL components must fail for the whole system to fail.

The corresponding mathematical formula looks like this:
P[system survival] = 1 - P[system failure] = 1 - P[F₁ ∩ F₂ ∩ ... ∩ F_n]

where

P(F_i) - probability of failure of component F_i.

And in a simpler case, where all components are independent:
P[system survival] = 1 - P[system failure] = 1 - P(F₁)P(F₂) ... P(F_n)

Example: A rocket with 2 engines and an engine-out capability or a 2 disk RAID1 array (we assume that these components are independent - e.g. an explosion of one engine cannot affect the remaining engine). Each component has a 0.1 probability of failure (i.e. reliability=0.9) - what is the reliability of the overall system?

Reliability = P[system survival] = 1 - (1-0.9)*(1-0.9) = 1 - 0.1*0.1 = 0.99 (or 99%)

Again, if we consider different probabilities (say 0.9 and 0.8) then
Reliability = P[system survival] = 1 - 0.1*0.2 = 0.98 (or 98%)

Opposite to series system - in parallel system the overall system reliability increases as the number of components increases. I.e. adding more components increases overall reliability.

We also know this approach as redundancy.

Notice that overall reliability increases as we increase reliability of a component. The most reliable component has the largest impact on reliability (because it - being most reliable - would most likely fail last)

Consider the following example:
P1=0.6, P2=0.8, P3=0.9

Reliability = 1 - 0.4*0.2*0.1 = 0.992 (99.2%)

By improving P1 from 0.6 to say 0.7 we achieve a 0.2% improvement
Reliability = 1 - 0.3*0.2*0.1 = 0.994 (99.4%)

On contrary if we only improve P3 from 0.9 to 0.95
Reliability = 1 - 0.4*0.2*0.05 = 0.996 (99.6%) - a 0.4% improvement

Improving reliability of the most reliable component delivers better results - an important fact to know when designing/optimising parallel systems. In series systems we can achieve better outcomes by improving reliability of the least reliable component.

"k out of n" systems

Series and parallel systems are 2 simplest scenarios. A slightly more complicated case is "k out of n" systems. These are systems that fail if k or more components fail. E.g. an airplane with 4 engines that can fly with 1 engine failure (but if 2 engines fail then it can't continue the flight). Or a RAID 6 disk array - it can continue its operations (in recovery mode) "in the presence of any two concurrent disk failures". It's easy to see that with k=1 we have a series system and with k=n we have a parallel system.

In the simplest scenario (independent components with the identical reliability R)

Source:Reliawiki

where
n is the total number of components in the system
k is the minimum number of units required for system to remain operational
R is the reliability of each component

The expression in round brackets is a binomial coefficient that can be calculated as
n! / [ r! * (n-r)!]

Imagine a RAID 6 array that consists of 6 disks (n=6, k=4 as up to 2 disks can fail) with each disk having reliability of 85% (I'd like to cheat and reuse the Reliawiki's example here)

Then the array's reliability can be calculated as

Source: Reliawiki

As an exercise try to calculate reliability of Falcon-9's first stage. It contains 9 identical Merlin 1D engines and during a certain part of ascent it can lose 1 or 2 engines and still reach space (for simplicity use R=0.9). The actual reliability of the engine is higher. I don't have current stats but a few months ago there were 90 Merlin-1D engines flown with 1 in-flight engine failure, which gives us reliability estimation of 0.98(8)

Don't over-engineer it at the component level - focus on the overall desired outcomes

It might be tempting to keep adding more and more components to continue improving reliability. But we need to be careful here. Additional components come with a cost. Inevitably we need to spend more money. There is also another cost involved - more weight/volume required (which is a critical factor for space missions).

It is also very important to understand how our system fits (as a component itself) into the global system. Know the context/full picture - don't over-engineer your system as there might be other compensating controls in place that would help us achieve desired reliability goals.

Consider the following scenario:
We work on a space transportation system. It is going to be human-rated (i.e. it will carry people to space). So it needs to be very reliable. A reliability of 0.999 (1 failure in a 1000 missions) is considered acceptable for this project. And our part of the project is to build the launch vehicle. The initial reaction might be to achieve (realistically) maximum possible reliability. But there are other systems that are part of this space transportation system and one of them will be an escape system (the one that carries crew to safety away from the failing launch vehicle)

Source: Wikipedia

And if we consider how these 2 systems together form a larger system, we might arrive to a different conclusion. E.g. it might be easier/more efficient/more viable to focus on higher reliability of the escape system.

I will use a great example from "Modern Engineering for Design of Liquid-Propellant Rocket Engines" by Dieter K. Huzel and David H. Huang:

Reliability		Flight safety
Spacecraft and launch vehicle	Escape system	Probability of crew survival
0.50	0.998	0.999
0.90	0.99	0.999
0.999	0.00	0.999

See how the main goal of flight safety of 0.999 ("three nines") can be achieved by 3 VERY different approaches.

Case 1: we have a really bad (but presumably simple/cheap to build) launch vehicle - it is going to fail every second launch!!! But it is OK because our escape system is extremely reliable. We may not see many missions reaching orbit but our crew will be safely returned back to Earth.

Case 2: a better launch vehicle (will fail in 1 out of 10 launches) and a decent escape system (optimum reliability) deliver the same "three nines" flight safety with a higher degree of confidence in mission success.

Case 3: another extreme - our launch vehicle is SO reliable that we don't even need an escape system at all. With the 0.999 reliability we can entrust our crew to this launch vehicle alone (not sure if the crew is going to appreciate it though - there is a certain psychological aspect knowing there IS a backup plan).

Without knowing anything about the escape system it would be impossible to properly design our part of the project (the launch vehicle). We would end up either over-engineering it or not providing adequate reliability.

Conclusion (TL;DR)

The overall series system reliability is lower than the reliability of its components (1)

The overall series system reliability is less than the reliability of the least reliable component (1.1)
By increasing the number of components in the series system we reduce overall system reliability (1.2)

In parallel system the overall system reliability increases as the number of components increases (2)

... but this comes with the increase in costs (money, weight/volume etc)
The most reliable component has the largest impact on reliability (2.1)

When optimising/improving overall system reliability the most efficient way is to focus on

the least reliable component in series systems (3.1)
the most reliable component in parallel systems (3.2)
Don't over-engineer it at the component level. Focus on the overall desired outcomes. (3.3)

Dmitry Kulshitsky

Monday, August 31, 2015

Resilience - Part 1 - Introduction