Dmitry Kulshitsky

Monday, August 31, 2015

Resilience - Part 1 - Introduction

Introduction

The longer I work in IT the more I become fascinated by the fact that we don't focus enough on system resilience. And to make it clearer, when I talk about resilience in the context of IT (especially around Programming, Network/Systems engineering, DevOps, Security) I am pretty happy with this definition:

Resilience is "the ability to provide and maintain an acceptable level of service in the face of faults and challenges to normal operation"

And like I promised in one of my earlier posts, I would love to explore the similarities and differences in approaches to engineering for resilience in IT vs Aerospace industry. Typically the impact of the system failure in space is higher. LOM (loss of mission) and LOC (loss of crew) terms may sound like tech jargon to an outsider but these are real people, whose lives depend on the systems' ability to survive various failure conditions. And even mission costs themselves are usually measured in tens or hundreds of millions of dollars. Having said that, some of the IT failures can result in catastrophic consequences too (think SCADA as an example).

But before we dive deeper, I would like to cover some basics. Some of you may find this post boring. There will be even some maths. And I remember this quote from Stephen Hawking's "A Brief History of Time": "Someone told me that each equation I included in the book would halve the sales"

And yet I believe that it is important to get the foundation right. So let's get started.

Reliability and probability of failure events

Many systems can be logically viewed as a set of connected elements/parts/components that form this system. Those parts may fail with various probabilities and the overall system reliability (or a probability of the whole system failure) depends on and can be calculated from the reliability of its components.

2 classic (and simplest) scenarios to consider - series and parallel systems

Credit: Wikipedia

Series systems

In the series systems a failure of ANY components results in (overall) system failure.

A proper scientific way to express this statement:

P[system failure] = 1 − P[system survival] = 1− P[X₁ ∩ X₂ ∩ ... ∩ X_n]

where

P[system failure] - probability of system failure

P[system survival] - probability that system survives/remains operational. It's also called system reliability

P(X_i) - probability that component X_i remains operational.

Probability values lay in the 0..1 range

If for simplicity we ignore the common mode failure - i.e. all components are independent and a failure of one component doesn't affect the reliability of another component - then we can simply multiply the individual probabilities to get the probability of system failure:

P[system failure] = 1 − P[system survival] = 1− P(X₁)P(X₂) ... P(X_n)

If all probabilities are the same then the formula looks even simpler:

P[system failure] = 1 − (1 − P)ⁿ

Example: A rocket with 2 engines or a 2 node shard. Each component remains operational with the probability 0.9 - what is the overall system reliability?

P[system survival] = 0.9*0.9 = 0.81 (or 81%)

As you can see the overall series system reliability is lower than the reliability of its components.

If we consider different probabilities (say 0.9 and 0.8) then

0.9*0.8 = 0.72

we can deduce an even stronger statement (the "weakest link" principle): the overall series system reliability is less than the reliability of the least reliable component.

It is also important to note that by increasing the number of components in the series system we reduce overall system reliability (hello microservices/multi-tier applications)

Parallel systems

In the parallel systems ALL components must fail for the whole system to fail.

The corresponding mathematical formula looks like this:
P[system survival] = 1 - P[system failure] = 1 - P[F₁ ∩ F₂ ∩ ... ∩ F_n]

where

P(F_i) - probability of failure of component F_i.

And in a simpler case, where all components are independent:
P[system survival] = 1 - P[system failure] = 1 - P(F₁)P(F₂) ... P(F_n)

Example: A rocket with 2 engines and an engine-out capability or a 2 disk RAID1 array (we assume that these components are independent - e.g. an explosion of one engine cannot affect the remaining engine). Each component has a 0.1 probability of failure (i.e. reliability=0.9) - what is the reliability of the overall system?

Reliability = P[system survival] = 1 - (1-0.9)*(1-0.9) = 1 - 0.1*0.1 = 0.99 (or 99%)

Again, if we consider different probabilities (say 0.9 and 0.8) then
Reliability = P[system survival] = 1 - 0.1*0.2 = 0.98 (or 98%)

Opposite to series system - in parallel system the overall system reliability increases as the number of components increases. I.e. adding more components increases overall reliability.

We also know this approach as redundancy.

Notice that overall reliability increases as we increase reliability of a component. The most reliable component has the largest impact on reliability (because it - being most reliable - would most likely fail last)

Consider the following example:
P1=0.6, P2=0.8, P3=0.9

Reliability = 1 - 0.4*0.2*0.1 = 0.992 (99.2%)

By improving P1 from 0.6 to say 0.7 we achieve a 0.2% improvement
Reliability = 1 - 0.3*0.2*0.1 = 0.994 (99.4%)

On contrary if we only improve P3 from 0.9 to 0.95
Reliability = 1 - 0.4*0.2*0.05 = 0.996 (99.6%) - a 0.4% improvement

Improving reliability of the most reliable component delivers better results - an important fact to know when designing/optimising parallel systems. In series systems we can achieve better outcomes by improving reliability of the least reliable component.

"k out of n" systems

Series and parallel systems are 2 simplest scenarios. A slightly more complicated case is "k out of n" systems. These are systems that fail if k or more components fail. E.g. an airplane with 4 engines that can fly with 1 engine failure (but if 2 engines fail then it can't continue the flight). Or a RAID 6 disk array - it can continue its operations (in recovery mode) "in the presence of any two concurrent disk failures". It's easy to see that with k=1 we have a series system and with k=n we have a parallel system.

In the simplest scenario (independent components with the identical reliability R)

Source:Reliawiki

where
n is the total number of components in the system
k is the minimum number of units required for system to remain operational
R is the reliability of each component

The expression in round brackets is a binomial coefficient that can be calculated as
n! / [ r! * (n-r)!]

Imagine a RAID 6 array that consists of 6 disks (n=6, k=4 as up to 2 disks can fail) with each disk having reliability of 85% (I'd like to cheat and reuse the Reliawiki's example here)

Then the array's reliability can be calculated as

Source: Reliawiki

As an exercise try to calculate reliability of Falcon-9's first stage. It contains 9 identical Merlin 1D engines and during a certain part of ascent it can lose 1 or 2 engines and still reach space (for simplicity use R=0.9). The actual reliability of the engine is higher. I don't have current stats but a few months ago there were 90 Merlin-1D engines flown with 1 in-flight engine failure, which gives us reliability estimation of 0.98(8)

Don't over-engineer it at the component level - focus on the overall desired outcomes

It might be tempting to keep adding more and more components to continue improving reliability. But we need to be careful here. Additional components come with a cost. Inevitably we need to spend more money. There is also another cost involved - more weight/volume required (which is a critical factor for space missions).

It is also very important to understand how our system fits (as a component itself) into the global system. Know the context/full picture - don't over-engineer your system as there might be other compensating controls in place that would help us achieve desired reliability goals.

Consider the following scenario:
We work on a space transportation system. It is going to be human-rated (i.e. it will carry people to space). So it needs to be very reliable. A reliability of 0.999 (1 failure in a 1000 missions) is considered acceptable for this project. And our part of the project is to build the launch vehicle. The initial reaction might be to achieve (realistically) maximum possible reliability. But there are other systems that are part of this space transportation system and one of them will be an escape system (the one that carries crew to safety away from the failing launch vehicle)

Source: Wikipedia

And if we consider how these 2 systems together form a larger system, we might arrive to a different conclusion. E.g. it might be easier/more efficient/more viable to focus on higher reliability of the escape system.

I will use a great example from "Modern Engineering for Design of Liquid-Propellant Rocket Engines" by Dieter K. Huzel and David H. Huang:

Reliability		Flight safety
Spacecraft and launch vehicle	Escape system	Probability of crew survival
0.50	0.998	0.999
0.90	0.99	0.999
0.999	0.00	0.999

See how the main goal of flight safety of 0.999 ("three nines") can be achieved by 3 VERY different approaches.

Case 1: we have a really bad (but presumably simple/cheap to build) launch vehicle - it is going to fail every second launch!!! But it is OK because our escape system is extremely reliable. We may not see many missions reaching orbit but our crew will be safely returned back to Earth.

Case 2: a better launch vehicle (will fail in 1 out of 10 launches) and a decent escape system (optimum reliability) deliver the same "three nines" flight safety with a higher degree of confidence in mission success.

Case 3: another extreme - our launch vehicle is SO reliable that we don't even need an escape system at all. With the 0.999 reliability we can entrust our crew to this launch vehicle alone (not sure if the crew is going to appreciate it though - there is a certain psychological aspect knowing there IS a backup plan).

Without knowing anything about the escape system it would be impossible to properly design our part of the project (the launch vehicle). We would end up either over-engineering it or not providing adequate reliability.

Conclusion (TL;DR)

The overall series system reliability is lower than the reliability of its components (1)

The overall series system reliability is less than the reliability of the least reliable component (1.1)
By increasing the number of components in the series system we reduce overall system reliability (1.2)

In parallel system the overall system reliability increases as the number of components increases (2)

... but this comes with the increase in costs (money, weight/volume etc)
The most reliable component has the largest impact on reliability (2.1)

When optimising/improving overall system reliability the most efficient way is to focus on

the least reliable component in series systems (3.1)
the most reliable component in parallel systems (3.2)
Don't over-engineer it at the component level. Focus on the overall desired outcomes. (3.3)

Thursday, July 30, 2015

NVIDIA driver problem after Windows 10 upgrade

This is a quick post to help those experiencing the same issue.

I have just performed an in place upgrade of my home PC from Windows 8.1 to Windows 10. Everything went well during the installation phase but when I finally logged in I only had one monitor working in the default lower than usual resolution. Hmmm, ok, I went to the device manager and noticed this:

This is clearly a video driver issue. For some reason Windows recognised but failed to install a proper driver. To fix this issue I went to the NVIDIA's web site, downloaded the latest Windows 10 64 bit driver and ran the installer.

To my surprise (once the package was extracted to my local disk) I was greeted by this error message - "NVIDIA Installer failed":

Puzzled, I ran a Windows update process but the system didn't detect any new device drivers.

Then I decided to try to install the driver manually.

Right-click the video card in the device manager and select "Update driver software..."

Click "Browser my computer for driver software". Click "Browse" and navigate to the directory where the nVidia driver installer extracted its files to:

In my case the path was: C:\NVIDIA\DisplayDriver\353.62\Win10_64\International\Display.Driver

Click OK and the driver will be installed.

Now run the NVIDIA installer again (it will work this time) and proceed with the normal installation to make sure you've got other required pieces from this package.

Reboot and you will finally have all your monitors detected and running with the proper screen resolution.

Hope this will save some time for people experiencing the same issue during the upgrade.

Saturday, July 18, 2015

Disabled Adobe Flash browser plugin? This might not be enough

If you follow IT news, I am sure you have heard about the Hacking Team leak. As part of the leaked material analysis we learnt about several exploits that relied on 0-day vulnerabilities. Adobe Flash had 3 separate vulnerabilities revealed within the first few days. Adobe had to rush 2 patches one after another to fix these vulnerabilities (and further improve security by hardening sensitive areas in the code - thanks to Google's Project Zero)

It didn't take much time for the criminals to add these (now public) exploits to the so-called exploit kits for the purpose of spreading malware. The risk was high enough for Mozilla Firefox and Google Chrome to automatically disable Flash plugin until the patch(es) were made available to address those vulnerabilities.

I am sure you (being security conscious) went and disabled the Flash plugin even before it was done automatically by some of the vendors. So your Internet Explorer Add-One list looks similar to this (notice status=disabled):

And your Chrome list of plugins (chrome://plugins) resembles this:

These are good security measures. But is this enough? Apparently not. What we've done is disabled Flash plugins in these particular browsers. But Flash itself is still well and truly present in the system. And I can demonstrate this to you. Windows has a built-in utility called HTML Help (hh.exe). Its main purpose is to display help files but it can also open remotely stored documents - including HTML pages. So it can act as a browser. Here is what I was able to observe on my system:

I went to the Adobe's Flash test page and opened it in IE (top left). As expected, the plugin couldn't run because (see the Manage Add-ons window in the bottom-left corner) it has been disabled. And yet when I opened the same test URL in HH - Flash was right there. And this is a problem. Yes, by disabling Flash in the main browsers we have significantly reduced the risk but we have not eliminated it.

There are other applications that can embed Flash content and hence still expose you to the risk of having malicious code executed on your machine. In fact, a team from Fortinet has just posted a short story on their blog that demonstrates this scenario. They described an experiment, where they were able to execute Flash (and "compromise" the machine by running the calculator application) by embedding Flash exploit code into the Microsoft Office document (PPT) and into an Adobe Reader PDF document.

Completely uninstalling Flash from the system might sound like a better option. Alas, some applications embed their own version of Flash. I know of 2 such applications - Google Chrome and Adobe Reader. Please let me know if you are aware of any other such applications.

In the meantime, install the latest version of Flash if you need it. Uninstalling Flash is even a better option. Apparently (according to Brian Krebbs), it is not that hard to survive without Flash these days. Stay safe!

Thursday, July 16, 2015

RC4 No More

Background

RC4 (Rivest Cipher 4) is a stream cipher. It was designed by Ronald Rivest (from RSA) in 1987. RC4 was (and still is) a commonly used cipher in many software packages. It was also used in the wireless standards such as WEP (wireless encryption protocol) and WPA/TKIP.

What is wrong with RC4?

RC4 was a good protocol but it is time to move on. In 2015 RC4 is weak.

In 2001 Itsik Mantin and Adi Shamir have shown

"a major statistical weakness in RC4, which makes it trivial to distinguish between short outputs of RC4 and random strings by analyzing their second bytes. This weakness can be used to mount a practical ciphertext-only attack on RC4 in some broadcast applications, in which the same plaintext is sent to multiple recipients under different keys".

This meant that practical plaintext recovery attacks on RC4 were possible (at least in theory). But until 2013 SSL and TLS ciphers based on RC4 were considered more or less secure and were widely used. Research data from Microsoft suggests that in 2013 almost 43% of web sites either required or preferred the use of RC4.

Several groups focused their research on WEP with more weaknesses (attributed to RC4) revealed in 2004 (The KoreK and Chopchop attacks) and 2007 (The PTW attack).

In 2011 a group of researchers presented 9 new exploitable correlations in RC4. They have demonstrated a practical attack against WEP - a key could be recovered by capturing only 9800 encrypted packets (requiring less than 20 seconds).

In March 2013 another group of researchers found a new attack "that allows an attacker to recover a limited amount of plaintext from a TLS connection when RC4 encryption is used".

This particular attack made WPA/TKIP weak too. WPA2 has essentially become the only recommended option.

From this moment many software vendors were recommending to reduce our reliance on RC4. On the clients it was recommended to disable it. On the servers some companies deprioritised RC4 ciphers or (the brave ones) disabled them altogether.

At the end of 2013 Microsoft has published a KB article with the patch and recommendation how to disable RC4 via the registry settings. This patch did not apply to Windows 8.1 and Windows Server 2012 R2 as they already included the functionality to restrict the use of RC4 - i.e. RC4 won't be available in the first handshake.

In March 2015 we saw a new attack against RC4 in TLS that focussed on recovering user passwords. And although it was more efficient than the previous versions it was still not very practical in real terms.

The latest announcement (hence the :RC4 No More" in the title) comes from Mathy Vanhoef and Frank Piessens. Their RC4 NOMORE attack "exposes weaknesses in this RC4 encryption algorithm".

We require only 9⋅2²⁷ requests, and can make a victim generate 4450 requests per second. This means our attack takes merely 75 hours to execute

An attack that only needs 75 hours? That's VERY practical!

And another quote specifically in relation to WPA-TKIP

We can break a WPA-TKIP network within an hour. More precisely, after successfully executing the attack, an attacker can decrypt and inject arbitrary packets sent towards a client. In general, any protocol using RC4 should be considered vulnerable

Hmmm, a scary paragraph, right? If you want to feel safe using your WiFi connection to not use the TKIP variants - use only the AES ones. WPA2-AES is the best option so far.

It is also worth noting that early attacks (2001) were passive - an attacker was just listening, collecting data/packets, and performing analysis. The latest attacks require an active interaction (sending packets) between the attacker and the victim. This makes this type of attacks quite noisy.

Where to from now?

In short - RC4 must be disabled everywhere.

I would like to provide some practical recommendations

Clients/Browsers

Some vendors (e.g. Microsoft, Mozilla) have been advocating to disable RC4 support since 2013. If you are still using Windows 8 or below - install the patch (KB2868725) and apply the corresponding registry settings.

Internet Explorer 11: Does not offer RC4 ciphers in the initial SSL handshake (meaning that most likely another non-RC4 cipher will be negotiated with the server). Note: IE11 CAN perform a fallback to RC4 if the initial handshake was unsuccessful (a relatively rare (3.9%) scenario involving systems that can ONLY support RC4). So I would say if you are on Windows 8.1/IE11 you don't need to do anything special.

Mozilla Firefox:

Navigate to about:config
Search for RC4 (i.e. for entries like this one - security.ssl3.ecdh_ecdsa_rc4_128_sha)
Disable all those RC4 entries (double-click the line to set value to false)

Chrome:

Chrome allows you to selectively disable specific ciphers via a command line parameter. You will need to launch Chrome with this parameter (the easiest and most convenient way to achieve this is to update the shortcut you use to launch Chrome):

--cipher-suite-blacklist=0x0004,0x0005,0xc007,0xc011,0x0066,0xc00c,0xc002

I took a full list of supported cipher IDs from this article and selected those with RC4 in their names.

You can check which ciphers are supported by your browser using these 2 links:

SSL Labs - view my client

SSL Cipher Suite details

You should not see any RC4 ciphers on the results page.

Servers/Network equipment/Load balancers/Firewalls etc

Review your currently enabled cipher suite and ideally remove any of the RC4 ciphers. If, for any reason, you cannot do this then deprioritise (i.e. move them down the list) any of the RC4 ciphers to increase the chance of clients negotiating a non-RC4 cipher with your server.

If your system is accessible from the Internet, you can use the brilliant Qualys SSL Labs SSL Server Test to check which ciphers are enabled and in which order they will be negotiated with the clients.

If RC4 ciphers are detected you will see this message

Make Internet a safer place - disable the RC4 ciphers today!

Tuesday, June 23, 2015

Subresource Integrity is coming to the modern browsers near you

Great news - just a couple of months ago (on the 5th of May) W3C has delivered a working draft for Subresource Integrity (SRI) specification. From the abstract:

This specification defines a mechanism by which user agents may verify that a fetched resource has been delivered without unexpected manipulation.

Why am I exciting about this announcement? Well, the key phrase here is "verify that ... delivered without manipulation". This feature alone is not a panacea for all the bad stuff happening on the Internet. But it is an excellent defence in depth measure that (in many cases) doesn't cost too much time and effort to implement.

Many web sites embed resources like css or javascript. Sometimes those resources are hosted on the 3rd party web sites. E.g. you may find it easier to reference a boostrap CSS file from http://www.bootstrapcdn.com/ or the latest version of jQuery from http://code.jquery.com/jquery-git2.min.js. But what if one of the sites hosting those resources gets compromised? The security of your web site will be affected.

We have also seen cases of content delivered across the wire being modified on the way to the end users (e.g. ISPs or WiFi hotspot operators injecting ads or governments stealing credentials). By supporting SSL/TLS and loading websites via HTTPS we can protect the content of the web pages. SRI helps to further improve security by allowing a server to supply an additional piece of information to the client (browser) to ensure that this particular resource hasn't been modified/tampered with.

This additional piece of information is officially called "integrity metadata". This is just a base64-encoded hash of the resource. The specification says that servers/clients MUST support SHA-2 hashes (i.e. SHA-256, SHA-384, SHA-512) and MAY support other cryptographic functions. By supplying a hash we can (almost - aside from a hash collision scenario) guarantee that the resource hasn't been modified from the moment when the hash has been generated.

Note: if an attacker controls the web server then he/she can produce valid hashes.

Now putting it all together - this is how it will most likely look like:

<script src="https://analytics-r-us.com/v1.0/include.js"
        integrity="sha256-SDfwewFAE...wefjijfE"
        crossorigin="anonymous"></script>

Here we can see a standard script tag that embeds an external include.js javascript file and a newly introduced "integrity" attribute, that specifies a SHA-256 hash of the include.js file. A client side (browser) now has the ability to download this resource, recalculate the hash and compare the result with the value supplied in the integrity attribute. If 2 values don't match then this resource needs to be discarded (it can't be trusted).

It will also be possible to specify multiple hash values for the same resource

<script src="hello_world.js"
   integrity="sha256-+MO/YqmqPm/BYZwlDkir51GTc9Pt9BvmLrXcRRma8u8=
              sha512-rQw3wx1psxXzqB8TyM3nAQlK2RcluhsNwxmcqXE2YbgoDW735o8TPmIR4uWpoxUERddvFwjgRSGw7gNPCwuvJg=="
   crossorigin="anonymous"></script>

In this scenario client will be able to choose the strongest supported hash function.

Note: some examples that you might find on the Internet use an older syntax (notice the "ni" part that stands for "named information" as defined in RFC6920):

integrity="ni:///sha-256;C6CB9UYIS9UJeqinPHWTHVqh_E1uhG5Twh-Y5qFQmYg?ct=application/javascript"

Around January 2015 the specification has been updated to adopt the same format as CSP Level 2 for the hash format. So the "ni" part is no longer required.

In addition to link (css) and script tags the future versions of SRI will support other types of resources - e.g. file downloads referenced in A tags or even iframes.

From the information I have, it looks like SRI will be fully supported in Firefox v.42.

It is currently "under consideration" for Microsoft Edge. Most likely it won't be implemented in IE11.

In conclusion I would like to share 2 links with you:

SRI hash generator - will make it easier to calculate hashes

W3C SRI test - will run the test and show how well SRI is supported in your browser of choice.

Wednesday, May 20, 2015

KCodes NetUSB vulnerability (CVE-2015-3036) and a short-term fix for TP-Link

There was a vulnerability disclosed by SEC Consult earlier today that affects a significant number of SOHO routers. NetUSB is a technology that provides a "USB over IP" functionality. It was developed by a company KCodes and since then has been adopted by many popular network device manufactures (including Netgear and TP-Link).

NetUSB runs as a Linux kernel driver. When it is available it launches a server on TCP port 20005 (that is typically accessible on the LAN only). I have seen some reports already claiming that some (mis)configurations were exposing port 20005 on the WAN side (i.e. to the Internet) as well. And this is bad news because according to the advisory NetUSB suffers from a remote stack buffer overflow. And being kernel driver it means that remote attacker by exploiting this vulnerability can gain admin privileges on the affected device.

The AES keys are static - that's not great, as it means they are useless. They won't be able to stop the attackers as the keys are already known to them. And all they need to do is to send a computer name longer than 64 bytes to cause an overflow. This feels like 90s again.

If port 20005 is not accessible from the outside then this reduces the risk but it still leaves this network vulnerable to the attacks from the inside.

I've got my hands on one of the affected models - TP-Link Archer D9.

A quick test to connect to port 20005:

telnet 192.168.0.1 20005

reveals that it does indeed listen on port 20005 in the default configuration (i.e. I was able to connect).

A web management interface has this section under USB Management -> Print server:

As you can see by default the Print Server is turned on.

Let's click the Stop button...

... and try to connect again:

Ah, much better.

I am not sure if this approach fully mitigates this issue but it certainly makes the overall situation better.
An updated firmware version with a fix from TP-Link is expected around the 25th of May. Until then I would recommend you to stop the Print Server.

Sunday, May 17, 2015

Integers - when size does matter

I have just read about an issue affecting all Boeing 787 airplanes:

"We have been advised by Boeing of an issue identified during laboratory testing. The software counter internal to the generator control units (GCUs) will overflow after 248 days of continuous power, causing GCU to go into failsafe mode. If the four main GCUs (associated with the engine mounted generators) were powered up at the same time, after 248 days of continuous power, all four GCUs will go into failsafe mode at the same time, resulting in a loss of all AC electrical power regardless of flight phase.

Wow, this is scary - especially the "regardless of flight phase" bit. I've done some research and it turns out that the probability is VERY low for a given aircraft to remain powered for 248 days in a row.

In the same document FAA (as an interim measure) adds a requirement:

This AD requires a repetitive maintenance task for electrical power deactivation

This essentially means - each plane must be periodically powered off (obviously with the frequency < 248 days). We all recognise this pattern - a periodic application restart or server reboot when dealing with the misbehaving applications (memory leaks etc)

In fact, there are several relevant and related to IT moments that caught my attention.

The "magic" 248 days number

A 32 bit signed (i.e. we can only use 31 bit) integer can store a maximum value of 2147483647.

2,147,483,647 / (24 hours*60 min*60 sec) = 24,855

Many sensors connected to the ARINC-429 bus have a 100Hz data sampling rate.

Dividing 24,855 by 100Hz we get 248.55 days needed to overflow this integer.

We have many examples in IT, where integer overflows cause all sorts of troubles (ranging from availability to security)

But I wanted to mention one issue that's worth keeping an eye on. Have you ever seen an error message like this?

Server: Msg 8115, Level 16, State 1, Line 1
Arithmetic overflow error converting IDENTITY to data type int.
Arithmetic overflow occurred.

SQL Server will generate this error when it detects an IDENTITY column overflow.

I used to use a script that looked very similar to the one provided by Red-gate. Give it a go - who knows, you might be able to discover an identity column approaching the limit and prevent an outage.

If the four <redundant devices> were powered up at the same time...

This is another interesting issue. If we have N redundant devices but they all share the same common (time based) problem then there is a chance that the fault across all N devices will happen at the same time. This means that a fault will escalate to a system-wide failure (i.e. outage).

It is quite common in IT to implement a staggered approach when introducing changes (OS patches, code rollout etc) - "touching" all systems at the same time is not desirable.

There are certain events that bring application restarts and server reboots back in "sync". E.g.

A predictable Patch Tuesday and default settings will result in many computers applying patches and rebooting around 3 AM (in a given time zone). Note: Microsoft decided to move away from the monthly cycle and release patches as they become available - this is a good move in my view.
A vulnerability in a common library might force many systems administrators to patch affected software almost at the same time around the world. Plus SSL certificates might need to be reissued too - we've seen an April spike caused by Heartbleed

This is something that we usually don't think about when patching our systems. So take this into the consideration next time you apply patches to a multi-node cluster or a group of network devices operating in HA mode.