Dmitry Kulshitsky: Integers - when size does matter

I have just read about an issue affecting all Boeing 787 airplanes:

"We have been advised by Boeing of an issue identified during laboratory testing. The software counter internal to the generator control units (GCUs) will overflow after 248 days of continuous power, causing GCU to go into failsafe mode. If the four main GCUs (associated with the engine mounted generators) were powered up at the same time, after 248 days of continuous power, all four GCUs will go into failsafe mode at the same time, resulting in a loss of all AC electrical power regardless of flight phase.

Wow, this is scary - especially the "regardless of flight phase" bit. I've done some research and it turns out that the probability is VERY low for a given aircraft to remain powered for 248 days in a row.

In the same document FAA (as an interim measure) adds a requirement:

This AD requires a repetitive maintenance task for electrical power deactivation

This essentially means - each plane must be periodically powered off (obviously with the frequency < 248 days). We all recognise this pattern - a periodic application restart or server reboot when dealing with the misbehaving applications (memory leaks etc)

In fact, there are several relevant and related to IT moments that caught my attention.

The "magic" 248 days number

A 32 bit signed (i.e. we can only use 31 bit) integer can store a maximum value of 2147483647.

2,147,483,647 / (24 hours*60 min*60 sec) = 24,855

Many sensors connected to the ARINC-429 bus have a 100Hz data sampling rate.

Dividing 24,855 by 100Hz we get 248.55 days needed to overflow this integer.

We have many examples in IT, where integer overflows cause all sorts of troubles (ranging from availability to security)

But I wanted to mention one issue that's worth keeping an eye on. Have you ever seen an error message like this?

Server: Msg 8115, Level 16, State 1, Line 1
Arithmetic overflow error converting IDENTITY to data type int.
Arithmetic overflow occurred.

SQL Server will generate this error when it detects an IDENTITY column overflow.

I used to use a script that looked very similar to the one provided by Red-gate. Give it a go - who knows, you might be able to discover an identity column approaching the limit and prevent an outage.

If the four <redundant devices> were powered up at the same time...

This is another interesting issue. If we have N redundant devices but they all share the same common (time based) problem then there is a chance that the fault across all N devices will happen at the same time. This means that a fault will escalate to a system-wide failure (i.e. outage).

It is quite common in IT to implement a staggered approach when introducing changes (OS patches, code rollout etc) - "touching" all systems at the same time is not desirable.

There are certain events that bring application restarts and server reboots back in "sync". E.g.

A predictable Patch Tuesday and default settings will result in many computers applying patches and rebooting around 3 AM (in a given time zone). Note: Microsoft decided to move away from the monthly cycle and release patches as they become available - this is a good move in my view.
A vulnerability in a common library might force many systems administrators to patch affected software almost at the same time around the world. Plus SSL certificates might need to be reissued too - we've seen an April spike caused by Heartbleed

This is something that we usually don't think about when patching our systems. So take this into the consideration next time you apply patches to a multi-node cluster or a group of network devices operating in HA mode.

1 comment:

UnknownMay 19, 2015 at 4:47 AM
“Hey folks – this is your captain speaking. I hope you had a pleasant flight so far… ummm, well, nothing to worry, just letting you know we will have to run a quick reboot procedure. Essentially, we’ll turn the plane off and on again – real quick, I promise. We’re running on Vista, happens all the time. During the procedure we will glide for these few brief moments, again no biggie, it’s been done before*. Most importantly remain seated with your belts fastened and DO NOT PANIC. Oh, one more thing, if you have to use the toilet – do it now, as you can’t flush it without the power. Okey dokey, we will be kicking off the procedure in few moments, after which we will be serving food, delicious chicken or beef vindaloo. This was your captain speaking…”

Sunday, May 17, 2015

Integers - when size does matter

The "magic" 248 days number

If the four <redundant devices> were powered up at the same time...

1 comment: