"We have been advised by Boeing of an issue identified during laboratory testing. The software counter internal to the generator control units (GCUs) will overflow after 248 days of continuous power, causing GCU to go into failsafe mode. If the four main GCUs (associated with the engine mounted generators) were powered up at the same time, after 248 days of continuous power, all four GCUs will go into failsafe mode at the same time, resulting in a loss of all AC electrical power regardless of flight phase.
Wow, this is scary - especially the "regardless of flight phase" bit. I've done some research and it turns out that the probability is VERY low for a given aircraft to remain powered for 248 days in a row.
In the same document FAA (as an interim measure) adds a requirement:
This AD requires a repetitive maintenance task for electrical power deactivation
This essentially means - each plane must be periodically powered off (obviously with the frequency < 248 days). We all recognise this pattern - a periodic application restart or server reboot when dealing with the misbehaving applications (memory leaks etc)
In fact, there are several relevant and related to IT moments that caught my attention.
The "magic" 248 days number
A 32 bit signed (i.e. we can only use 31 bit) integer can store a maximum value of 2147483647.
2,147,483,647 / (24 hours*60 min*60 sec) = 24,855
Many sensors connected to the ARINC-429 bus have a 100Hz data sampling rate.
Dividing 24,855 by 100Hz we get 248.55 days needed to overflow this integer.
We have many examples in IT, where integer overflows cause all sorts of troubles (ranging from availability to security)
But I wanted to mention one issue that's worth keeping an eye on. Have you ever seen an error message like this?
Server: Msg 8115, Level 16, State 1, Line 1 Arithmetic overflow error converting IDENTITY to data type int. Arithmetic overflow occurred.
SQL Server will generate this error when it detects an IDENTITY column overflow.
I used to use a script that looked very similar to the one provided by Red-gate. Give it a go - who knows, you might be able to discover an identity column approaching the limit and prevent an outage.
If the four <redundant devices> were powered up at the same time...
This is another interesting issue. If we have N redundant devices but they all share the same common (time based) problem then there is a chance that the fault across all N devices will happen at the same time. This means that a fault will escalate to a system-wide failure (i.e. outage).
It is quite common in IT to implement a staggered approach when introducing changes (OS patches, code rollout etc) - "touching" all systems at the same time is not desirable.
There are certain events that bring application restarts and server reboots back in "sync". E.g.
- A predictable Patch Tuesday and default settings will result in many computers applying patches and rebooting around 3 AM (in a given time zone). Note: Microsoft decided to move away from the monthly cycle and release patches as they become available - this is a good move in my view.
- A vulnerability in a common library might force many systems administrators to patch affected software almost at the same time around the world. Plus SSL certificates might need to be reissued too - we've seen an April spike caused by Heartbleed
This is something that we usually don't think about when patching our systems. So take this into the consideration next time you apply patches to a multi-node cluster or a group of network devices operating in HA mode.