Monday, April 27, 2015

Engineer for resilience

I actually didn't plan to write this blog post. I was reading my twitter feed when I spotted a discussion about an Indian web site TRAI that publicly exposed several thousands of e-mail addresses (net neutrality related e-mails). Another comment was made that the web site was down because of the amount of people trying to check if their addresses were in those published lists. And then I noticed this screenshot:

Makes sense - a massive (and unexpected) spike in workload and it is understandable that the backend database server couldn't handle the load. I am obviously speculating here but it doesn't really matter for the purpose of this discussion.

Before we move on - displaying detailed error messages like this one (containing a stack trace, .Net version etc) is bad from the security perspective. Troy Hunt has a great (and very detailed) explanation how to set up custom error pages properly.

But what really caught my attention was this line:
   banner.proc_Display_banner(String param) +26
A "display banner" function tries to perform an action against the database. This request fails and the whole page "explodes" and displays a detailed error page. I will speculate again - a banner may mean 2 things in my view. It is either a topbar of the web site or an embedded advertisement of some sort.

In both cases I would argue that this banner is most likely not essential for the core functionality of the web site. And that brings me to the topic that I am very passionate about - engineering for resilience. I've seen it many times - it is way too easy to just throw an exception and give up at this point (hoping that this exception will bubble up and will be handled somewhere - or maybe not).

Think about it - a function call to display a banner fails and takes down the whole page. Or we might even argue - takes down the whole web site as the error happens on the homepage. What would you prefer - an operational web site that lacks a topbar (or doesn't display an advertisement) or a web site that is broken?

For some reason I see this times and times again - developers focus on the "success" story - i.e. the code performs what it's supposed to do in the ideal conditions when all databases/services/endpoints are available. And they don't consider scenarios when some of these building blocks they rely on in their code become unavailable. It's convenient to just assume that this database will always be there, right? And it will be the case in 99.9% of the time. The SLA of "three nines" is not ideal but not unheard of (especially for the single instance setups). 99.9% also means roughly 40 minutes of downtime each month. How will your code behave during those 40 minutes? The key point I'd like to make - we need to expect failures and engineer for resilience. As we move from the monolithic systems to distributed/microservices based architectures, inevitably we will be relying more and more on various external (to our code) APIs, various endpoints and databases. And usually we don't have any information about the availability of a particular endpoint when we are about to call it.

There is nothing wrong with failures per se. You pick up the phone, call you colleague and receive a busy signal. In a way it is a failure. You failed to connect and invite your colleague for a coffee. But is it a problem? No, as most likely you will "retry" a few minutes later. 

Retry is a great (and simple) way to recover from failure. Just retry the last failed call and see if it finishes successfully this time. Limit the number of retries to make sure you don't create and endless retry loop and that you don't overload/DOS the system that might already be struggling.

So what can we do if we retried but were not able to achieve a desired outcome?
Consider a degrade approach. (Think - going for a coffee alone without your colleague). In many cases it is possible to reduce full functionality of a system - somehow simplify it - without losing it (functionality) completely. E.g. in the case of a banner we could:
  • Replace a banner that we wanted to display (and couldn't retrieve from the database) with a hardcoded version (that doesn't require access to the database).
  • Replace this banner with an blank image of the right size
  • Don't display the module that failed at all. This might break the HTML layout but the main page will still be alive.

By wrapping the database call in a try-catch block obviously would have allowed a developer to catch this failure condition and make decision what to do next - how to handle this failure scenario.

I will write another complementary blog post regarding how "engineering for resilience" is done in space technology - approach/ideas/principles, which can teach us - IT people - a few things.

Please make sure you consider all possible outcomes - no matter how improbable they might be. Databases go down, processes run out of threads and memory, network failures/packet drops happen. We need to be prepared for all of these scenarios. Code that can survive failures is a sign of a high class/experienced developer.

No comments:

Post a Comment