Tuesday, January 12, 2016

Resilience - Part 2 - SLAs explained


In Part 1 we've covered the basics. Now let's talk about the real world situations. In IT world we often talk about Service Level Agreements (SLAs).

An SLA is an agreement/document that describes the expected level of service (including specific metrics used to measure provided quality of service and potentially penalties for not meeting the expectations i.e. not achieving agreed levels of service).

SLAs can be both formal (e.g. between an external 3rd party service provider and a client) and informal (e.g. between 2 internal departments or teams within the organisation)

Service providers usually have an option to choose different levels of service quality/uptime. It is natural for customers to expect to pay more for higher levels of system availability.

How do we specify SLAs?

E.g. you might get a server running in a data centre and your hosting provider will promise a 99.9 SLA. This is a typical SLA for a single server setup. But what does it mean? How reliable is this server going to be? "99.9" ("three nines") means that the hosting provider guarantees that this server will be up and running (i.e. will be available) 99.9% of the time. 

If we take a "standard" month that consists of 30 days then all these "nines" can be translated in real terms of downtime as:
99% (two nines)7 hours 12 minutes
99.9% (three nines)43 minutes 12 seconds
99.95%21 minutes 36 seconds
99.99% (four nines)4 minutes 19 seconds
99.999% (five nines)26 seconds
99.9999% (six nines)3 seconds

You can use a very convenient uptime calculator if you want to experiment with some other numbers.

To give you a few examples let's see what some of the most popular cloud providers commit to. For simplicity let's check the SLAs for single instances/VMs

Service commitment99.95% during any monthly billing cycle99.95%
Service credit<99.95% - 10%
<99% - 30%
<99.95% - 10%
<99% - 25%
The actual SLAAWS SLAAzure VM SLA

As a side note - it is also interesting to note how AWS and Azure define "downtime" or being "unavailable".

"Unavailable" and "Unavailability" mean:
For Amazon EC2, when all of your running instances have no external connectivity.

Downtime - The total accumulated minutes that are part of Maximum Available Minutes that have no External Connectivity.

So both vendors define being unavailable as having no external connectivity.

I'd like to mention another consideration that I was made aware of while visiting Telstra's GSOC in Melbourne. Imagine if a telco dropped just 1 packet in a whole month. So a particular client just hasn't received one single packet. The telco might think their availability was nearly 100% for that month. But from the client's perspective this same situation may result in a very different outcome. Some (especially old legacy) systems cannot tolerate a single packet loss and enter the error state. In order to recover, engineers on the client site might be forced to go through an hour long process of restarting their systems in a predefined order to recover from this failure. Just think about it - a single lost packet can cause an hour long outage on the client side (straight away - this client won't be able to achieve a 99.9% SLA for that month). This may sound like an extreme case but trust me - these things do happen in the real world.

No comments:

Post a Comment