Most likely you have already experienced a system downtime, either on an application you have worked on or on some service that you consume. It has happened to Amazon, Netflix, Microsoft, Salesforce, etc. How much did you have to wait? How much did your users?
If you’re building an application and you ask your boss (or your client) what’s the percentage of time the application should be working correctly, most likely you’ll get an answer like “always” or “100% (or more)”.
Even though we don’t wan’t bad things to happen, they will surely do; bugs, attacks, power outages, natural disasters, etc., all are scenarios that might affect a system. Expecting them not to happen is naive; it’s better to think on and plan for failure, since it is inevitable.
Availability is the capability of an application to be available after some problem occurs. Again, we are not saying there will be no problems, but how effectively will be able to recover from them instead. This means that we need to a) identify the potential fail points and b) create an strategy to be able to prevent the error becoming a failure affecting the user (this means a tree falling in the forest when no-one is around makes no sound).
If a tree falls in a forest…
Going back to the question regarding the percent of time the application needs to work correctly, it is important to understand how it translates into downtime.
|Availability||Downtime per year|
|99%||3d 15h 36m|
|99.9%||8h 0m 45s|
This means that if your target is 99.99% you need to be able to recover on little less than an hour if you have one incident, less than 30 min if you have two, and so on. It is important to notice this downtime values don’t include planned outages, like when updating the application, patching the OS, migrating the database, etc.
Service Level Agreements
Having a target availability value (%) leads to ask, what happens if it is not reached? This is where SLA come into play when working with third-parties. They are an agreement between two parties declaring what is the promised uptime and what will be the penalty or credit applied if not complied. For example Microsoft declares in their SLA for Azure App Services that you might be elegible for a credit of 10% if they not maintain a availability of at least 99.95% and of 25% if it is below 99%.
The “easier” problem to think of is handling application exceptions properly (OnError Resume Next anyone?), but there are many other threats that can affect your application to be available
- DDoS attacks taking your application server down
- SQL injection, wiping out your database.
- Floods, hurricanes and other natural disasters provoking a massive power outage.
Even though these are security or external concerns, they produce the same result: making your application unavailable. Users don’t care about why, but you should; you need to think on ways for avoiding this to happen. There are many possible strategies to use, but they can mainly be categorized as:
- Prevention: Ensure you are thinking of potential errors before they appear; this means handling exceptions properly, identifying single point of failures, using fault trees to identify potential error chains, remove elements that might cause a problem, etc.
- Detection: Ensure you are aware of the error at the proper moment; monitoring tools like Nagios are a good example.
- Correction: Once the problem has arisen, what’s needed for solving it: restore a backup, move to a different server, turn on a power plant, etc.
(If you are working with Microsoft Azure, you have Security Center to help you identifying threats on the three different categories).
Backups always succeed, it’s Restores that fail. Test them. http://t.co/nzoti3LSur
— Scott Hanselman (@shanselman) July 3, 2014
As you can see, this means that you may need to think on more stuff than just your code. For example, what happens if your application handles all the exceptions properly but there is a hardware failure on your server or data center? or someone has the database crendentials and runs
delete from table with no filter clause on your production server. Even though they seem extreme cases, I’ve seen them happen before.
As I’ve mentioned on other posts, architecture is about balance, and this is no exception. Aiming for an availability target implies applying strategies that result in a cost. For example high-availability tipically means five nines (99.999%) and by that a little more than five minutes of total downtime during a year. This will imply implementing several strategies that will increase the cost of the project. There is also the trade-offs, you cannot aim for higher availability without impacting other quality attributes, like modifiability or maintainability .
Not every system require that level of uptime, I think many of them can go with two or three nines. There is also the environment consideration; it is not the same for a payroll system to fail on payday than in any other moment.
So next time you are discussing with a client about availability, be sure to ask him “how many nines?”.