Recently
machinehead asked me how to build
high availibility. For the less technically inclined,
high availibility is a measure of the reliability of a system and sometimes indicated by "Five nines" or 99.999% reliability. Basically what it means is the system has
a total downtime of no longer than five minutes per year.
Having done a course on distributed systems, I decided to take a crack at it. There are three main issues that needs to be addressed:
HardwareEveryone thinks high availibility means only hardware. Its not. Its only the beginning. Firstly, you need at least two of everything located geographically apart. For example, most people believe RAID is insurance against data loss. But RAID only protects against one or two drive failure. What happens if the PSU shorts and pump 240V instead of 5V. That will fry every disk in the array. Plus, rebuilding disk takes time which isn't high availibility. It should also be geographically located apart to protect against fire, earthquake, tsunami, plane etc.
Secondly, you need smooth, automatic switchover incase of crash. For example, using heartbeat to monitor servers and changing the IP at DNS server to the failover server. This makes it smooth and you don't need to make any other server aware of the server failure.
SoftwareSoftware should be built from ground up with high availibility in mind. Meaning they should be scalable and clusterable. The best way to do this is to make them stateless. For example, when you click on "2" to goto the second result page on Google, the second page doesn't necessarily have to be processed by the same server that did the initial first page. It can be done by any server. This is the power of stateless.
DataData is the hardest problem to solve. If you fragment and replicate the data for performace and scalability, you need to address sync issues. How would you lock and commit multiple partitions? How would you detect deadlocks? 2 phrase lock and 2 phrase commit is not an easy answer. eBay takes down the site on Monday 12-4AM every week to archive sold items so as to keep the fragments small (smaller fragments means faster searching by the database). Yes, this 4 hours is "planned" downtime and no, planned downtime does count towards high availibility.