Simplifying cloud: Reliability

This content is 14 years old and may not reflect reality today nor the author’s current opinion. Please keep its age in mind as you read it.

Reliability in cloud computing is a very simple concept which I’ve explained in many presentations but never actually documented:

Traditional legacy IT systems consist of relatively unreliable software (Microsoft Exchange, Lotus Notes, Oracle, etc.) running on relatively reliable hardware (Dell, HP, IBM servers, Cisco networking, etc.). Unreliable software is not designed for failure and thus any fluctuations in the underlying hardware platform (including power and cooling) typically result in partial or system-wide outages. In order to deliver reliable service using unreliable software you need to use reliable hardware, typically employing lots of redundancy (dual power supplies, dual NICs, RAID arrays, etc.). In summary:

unreliable software
unreliable hardware

Cloud computing platforms typically prefer to build reliability into the software such that it can run on cheap commodity hardware. The software is designed for failure and assumes that components will misbehave or go away from time to time (which will always be the case, regardless of how much you spend on reliability – the more you spend the lower the chance but it will never be zero). Reliability is typically delivered by replication, often in the background (so as not to impair performance). Multiple copies of data are maintained such that if you lose any individual machine the system continues to function (in the same way that if you lose a disk in a RAID array the service is uninterrupted). Large scale services will ideally also replicate data in multiple locations, such that if a rack, row of racks or even an entire datacenter were to fail then the service would still be uninterrupted. In summary:

unreliable software
unreliable hardware

Asked for a quote for Joe Weinman’s upcoming Cloudonomics: The Business Value of Cloud Computing book, I said:

The marginal cost of reliable hardware is linear while the marginal cost of reliable software is zero.

That is to say, once you’ve written reliability into your software you can scale out with cheap hardware without spending more on reliability per unit, while if you’re using reliable hardware then each unit needs to include reliability (typically in the form of redundant components), which quickly gets very expensive.

The other two permutations are ineffective:

Unreliable software on unreliable hardware gives an unreliable system. That’s why you should never try to install unreliable software like Microsoft Exchange, Lotus Notes, Oracle etc. onto unreliable hardware like Amazon EC2:

unreliable software
unreliable hardware

Finally, reliable software on reliable hardware gives a reliable but inefficient and expensive system. That’s why you’re unlikely to see reliable software like Cassandra running on reliable platforms like VMware with brand name hardware:

reliable software
reliable hardware

Google enjoyed a significant competitive advantage for many years by using commodity components with a revolutionary proprietary software stack including components like the distributed Google File System (GFS). You can still see Google’s original hand-made racks built with motherboards laid on cork board at their Mountain View campus and the computer museum (per image above), but today’s machines are custom made by ODMs and are a lot more advanced. Meanwhile Facebook have decided to focus on their core competency (social networking) and are actively commoditising “unreliable” web scale hardware (by way of the Open Compute Project) and software (by way of software releases, most notably the Cassandra distributed database which is now used by services like Netflix).

The challenge for enterprises today is to adopt cheap reliable software so as to enable the transition away from expensive reliable hardware. That’s easier said than done, but my advice to them is to treat this new technology as another tool in the toolbox and use the right tool for the job. Set up cloud computing platforms like Cassandra and OpenStack and look for “low-hanging fruit” to migrate first, then deal with the reticent applications once the “center of gravity” of your information technology systems has moved to cloud computing architectures.

P.S. Before the server huggers get all pissy about my using the term “relatively unreliable software”, this is a perfectly valid way of achieving a reliable system — just not a cost effective one now “relatively reliable software” is here.