Cloud computing – Running your stuff in someone elses data center.
It was supposed to be the Valhalla of systems management. For many, who do not care about uptime, it is. The real sex appeal to the whole thing is that you can save dozens of thousands of dollars in capital expenses for hardware/network purchases and then potentially hundreds of thousands of dollars in employment costs by outsourcing the hardware and network maintenance to a third party. The trade-off, of course, is that you lose your control of your own network and must rely on a third-party underpinning contract for your company’s SLA.
For some companies, this is not a problem – especially startups who want to prove a product with little investment money or SMB’s who just do not want to host in a data center. Those companies can take the hit that is uptime for the tradeoff of known MRC (or at least a linear scale). There is the initial cost of creating your server images and figuring out the nuances of the hosting solution, but after that it should be a cakewalk right?
Well many companies are finding out that the “Cloud” is not all that it is cracked up to be – at least not yet.
SPOFs
The first thing that any good Systems Administrator will think of is SPOF (Single Point of Failure). In the Cloud, SPOF is not thought of in the traditional sense of single server failure, database failure or network device failure. Of course, why would it be since you are paying to host it in a virtualized environment. The SPOF we talk about in the Cloud are service based SPOF’s. A service based SPOF is one in which the entire service being provided by the third-party goes down. That means all customers using that service are down for the duration of the outage, or a group of customers using services in a single center are down. This is evident by the several well known outages of Amazon’s S3 storage cloud service which has brought down companies like SmugMug for lots of downtime. These types of outages range from the preventable to the inevitable, but most of them come with the territory when you have a 99.9% SLA, which is your underpinning contract with Amazon. That means that you can _never_ offer a higher SLA service because that is the weakest link in the chain.
An 99.9% SLA means a single service can be down (hard down) for no more than 43 minutes per month. That sounds pretty reasonable right? Well, look further and you will see the rub. Amazon, for instance, makes a “reasonable effort” to guarantee 99.9% uptime. If you dip below that all the way down to 99% for the month they will give you a 10% credit of your monthly fees as compensation. Anything lower than 99% nets you a 25% savings on your monthly bill for the time lost. That .9%, however, is a big number change. To go from 99.9% to 99% takes you from 43 minutes per month to 432 minutes per month, that is 7.2 hours of downtime. That 7.2 hours of downtime will net you a 10% savings of your monthly bill, but because these services are significantly marked down compared to what you may be charging your customers – your loss will mostly likely be significantly higher than the 10% savings over that 7.2 hour period.
Since Amazon offers many services as part of its AWS cloud platform, you have many ways in which your company can die. The more services you use, the greater the number of SPOF’s you have. Want to use their EBS (Elastic Block Store) as a database? You can have 42 minutes of downtime there as well – except _that_ 42 minutes of downtime will probably take your entire application stack down. That means you will still pay for the S3 and EC2 hosting costs but will get a 10% discount only on the EBS portion of it.
The more you slice it, the more you see that it does not make sense to put your entire enterprise on a cloud platform quite yet – from the SPOF point of view.
Network Obfuscation
The first thing a good Network Administrator will think of is probably “How can I manage the interfaces for my devices?”. The true answer, for now at least, is – you cannot. Companies like Amazon and ServePath give you some limited tools to control your ACLs and host entries, but most of these companies are severely lacking in broad network support that many of us have come to rely on. Things like intra-VLAN ACLs and complicated application layer rules just are not there. You are also at the whim of whatever NLB (Network load balancing) method the company chooses for you for redundancy.
If you are used to setting up iRules for your F5 or have a layer 7 switch currently, do not expect AWS or GoGrid to meet your expectations for flexibility and features on the network level. Instead, many companies are choosing to invest greater dollars in doing these types of tasks on the software level than on the network level. While that methodology has a life of its own, and its own merits as well, nothing beats an appliance that is dedicated to a task. You could write an application to take the place of the iRules feature of an F5, but I will bet you a paycheck it will take longer/cost more money and will not be as reliable as the appliance version with SSL accelerators.
In addition, simple things (I say that with my tongue firmly planted in my cheek) like DDOS attacks, can go untreated for hours while services crash around them. Take for instance BitBuckets latest outage with a DDOS and Amazons EBS offering. One could say they should have been using a network monitoring solution, or Amazon will learn from this and build better tools – both of which are true. Both of said options, however, are significantly limited by _what_ you can access and when. If your sites are being DDOS’d from the network and you need to connect to the machines over the network, you are a little assed out now are you not? Conversely if your servers are in your own datacenter, the simplest option is to take the outage but block traffic at the edge or firewall and then deal with the servers on your LAN. These issues are not currently dealt with in the nicest way with Cloud offerings. Now, an iLO type approach would be very nice – and it might be coming – but we will need to see a bit of movement before we are ready to move our services over.
The simple fact is, cloud computing is a commodity based service with commodity options. Those options are growing every day, but some of the significant limitations are still right in the face of the best admins around.
The better idea, as far as I can tell, is to use AWS and GoGrid as a supplement to your existing services. If you look at cloud computing as a very cheap, somewhat reliable and commodity based scaling platform you can easily use these services to spin up another fleet of servers based on load. Why not use your F5 that you have already and add new nodes on demand in the remote data center? Why not use AWS or GoGrid as your D.R. site? Why not use GoGrid or AWS or GoogleApps to prove that new Proof of Concept system without shelling out cash for new servers? These are some of the ideas that I think Cloud computing is good for -now- and ones that make sense -now- for companies that need an extra % of uptime in their services.
With the new VPN and SDC services from both AWS and GoogleApps , things are getting better, but even those options are zeroing in on the practical uses that I outlined in the paragraph above – not on hosting your entire company in the cloud.
As with anything, too much of a single thing is bad and excess is never a good policy to follow. A good IT shop should never put all of its eggs in one basket (from Backup solutions to a single technology play) so why would you let your entire company ride on hosted solutions? Well, that is, unless you build your company around a poor SLA model and just factor that into the product
.
http://smugmug.wordpress.com/2008/07/20/amazon-s3-outage-causes-smugmug-outage/
http://blog.bitbucket.org/2009/10/04/on-our-extended-downtime-amazon-and-whats-coming/
http://www.theregister.co.uk/2008/02/15/amazon_s3_outage_feb_2008/
http://www.gogrid.com/index.v5.php
http://aws.amazon.com/
Posted in JaysIdeas, Tech/Science
Recent Comments