You may have read recently that Amazon experienced various levels of service outages in their cloud offering that lasted a full 24 hours. Many sites, including some for large organizations, experienced serious performance issues. Many businesses that rely on various cloud offerings residing on the Amazon infrastructure were rendered unable to transact their business properly for a full business day.
I'm not really interested in why this happened, or what happened. I have long been nervous that too many are jumping too quickly to cloud offerings without understanding the limitations, challenges, and realities. The simple fact is, this issue pointed out that there are risks that are very real to business who have moved to cloud. This particular outage also showed that many cloud customers failed to have proper disaster recovery plans in place, mostly because they were led to believe (or just did believe) that cloud computing was exceptionally robust and that they should not and would not have significant failures.
Another lesson learned is about terms of service and up time guarantees. The fact is, despite the large outage, Amazon did not break its service level agreement! It is critical to really understand what the SLA is and what it is actually promising you, and what your responsibilities are under the agreement. In many cases, the SLA covers specific requirements, such as the ability to connect to the platform 99.9% of the time, but does not cover the actual functionality of the solution itself. That was basically the issue with Amazon, so though folks could not run their business applications, Amazon was not in breach of their SLA's. I bet a lot of folks were angry about that, but at the same time, we HAVE to understand what these SLA's are and they are not. After all, it is your business at stake here!
On the whole, I still think cloud computing is coming our way, and is something that should be part of all strategic IT discussions. BUTÂ… more importantly I firmly believe that CONTROL is an important part of IT and that it is critical to understand that YOU the owner of a business are going to pay the price for large failures of infrastructure, not a cloud provider, so you must ensure that you have the ability to control your environment and ensure that you can control the process of recovering your environment should something go wrong. If you are deploying in the cloud, make sure you have images of those servers you can spin up virtually locally. Always have redundancy in your plans. Amazon is one of the most credible companies in the world for this type of service, and yet, even they can experience crippling downtime.
At the end of the day, PLAN carefully any cloud deployment and ensure that you have answers to the following questions:
1. If my cloud provider goes down, can I continue to run my business?
2. How long will it take me to recover my data and applications in the worst case scenario?
3. What is the provider's actual service level agreement and what recourse do I have if any if they break it?
Many have gone with cloud computing to save money. I wonder if a number of companies, after a day of downtime, are still feeling like they have saved money? I wonder if any businesses were irreparably harmed? I wonder how many may rethink their cloud strategy at this point?
Go slow. Plan carefully. Be prepared.
That way you can make sure you continue to experience Happy Computing!