Wednesday 2 October 2013

Cloud–Azure Downtime --A Reality and It Hurts….

Cloud is a technological innovations which has been accepted in the main stream . The cloud platform is constantly evolving, still in its infancy still. The platform will take sometime to get matured. The non functional the ilaties is something which needs a lot more data, understanding and most of them come a very one sided T&C with fine print “Conditions Apply”.

Cloud downtime is a tricky topic and needs a lot more understanding. Its only when we run into a real situation do we start reading the fine print.

Microsoft Azure Compute ran into hot waters 27 Sep 2013 6:34 AM UTC North Central US datacenters. The documentation mentions Partial Performance Degradations & “The repair steps have been successfully executed and validated. Full Compute functionality has been restored in the affected North Central US sub-region. We apologize for any inconvenience this has caused our customers.”.

What does the Cloud Services SLA read as on the Microsoft site

Cloud Services, Virtual Machines and Virtual Network

  • For Cloud Services, we guarantee that when you deploy two or more role instances in different fault and upgrade domains, your Internet facing roles will have external connectivity at least 99.95% of the time.
  • For all Internet facing Virtual Machines that have two or more instances deployed in the same Availability Set, we guarantee you will have external connectivity at least 99.95% of the time.
  • For Virtual Network, we guarantee a 99.9% Virtual Network Gateway availability

What is the SLA around Fault Domain?

We all tend to think the Fault Domain will just be God sent, but in all actuality, Fault Domain is a set of computer on some rack and same data center is also liable to go down.  If the entire data center goes down there is no fail over.  Moreover even if the down time is unplanned and over the documented limits there is no region wise fail over.

<Correction> Fault Domain don't exists across data center as pointed out by Wood  - Thanks</Correction>

For more documentation on Fault Domain refer here.

Can one find which are fault domain instances for their instances?

No apparently MSFT doesn't indicate where the fault domain for the instances ,at least on the management portal. However Windows Azure SDK provides some properties you can use to query fault domain and upgrade domain information. RoleInstance class has property called FaultDomain that you can read to find out in which fault domain your role instance is running. There is a catch though – querying FaultDomain property will return either 1 (one) or 2 (two). This is because you are entitled for only 2 fault domains for your application. If your application is deployed across more fault domains you will not be able to determine this using the FaultDomain property.

Can one really rely on the Fault Domain?

My take is given the limited documentation and Fault Domain is more of abstract term which features in the SLA and T&C . The recent increase in compute & sql reporting downtime

  • Sept 27 – 2013 – North Central US , 4:36 AM, 5:15 AM, 6:34 AM.
  • Sept 28-2013 -Compute : Partial Performance Degradation [North Central US]
    • 28 Sep 2013 5:15 PM UTC
    • 28 Sep 2013 5:00 PM UTC
    • 28 Sep 2013 4:20 PM UTC
    • 28 Sep 2013 3:20 PM UTC
  • Sept 29 -Compute, Storage and SQL Reporting : Performance Degradation [Southeast Asia]
    • 29 Sep 2013 4:44 PM UTC
    • 29 Sep 2013 4:07 PM UTC
    • 29 Sep 2013 2:07 PM UTC
    • 29 Sep 2013 1:07 PM UTC

The string of episodes around Compute - did impact a lot of customers in multiple region and most of them I presume had fault domain (min 2 instances), still this resulted in disruption of service. My take if downtime can result in business loss then don’t depend on fault domains look for alternatives.

What alternatives does the customer have in case they don’t want to rely on Fault Domains?

Azure does provide Traffic Manager – which is CTP , this can be used please evaluate a fully working POC before using this as an option.

Traffic Manager allows you to load balance incoming traffic across multiple hosted Windows Azure services whether they’re running in the same datacenter or across different datacenters around the world. By effectively managing traffic, you can ensure high performance, availability and resiliency of your applications. Traffic Manager provides you a choice of three load balancing methods: performance, failover, or round robin.

Use Traffic Manager to:

Ensure high availability for your applications

Traffic Manager enables you to improve the availability of your critical applications by monitoring your hosted services in Windows Azure and providing automatic failover capabilities when a service goes down.

Run responsive applications

Windows Azure allows you to run services in datacenters located around the globe. By serving end-users with the hosted service that is closest to them in terms of network latency, Traffic Manager can improve the responsiveness of your applications and content delivery times.

Note: The customer pays for the extra set of instances running in a different data centre. The data synchronization methods have to build over and above Azure Sync. The deployment from the customer development environment has to deploy to both production and fail over simultaneously. The customer will entail additional costs.

What are dates around Traffic Manager GA?

No dates are announced in public from MSFT, The assumption is early March 2014.

What does the SLA for Compute mention?

Find the Compute SLA here. The important part of the document SLA Exclusions

a. SLA Exclusions

i. This SLA and any applicable Service Levels do not apply to any performance or availability issues:

1. Due to factors outside Microsoft’s reasonable control (for example, a network or device failure at the Customer site or between the Customer and our data center).<Important> Does a Natural Disaster classify beyond reasonable control </important>

2. That resulted from Customer’s or third party hardware or software. This includes VPN devices that have not been tested and found to be compatible by Microsoft. The list of compatible VPN devices is available at http://msdn.microsoft.com/en-us/library/windowsazure/jj156075.aspx.

3. When Customer uses versions of operating systems in either Virtual Machines or Cloud Services that have not been tested and found to be compatible by Microsoft. The Virtual Machines list of compatible Microsoft software and Windows versions is available at http://support.microsoft.com/kb/2721672. The Virtual Machines list of compatible Linux software and versions is available at http://support.microsoft.com/kb/2805216. The Cloud Services list of compatible operating systems is available at http://msdn.microsoft.com/en-us/library/ee924680.aspx.

4. That resulted from actions or inactions of Customer or third parties;

5. Caused by Customer’s use of the Service after Microsoft advised Customer to modify its use of the Service, if Customer did not modify its use as advised;

6. During Previews (e.g., technical previews, betas, as determined by Microsoft);

Or

7. Attributable to the acts or omissions of Customer or Customer’s employees, agents, contractors, or vendors, or anyone gaining access to Microsoft’s Service by means of Customer’s passwords or equipment.

This is lot of legal jargons but My 2 cents is if your business cannot afford a downtime please read the SLA’s and considering cloud is evolving, we can only get better on the SLA’s. The customer will have to look for options other than MSFT.

Additional Notes

How to find out the Fault Domain? -http://msdn.microsoft.com/en-us/library/microsoft.windowsazure.serviceruntime.roleinstance.faultdomain.aspx

Microsoft Azure Real-time Service Status http://www.windowsazure.com/en-us/support/service-dashboard/