Microsoft’s public cloud service, Azure faced outage in India, starting Monday, 18th May 2020, evening. Several users reported problems in accessing the cloud service. Between 6:11 PM IST on 18 May 2020 and 2:00 PM IST on 19 May 2020, customers experienced difficulties connecting to resources hosted in Central India. According to a summary of the impact on the company’s Azure Status website, a number of storage scale units had gone offline, impacting Virtual Machines and other Azure services with dependencies on these.
A tweet from the Twitter handle, @Azuresupport, the official Microsoft Azure account for improving customer experience reads: “Engineers are investigating an issue impacting Service Availability in central India.”
The outage will particularly hit hard coming at a time when businesses are increasingly relying upon cloud for business continuity and resiliency amid the COVID-19 crisis. Cloud has been central to organizations’ BCP strategy with the agility and quick scalability it offers to meet evolving requirements, including Work From Home (WFH). Needless to say, the outage will have a far greater impact considering the increased workloads, including mission critical workloads, on cloud in light of COVID-19.
As of 4:59 AM IST on 19th May 2020, the status on Azure Status (https://status.azure.com/en-gb/status) states that the team had been able to recover storage and networking infrastructure and were starting the recovery of dependent compute resources.
As per the Downdetector website, that gives details of live outages, the issues peaked between 7:00 – 8:00 PM IST and as of 4:00 AM IST reports of problems were still coming in, though less.
As per the preliminary root cause analysis done by Microsoft, a power issue with the regional utility power provider caused a Central India datacenter to transfer to generator power. This transition to generators worked as designed for all infrastructure systems except for a number of air handling units in two of the datacenters internal zones (Colos). As a result, internal temperatures for these two Colos rose above operational thresholds, alerts were triggered, and automation began shutting down network and storage resources to protect data.
The company stated that in its mitigation response exercise, the engineers undertook various workstreams to bring back connectivity. First, technicians isolated the unhealthy air handling units and restored power, returning temperatures to operational levels. Once temperatures were below threshold, technicians started to physically recover Storage scale units. With Storage and Networking restored, dependent Compute scale units began to recover. As Compute nodes became healthy, reliant Virtual Machines and other dependent Azure services also recovered.
The company will be conducting a formal root cause analysis for the incident, which will be published within three business days.
Meanwhile, several Microsoft Azure users have been posting their grievances around the outage on Twitter with #AzureShutdown, #AzureCrash and #AzureIndiaFail.
Not to miss on the opportunity a cloud company even pitched for its services as an alternative.
The outage also inspired a few memes like the one below.
(Image Courtesy: www.botmetric.com)