Azure Business Continuity Technical Guidance

Authors: Patrick Wickline, Jason Roth

Contributers & Reviewers: Luis Carlos Vargas Herring, Drew McDaniel, David Magar, Ganesh Srinivasan, Milan Gada, Nir Mashkowski ,Harsh Mittal, Sasha Nosov, Selcin Turkarslan, Cephas Lin, Cheryl McGuire, Bill Mathers, Mandi Ohlinger, Sidney Higa, Michael Green, Heidi Steen, Matt Winkler, Shayne Burgess, Larry Franks, Brad Severtson, Yavor Georgiev, Glenn Gailey, Tim Ammann, Ruppert Koch, Seth Manheim, Abhinav Gupta, Steve Danielson, Corey Sanders, John Deutscher

Introduction

Meeting high availability and disaster recovery requirements requires two types of knowledge: 1) detailed technical understanding of a cloud platform’s capabilities and 2) how to properly architect a distributed service. This paper covers the former - the capabilities and limitations of the Azure platform with respect to Business Continuity. While it also touches on architecture and design patterns that is not the focus. The reader should consult the material in the other Additional Resources section for design guidance.

The information is organized into the following sections:

  • 1. Recovery from local failures : Physical hardware (for example drives, servers, and network devices) can all fail and resources can be exhausted when load spikes. This section describes the capabilities Azure provides to maintain high availability under these conditions.

  • 2. Recovery from loss of an Azure region: Widespread failures are rare but possible. Entire regions can become isolated due to network failures, or be physically damaged due to natural disasters. This section explains how to use Azure’s capabilities to create applications that span geographically diverse regions.

  • 3. Recovery from on-premises to Azure: The cloud significantly alters the economics of disaster recovery, making it possible for organizations to use Azure to establish a second site for recovery. This can be done at a fraction of the cost of building and maintaining a secondary datacenter. This section explains the capabilities Azure provides for extending an on premises datacenter to the cloud.

  • 4. Recovery from data corruption or accidental deletion: Applications can have bugs which corrupt data and operators can incorrectly delete important data. This section explains what Azure provides for backing up data and restoring to a previous point it time.

  • 5. Additional Resources: Other important resources covering availability and disaster recovery in Azure.

1. Recovery from local failures

There are two primary threats to application availability: the failure of devices, such as drive and servers, and the exhaustion of critical resources, such as compute under peak load conditions. Azure provides a combination of resource management, elasticity, load-balancing, and partitioning to enable high availability under these circumstances. Some of these features are performed automatically for all cloud services; however, in some cases the application developer must do some additional work to benefit from them.

Compute (PaaS)

All cloud services hosted by Azure are collections of one or more web or worker roles. One or more instances of a given role can run concurrently. The number of instances is determined by configuration. Role instances are monitored and managed with a component called the Fabric Controller (FC). The FC detects and responds to both software and hardware failure automatically.

  • Every role instance runs in its own virtual machine (VM) and communicates with its FC through a guest agent (GA). The GA collects resource and node metrics, including VM usage, status, logs, resource usage, exceptions, and failure conditions. The FC queries the GA at configurable intervals, and reboots the VM if the GA fails to respond.

  • In the event of hardware failure, the associated FC moves all affected role instances to a new hardware node and reconfigures the network to route traffic there.

To benefit from these features, developers should ensure that all service roles avoid storing state on the role instances. Instead, all persistent data should be accessed from durable storage, such as Azure Storage Services or Azure SQL Database. This allows requests to be handled by any roles. It also means that role instances can go down at any time without creating inconsistencies in the transient or persistent state of the service.

The requirement to store state external to the roles has several implications. It implies, for example, that all related changes to an Azure Storage table should be changed in a single Entity Group transaction, if possible. Of course, it is not always possible to make all changes in a single transaction. Special care must be taken to ensure that role instance failures do not cause problems when they interrupt long running operations that span two or more updates to the persistent state of the service. If another role attempts to retry such an operation, it should anticipate and handle the case where the work was partially completed.

For example, in a service that partitions data across multiple stores, if a worker role goes down while relocating a shard, the relocation of the shard may not complete, or may be repeated from its inception by a different worker role, potentially causing orphaned data or data corruption. To prevent problems, long running operations must be idempotent (i.e., repeatable without side effect) and/or incrementally restartable (i.e., able to continue from the most recent point of failure).

  • To be idempotent, a long running operation should have the same effect no matter how many times it is executed, even when it is interrupted during execution.

  • To be incrementally restartable, a long running operation should consist of a sequence of smaller atomic operations, and it should record its progress in durable storage, so that each subsequent invocation picks up where its predecessor stopped.

Finally, all long running operations should be invoked repeatedly until they succeed. For example, a provisioning operation might be placed in an Azure queue, and removed from the queue by a worker role only when it succeeds. Garbage collection might be necessary to clean up data created by interrupted operations.

Elasticity

The initial number of instances running for each role is determined in each role’s configuration. Administrators should initially configure each of the roles to run with two or more instances based on expected load. But role instances can easily be scaled up or down as usage patterns change. This can be done with the Azure Portal, Windows PowerShell, the Service Management API, or third-party tools. The FC automatically provisions any new instances and inserts them into the load balancer for that role.

With Azure Auto-Scale (Preview), you can enable Azure to automatically scale your roles based on load. Automatic scaling can also be programmatically built-in and configured for a cloud service using a framework like the Auto-Scaling Application Block (WASABi).

Partitioning

FCs use two types of partitions: upgrade domains and fault domains.

  • An upgrade domain is used to upgrade a service’s role instances in groups. Azure deploys service instances into multiple upgrade domains. For an in-place upgrade, the FC brings down all the instances in one upgrade domain, upgrades them, and then restarts them before moving to the next upgrade domain. This approach prevents the entire service from being unavailable during the upgrade process.

  • A fault domain defines potential points of hardware or network failure. For any role with more than one instance, the FC ensures that the instances are distributed across multiple fault domains, in order to prevent isolated hardware failures from disrupting service. All exposure to server and cluster failure in Azure is governed by fault domains.

Per the Azure SLA, Microsoft guarantees that when two or more web role instances are deployed to different fault and upgrade domains, they will have external connectivity at least 99.95% of the time. Unlike update domains, there is no way to control the number of fault domains. Azure automatically allocates fault domains and distributes role instances across them. At least the first two instances of every role are placed in different fault and upgrade domains in order to ensure that any role with at least two instances will satisfy the SLA. This is represented in the following diagram.

Fault Domain Isolation (Simplified View)

Load Balancing

All inbound traffic to a web role passes through a stateless load balancer, which distributes client requests among the role instances. Individual role instances do not have public IP addresses, and are not directly addressable from the Internet. Web roles are stateless, so that any client request can be routed to any role instance. A StatusCheck event is raised every 15 seconds. This can be used to indicate if the role is ready to receive traffic, or is busy and should be taken out of the load balancer rotation.

Virtual Machines (IaaS)

Azure Virtual Machines differ from PaaS compute roles in several respects in relation to high availability. In some instances, you must do additional work to ensure high availability.

Disk Durability

Unlike PaaS role instances, data stored on Virtual Machine drives is persistent even when the virtual machine is relocated. Azure Virtual Machines use VM Disks that exist as blobs in Azure Storage. Because of the availability characteristics of Azure Storage, the data stored on a Virtual Machine’s drives is also highly available. Note that the D: drive is the exception to this rule. The D: drive is actually physical storage on the rack server that hosts the VM, and its data will be lost if the VM is recycled. The D: drive is intended for temporary storage only.

Partitioning

Azure natively understands the tiers in a PaaS application (Web role and Worker role) and thus can properly distribute them across Fault and Update Domains. In contrast, the tiers in an IaaS applications must be manually defined using availability sets. Availability sets are required for an SLA under IaaS.

Availability Sets for Windows Azure VMs

In the diagram above the IIS tier and the SQL tier are assigned to different Availability Sets. This ensures that all instances of each tier have hardware redundancy by distributing them across fault domains, and are not taken down during an update.

Load Balancing

If the VMs should have traffic distributed across them, you must group the VMs in a cloud service and load balance across a specific TCP or UDP endpoint. For more information, see Load Balancing Virtual Machines. If the VMs receive input from another source (for example, a queuing mechanism), then a load balancer is not required.  The load balancer uses a basic health check to determine if traffic should be sent to the node. It is also possible to create your own probes to implement application specific health metrics that determine if the VM should receive traffic.

Storage

Azure Storage is the baseline durable data service for Azure, providing blob, table, queue, and VM Disk storage. It uses a combination of replication and resource management to provide high availability within a single data center. The Azure Storage availability SLA guarantees that at least 99.9% of the time correctly formatted requests to add, update, read and delete data will be successfully and correctly processed, and that storage accounts will have connectivity to the internet gateway.

Replication

Data durability for Azure Storage is facilitated by maintaining multiple copies of all data on different drives located across fully independent physical storage sub-systems within the region. Data is replicated synchronously and all copies are committed before the write is acknowledged. Azure Storage is strongly consistent, meaning that reads are guaranteed to reflect the most recent writes. In addition, copies of data are continually scanned to detect and repair bit rot, an often overlooked threat to the integrity of stored data. Services benefit from replication just by using Azure Storage. No additional work is required by the service developer for recovery from a local failure.

Resource Management

Storage accounts created after June 7th, 2012 can grow to up to 200TB (the previous maximum was 100 TB). If additional space is required, applications must be designed to leverage multiple storage account.

VM Disks

A Virtual Machine’s VM Disk is stored as a page blob in Azure Storage, giving it all the same durability and scalability properties as blob storage. This design makes the data on a Virtual Machine’s disk persistent even if the server running the VM fails and the VM must be restarted on another server.

Database

SQL Database

Microsoft Azure SQL Database provides database-as-a-service, allowing applications to quickly provision, insert data into, and query relational databases. It provides many of the familiar SQL Server features and functionality, while abstracting the burden of hardware, configuration, patching and resiliency.

Note

Azure SQL Database does not provide 1:1 feature parity with SQL Server, and is intended to fulfill a different set of requirements uniquely suited to cloud applications (elastic scale, database-as-a-service to reduce maintenance costs, and so on). For more information, see Data Series: SQL Server in Azure Virtual Machine vs. SQL Database.

Replication

Azure SQL Database provides built-in resiliency to node-level failure. All writes into a database are automatically replicated to two or more background nodes using a quorum commit technique (the primary and at least one secondary must confirm that the activity is written to the transaction log before the transaction is deemed successful and returns). In the case of node failure the database automatically fails over to one of the secondary replicas. This causes a transient connection interruption for client applications. For this reason all Microsoft Azure SQL Database clients must implement some form of transient connection handling. For more information, see Using the Transient Fault Handling Application Block with SQL Azure.

Resource Management

Each database, when created, is configured with an upper size limit. The currently available maximum size is 150GB. When a database hits its upper size limit it rejects additional INSERT or UPDATE commands (querying and deleting data is still possible).

Within a database, Microsoft Azure SQL Database uses a fabric to manage resources. However, instead of a fabric controller, it uses a ring topology to detect failures. Every replica in a cluster has two neighbors, and is responsible for detecting when they go down. When a replica goes down, its neighbors trigger a Reconfiguration Agent (RA) to recreate it on another machine. Engine throttling is provided to ensure that a logical server does not use too many resources on a machine, or exceed the machine’s physical limits.

Elasticity

If the application requires more than the 150GB database limit it must implement a scale-out approach. Scaling out with Microsoft Azure SQL Database is done by manually partitioning, also known as sharding, data across multiple Azure SQL Databases. This scale-out approach provides the opportunity to achieve near linear cost growth with scale. Elastic growth or capacity on demand can grow with incremental costs as needed because databases are billed based on the average actual size used per day, not based on maximum possible size.

SQL Server 2014 on Virtual Machines (IaaS)

By installing SQL Server 2014 on Azure Virtual Machines (IaaS), you can take advantage of the traditional availability features of SQL Server, such as AlwaysOn Availability Groups or database mirroring. Note that Azure VM, storage, and networking, have different operational characteristics than an on-premise, non-virtualized IT infrastructure. A successful implementation of a HADR SQL Server solution in Azure requires that you understand these differences and design your solution to accommodate them.

High availability nodes in an availability set

When you implement a high availability solution in Azure, the availability set in Azure enables you to place the high availability nodes into separate fault domains and upgrade domains. To be clear, the availability set is an Azure concept. It is a best practice that you should follow to make sure that your databases are indeed highly available, whether you are using AlwaysOn Availability Groups, database mirroring, or otherwise. If you do not follow this best practice, you may be under the false assumption that your system is highly available, but in reality your nodes can all fail simultaneously because they happen to be placed in the same fault domain in the Azure datacenter. This recommendation is not as applicable with log shipping, since, as a disaster recovery feature, you should ensure that the servers are running in separate Azure datacenter locations (regions). By definition, these datacenter locations are separate fault domains.

For Azure VMs to be placed in the same availability set, you must deploy them in the same cloud service. Only nodes in the same cloud service can participate in the same availability set. In addition, the VMs should be in the same VNet to ensure that they maintain their IPs even after service healing, thus avoiding DNS update times.

Azure-Only: High Availability Solutions

You can have a high availability solution for your SQL Server databases in Azure using AlwaysOn Availability Groups or database mirroring.

The following diagram demonstrates the architecture of AlwaysOn Availability Groups running in Azure Virtual Machines. This diagram was taken from the depth article on this subject, High Availability and Disaster Recovery for SQL Server in Azure Virtual Machines.

AlwaysOn Availability Groups in Windows Azure

You can also automatically provision an AlwaysOn Availability Group deployment end-to-end on Azure VMs by using the AlwaysOn template in the Microsoft Azure Portal. For more information, see SQL Server AlwaysOn Offering in Microsoft Azure Portal Gallery.

The following diagram demonstrates the use of Database Mirroring on Azure Virtual Machines. It was also taken from the depth topic, High Availability and Disaster Recovery for SQL Server in Azure Virtual Machines.

Database Mirroring in Windows Azure

Note

Note that in both architectures a domain controller is required. However, with Database Mirroring it is possible to use server certificates to eliminate the need for a domain controller.

Other Azure Platform Services

Azure Cloud Services are built on Azure, so they benefit from the platform capabilities previously described to recover from local failures. In some cases, there are specific actions that you can take to increase the availability for your specific scenario.

Access Control Service (Availability)

Access Control Service (ACS) 2.0 takes backups of all namespaces once per day and stores them in a secure offsite location. When ACS operation staff determines there has been an unrecoverable data loss at one of ACS’s regional data centers, ACS will attempt to recover customers’ subscriptions by restoring the most recent backup. Due to the frequency of backups data loss up to 24 hours may occur. For more information, see Access Control Service (Disaster Recovery).

Service Bus (Availability)

To mitigate against a temporary outage of Azure Service Bus, consider creating a durable client-side queue. This temporarily uses an alternate, local storage mechanism to store messages that cannot be added to the Service Bus queue. The application can decide how to handle the temporarily stored messages after the service is restored. For more information, see Insulating Service Bus Applications Against Service Bus Outages and Disasters. For more information, see Service Bus (Disaster Recovery).

Mobile Services (Availability)

There are two availability considerations for Azure Mobile Services. First, regularly back up the Azure SQL Database associated with your mobile service. Also back up the mobile service scripts. For more information, see Recover your mobile service in the event of a disaster. If Mobile Services experiences a temporary outage, you might have to temporarily use an alternate Azure datacenter. For more information, see Mobile Services (Disaster Recovery).

HDInsight (Availability)

The data associated with HDInsight is stored by default in Azure Blob Storage, which has high the availability and durability properties specified by Azure Storage. The multi-node processing associated with Hadoop MapReduce jobs is done on a transient Hadoop Distributed File System (HDFS) that is provisioned when needed by HDInsight. Results from a MapReduce job are also stored by default in Azure Blob Storage, so that the processed data is durable and remains highly available after the Hadoop cluster is deprovisioned. For more information, see HDInsight (Disaster Recovery).

Checklist: Local Failures

Service/Area Checklist

Compute (PaaS)

  • Configure at least two instances for each role.

  • Persist state in durable storage, not on role instances.

  • Correctly handle the StatusCheck event.

  • Wrap related changes in transactions when posssible.

  • Verify that worker role tasks are idempotent and restartable.

  • Continue to invoke operations until they succeed.

  • Consider autoscaling strategies.

Virtual Machines (IaaS)

  • Do not use the D: drive for persistent storage.

  • Group machines in a service tier into an availability set.

  • Configure load balancing and optional probes.

Storage

  • Use multiple storage accounts when data or bandwidth exceeds quotas.

SQL Database

  • Implement a retry policy to handle transient errors.

  • Use partitioning/sharding as a scale out strategy.

SQL Server 2014 on Virtual Machines (IaaS)

  • Follow the previous recommendations for Virtual Machines.

  • Use SQL Server high availability features, such as AlwaysOn.

Access Control Service (Availability)

  • No additional availability steps required for local failures.

Service Bus (Availability)

  • Consider creating a durable client-side queue as a backup.

Mobile Services (Availability)

  • Regularly back up the Azure SQL Database associated with mobile services.

  • Back up mobile services scripts.

HDInsight (Availability)

  • No additional availability steps required for local failures.

2. Recovery from loss of an Azure region

Azure is divided physically and logically into units called regions. A region consists of one or more datacenters in close proximity. At the time of this writing, Azure has eight regions (4 in North America, 2 in Asia, and 2 in Europe).

Under rare circumstances facilities in an entire region can become inaccessible, for example due to network failures, or lost entirely for example due to natural disasters. This section explains Azure’s capabilities for creating applications which are distributed across regions. Regions are designed to minimize the possibility that a failure in one region could affect other regions.

Compute (PaaS)

Resource Management

Distributing compute instances across regions is accomplished by creating a separate cloud service in each target region and publishing the deployment package to each cloud service. However, note that distributing traffic across cloud services in different regions must be implemented by the application developer or with a traffic management service.

Determining the number of spare role instances to deploy in advance for disaster recovery is an important aspect of capacity planning. Having a full-scale secondary deployment ensures that capacity is already available when needed; however, this effectively doubles the cost. A common pattern is to have a small secondary deployment just large enough to run critical services. We recommend creating at least a small secondary deployment, both to reserve capacity, and for testing configuration of the secondary environment.

Note

the subscription quota is not a capacity guarantee. The quota is simply a credit limit. To guarantee capacity the required number of roles must be defined in the service model and the roles must be deployed.

Load Balancing

To load balance traffic across regions requires usage of a traffic management solution. Azure provides Azure Traffic Manager. You can also take advantage of third-party services that provide similar traffic management capabilities.

Strategies

Many alternative strategies are available for implementing distributed compute across regions. These must be tailored to the specific business requirements and circumstances of the application. At a high level the approaches can be divided into 3 categories:

  • Redeploy on disaster: In this approach the application is redeployed from scratch at the time of disaster. This is appropriate for non-critical applications that don’t require a guaranteed recovery time.

  • Warm Spare (Active/Passive): A secondary hosted service is created in an alternate region, and roles are deployed to guarantee minimal capacity; however, the roles don’t receive production traffic. This approach is useful for applications which have not been designed to distribute traffic across regions.

  • Hot Spare (Active/Active): The application is designed to receive production load in multiple regions. The cloud services in each region might be configured for higher capacity than required for DR purposes. Alternatively, the cloud services might scale out as necessary at the time of a disaster and failover. This approach requires substantial investment in application design but has significant benefits including low and guaranteed recovery time, continuous testing of all recovery locations, and efficient usage of capacity.

A complete discussion of distributed design is outside the scope of this document. The following papers provide detailed guidance on these scenarios.

Virtual Machines (IaaS)

Recovery of IaaS VMs is similar to PaaS Compute recovery in many respects, however there are important differences due to the fact that an IaaS VM consists of both the VM and the VM Disk.

  • Use the Blob Copy API to duplicate VM Disks: In order to create VMs in multiple regions the VM Disk must be copied to the alternate region. Because VM Disks are just blobs this can be accomplished using the standard blob copy API.

  • Separate the Data disk from the OS disk: An important consideration for IaaS VMs is that you cannot change the OS disk without recreating the VM. This is not a problem if your recovery strategy is to redeploy after disaster. However, it might be a problem if you are using the Warm Spare approach to reserve capacity. To implement this properly you must have the correct OS disk deployed to both the primary and secondary locations and the application data must be stored on a separate drive. If possible use a standard OS configuration that can be provided on both locations. After a failover you must then attach the data drive to your existing IaaS VMs in the secondary DC. Use the CopyBlob API to copy snapshots of the data disk(s) to a remote site.

  • Potential consistency issues after a geo-failover of multiple VM Disks: VM Disks are implemented as Azure Storage blobs, and have the same geo-replication characteristic (see below). VM Disks are guaranteed to be in a crash consistent state after a geo-failover, however there are no guarantees of consistency across disks because disks, because geo-replication is asynchronous and replicates independently. This could cause problems in some cases (for example, in the case of disk striping). Additional work might be required to restore consistency after a geo-failover in these cases. To ensure correctness of backups a backup product such as Data Protection Manager should be used to backup and restore application data.

Storage

Recovery using Geo Redundant Storage of Blob, Table, Queue and VM Disk Storage

In Azure blobs, tables, queues, and VM Disks are all geo-replicated by default. This is referred to as Geo Redundant Storage (GRS). GRS replicates storage data to a paired datacenter hundreds of miles apart within a specific geographic region. GRS is designed to provide additional durability in case there is a major data center disaster. Microsoft controls when failover occurs and failover is limited to major disasters in which the original primary location is deemed unrecoverable in a reasonable amount of time. Under some scenarios this can be several days. Data is typically replicated within a few minutes, although synchronization interval is not yet covered by an SLA.

In the event of a geo-failover there will be no change to how the account is accessed (the URL and account key will not change), however, the storage account will be in a different region after failover, which could impact applications which require regional affinity with their storage account. Even for services and applications that do not require a storage account in the same data center, the cross-datacenter latency and bandwidth charges might be a compelling reason to move traffic to the failover region temporarily. This could factor into an overall disaster recovery strategy.

In addition to automatic failover provided by GRS, Azure has introduced a service that gives you read access to the copy of your data in the secondary storage location. This is called Read Access - Geo Redundant Storage (RA-GRS).

For more information about both GRS and the RA-GRS preview, see Azure Storage Redundancy Options and Read Access Geo Redundant Storage.

Geo-Replication Region Mappings:

It is important to know where your data is geo-replicated to in order to know where to deploy the other instances of your data which require regional affinity with your storage. The following table shows the primary and secondary location pairings:

Primary Secondary

North Central US

South Central US

South Central US

North Central US

East US

West US

West US

East US

US East 2

Central US

Central US

US East 2

North Europe

West Europe

West Europe

North Europe

South East Asia

East Asia

East Asia

South East Asia

East China

North China

North China

East China

Japan East

Japan West

Japan West

Japan East

Brazil South

South Central US

Australia East

Australia Southeast

Australia Southeast

Australia East

Geo-Replication Pricing:

Geo-replication is included in current pricing for Azure Storage.  This is called Geo Redundant Storage. If you do not want your data geo-replicated you can disable geo-replication for your account. This is called Locally Redundant Storage, and is charged at a discounted price over geo-replicated storage.

Determining if a geo-failover has occurred

If a geo-failover occurs this will be posted to the Azure Service Health Dashboard, however, applications can implement an automated means of detecting this by monitoring the geo-region for their storage account.  This can be used to trigger other recovery operations such as activation of compute resources in the geo-region where their storage moved to.  This is queryable from the service management API using Get Storage Account Properties. The relevant properties are:

<GeoPrimaryRegion>primary-region</GeoPrimaryRegion>
<StatusOfPrimary>[Available|Unavailable]</StatusOfPrimary>
<LastGeoFailoverTime>DateTime</LastGeoFailoverTime
<GeoSecondaryRegion>secondary-region</GeoSecondaryRegion
<StatusOfSecondary>[Available|Unavailable]</StatusOfSecondary>

VM Disks and geo-failover

  • As discussed in the section on VM Disks, there are no guarantees for data consistency across VM disks after a failover. To ensure correctness of backups a backup product such as Data Protection Manager should be used to backup and restore application data.

Database

SQL Database (Disaster Recovery)

Recovery of Azure Azure SQL Databases can be achieved by taking advantage of Point in Time Restore for Basic, Standard, or Premium tiers. For more information, see Azure SQL Database Backup and Restore.

In addition to using Point in Time Restore, you can manually export the database to an Azure Storage blob using the Azure Azure SQL Database Import/Export service. This can be implemented in three ways:

  • Export to a blob using storage account in a different data center

  • Export to a blob using storage account in the same data center (and rely on Azure Storage geo-replication to the separate data center).

  • Import to your on-premises SQL Server.

For implementation details see the MSDN article Business Continuity in Azure SQL Database.

SQL Server on Virtual Machines (Disaster Recovery)

There are two recommended options for recovering an SQL Server database running on IaaS to an alternate Azure datacenter: cross-region AlwaysOn Availability Groups or backup and restore with storage blobs.

It is also possible to use database mirroring, but this feature will be removed in a future version of SQL Server. When using database mirroring for disaster recovery, you must have the principal and mirror servers running in different Azure datacenters. This means that you must deploy using server certificates, because an Active Directory domain cannot span multiple Azure datacenters without routing traffic through an on-premises network. The following diagram illustrates this setup.

Database Mirroring (DR) in Windows Azure

The following diagram demonstrates standard backup and restore with Azure storage blobs.

Backup to Blog in Windows Azure

For more information, see High Availability and Disaster Recovery for SQL Server in Azure Virtual Machines.

Other Azure Platform Services

When attempting to run your cloud service in multiple Azure regions, you must consider the implications for each of your dependencies. In the following sections, the service-specific guidance assumes that you must use the same Azure service in an alternate Azure datacenter. This involves both configuration and data-replication tasks.

Note that in some cases, these steps can help to mitigate a service-specific outage rather than an entire datacenter event. From the application perspective, a service-specific outage might be just as limiting and would require temporarily migrating the service to an alternate Azure region.

Access Control Service (Disaster Recovery)

The Access Control Service (ACS) uses a unique namespace name that does not span Azure regions. ACS 2.0 takes backups of all namespaces once per day and stores them in a secure offsite location. In the case of a disaster, the ACS operation staff may attempt to recover customers’ subscriptions in a remote Azure region using the most recent backup. Due to the frequency of backups data loss up to 24 hours may occur. There is no SLA for regional failover and the recovery time can be several days depending on the scenario.

To use ACS in an alternate region, customers must configure an ACS namespace in that region. ACS 2.0 customers concerned about potential for data loss are encouraged to review the ACS 2.0 Management Service. This interface allows administrators to manage their namespaces and import and extract all relevant data. Through the use of this interface, ACS customers have the ability develop custom backup and restore solutions for a higher level of data consistency than is currently offered by ACS. For other availability considerations, see Access Control Service (Availability).

Service Bus (Disaster Recovery)

Like ACS, Service Bus uses a unique namespace that does not span Azure regions. So the first requirement is to setup the necessary service bus namespaces in the alternate region. However, there are also considerations for the durability of the queued messages. There are several strategies for replicating messages across Azure regions. For the details on these replication strategies and other disaster recovery strategies, see Insulating Service Bus Applications Against Service Bus Outages and Disasters. For other availability considerations, see Service Bus (Availability).

Web Sites (Disaster Recovery)

To migrate an Azure Web Site to a secondary Azure region, you must have a backup of the website available for publishing. If the outage does not involve the entire Azure datacenter, it might be possible to use FTP to download a recent backup of the site content. Then create a new Web Site in the alternate region, unless you have previously done this to reserve capacity. Publish the site to the new region, and make any necessary configuration changes. These changes could include database connection strings or other region-specific settings. If necessary, add the site’s SSL certificate and change the DNS CNAME record so that the custom domain name points to the redeployed Azure Web Site URL

Mobile Services (Disaster Recovery)

In the secondary Azure region, create a backup mobile service for your application. Restore the Azure SQL Database to the alternate region as well. Then use Azure command-line tools to move the mobile service to the alternate region. Then configure the mobile service to use the restored database. For more information on this process, see Recover your mobile service in the event of a disaster. For other availability considerations, see Mobile Services (Availability)

HDInsight (Disaster Recovery)

The data associated with HDInsight is stored by default in Azure Blob Storage. HDInsight requires that a Hadoop cluster processing MapReduce jobs must be collocated in the same region as the storage account that contains the data being analyzed. Provided you use the geo-replication feature available to Azure Storage, you can access your data in the secondary region where the data was replicated if for some reason the primary region is no longer available. You can create a new Hadoop cluster in the region where the data has been replicated and continue processing it. For other availability considerations, see HDInsight (Availability).

SQL Reporting (Disaster Recovery)

At this time, recovering from the loss of an Azure region requires multiple SQL Reporting instances in different Azure regions. These SQL Reporting instances should access the same data, and that data should have its own recovery plan in the event of a disaster. You can also maintain external backup copies of the RDL file for each report.

Media Services (Disaster Recovery)

Azure Media Services has a different recovery approach for encoding and streaming. Typically, streaming is more critical during a regional outage. To prepare for this, you should have a Media Services account in two different Azure regions. The encoded content should be located in both regions. During a failure, you can redirect the streaming traffic to the alternate region. Encoding can be performed in any Azure region. If encoding is time-sensitive, for example during live event processing, you must be prepared to submit jobs to an alternate datacenter during failures.

Virtual Network (Disaster Recovery)

Configuration files provide the quickest way to setup a virtual network in an alternate Azure region. After configuring the virtual network in the primary Azure region, export the virtual network settings for the current network to a network configuration file. In the event of an outage in the primary region, restore the virtual network from the stored configuration file. Then configure other cloud services, virtual machines, or cross-premises settings to work with the new virtual network.

Checklist: Disaster Recovery

Service/Area Checklist

Compute (PaaS)

  • Create a cross-region disaster recovery strategy.

  • Understand trade-offs in reserving capacity in alternate regions.

  • Use traffic routing tools, such as Azure Traffic Manager (CTP).

Virtual Machines (IaaS)

  • Copy the blob VM disk resources to an alternate datacenter.

  • Perform regular backups of the VM disk or disk contents.

Storage

  • Do not disable geo-replication of storage resources.

  • Understand alternate region for geo-replication in the event of failover.

  • Create custom backup strategies for user-controlled failover strategies.

SQL Database (Disaster Recovery)

  • Use Azure SQL Database Point in Time Restore.

  • Export Azure SQL Database to blob storage.

  • Create a disaster recovery plan based on previous storage considerations.

SQL Server on Virtual Machines (Disaster Recovery)

  • Use cross-region AlwaysOn Availability Groups or database mirroring.

  • Alternately use backup and restore to blob storage.

Access Control Service (Disaster Recovery)

  • Configure an ACS namespace in an alternate region.

  • Use the ACS 2.0 Management Service to create custom backup solutions.

Service Bus (Disaster Recovery)

  • Configure a Service Bus namespace in an alternate region.

  • Consider custom replication strategies for messages across regions.

Web Sites (Disaster Recovery)

  • Maintain web site backups outside of the primary region.

  • If outage is partial, attempt to retrieve current site with FTP.

  • Plan to deploy web site to new or existing web site in an alternate region.

  • Plan configuration changes for both application and DNS CNAME records.

Mobile Services (Disaster Recovery)

  • Create a backup mobile service in an alternate region.

  • Manage backups of the associated Azure SQL Database to restore during failover.

  • Use Azure command-line tools to move mobile service.

HDInsight (Disaster Recovery)

  • Create a new Hadoop cluster in the region with replicated data.

SQL Reporting (Disaster Recovery)

  • Maintain an alternate SQL Reporting instance in a different region.

  • Maintain a separate plan to replicate the target to that region.

Media Services (Disaster Recovery)

  • Create a Media Services account in an alternate region.

  • Encode the same content in both regions to support streaming failover.

  • Submit encoding jobs to an alternate region in the event of an outage.

Virtual Network (Disaster Recovery)

  • Use exported virtual network settings to recreate it in another region.

3. Recovery from on-premises to Azure

Azure provides a comprehensive set of services for enabling extension of an on-premises datacenter to Azure for high availability and disaster recovery purposes:

  • Networking: With virtual private network you securely extend your on-premises network to the cloud.

  • Compute: Customers using Hyper-V on-premise can “lift and shift” existing VMs to Azure

  • Storage: StorSimple extends your file system to Azure Storage. The Azure Backup service provides backup for files and Azure SQL Databases to Azure Storage

  • Database Replication: With SQL 2014 Availability Groups you can implement high availability and disaster recovery for your on-premises data

Networking

Azure Virtual Network enables you to create a logically isolated section in Azure and securely connect it to your on-premises datacenter or a single client machine using an IPsec connection. Virtual Network makes it easy for you to take advantage of Azure’s scalable, on-demand infrastructure while providing connectivity to data and applications on-premises, including systems running on Windows Server, mainframes and UNIX. See here for more information.

Compute

Customers using Hyper-V on-premise can “lift and shift” existing VMs to Azure & Service Providers running Windows Server 2012, without making changes to the VM or converting VM formats. For more information, see Manage Disks and Images.

Storage

There are several options for using Azure as a backup site for on-premises data.

StorSimple

StorSimple securely and transparently integrates cloud storage for on-premises applications and offers a single appliance that delivers high-performance tiered local and cloud storage, live archiving, cloud-based data protection and disaster recovery. For more information, see StorSimple -- Cloud-integrated Storage -- What & Why.

Azure Backup Services

Azure Backup Services enables cloud backups using the familiar backup tools in Windows Server 2012, Windows Server 2012 Essentials, and System Center 2012 Data Protection Manager. These tools provide a workflow for backup management that is independent of the storage location of the backups, whether a local disk or Azure Storage. After data is backed up to the cloud, authorized users can easily recover backups to any server.

With incremental backups, only changes to files are transferred to the cloud. This helps to efficiently use storage space, reduce bandwidth consumption, and support point-in-time recovery of multiple versions of the data. You can also choose to use additional features, such as data retention policies, data compression, and data transfer throttling. Using Azure as the backup location has the obvious advantage that the backups are automatically “offsite”. This eliminates the extra requirements to secure and protect onsite backup media. For more information, see Azure Backup Overview and Using DPM with Azure Backup.

Database

You can have a disaster recovery solution for your SQL Server databases in a hybrid-IT environment using AlwaysOn Availability Groups, database mirroring, log shipping, and backup and restore with Azure blog storage. All of these solutions use SQL Server running on Azure Virtual Machines.

AlwaysOn Availability Groups can be used in a hybrid-IT environment where database replicas exist both on-premises and in the cloud. This is shown in the following diagram, taken from the depth topic High Availability and Disaster Recovery for SQL Server in Azure Virtual Machines.

AlwaysOn Availability Groups in Hybrid IT

Database Mirroring can also span on-premises servers and the cloud in a certificate-based setup. The following diagram illustrates this concept.

Database Mirroring in Hybrid IT

Log shipping can be used to synchronize an on-premises database with a SQL Server database in an Azure Virtual Machine.

Log Shipping in Hybrid IT

Finally, you can backup an on-premises database directly to Azure Blob Storage.

Backup to Blog in Hybrid IT

For more information, see High Availability and Disaster Recovery for SQL Server in Azure Virtual Machines and Backup and Restore for SQL Server in Azure Virtual Machines.

Checklist: On-Premises Scenarios

Service/Area Checklist

Networking

  • Use Virtual Network to securely connect on-premises to the Cloud.

Compute

  • Relocate VMs between Hyper-V and Azure.

Storage

  • Take advantage of StorSimple services for using Cloud storage.

  • Use Azure Backup Services.

Database

  • Consider using SQL Server on Azure VMs as the backup.

  • Setup AlwaysOn Availability Groups.

  • Configure certificate-based Database Mirroring.

  • Use log shipping.

  • Backup on-premises database to Azure blob storage.

4. Recovery from data corruption or accidental deletion

This scenario is about recovery of data corrupted or accidently deleted due to application errors or operator error.

Storage

Note that while Azure Storage provides data resiliency through automated replicas, this does not prevent your application code (or developers/users) from corrupting data through accidental or unintended deletion, update, and so on. Maintaining data fidelity in the face of application or user error requires more advanced techniques, such as copying the data a secondary storage location with an audit log. Developers can take advantage of the blob snapshot capability, which can create read-only point in time snapshots of blob contents. This can be used as the basis of a data-fidelity solution for blobs.

Blob and Table Storage Backup

While blobs and tables are highly durable, they always represent the current state of the data. Recovery from unwanted modification or deletion of data may require restoring data to a previous state. This can be achieved by taking advantage of the capabilities provided by Azure to store and retain point-in-time copies.

For Azure Blobs, you can perform point-in-time backups using the blob snapshot feature. For each snapshot, you are only charged for the storage required to store the differences within the blob since the last snapshot state. The snapshots are dependent on the existence of the original blob they are based on, so a copy operation to another blob or even another storage account is advisable to ensure that backup data is properly protected against accidental deletion. For Azure Tables, you can make point-in-time copies to a different table or to Azure Blobs. More detailed guidance and examples of performing application-level backups of tables and blobs can be found here:

Database

There are several business continuity (backup, restore) options available for Azure SQL Database. Databases can be copied via the Database Copy functionality, or the DAC Import/Export Service. Database Copy provides transactional consistent results, while a bacpac (through the import/export service) does not. Both of these options run as queue-based services within the data center and do not currently provide a time-to-completion SLA.

Note

Note that the database copy and import/export service place a significant degree of load on the source database, and can trigger resource contention or throttling events (described in the following section on Shared Resources and Throttling).

SQL Database Backup

Point-in-time backups for Microsoft Azure SQL Database are achieved with the Azure SQL Database Copy command. You can use this command to create a transactionally-consistent copy of a database on the same logical database server or to a different server. In either case, the database copy is fully functional and completely independent of the source database. Each copy you create represents a point-in-time recovery option. You can recover the database state completely by renaming the new database with the source database name. Alternatively, you can recover a specific subset of data from the new database by using Transact-SQL queries. For additional details about SQL Database, see Business Continuity in Azure SQL Database.

SQL Server on Virtual Machines Backup (IaaS)

For SQL Server on IaaS there are two options: traditional backups and log shipping. Using traditional backups enables you to restore to a specific point in time, but the recovery process is slow. Restoring traditional backups requires starting with an initial full backup, and then applying any backups taken after that. The second option is to configure a Log Shipping session to delay the restore of log backups (for example, by two hours). This provides a window to recover from errors made on the primary.

Other Azure Platform Services

Some Azure platform services store information in a user-controlled storage account or Azure SQL Database. If the account or storage resource is deleted or corrupted, this could cause serious errors with the service. In these cases, it is important to maintain backups that would enable you to recreate these resources if they were deleted or corrupted.

For Azure Web Sites and Azure Mobile Services, you must backup and maintain the associated databases. For Azure Media Service and Virtual Machines, you must maintain the associated Azure Storage account and all resources in that account. For example, for Virtual Machines, you must backup and manage the VM disks in Azure blob storage.

Checklist: Data Corruption or Accidental Deletion

Service/Area Checklist

Storage

  • Regularly backup critical storage resources.

  • Consider using the snapshot feature for blobs.

Database

  • Create point-in-time backups using the Database Copy command.

SQL Server on Virtual Machines Backup (IaaS)

  • Use traditional backup and restore techniques.

  • Created a delayed log shipping session

Web Sites

  • Backup and maintain the associated database if any.

Mobile Services

  • Backup and maintain the associated database.

Media Services

  • Backup and maintain the associated storage resources.

Virtual Machines

  • Backup and maintain the VM disks in blob storage.

5. Additional Resources

Failsafe: Guidance for Resilient Cloud Architectures: Guidance for building resilient cloud architectures, guidance for implementing those architectures on Microsoft technologies, and recipes for implementing these architectures for specific scenarios.

Disaster Recovery and High Availability for Azure Applications: A detailed overview of availability and disaster recovery. It covers the challenge of manual replication for reference and transactional data. The final sections provide summaries of different types of DR topologies that span Azure datacenters for the highest level of availability.

Business Continuity in Azure SQL Database: Focusses exclusively on Azure Azure SQL Database techniques for availability, which primarily centers on backup and restore strategies. If you use Azure SQL Database in your cloud service, you should review this paper and its related resources.

High Availability and Disaster Recovery for SQL Server in Azure Virtual Machines: Discusses the availability options open to you when you use Infrastructure-as-a-Service (IaaS) to host your database services. It discusses AlwaysOn Availability Groups, Database Mirroring, Log Shipping, and Backup/Restore. Note that there are also several tutorials in the same section that show how to use these techniques.

Best Practices for the Design of Large-Scale Services on Azure Cloud Services: Focusses on developing highly scalable cloud architectures. Many of the techniques that you employ to improve scalability also improve availability. Also, if your application can not scale under increased load, then scalability becomes an availability issue.

Backup and Restore for SQL Server in Azure Virtual Machines