AKS Cluster Backup and Restore

Why worry about Backup and Restore?

It’s important when dealing with data to always have in mind what would happen, both to you and your customer, if that data were to be lost and were no longer available. As you can imagine, the outcome is rarely optimal. When it comes to cloud computing, concepts such as Disaster Recovery and High Availability are often discussed and put into practice for this very reason. If any of the data storage in AKS cluster were to fail, we need to ensure that we have a backup data disk from which we can restore and continue running normally.

What resources should I be backing up?

The only resources that need to be backed up from an AKS resource group are the mounted Persistent Storage disk resources. This means any OS Disk resources, often labeled aks-agentpool-NUMBER-1_ID, or other VM components, NSG, Load Balancers, etc need not be snapshot as they do not contain any data we need. For an example, see the picture below.

AKS Cluster

What format should the data be backed up in?

The backed-up data should be snapshot of the desired disk.

How do I go about backing up my AKS Cluster?

As there is currently no way to natively back up persistent storage in Azure, you must manually snapshot each disk you wish to back up. Ideally this would be done by writing a Snapshot Persistent Storage Disks task where applicable. Please refer to Dynamically create and use a persistent volume with Azure disks in Azure Kubernetes Service 1 for more information and steps on how to do so. Keep in mind, you will need the subscription ID and resource group name for appreciate AKS resource group, as well as the kubectl tool.

How do I go about restoring data I have backed up?

Again, as there is currently no way to natively back up persistent storage in Azure, you must manually restore each disk you’d like from each individual snapshot.

Resources:

  1. https://docs.microsoft.com/en-us/azure/aks/azure-disks-dynamic-pv

Azure cost saving opportunities

Cost-effectiveness is one of the advertised benefits of cloud computing. However, a survey of 100 IT decision-makers 1 in companies with 500 or more employees conducted by NetEnrich found that top cloud computing issues are:

  • Security (68%)
  • Cost overruns (59%)
  • Cost of recruiting cloud professionals (48%)

What you can do about it?

Manage resources appropriately:

  • Shutdown during weekend when no one is using the environment.

For development and test environments, do you need a full set of data in your databases?

  • For development and test environments, you can streamline data volumes upon database refresh from production to work with a smaller dataset.
  • Right/Downsize development resources that are under-utilized.

Remove Azure resources that are no longer needed.

Receive a discount on your Azure services by purchasing resource reservations (savings can be up to 72%).

Azure Cosmos Database overview

Azure Cosmos Database should be used when:

  • An Azure SQL Database is not a feasible option;
  • The solution is globally distributed – Data can be replicated to the geolocation from where users are accessing, which helps in serving data quickly with low latency;
  • Low latency – Cosmos DB guarantees 10 milliseconds latency at the 99th percentile for reads and writes for all consistency levels;
  • Horizontally scalable – Ability to handle the increased load by adding more servers to the cluster;
  • High availability is needed – Cosmos DB provide 99.999% availability for both reads and writes for multi-region accounts with multi-region writes;
  • Multi-model database service is needed – Document store, Graph DBMS, Key Value store, Columnar store.

Azure SQL database general recommendations

1) When designing a solution that leverage an Azure SQL Database, create a single database with multiple schemas instead of multiple databases within a solution.

2) Based on testing various configuration for different workloads and data volumes, below are recommended configurations:

  • For Meta Stores, utilize Standard S0 database (10DTUs);
  • For a simple application, utilize the Standard S1 database (20 DTUs) in Development and Test Environments. Use Premium P1 databases (125 DTUs) only in Production;
  • For medium complexity applications, utilize Standard S3 Databases (100 DTUs) in Development and Test Environments. Use Premium P1 databases (125 DTUs) only in Production;
  • For Business-Critical applications, utilize Gen5(16 vCores) for complex workloads in Production.

Other ways? Azure SQL with auto-pause settings

“Azure SQL Database serverless automatically scales compute for single databases based on workload demand and bills for compute used per second.  Serverless also provides an option to automatically pause the database during inactive usage periods when only storage costs are billed” 2 – Microsoft says, but how it’s in real life?

So, to check this I will create two similar Azure SQL Databases, but with only one difference – Auto-Pause Enabled and Auto-Pause Disabled option.

For Database with Available Auto-pause delay I choose the following configuration settings:

 Auto-pause delay

Wait some time until results appear in dashboard…

Auto-Pause Disable
Azure SQL Serverless Auto-Pause Disable
Azure SQL Serverless Auto-Pause Enabled
Azure SQL Serverless Auto-Pause Enabled

Cost comparing between:

As it was just a simple test, the results are exceeding all expectations. Cost-saving is more than 57%.

Consideration

Serverless will not work for all cases:

  • IOPs are limited;
  • A bad serverless implementation can actually increase your actual costs;
  • Application code needs to be adapted to serverless (retry logic);
  • SSMS connectivity can keep database awake, spending money;
  • No way to force a pause state.

Resources:

  1. https://www.globenewswire.com/news-release/2019/01/17/1701128/0/en/Enterprise-IT-Focused-on-Moving-More-Workloads-to-Cloud-in-2019.html
  2. https://azure.microsoft.com/en-us/updates/update-to-azure-sql-database-serverless-providing-even-greater-price-optimization/