Benjamin Clements is President of Strategic Online Systems, an IT consulting firm and full-service managed service provider (MSP) located in Collierville, TN. Strategic Online Systems specializes in Microsoft and Veeam solutions and is a DataON partner. Benjamin has over 21 years of CIO consulting experience.
When it comes to a business continuity plan, it’s important to start with the big picture. In addition, we need to know not just what to do, but why to do it.
The old fable talks about the valuable hen because it lays golden eggs. We need to look at everything around the data that makes up the system processes. And that data needs to be easily usable and able to maintain data integrity while in production or being archived in a DR site.
We also need to evaluate our practical risk management and address potential threats such as physical disasters, data corruption, hardware failure, user damage, and ransomware or malicious attacks. But where do we start?
In order to find a business continuity plan, we need to do four things:
- Choose the right technologies
- Choose the right partners
- Choose the right architecture
- Follow best practices
1/ Choose the Right Technologies
When it comes to a business continuity plan, it is important to find technologies that will prioritize data integrity, even above performance. That means prioritizing infrastructure redundancy to always expect hardware failures.
Multiple copies of data should be sent to multiple locations, integrating the 3-2-1-0 rule into the strategy. This means having 3 copies of your data:
- 1st copy being your live data on production storage
- 2nd copy being a backup of this data
- 3rd copy being the copy of the backup data
These copies should exist on at least 2 different storage mediums or storage technologies, with 1 being stored offsite, to ensure 0 data loss in the event of hardware failure, data corruption, or catastrophic outages. The 3-2-1-0 rule can be accomplished with Veeam Backup and Replication software and Microsoft Azure Stack HCI.
Figure 1. 3-2-1-0 Rule
The technology must be able to create authentication air gaps between production clusters and backups so that the same user credentials on the production side can’t be used on the backup repository. This can be done by creating separate non-trusted Active Directory or some other authentication for the backup repository. This prevents a ransomware attack from infiltrating the air gapped backup, and the exposure is contained.
We also need the ability to concurrently spin up a VM from multiple points in time. This enables us to validate the condition of servers and data, for comparison purposes.
The technology should also support a hybrid cloud model. Now this doesn’t work for every application as certain industries have indicated that their applications work more efficiently on-prem. However, hybrid cloud is here to stay, if for nothing else, flexibility. Even if you are only on-prem, with the right partner, you can keep your cloud options open, and integrate cloud workloads seamlessly, even if it’s just one workload at a time.
Finally, it should also support the latest hardware technologies like NVMe flash and RDMA networking. Not only does this improve production workloads, but it also improves backup transfer times that can allow for more frequent and less disruptive backups.
2/ Choosing the Right Partners
DataON brings together the latest technology from industry leaders such as Microsoft, Veeam, Intel, Mellanox, and Western Digital.
Microsoft provides industry-leading hyper-converged infrastructure with Azure Stack HCI. It’s been named a visionary in Gartner’s Magic Quadrant for HCI, and features the same Hyper-V based software-defined compute, storage, and networking as Azure Stack. It also supports the latest data center technologies such as Intel® Optane™, NVMe SSDs, persistent memory, and RDMA networking.
Veeam® is a leader in backup and recovery solutions with its Backup & Replication™ software. It delivers availability for all your cloud, virtual and physical workloads, with a simple-by-design single management console.
Intel powers Azure Stack HCI with Intel® Xeon® Scalable processors, Intel® Optane™ storage & persistent memory, and networking. It also partners with DataON and Microsoft for validated Intel Select Solutions for Azure Stack HCI, designed to help business take advantage of new technologies faster with tested pre-defined configurations.
DataON also utilizes Mellanox switches and RDMA networking for 25GbE to 100GbE low latency networking and Western Digital for its high capacity storage solutions.
DataON is also a Microsoft Cloud Service Provider (CSP) and offers VM data migration (Hyper-V to Hyper-V or VMware ESXi to Hyper-V), onsite support, and Azure services. DataON can provide customized service packages to help your IT team quickly implement and deploy their infrastructure and accelerate the learning curve.
3/ Choose the Right Architecture
Now that we’ve identified the right technologies and partners to use, what kind of architecture should we use?
- Hyper-converged infrastructure with Microsoft Azure Stack HCI. In recent years, propriety SANs have relied on RAID card caches to receive one copy of data for quick write acknowledgement as a technique to speed write performance at the cost of initially having only one copy of the data. However, Microsoft made it a priority not to lose data with Azure Stack HCI. Instead of taking shortcuts, Microsoft designed its solution to write multiple copies of your data before acknowledgement. Built to anticipate and survive multiple failures through its shared-nothing architecture, the software-defined storage component (Storage Spaces Direct) of Azure Stack HCI helps get clusters back to full redundancy whenever possible with two or three-way mirror fault tolerance. It is constantly looking to rebalance data so there are three copies everywhere, and that happens automatically. That way if any two server nodes go down in the primary cluster, there won’t be any data loss, and it will stay up running without having to switch over to a DR site or rely on backups. Instead of relying on shortcuts, Azure Stack HCI supports the latest technology, being one of the first HCI solutions to support NVMe and RDMA networking to achieve incredible performance.
- Business continuity, backups and disaster recovery with Veeam Backup & Replication. Veeam’s Backup & Replication delivers availability for all your workloads and data, no matter if it’s in the cloud or on-prem, virtual or physical. It has built-in connectors to send data to multiple locations, whether it’s to your on-prem equipment, a CSP partner, or directly to the cloud with Azure. Veeam’s flexibility allows you to design your infrastructure to incorporate multiple authentication air gaps so that user credentials are not shared between the production side and the backup repository. Veeam provides the ability to concurrently spin up these VMs from a point in time on demand, even multiple times. And you can spin them up directly from the repository.
- Monitoring & management with Microsoft Windows Admin Center and DataON MUST. Windows Admin Center is Microsoft’s new locally deployed, browser-based app. You can use it to simplify server management, work with hybrid solutions, and streamline hyper-converged management. DataON extends Windows Admin Center’s capabilities with its MUST monitoring and management extension for Windows Admin Center. It gives you the ability to see what’s going inside the DataON hardware and provides full management of it. If there’s an issue with the cluster or a server or drive fails, MUST’s alert services send out automated e-mail to administrators, increasing response and resolution times. DataON has a few upcoming features that work with Azure, like a subscription call home service that integrates with Azure Analytics, and a diagnostics deployment tool that allows system administrators to pull their configuration and disk mapping from Azure.
4/ Architecture best practices with Veeam and Azure Stack HCI
When coming up with a DR plan, we need to ask five questions to assess our risk priorities.
- How likely is the threat? Is the business or organization in an area where natural disasters can hit that could damage the primary data center? Do they allow hundreds of users domain admin access to the VHDX’s, which raises the likelihood of accidentally deleted VMs? Is their data valuable that would make them a likely the target of a ransomware attack?
- How visible is the threat? Or to better rephrase, how invisible is it? How long would it take a customer to discover if one of their servers was compromised by a virus or malware? If they had an archival system that they use only every three months or so, it might be too late before they saw that there was corruption of that server. Therefore, we need to make a note that they may need very long archivals for that particular VM or application.
- How potentially devastating can the threat be? Some systems, while important, can take downtime. For example, fi you lose an Active Directory server, but have three or more AD servers remining to absorb the load, then loading the first AD server isn’t devastating. But if you have only one DHCP server that your 911 call center relies on, then you need to realize the importance of that DHCP server and proactively plan for it.
- How quickly do they need to recover from the threat? A 911 call center can’t afford downtime. Therefore, high availability at the infrastructure level may not be enough. In that case, we may have to think about leveraging application-level HA solutions such as SQL availability groups.
- What related services are there? For example, if we’ve identified that we need to protect our customers SQL servers, but there’s also an e-commerce front-end. We need to ensure that it has the same level of availability as their SQL servers. Therefore, we need to make sure that every related application has the same resiliency and backup frequency.
With these questions answered, we can better figure out how to protect our data.
When it comes to designing the DR architecture, we like to identify four quadrants:
- the production cluster
- the on-prem repository
- the off-prem repository
- the off-prem DR cluster
We need to integrate the 3-2-1-0 rule in our production cluster.
Each quadrant will need to meet certain criteria. For example, in the production cluster, the solution must write three copies, or in an on-prem repository, there must be an authentication air gap, and so on.
Figure 2. Veeam Replication from Backup to VM Design
- Quadrant 1: Production cluster (Azure Stack HCI). A production cluster running Azure Stack HCI should write three copies in three separate nodes before acknowledgement, for data integrity purposes. Therefore, for this implementation, the hardware cluster must be comprised of three or more server nodes. Although Azure Stack HCI fully supports two-node clusters, that is outside the scope of this blog. It must survive hardware failures, rebalance the data quickly, and integrate into Azure (if the customer decides to use the cloud).
- Quadrant 2: On-prem backup repository (Veeam Backup & Replication). Next, the on-prem backup repository needs to be on the same physical network as the production cluster so there’s minimal disruption of production systems during backups and the customer can get fast recovery of data. There also needs to be an authentication air gap, accomplished in this case by placing the repository on its own workgroup with different login and user credentials.
- Quadrant 3: Off-prem backup repository (Veeam Backup & Replication). The off-prem backup repository created by Veeam serves as a backup for the on-prem repository at a separate physical location. This protects data integrity if the primary data center is compromised. The off-prem repository also serves as long term archival and provides a third authentication air gap. Therefore, if someone did something really foolish like opening an e-mail on this server, whatever malicious corruption they open will be contained here, and it won’t corrupt the other quadrants.
- Quadrant 4: Off-prem DR Cluster (Azure Stack HCI and Veeam Backup & Replication). The off-prem DR cluster is built on an Azure Stack HCI solution, while Veeam Backup & Replication provides the tools to put these VMs on a DR cluster ready to start. Veeam also provides flexibility in the hardware, allowing the customer to choose performant the hardware will be on the DR side. This means that the DR doesn’t have to be a one-to-one hardware solution, if they don’t require the same performance in a DR situation.
A Real-World Business Continuity and Hardware Refresh Project
Partnering with DataON, we recently implemented a complete hardware refresh and business continuity solution for a large city government in Texas. It included a new primary cluster and data migration with Azure Stack HCI and a disaster recovery solution with Veeam Backup & Replication.
It was important that during our risk assessment, we considered all the different vulnerabilities, and followed best practices to develop the right backup solution for the customer.
Here is a brief summary of what we did and how we implemented it:
Figure 3. Backup & DR Solution for a large city government in Texas
- The DataON professional service team arrived onsite and replaced their legacy three-tier SAN infrastructure with two primary Microsoft Azure Stack HCI clusters and implemented one onsite backup repository with Veeam Backup & Replication. The first was an all-NVMe flash cluster to deliver high performance, and the second was a hybrid cluster for more storage but using NVMe flash as a first tier for better performance than a typical SATA/SAS HDD only solution. All were cabled and the network setup with Mellanox RDMA networking. Because malware is on everyone’s minds right now, we used an authentication air gap between the primary cluster and the backup repository to make sure that we contained any damage any malware could do.
- We then deployed an Azure Stack HCI cluster with hybrid storage for failover and another Veeam backup repository at their DR site a hundred-fifty miles away at COLO. We did this for several reasons. First, the primary data center is vulnerable to hurricanes so in addition to the natural disaster, they need to be ready for possible subsequent flooding and power outages. The location of the COLO was much further away from the major threats to ensure that service would continue. Second, we once again used the authentication air gap with a different credential to make sure the second Veeam repository would not be affected by malware. If a ransomware attack were to happen, the organization would be able to recover all its data.
- Finally, we migrated their VM and application data from VMware ESXi running on a proprietary SAN to Azure Stack HCI. We used a variety of tools to reduce downtime and disruption of VM’s during the migration.
- We worked with the local government IT team to identify their backup schedules, archival requirements, and DR needs. We gave them tools to do failover testing and validation for individual applications. We them demonstrated how to test various workloads on the DR cluster without disrupting the corresponding production workload.
I hope this helps you to start asking the right questions and give you a feel for what can be accomplished today without locking yourself out of future expansion, all the while preparing yourself for the hybrid cloud.