As an ArcGIS Enterprise deployment grows in complexity, additional considerations should be made when it comes to disaster recovery. These considerations require insight into the disparate systems that form the deployment architecture. As in many technical scenarios, there is no one-size-fits-all approach to backing up the core and dependent systems in a deployment.
The following provides a framework for increasing the success rate of restoring during a disaster recovery event. These practices can be adopted by organizations to define their standard operating procedures as part of a Business Continuity/Disaster Recovery (BC/DR) plan in case of a disaster in the context of their ArcGIS Enterprise deployment.
Best practices for creating backups
Review the following best practices for creating backups of your ArcGIS Enterprise organization and any referenced data sources.
Back up ArcGIS Enterprise
An ArcGIS Enterprise organization is composed of the Portal for ArcGIS site, all federated ArcGIS Server sites and their associated data, and the data contained within the ArcGIS Data Store. The components can be backed up using the included Web GIS Disaster Recovery (WebGISDR) tool, or by using third-party tools for machine-based and image-based backups.
The WebGISDR tool is a command line utility included with Portal for ArcGIS that is used to back up the organization's content and data, federated ArcGIS Server site information, and data contained within the relational and tile cache data stores. This tool is particularly useful for maintaining consistency in the components of a base deployment as well as in any additional federated sites, though it requires a functional deployment to perform the recovery.
The following should be considered outside the WebGISDR backup process:
- Federated ArcGIS Mission Server or ArcGIS Notebook Server sites—If you have either of these, create backups by following the instructions in the ArcGIS Mission Server documentation and the ArcGIS Notebook Server documentation.
- Spatiotemporal big data store, graph store, and object store backups—If you have any of these ArcGIS Data Store types registered with the hosting server, create backups of each using the ArcGIS Data Store backupdatastore utility.
- ArcGIS GeoEvent Server site configuration—Manage the backup of your ArcGIS GeoEvent Server configuration using the backup configuration file.
Most virtualization platforms allow for snapshots to be taken of running virtual machines that allow for low recovery time objectives. While these are useful, they are not considered durable backups as part of a larger BC/DR plan.
When taking a backup before or during a maintenance window, the low recovery time objective supplied by snapshots serves as motivation to use those tools when available. When taking a third-party backup, the underlying data-tier components of both Portal for ArcGIS and ArcGIS Data Store do not have an integration with those methods and therefore involve a level of risk associated with taking a live backup of a running database. To minimize this risk, snapshots and image-based backups should be taken after stopping the service for the running ArcGIS Enterprise components.
In the case of architectures that use a file share to host the shared portal content directory or the configuration store and root directories for the ArcGIS Server sites, it is important to consider the consistency of backups of those locations when using third-party backup tools such as virtual machine snapshots or image-based backups. For example, if an administrator is rolling back following an unsuccessful Portal for ArcGIS upgrade by recovering a snapshot, the content directory may have been altered by the upgrade process and would no longer be consistent with the information contained within the database on the recovered instance. To minimize these effects when using third-party tools, the backups should be taken during an outage window when no content is being published or edited within the organization. This includes both the ArcGIS Enterprise components as well as any associated file share.
The ArcGIS Data Store can be backed up separately from the other components to minimize data loss in the event of a failure in that component. Running scheduled backups of relational and tile cache data stores can occur outside of the schedule for the WebGISDR utility and other backup tools.
Back up referenced data sources
ArcGIS Server can serve content from many sources including enterprise geodatabases, registered file shares, and cloud stores. These external data sources should be included in the disaster recovery plan for a deployment. It is recommended that you follow vendor instructions for taking backups or replicating data to another location.
Enterprise geodatabases and relational databases that contain data served by referenced services should be backed up according to the recovery point objectives of each organization by using the tools provided by the relational database vendors. Because this data is referenced by ArcGIS Server services, the consistency of the published services can potentially get out of sync with the back-end database tables if the recovery of the database is performed independently of the sites that contain the published services. This makes it important to align the schedule of backups across all components in the ArcGIS Enterprise deployment.
Network file shares can use either image-based or file system-based backup tools to package the data and then transfer to a durable storage solution that exists outside of the failure domain of the deployment.
Cloud stores should be backed up or replicated to another region for to allow you to recover their contents. The replicated stores can also be deployed using archive or cold storage to reduce overall cost.
When to back up
How often a backup is taken depends on several factors, the most important of which is how long the backup takes to complete. Since backup processes can impact system resource utilization, full backups are typically scheduled outside of major business hours. For different backup types, the frequency with which the system is backed up can vary across an ArcGIS Enterprise deployment.
For example, a production enterprise geodatabase may be backed up incrementally every 15 minutes for a low recovery point objective. The most important data should be stored within this database instance to reduce the amount of potential data loss. For an ArcGIS Enterprise deployment with many referenced services and static content, the frequency with which backups can be taken might be daily or weekly, while deployments with heavy utilization of hosted feature services and frequent web map and application creation should target a shorter time between backups.
Validate backups
Backups should be monitored for successful completion and alert administrators when a failure occurs. For the WebGISDR tool, the exit code from running the script can be used as a gauge of whether a backup has completed successfully. A zero represents a successful backup while any non-zero code indicates a failure. There are several alerting tools that can be integrated to allow for email or SMS notifications to the team responsible for the backup integrity. Many third-party backup tools provide similar functionality or can be integrated with other services for providing alerts.
Another important aspect of validating an organization's BC/DR plan is to run a restore drill on a semiregular cadence. This helps administrators ensure that in the case of a disaster, they are prepared to restore from the functional backups and validate the restore plan described below.
How long to keep backup files
Deciding how long to keep backup files depends on the amount of free disk space you have and how much flexibility you require for recovery options. If you won't need to restore to a time before the last full backup, you can keep the last full backup and the incremental backups created since then.
Incremental backups created with the WebGISDR tool are cumulative; you can apply the most recent incremental backup to the last full backup. Therefore, at minimum, you need to retain the last full backup and the most recent incremental backup created since that full backup.
You can also move a few sets of older backups to another location, such as storage media. That way, if you discover that key data and services were deleted prior to the last full backup, you'll still have the files available.
Note:
The WebGISDR utility records the software versions of the ArcGIS Enterprise components when you create a backup. The deployment to which you restore must be at the same version it was when you created the backup. Additionally, you must restore to the same type of operating system. For example, you cannot create a backup of an ArcGIS Enterprise deployment on Linux and restore it to Windows machines.
Best practices for restoring your organization
Review the following best practices for restoring your ArcGIS Enterprise organization using the backups you have created.
What to restore
When an administrator has several backup types at their disposal, it is possible to restore components in a more granular fashion than reverting the entire deployment. If a map or image service's cache is deleted, only those files need to be recovered from a backup. Similarly, if a table is accidentally dropped from an enterprise geodatabase, that database can be recovered without affecting other components.
If bad edits are made to a hosted feature layer and the data needs to be rolled back, an administrator has the option to restore only the relational data store without restoring the entire ArcGIS Enterprise deployment. This reduces the impact the restore has on other data stored within the database, but if there were hosted services created during that time, it could cause the ArcGIS Server site to become inconsistent with the restored database tables and require manual cleanup and republishing of the affected services.
Other times, there may be a significant outage such as a data center or cloud region that requires restoration of the entire ArcGIS Enterprise deployment as well as any external data sources. This would be the most extreme example and requires adequate planning to ensure complete functionality of the restored environment.
How to restore
When an ArcGIS Enterprise deployment experiences a widespread outage, there are multiple recovery options that depend on the types of backups available. Replication to a nearby site using the WebGISDR utility is the most significant method to reduce the time to recover the deployment, while having a cold standby site available to spin up and restore can facilitate recovery drills as well as reduce the overall time to recovery.
When deciding on the path to recovery, the option with the shortest recovery point and time objectives should be attempted first. This would allow the fastest feedback on the level of success of the restore. Having an administrator comfortable with the backup strategy who has tested restores regularly in the past can also shorten the time taken to recover in a disaster scenario.
Since ArcGIS Enterprise has multiple tiers across internal and external components, the order in which those components are restored influences the stability of the deployment following a restore. All referenced data sources should be made available first and should be verified that they are accessible from the ArcGIS Enterprise environment, including database instances and external file shares, prior to restoring the ArcGIS Enterprise machines and components.
Once the surrounding dependencies are in place, the ArcGIS Enterprise deployment should be restored to a consistent state. This is to avoid scenarios in which the hosting server site may have a hosted feature service published but the relational data store is missing the dependent data table, or the organization may have an item for a service that is no longer present in one of the federated sites.
Post-restore validation
Once a restore operation is complete, validation should be performed for business-critical data and widespread functionality of the ArcGIS Enterprise deployment. This can be accomplished by creating checklists for business centers and departments to verify their most important content or by automated scripting. Approaching this validation by using automated scripts allows for greater confidence that the restore was successful in less time than a manual verification of items and services.
Automating back up and restore operations
It's recommended that you create backups on a regular basis to guard against significant data loss and reduce downtime. How often you'll create backups will be determined by your organization's recovery point objective (RPO), which defines how much data loss is acceptable. For example, if your organization can't tolerate more than 12 hours of data loss, you'll define a schedule that creates backups at a cadence shorter than 12 hours.
Creating and restoring backups can be automated in Linux using a CronJob or any other scheduling software. Keep in mind that the amount of data in your organization will also impact how often you can create backups and how quickly you can restore them. You can test how long it will take prior to setting up your scheduled task to ensure that backup or restore operations are completing before the next attempt is made.
Additionally, you should be determining whether your backup or restore operations are succeeding. The WebGISDR tool supports an output file that will record the results of the operation as JSON that can be parsed to determine whether the backup is located, whether any components failed, and how long each component took. This file can be integrated into your backup and restore logic to notify administrators of any failures or action items.