By actively monitoring your ArcGIS Enterprise organization, you can stabilize system uptime, identify service performance issues or outages, and proactively adjust allocated resources across participating machines to run the underlying applications. Monitoring solutions can provide active checks for commonly used endpoints and alert the appropriate contacts when responses fall outside of expected tolerances. In addition, you can use them to collect historical information that can be used to corroborate with system and software logs during root cause analysis or postmortem investigations.
While you can use ArcGIS Monitor to monitor your ArcGIS Enterprise organization, there are also third-party tools that allow you to achieve similar results. The information below is a starting point for how to integrate monitoring solutions with ArcGIS Enterprise.
In general, there are two perspectives from which enterprise applications can be monitored: resource utilization and user experience.
Resource utilization is a familiar concept to those in systems administration in that it involves characteristics of the collection of machines and supporting infrastructure that run the enterprise software. These metrics typically scale proportionally with the volume of users accessing the platform, but some workflows may cause significant spikes in utilization as well.
Alternatively, user experience monitoring generally reflects how the client connects and interacts with front-end applications and is more familiar to business analysts and GIS administrators. These metrics are useful in determining baseline response times for a variety of requests, which can then be used to establish thresholds at which administrative teams should be alerted. There are also aspects of the user experience that require consideration outside of response times, such as SSL certificate expiration.
The subsections below describe monitoring a system from a resource utilization perspective.
When monitoring machines in an ArcGIS Enterprise deployment from a resource utilization perspective, metrics to track include the following:
- Processor—When a participating machine's processor spikes or reaches 100 percent capacity, compute requests are backlogged, which may cause a delay in the return of information. This applies to any running process when experiencing a burst in activity.
- Physical memory—When physical memory approaches 100 percent utilization, running processes may crash as they attempt to expand into additional memory space. This is mitigated by the presence of virtual memory.
- Virtual memory—Virtual memory provides a buffer between the physical memory of a machine and the underlying storage. It uses part of the underlying storage to exchange data out of physical memory while keeping it more readily accessible than loading directly from disk. Adverse effects due to virtual memory exhaustion are less common with Linux systems; however, it is important to also monitor swap usage.
- Committed memory—System committed memory capacity is the sum of the physical memory of a machine plus the virtual memory size at a given point in time. Since virtual memory can grow, the committed memory limit can change over time. A machine approaching 100 percent committed memory utilization indicates that both physical and virtual memory are being exhausted and more resources are needed.
- Disk volume available space—Running out of disk space for either the system, application, or data volumes on a system can have significant consequences for both the running operating system as well as any applications that depend on those volumes. Monitor available space to ensure that systems do not run out of the disk space as well as determine when there are significant increases in used space, which can be indicative of anomalous publishing events.
As you monitor your system, keep in mind that network bottlenecks, though becoming rarer in enterprise-grade network environments, can affect the optimal response times for ArcGIS Enterprise components. This becomes increasingly possible in a multimachine environment where multiple internal requests are exchanged between all ArcGIS Enterprise components and other registered data sources and file services.
When possible, divide the processor and memory into a per-process listing to determine which process is spiking during a given time. When using this level of granularity in monitoring, the command line portion of the process can be used to distinguish ArcGIS Enterprise internal components from each other or from real-time antivirus scanning, for example.
Monitor not only the machines on which ArcGIS Enterprise components are installed but also any file servers and database instances that the deployment may depend on for proper operation. ArcGIS Enterprise applications typically start at their lowest resource consumption levels. As applications are accessed and used, their resource consumption scales proportionately with resource utilization.
Collect resource metrics
While not included with most base Linux distributions by default, there are a number of software packages that allow interrogation and collection of machine resource metrics. Collect the resource utilization metrics mentioned in the previous section, at a minimum, for all machines in the deployment by adding them as counters for the chosen software. During service degradations or outages, you can increase the frequency of polling to gain additional insight into the processes and events that precede the outage conditions.
Analyze resource metrics
Once you have chosen a collection tool and captured resource utilization data for your machines, you can analyze resource metrics. Consider the following when analyzing resource metrics:
- The life-span of the issue—Understanding whether the occurrence was an isolated event or long term will help you determine the best path forward in most situations. A short-term spike in resource utilization tends to occur with an immediate demand in specific services such as adding a newly released dashboard or web app or adding a department to the portal. Longer-term growth toward the current utilization can indicate increasing popularity of the platform and its associated services or applications. Short-term spikes may or may not recur, so the context surrounding those events is important in determining whether additional resources are needed to increase the long-term stability of the deployment.
- The processes consuming most of the system's resources—From a Portal for ArcGIS and ArcGIS Data Store perspective, utilization should scale almost linearly with the number of users on the platform and use of hosted services, respectively. When considering ArcGIS Server, scaling of dedicated services and use of hosted services are the two major factors in resource utilization. Dedicated services can be tuned in an ArcGIS Server site to reduce overall resource utilization, but that may not be adequate when demand reaches its peak over time.
- The distribution of roles—Distributing roles across multiple machines in an ArcGIS Enterprise deployment allows for a more careful resource adjustment for each component as well as increased granularity of understanding when issues arise. Increasing resources for only the relational data store or hosting server machines may be more strategic than increasing resources for a single-machine based enterprise deployment. You can make adjustments to the current site architecture through join site operations to move from a single machine to a distributed architecture in an established deployment.
Now that you can identify, track, and analyze machine resource metrics, you can address unexpected system responses. This may mean increasing assigned processor resources, assigning or installing more RAM, or increasing disk space. Before taking action, you must understand the best practices for resolving resource utilization issues.
Before increasing the assigned processor resources of the machines encountering high processor utilization, determine whether it is an ArcGIS Enterprise component or other software on the system that is causing the spikes in utilization. Security software with real-time scanning enabled can increase processor utilization during normal web server and database operations. If this is the case, alert your cybersecurity team based on the observed behavior. For virtual machines, the underlying host may be overprovisioned, which can lead to a performance bottleneck that is undetectable to virtual machines.
Physical memory utilization
When physical memory utilization approaches 100 percent, the machines may require more RAM assigned or installed. As described above, separating workloads on dedicated machines can allow for more granular resource allocation and reduce current resource contention, but you can also increase memory on the existing machines. When physical memory utilization approaches 100 percent, the available virtual memory may be exhausted as well.
Virtual and committed memory utilization
Virtual and committed memory utilization typically demonstrate the same patterns when reaching 100 percent utilization. Virtual memory allows for processes to use more memory than is available on a system and typically scales automatically to a threshold value unless set statically by the system administrator responsible for the provisioned machines. You may be able to increase virtual memory by modifying system settings if there is adequate disk space to extend the page file.
Disk volume available space
Disk space exhaustion is one of the most unpredictable failure methods that can occur in an ArcGIS Enterprise deployment. Files can be blanked or truncated when attempts to update are incomplete, which can prevent the software from starting properly. First, search for large files that can be moved to a registered data store or other location. If you cannot remove unneeded files, you must increase the disk space. You can also migrate the system directories to separate storage, such as the content directory for a Portal for ArcGIS site or cache directory for ArcGIS Server.
To see the top 25 files by size (in bytes) for the specified directory, <directory>, run this command:
sudo find <directory> -type f -printf '%s %p\n' | sort -nr | head -25
Running on the root volume can take a long time, so it is recommended that you specify a child directory in the command.