The Critical Information Your System Monitoring Tools Are Missing

Posted on January 4, 2014 by BinaryWave

Effective IT operations management depends on a robust set of tools which provide insight into the status, behavior and health of critical systems. The information generated by such tools enables support personnel to make decisions and take action in order to ensure optimal service delivery. Meaningful, contextual, and actionable intelligence provided in real-time can help prevent costly outages, ensure SLA compliance and increase customer satisfaction. In essence, the effectiveness of any IT organization is only as good as the data it receives from the relevant tools.

Enterprise applications generate a large volume of raw operational data in the form of event logs. This information is typically stored in delimited text files or temporary databases and retrieved only in the event of a service interruption when additional information is needed to perform root cause analysis. Most systems provide various levels of event logging, from information messages to critical failure alerts, which taken together provide a holistic picture of what is happening within an application at any given point in time. Events are the "voice" of an application, the method by which it communicates how it is performing, what problems may be occurring and which functional areas may need attention.

As valuable as event information can be it is often difficult to derive actionable intelligence from it. Applications such as Microsoft SharePoint, which require multiple servers in a shared farm environment, can log thousands of informational, warning and critical events every minute, scattered across multiple servers in numerous geographic locations. These events provide a clear picture of what is actually happening within the application but they must first be gathered, correlated, filtered and analyzed before the information they contain can be used in any meaningful way. Unlike system counters, which provide simple numerical output that can easily be charted against established thresholds, event data is comprised of multiple data types, including verbose text, timestamps and categorical metadata. Without a tool designed specifically to extract, analyze and interpolate data from operational events all the intelligence that can be gained from them is unavailable to the service personnel and decision makers who need it the most.

Consider the following scenario: an organization running Microsoft SharePoint begins receiving reports from users that page response times have become abnormally long. The problem is intermittent at first but quickly escalates until the entire user base is impacted. Within a few hours, the system becomes unusable and critical business processes go offline. System performance counters are within normal operating parameters – CPU utilization, memory consumption, query execution time, cache hit ratio, web server throughput, and so on. No error messages are being exposed to the user or captured in the operating system event logs. After eliminating all external factors, such as network congestion, hardware failure, and the like, support escalates the issue to the application administrators who begin troubleshooting the issue. They gather the log files from each server and begin sifting through tens of thousands of event messages. With so much data to analyze and no specific error message or identifier to guide them the search takes hours – meanwhile work cannot be performed, money is lost, SLA parameters are exceeded and stress levels rapidly increase. After much investigation, administrators finally discover a pattern of warning messages in the application logs from a improperly configured page component. Removing the component from the master template resolves the issue and service is restored.

In this all-too-common situation, the information required to solve the problem was there all along in the application log files but it was never captured, escalated or acted upon until after an outage had occurred. By relying upon traditional server monitoring tools, operations personnel were missing the key data they required in order respond to a critical situation; in fact, had they been alerted when the event pattern initially became abnormal they could potentially have prevented the outage altogether, saving the company a great deal of money and themselves a significant amount of time. They simply didn’t have the right tools to give them the right information at the right time.

This problem can be solved through the implementation of an Operational Analytics solution. OA tools are designed to gather, process, correlate, filter and evaluate application event data, isolating valuable chunks of information from streams of informational messages and assessing functional health in real time. By inspecting vast amounts of event data, OA software can identify behavioral patterns and determine when deviations occur from the baseline, alerting support personnel to potential issues, and enabling them to take proactive measures in order to ensure system continuity. These application profiles can be leveraged to establish health metrics, visualize trends, create actionable dashboards, simplify troubleshooting, generate actionable reports and perform various types of historical analysis.

An effective OA solution provides the following key benefits:

Efficiency – By aggregating event data from multiple servers into a single respository, filtering out low-value events and identifying patterns in event instances, the system can accelerate problem resolution by providing support personnel with information on current operating trends, instant access to specific event details and real-time alerts when high-value events occur.

Accuracy – Each server has an event profile unique to its role in the farm (web server, application server, database); likewise, each farm has a unique aggregate event profile that may differ from other farms in the environment (development, staging, production). These profiles are analyzed to determine a baseline of "normal" behavior and capture any functional deviation, resulting in a weighted score that is a more accurate reflection of application health than standardized thresholds.

Visibility – Event data is more comprehensive than SNMP-based alerts and performance counters. An easily accessible repository of core application event messages provides a wealth of information that can be used to perform root cause analysis, identify emerging trends, create historical comparisons, and support critical upgrade, enhancement or resource allocation decisions.

OA tools are a key element in a comprehensive event management strategy but they are only one part of the overall picture. System monitoring tools are also essential and when properly implemented the two work in concert – system monitoring focuses on "when" and "what", whereas operational analytics strives to answer "why", "where", and "how". Both provide necessary, but altogether different, information – implementation of one or the other in isolation will result in a less than complete event management solution and a limited view of overall system health. Organizations that learn how to leverage both toolsets will realize the benefits of increased system stability, streamlined service operations and greater operational efficiency.

Operational Analytics for SharePoint

www.binarywave.com

BinaryWave Inc. | 611 S. Main St. | Suite 400 | Grapevine, TX 76051 | (888) 387-1197

The Critical Information Your System Monitoring Tools Are Missing

The Critical Information Your System Monitoring Tools Are Missing

Leave a Reply Cancel reply