[0001] This application claims the benefit of U.S. Provisional Appl. No. 60/289,923, filed May 9, 2001, the disclosure of which is hereby incorporated by reference. The disclosure of U.S. application Ser. No. 09/484,686, filed Jan. 17, 2000, is bodily incorporated herein to facilitate an understanding of certain embodiments of present invention.
[0002] The present invention relates to software tools and services for testing, monitoring and analyzing the operation of web-based and other transactional servers.
[0003] A variety of commercially-available software tools exist for assisting companies in testing the performance and functionality of their web-based transactional servers and associated applications prior to deployment. Examples of such tools include the LoadRunner®, WinRunner® and Astra QuickTest® products of Mercury Interactive Corporation, the assignee of the present application.
[0004] Using these products, a user can record or otherwise create a test script which specifies a sequence of user interactions with the transactional server. The user may also optionally specify certain expected responses from the transactional server, which may be added to the test script as verification points. For example, the user may record a session with a web-based travel reservation system during which the user searches for a particular flight, and may then define one or more verification points to check for an expected flight number, departure time or ticket price.
[0005] Test scripts generated through this process are “played” or “executed” to simulate the actions of users—typically prior to deployment of the component being tested. During this process, the testing tool monitors the performance of the transactional server, including determining the pass/fail status of any verification points. Multiple test scripts may be replayed concurrently to simulate the load of a large number of users. Using an automation interface of the LoadRunner product, it is possible to dispatch test scripts to remote computers for execution.
[0006] The results of the test are typically communicated to the user through a series of reports that are accessible through the user interface of the testing tool. The reports may contain, for example, graphs or charts of the observed response times for various types of transactions. Performance problems discovered through the testing process may be corrected by programmers or system administrators.
[0007] A variety of tools and services also exist that allow web site operators to monitor the post-deployment performance of their web sites. For example, hosted monitoring services now exist which use automated agents to access a web site at regular intervals throughout the day. The agents measure the time required to perform various web site functions, and report the results to a server provided by Keynote Systems. The owner or operator of the web site can access this server using a web browser to view the collected performance data on a city-by-city or other basis. Other types of existing monitoring tools include log analysis tools that process access logs generated by web servers, and packet sniffing tools that monitor traffic to and from the web server. Further, using the LoadRunner ActiveTest service of Mercury Interactive Corporation, companies can load test their web sites and other systems over the Internet prior to deployment.
[0008] A significant problem with existing monitoring tools and services is that they often fail to detect problems that are dependent upon the attributes of typical end users, such as the user's location, PC configuration, ISP (Internet Service Provider), or Internet router. For example, with some web site monitoring services, the web site operator can monitor the web site only from the agent computers and locations made available by the service provider; as a result, the service may not detect a performance problem seen by the most frequent users of the system (e.g., members of a customer service department who access the web site through a particular ISP, or who use a particular PC configuration).
[0009] Even when such attribute-specific problems are detected, existing tools and services often fail to identify the specific attributes that give rise to the problem. For example, a monitoring service may indicate that web site users in a particular city are experiencing long delays, but may fail to reveal that the problem is experienced only by users that access the site through a particular router. Without such additional information, system administrators may not be able to isolate and correct such problems.
[0010] Another significant problem with existing tools and services is that they do not provide an adequate mechanism for monitoring the current status of the transactional server, and for promptly notifying system administrators when a problem occurs. For example, existing tools and services typically do not report a problem until many minutes or hours after the problem has occurred. As a result, many end users may experience the problem before a system administrator becomes aware of the problem.
[0011] Another significant problem with prior tools and services is that they generally do not provide a mechanism for identifying the source of performance problem. For instance, a web site monitoring service may determine that users are currently experiencing unusually long response times, but typically will not be capable of determining the source of the problem. Thus, a system administrator may be required to review significant quantities of measurement data, and/or conduct additional testing, to pinpoint the source or cause of the detected problem.
[0012] The present invention addresses these and other problems by providing a software system and method for monitoring the post-deployment operation of a web site system or other transactional server. In a preferred embodiment, the system includes an agent component (“agent”) that simulates the actions of actual users of the transactional server while monitoring and reporting the server's performance. In accordance with one aspect of the invention, the agent is adapted to be installed on selected computers (“agent computers”) to be used for monitoring, including computers of actual end users. For example, the agent could be installed on selected end-user computers within the various offices or organizations from which the transactional server is commonly accessed. Once the agent component has been installed, the agent computers can be remotely programmed (typically by the operator of the transactional server) using a controller component (“controller”). The ability to flexibly select the computers to be used for monitoring purposes, and to use actual end-user computers for monitoring, greatly facilitates the task of detecting problems associated with the attributes of typical end users.
[0013] In accordance with another aspect of the invention, the controller provides a user interface and various functions for a user to remotely select the agent computer(s) to include in a monitoring session, assign attributes to such computers (such as the location, organization, ISP and/or configuration of each computer), and assign transactions and execution schedules to such computers. The execution schedules may be periodic or repetitive schedules, (e.g., every hour, Monday through Friday), so that the transactional server is monitored on a continuous or near-continuous basis. The controller preferably represents the monitoring session on the display screen as an expandable tree in which the transactions and execution schedules are represented as children of the corresponding computers. Once a monitoring session has been defined, the controller dispatches the transactions and execution schedules to the respective agent computers over the Internet or other network. The controller also preferably includes functions for the user to record and edit transactions, and to define alert conditions for generating real-time alert notifications. The controller may optionally be implemented as a hosted application on an Internet or intranet site, in which case users may be able to remotely set up monitoring sessions using an ordinary web browser.
[0014] During the monitoring session, each agent computer executes its assigned transactions according to its assigned execution schedule, and generates performance data that indicates one or more characteristics of the transactional server's performance. The performance data may include, for example, the server response time and pass/fail status of each transaction execution event. The pass/fail status values may be based on verification points (expected server responses) that are defined within the transactions. The agent computers preferably report the performance data associated with a transaction immediately after transaction execution, so that the performance data is available substantially in real-time for viewing and generation of alert notifications. In the preferred embodiment, the performance data generated by the various agent computers is aggregated in a centralized database which is remotely accessible through a web-based reports server. The reports server provides various user-configurable charts and graphs that allow the operator of the transactional server to view the performance data associated with each transaction.
[0015] In accordance with another aspect of the invention, the reports server generates reports which indicate the performance of the transactional server separately for the various operator-specified attributes. Using this feature, the user can, for example, view and compare the performance of the transactional server as seen from different operator-specified locations (e.g., New York, San Francisco, and U.K.), organizations (e.g., accounting, marketing, and customer service departments), ISPs (e.g., Spring, AOL and Earthlink), or other attribute type. The user may also have the option to filter out data associated with particular attributes and/or transactions (e.g., exclude data associated with AOL customers), and to define new attribute types (e.g., modem speed or operating system) for partitioning the performance data. The ability to monitor the performance data according to the operator-specified attributes greatly facilitates the task of isolating and correcting attribute-dependant performance problems.
[0016] In accordance with another aspect of the invention, the performance data is monitored substantially in real-time (preferably by the controller) to check for any user-defined alert conditions. When such an alert condition is detected, a notification message may be sent by email, pager, or other communications method to an appropriate person. The alert conditions may optionally be specific to a particular location, organization, ISP, or other attribute. For example, a system administrator responsible for an Atlanta branch office may request to be notified when a particular problem (e.g., average response time exceeds a particular threshold) is detected by computers in that office. In the preferred embodiment, upon receiving an alert notification, the administrator can use a standard web browser to access the reports server and view the details of the event or events that triggered the notification.
[0017] In accordance with another aspect of the invention, the agent computers may be programmed to capture sequences of screen displays during transaction execution, and to transmit these screen displays to the reports server for viewing when a transaction fails. This feature allows the user to view the sequence of events, as “seen” by an agent, that led to the error condition.
[0018] In accordance with another feature of the invention, an agent computer may be programmed to launch a network monitor component when the path delay between the agent computer and the transactional server exceeds a preprogrammed threshold. Upon being launched, the network monitor component determines the delays currently being experienced along each segment of the network path. The measured segment delays are reported to personnel (preferably through the reports server), and may be used to detect various types of network problems. In accordance with another aspect of the invention, one or more of the agent computers may be remotely programmed to scan or crawl the monitored web site periodically to check for broken links (links to inaccessible objects). When broken links are detected, they may be reported by email, through the reports server, or by other means.
[0019] In accordance with another aspect of the invention, an agent computer may be programmed to measure time durations between predefined events that occur during transaction execution. The measured time durations are preferably reported to a centralized database, and may be used to display a break down of time involved in execution of the transaction into multiple components, such as, for example, network time and server time. Other time components that may be calculated and displayed include DNS resolution time, connection time, client time, and server/network overlap.
[0020] In accordance with another aspect of the invention, a server agent component is configured to monitor server resource utilization parameters concurrently with the monitoring of transaction response times, or other response times, by a client-side. The server agent component is preferably located local to the monitored transactional server. The performance data generated by the client and server agents is aggregated in a centralized database that is remotely accessible through a web reports server. The reports server provides various user-configurable charts, tables and graphs displaying the response times and server resource utilization parameters, and provides functions for facilitating an evaluation of whether a correlation exists between changes in the response times and changes in values of specific server resource utilization parameters. Using this feature, a user can identify the server-side sources of performance problems seen by end users.
[0021] In accordance with another aspect of the invention, a root cause analysis (RCA) system is provided that automatically analyzes performance data collected by agents to locate performance degradations, and to identify lower level parameters (such as server resource parameters) that are correlated with such degradations. In a preferred embodiment, the RCA system analyzes the performance data to detect performance or quality degradations in specific parameter measurements (e.g., a substantial increase in average transaction response times). Preferably, this analysis is initially performed on the measurement data of relatively high level performance parameters—such as transaction response times—that indicate or strongly reflect the performance of the transactional server as seen by end users.
[0022] To evaluate the potential sources or causes of a detected performance degradation, a set of predefined dependency rules is used to identify additional, lower level parameters (e.g., network response time, server time, DNS lookup time, etc.) associated with specific potential causes or sources of the performance degradation. The measurements taken over the relevant time period for each such lower level parameter are analyzed to generate a severity grade indicative of whether that parameter likely contributed to or is correlated with the higher level performance degradation. For instance, the RCA process may determine that “server time” was unusually high during a time period in which the performance degradation occurred, indicating that the server itself was the likely source of the degradation in end user performance. This process may be preformed recursively, where applicable, to drill down to even lower level parameters (such as specific server resource parameters) indicative of more specific causes of the performance degradation.
[0023] The results of the RCA analysis are preferably presented in an expandable tree collections of related measurements are represented by nodes, and in which parent-child relationships between the nodes indicate predefined dependencies between performance parameters. The nodes are color coded, or otherwise displayed, to indicate performance or quality levels of the respective sets of measurements they represent. The tree thus reveals correlations between performance degradations in different parameters (e.g., server time and CPU utilization), allowing users to efficiently identify root causes of performance problems.
[0024] A distributed monitoring tool and associated methods that embody the various inventive features will now be described with reference to the following drawings:
[0025]
[0026]
[0027] FIGS.
[0028] FIGS.
[0029] FIGS.
[0030] FIGS.
[0031]
[0032]
[0033]
[0034]
[0035]
[0036]
[0037]
[0038]
[0039]
[0040]
[0041]
[0042]
[0043] FIGS.
[0044]
[0045]
[0046]
[0047]
[0048]
[0049] Various inventive features will now be described with reference to a distributed monitoring tool and service for monitoring transactional servers. Although these features are described as part of a common monitoring system, those skilled in the art will recognize that many of these features can be practiced or used independently of others. In addition, the inventive features can be implemented differently than described herein, and/or within a different type of system (such as a load testing tool or service). Accordingly, the following description is intended only to illustrate certain embodiments of the invention, and not to limit the scope of the invention. The scope of the invention is defined only by the appended claims.
[0050] Throughout the following description, it will be assumed that the transactional server being monitored is a web-based system that is accessible via the Internet. It will be recognized, however, that the inventive methods and features can also be used to monitor other types of transactional servers and devices, including those that use proprietary protocols or are accessible only to internal users of a particular organization. For example, the underlying methodology can also be used to monitor internal intranets, two-tier client/server systems, SAP R/3 systems, and other types of distributed systems.
[0051] The description of the preferred embodiments is arranged within the following sections and subsections:
I. OVERVIEW II. TERMINALOGY III. ARCHITECTURE AND GENERAL OPERATION IV. CONTROLLER UI AND SESSION SETUP V. PERFORMANCE REPORTS VI. DATA FLOW AND DATABASE CONTENT VII. ADDITIONAL FEATURES FOR DETECTING AND REPORTING PROBLEMS VIII. ADDITIONAL FEATURES FOR DETERMINING THE SOURCE OF DETECTED PROBLEMS A. TRANSACTION BREAKDOWN B. SERVER RESOURCE MONITORING C. DETERMINATION OF NETWORK HOP DELAYS D. AUTOMATED ROOT CAUSE ANALYSIS OF PERFORMANCE DATA 1. RCA SYSTEM USER INTERFACE 2. ARCHITECTURE AND GENERAL OPERATION 3. ROOT CAUSE ANALYSIS METHODS a. MEASURING AND GRADING THE MEASUREMENT VALUES b. EXPANDING THE EVALUATION OF SUB-METRICS 4. AUTOMATED RECONFIGURATION OF TRANSACTIONAL SERVER
[0052] I. Overview
[0053]
[0054] As further depicted by
[0055] The agent
[0056] For convenience, the computers
[0057] The controller
[0058] The web reports server
[0059] As described below, one important feature of the monitoring tool involves the ability of the user to monitor server performance according to operator-selected attributes of the agent computers
[0060] Another important feature involves the ability of the user to assign execution schedules to particular agent machines
[0061] II. Terminology
[0062] To facilitate an understanding of the invention, the following terminology will be used throughout the remaining description:
[0063] The term “distributed monitoring session” or “distributed session” refers to a monitoring session in which multiple agent computers
[0064] The term “agent group” refers to the group of agent computers
[0065] The term “agent” refers either to the agent component
[0066] The term “attribute” refers to a particular characteristic or property of a host or agent computer, such as the location, organization, ISP, or configuration of the computer.
[0067] The term “transactional server” refers to a multi-user system which responds to requests from users to perform one or more tasks or “transactions,” such as viewing account information, placing an order, performing a search, or viewing and sending electronic mail. The term “operator” refers generally to a business entity that is responsible for the operation of the transactional server (typically the owner).
[0068] The term “testcase” refers generally to a computer representation of the transaction(s) to be performed by a particular computer to monitor a transactional server. In the preferred embodiment, the testcases include conventional test scripts (either in textual or executable form) that are “played” by the agent computers
[0069] The terms “parameter” and “metric” refer generally to a type or a definition of measurement.
[0070] III. Architecture and General Operation
[0071] In a preferred embodiment, the agent
[0072] The agents
[0073] Preferably, the agent group is selected so as to encompass a representative cross section of client attributes. For example, one or more agent computers
[0074] In addition, a monitoring service provider entity, such as the entity that operates the reports server
[0075] Where the agents
[0076] Further, rather than using agents that execute transactions, passive agents may be used to monitor interactions between actual end-users and the transactional server
[0077] As illustrated in
[0078] The controller
[0079] The controller
[0080] As depicted in
[0081] As indicated above, the reports server
[0082] The report generation component
[0083] IV. Controller UI and Session Setup
[0084]
[0085] The controller's menu, the top level of which is shown in
[0086] To create a new monitoring session, the user selects PROFILE/NEW, which causes the controller
[0087] In the preferred embodiment, the user can freely define what constitutes a “transaction” for monitoring purposes. For example, the user can start recording a user session, record any number of user interactions with the server (form submissions, page requests, etc.), stop recording, and then store the result as a transaction under a user-specified name (e.g., “browse catalog”). In addition, during subsequent editing of the transaction, the user can optionally divide the transaction into multiple smaller transactions or make other modifications. The transactions can also include accesses to multiple web sites. Preferably, the transactions are defined by the user with sufficient granularity to facilitate identification of performance bottlenecks. For example, the user may wish to create a separate transaction for each of the primary applications deployed on the transactional server
[0088] The transactions included within the session may optionally include special non-destructive or “synthetic” transactions that do not change the state of the transactional server
[0089] As illustrated by the “Select Computers” screen in
[0090] When the user selects the EDIT button (
[0091] The attributes that are assigned to the agent computers can be used to separately view the transactional server's performance as monitored by a particular attribute group (group of computers that share a particular attribute or set of attributes). For example, the user can view a graph of the response times measured by all agent computers with the location attribute “San Jose” or the ISP attribute “Sprint.” Example reports are shown in FIGS.
[0092] When the user selects the NEXT button from the Select Computers screen, an “Assign Transactions” screen (
[0093] When the user selects the NEXT button from the Assign Transactions screen, an “Assign Schedules” screen appears (
[0094] The execution schedules may be selected so as to provide continuous or near-continuous monitoring of the transactional server
[0095] The Setup Wizard may optionally provide one or more functions (not illustrated) for assisting users in setting up continuous or near-continuous monitoring sessions. For example, as the schedules are being assigned to agent computers, the wizard could automatically detect and display the “gaps” (periods of time during which the transactional server is not being monitored) in the cumulative execution schedule. The Setup Wizard could also provide an option to automatically generate an execution schedule which fills-in these gaps. In addition, a function could be provided for ensuring that at least two agent computers
[0096] When the user selects the FINISH button (
[0097] With the session open within the controller's console (
[0098] As illustrated by
[0099] The Alerts Wizard may also provide an option (not illustrated) to be notified when certain types of transactions fail, and/or when failures are detected within particular attribute groups. Using this option, a user can request to be notified whenever a problem is detected which falls within the user's respective area of responsibility. For example, a system administrator responsible for a particular business process may be notified when a transaction that corresponds to that business process fails; to avoid being notified of general failures, this notification may be made contingent upon other types of transactions completing successfully. Other example uses of this feature include: notifying an ISP administrator when a threshold number of agent computers using that ISP are unable to access to the transactional server (optionally contingent upon the transactional server being accessible from other ISPs); and notifying a system administrator responsible for a particular office when a threshold number of agent computers
[0100] In other embodiments, the various functions of the controller
[0101] In one embodiment, the controller
[0102] In embodiments in which the controller
[0103] In yet other embodiments, the controller
[0104] V. Performance Reports
[0105] FIGS.
[0106] The graphs indicate various aspects of the transactional server's performance as monitored over a particular time frame (the current day in this example). The first graph
[0107] As illustrated in
[0108] With further reference to FIGS.
[0109] The “Filters” option
[0110] The Graph List option
[0111]
[0112] As will be apparent from the foregoing examples, the ability to separately view and filter the performance data based on the attributes of the agent computers, including operator-specified attributes, greatly simplifies the task of identifying attribute-specific problems. Although specific attribute types are shown in the example reports, it should be understood that the illustrated features can be applied to other types of attributes, including user assigned attribute types.
[0113] The reports server
[0114] VI. Data Flow and Database Content
[0115] The general flow of information between components during the setup and execution of a typical monitoring session will now be described with reference to FIGS.
[0116]
[0117] Table 1 summarizes, for one example embodiment, the tables that are created in the sessions database
TABLE 1 EXAMPLE DATABASE SCHEMA TABLE NAME DESCRIPTION Groups Contains the names of all agent computers and their associated properties. Transactions Contains a listing of the transactions, by name, with each assigned a numerical transaction ID. For each transaction, the table contains the thresholds used for evaluating response times (e.g., less than 20 sec. = OK, from 20 to 30 sec. = poor, etc.). Status Contains a listing of the available transaction statuses (e.g., Pass = 0, Fail = 1, etc.). Ranks Contains a listing of the threshold criteria names (e.g., 1-OK, 2 = Warning, etc.). Properties For each property defined by the user, a table is created that assigns a numerical ID to the set of members of that property (e.g., for the “organizations” table might include the entries R&D = 1, Marketing = 2, etc.). Event Meter Contains the results of each transaction execution event. Each transaction execution event is represented by a record which contains the following data: record ID (increases sequentially with each new execution event), transaction ID, result (status value), date/time, response time in seconds, and properties of agent computer (location, organization, etc.) Alarms Contains definitions of events that trigger alarms Definitions Alarms Stores a log of triggered alarm conditions
[0118] As depicted by the downward arrow in
[0119]
[0120] As further depicted by
[0121] wait for message from a Vuser (agent)
[0122] route message to web reports server via API call
[0123] ApmApi_reportTransaction(transaction, host, status, value)
[0124] route message to alarms engine
[0125] go back to wait
[0126] Various alternatives to the data flow process shown in
[0127]
[0128] VII. Additional Features for Detecting and Reporting Problems
[0129] Three optional features for detecting and reporting error conditions and performance problems will now be described. All three of these features are preferably implemented in part through executable code of the agent component
[0130] The first such feature involves having the agent computers
[0131]
[0132] Once the transaction is finished, the agent
[0133] A second feature that may be incorporated into the agent
[0134] A third feature involves the ability of the agents
[0135] VIII. Additional Features for Determining the Source of Detected Problems
[0136] Upon determining that a performance problem exists with the deployed transactional server
[0137] Briefly, using a transaction breakdown feature (shown in
[0138] A. Transaction Breakdown
[0139] The transaction breakdown feature will now be described with reference to FIGS.
[0140] For example, if the worst performing location is New York, the user may select a location-specific drill down link
[0141] Thus, after determining, for example, from the performance summary report