Hitachi

JP1 Version 12 Infrastructure Management: Getting Started


5.2 Analyzing problems by analyzing correlations in performance data

Identify the cause of a problem by analyzing correlations in performance data. This method is useful when you cannot identify a problem with the system, for example, because the status of the IT base system has not changed. This manual first describes how to identify the targets of analysis based on the performance data of each resource, and then describes how to analyze the cause of a problem.

Before you begin

The examples in the procedure below assume the following context: An administrator receives a notification from a customer that a particular application was slow. The administrator checks the information in the E2E View window and the Event Analysis View window, but could not find any resource whose status indicates the occurrence of a warning or an error. Thus, the administrator could not identify the problem.

Procedure

  1. Change the base point of the analysis to the virtual machine on which the application is running, and then open the E2E View window.

  2. Check the configuration information of devices related to the virtual machine.

    The problem cannot be identified, because there are no resources whose statuses indicate the occurrence of a warning or an error.

  3. Click the Performance Analysis View button to analyze the problem by analyzing correlations in performance data.

    [Figure]

    In the Select Resource area of the Performance Analysis View window, configuration information about devices related to the virtual machine that was used as the base point of analysis is displayed. Performance graphs of the main metrics of the virtual machine are displayed in the Performance Graph Preview area.

    [Figure]
  4. In the Performance Graph Preview area, check the lines in the performance graphs of the virtual machine. Look for a graph that might indicate the cause of the problem.

    [Figure]
  5. To check the details of a graph, click the arrow on the right to enlarge the graph in the Detailed Analysis Area area.

    The enlarged graph indicates that the performance of virtual machine vm100 with respect to the Virtual Disk Write Latency metric seems to have deteriorated at a certain time period in the past.

    You can use the enlarged graph in the Detailed Analysis Area in operations (such as overlaying graphs and searching for similar graphs) for the purpose of analysis.

    Note

    You can compare performance graphs in JP1/OA against performance information managed by other software (such as data in JP1/PFM reports) to see whether there are any correlations. To do so, import the information output by the other software. The information can be used in the same ways that performance information from JP1/OA can be used: for example, in overlaying graphs or searching for similar graphs.

    For details about the types of external data that can be imported, see the JP1/Operations Analytics Configuration and Administration Guide.

  6. To start analysis, from the Select Resource area, select the hypervisor ESX001 on which virtual machine vm100 is running.

    [Figure]

    In the Performance Graph Preview area, the performance graphs of the main metrics of hypervisor ESX001 are also displayed.

  7. In the performance graphs of hypervisor A, look for a metric that might be related to the metric of the comparison source.

    The Disk Number of Writes Requests metric of hypervisor is of the same type as the Virtual Disk Write Latency metric of virtual machine. Enlarge the graph of the Disk Number of Writes Requests metric of hypervisor in the Detailed Analysis Area area for further analysis.

  8. In the Detailed Analysis Area area, compare the lines in the enlarged graphs by overlaying one graph on top of the other.

    1. Of the graphs of hypervisor ESX001, select a graph that shows changes, and then click the Overlay the Graph button.

      [Figure]

      The graph of hypervisor is overlaid with the graph of virtual machine.

    2. Check the overlaid graphs.

      Although the lines of the overlaid graphs look similar for a certain period of time, this alone is still insufficient to determine that the problem is caused by Disk Number of Writes Requests of hypervisor ESX001.

  9. Repeat steps 7 and 8 for other metrics that might be relevant until you find the bottleneck.

    Because the cause of the problem cannot be identified even after investigating other relevant metrics, the scope of the investigation is expanded to include all virtual machines running on hypervisor ESX001.

    However, because the Select Resource area displays only resources that are related to virtual machine (which was used as the base point of analysis), it will take time to check all virtual machines by repeatedly overlaying their graphs. For this reason, use correlations in performance data to search for performance graphs.

  10. Search for graphs that are similar to the graph of virtual machine vm100.

    1. To include all virtual machines on hypervisor within the scope of the search, add the applicable virtual machines to the Select Resource area.

      Click the Add Resources button and add the virtual machines required for the search.

    2. Select the graph of virtual machine vm100 and click the Find Similar Graphs button.

      [Figure]

      The search conditions to be set are displayed.

    3. Set the search conditions, and then click the Search button.

      Tip

      The correlation coefficient is a value that indicates the degree of correlation between two lines in a graph. A coefficient close to 1 indicates a strong positive correlation between the lines, whereas a coefficient close to -1 indicates a strong negative correlation between the lines (the lines have opposing trends). Specify this search condition as an absolute value. Metrics for which the absolute value of the correlation coefficient is equal to or greater than the specified value will be returned as search results.

      The default value is 0.7. In this case, metrics for which the absolute value of the correlation coefficient is equal to or greater than 0.7 (in other words, in the range from -1.0 to -0.7 or in the range from 0.7 to 1.0) will be returned as search results.

    4. The search results are displayed as a list of resource metrics in descending order of the absolute values of the correlation coefficient.

      The list includes the CPU Use metric of virtual machine vm102.

    5. Select each metric, in descending order of the absolute values of the correlation coefficient, and then check whether the lines in the graph of the metric are similar to those in the graph of virtual machine vm100 (the comparison source).

      [Figure]

      When the CPU Use metric of virtual machine vm102 is selected in the list, the graph of the selected metric is displayed, overlaid with the comparison-source graph.

      By examining the overlaid graphs, you can see that the two graphs are very similar. As a result, this metric can be identified as a possible cause of the problem.

      Note

      Despite the high value of the correlation coefficient, the comparison-source graph and the compared graph sometimes do not appear to be correlated. This is because the graphs are displayed differently depending on the metric units.

    6. If you want to find other similar graphs by checking other metrics in the list of search results, for the time being, click the Move this graph to the Analysis Area button. The graph will be enlarged and displayed in the Detailed Analysis Area area.

  11. Repeat the previous steps for other metrics, in descending order of the absolute values of the correlation coefficient, and check for similar graphs until you have narrowed down the possible causes of the problem.

    The graph of the CPU Use metric of virtual machine vm102 is most similar to the comparison-source graph. Based on this result, virtual machine vm102 is identified as the cause of the problem.

  12. Continue to analyze the problem of virtual machine vm102.

    From among the graphs that are displayed in the Detailed Analysis Area area, select the graph of the CPU Use metric of virtual machine vm102, and then click the Change the base point and open a new E2E View button.

    [Figure]

    The E2E View window appears and virtual machine vm102 is set as the base point.

    Tip

    For a small-scale business system, you can analyze the cause of the problem by continuing the analysis in the Performance Analysis View window. To continue the analysis, click the Change the base point and open a new Performance Analysis View button.

Next steps

Check the configuration information in the E2E View window, where virtual machine vm102 is set as the base point of analysis, and then continue to analyze the impact and severity of the problem.