3.3.1 Example of setup for predictive error detection in the performance of monitored services and the corrective action support methodology

This subsection explains an example of predictive error detection in the performance of monitored services and the corrective action support methodology, as discussed in 1.1.2 Monitoring service status.

This subsection explains by way of example how to perform evaluation and setup based on given conditions to support predictive error detection in the performance of monitored services and the corrective actions to take.

Organization of this subsection

(1) Prerequisites
(2) Defining SLOs from the SLA
(3) Setting up monitoring items

(1) Prerequisites

The following are the conditions for this setup example:

There is a service level agreement (SLA) regarding the service quality (service level) between the service's outsourcing company (service provider) and an outsourced contractor (data center). The data center is required to maintain the service level based on the SLA.
The outsourced services are registered as monitored services as shown below, and monitoring of the monitored services has stopped.
- Service group: Group01
  
  Services belonging to service group Group01: Service01 to Service03
- Service group: Group02
  
  Services belonging to service group Group02: Service04 and Service05
- Service group: Group03
  
  Service belonging to service group Group03: Service06
- Service group: Group04
  
  Service belonging to service group Group04: Service07

The following figure shows the relationship among the personnel involved in this task.

Figure 3‒30: Relationship among personnel involved in predictive error detection in the performance of monitored services and the corrective action support methodology (setup example)

Person who monitors all services

Determines the SLO for each monitoring item based on the SLA, and then sets up the monitoring items in the Settings window.
Outsourcing company's agent

This person is in charge of providing the services outsourced in the agreement. The person who monitors all services is responsible for managing the service level for the outsourced services.

To Page Top

(2) Defining SLOs from the SLA

Tasks required for setting up monitoring items in ITSLM

The person who monitors all services checks the SLA and evaluates the SLOs for thresholds.

Because the SLA contains requirements, including that achievement of response performance be 95% or higher and availability of service be 99.8% or higher, the person who monitors all services defines the SLOs as follows:

Average response time: 3,000 milliseconds
Throughput: 800 count/second
Error rate: 1.0%

The person who monitors all services also decides to perform out-of-range value detection in addition to monitoring based on thresholds as SLOs because warning signs of service performance errors must be detected and handled.

Results of the tasks

Because SLOs have been defined, the person who monitors all services decides to set up monitoring items for each monitored service.

To Page Top

(3) Setting up monitoring items

Tasks in ITSLM

The person who monitors all services decides to log in to ITSLM - Manager to display the Settings window and set up monitoring items for the monitored services based on the defined SLOs.

The following shows a setup example of monitoring items for the monitored services based on the SLOs.

Figure 3‒31: Setup example of monitoring items for the monitored services based on the SLOs

This example sets up monitoring items for service Service01 of service group Group01. The following shows the settings for the monitoring items.

SLO monitor settings

Table 3‒11: Example settings under SLO monitor settings
Check box	Item name	Threshold	Check box	Trend monitoring
Selected	Avg. response	`3000`	Selected	`5`
Selected	Throughput	`800`	Selected	`5`
Selected	Error rate	`1.0`	--	--

Legend:: --: Cannot be set

Under SLO monitor settings, the SLO definition items are specified as thresholds, and then trend monitoring is set up for average response time and throughput so as to promptly detect any error in the performance of a monitored service.

A potential service performance error must be detected at least five hours in advance because other personnel must be contacted to take corrective action in the event of a service performance error. For this reason, trend monitoring is set to 5 hours.

Error Predict. settings

Table 3‒12: Example settings under Error Predict. settings
Days in baseline calculation	Days till start	Check box	Item name	Sensitivity	Correlated item
`20`	`5`	Selected	Avg. response	High	Throughput
		Selected	Throughput	High	--
		Selected	Error rate	High	--

Legend:: --: Cannot be set

Under Error Predict. settings, 20 days' worth of service performance is to be used to calculate the baseline for performing monitoring based on typical service performance. Days till start is set to 5 because it was requested that monitoring be started five days later.

Out-of-range value detection is to be performed for all monitoring items. The sensitivity is set to high so that any service performance that veers from the baseline will be detected quickly. Out-of-range value detection with multiple monitoring items combined is also to be performed to improve the precision of out-of-range value detection.

Results of the tasks

Once setup has been completed for service Service01 of service group Group01, the person who monitors all services proceeds to set up monitoring items for the remaining monitored services in the same manner.

After setup has been completed for all monitored services, the person who monitors all services decides to perform monitoring. For an example of execution of monitoring, see 4.6.1 Example of execution for predictive error detection in the performance of monitored services and the corrective action support methodology.

To Page Top