3.1.5 Using availability monitoring for checking the availability of services (working with Performance Management)

Availability monitoring is supported when ITSLM is linked with Performance Management.

This subsection explains availability monitoring.

Organization of this subsection

(1) About availability monitoring
(2) Availability items that can be output to reports
(3) Reporting criteria
(4) Criteria for determining that performance has returned to normal
(5) Supplementary information
(6) Related topics

(1) About availability monitoring

Availability monitoring is a method for checking whether monitored services are running smoothly.

PFM - Agent for Service Response is used for monitoring the availability of monitored services. You can monitor the availability of monitored services even when no users are accessing them.

The following figure shows how availability monitoring works.

Figure 3‒21: How availability monitoring works

You can check the current availability of services in the Home window or the Real-time Monitor window. If a monitored service has stopped, an error is displayed in these windows. The following shows an example in which an error is displayed in a window.

Figure 3‒22: Example in which an error is displayed in a window (availability monitoring)

To Page Top

(2) Availability items that can be output to reports

For the monitored services whose availability is being monitored, you can output availability items to reports. The availability items are metrics used to evaluate availability. Availability monitoring enables you to output the following availability items to reports:

Service availability
MTTR (mean time to recovery)
MTBF (mean time between failures)

The following table provides details about the availability items that can be output to reports by availability monitoring.

Table 3‒8: Definition of availability items and formulas
No.	Evaluation metric (SLO)	Definition	Formula
1	Service availability	Percentage of the time during the report interval that the service was running	Service availability (%) = `A` (`A` + `B`) 100 `A` = Total operational period during the report interval (minutes) `B` = Total error period during the report interval (minutes)
2	MTTR (mean time to recovery)	Average time required from the occurrence of an error to recovery from the error during the report interval	Mean time to recovery (minutes) = `B`/`C` `B` = Total error period during the report interval (minutes) `C` = Number of times errors occurred during the report interval
3	MTBF (mean time between failures)	Average time from one error recovery to the occurrence of the next error during the report interval	Mean time between failures (minutes) = `A`/`C` `A` = Total operational period during the report interval (minutes) `C` = Number of times errors occurred during the report interval

Legend:

Report interval: Total length of time subject to reporting that is obtained from the start time and period entered by the user in the Report area of the Report window.

Operational period: Period from the time normal operation of the monitored service was verified to the time a stoppage of the monitored service was detected or monitoring was stopped.

Error period: Period from the time a stoppage of the monitored service was detected to the time normal operation of the monitored service was verified or monitoring was stopped

The following explains for three cases how availability items are calculated by availability monitoring.

Case where the monitored service is stopped at the time reporting begins or at the time reporting ends

If the time the monitored service stopped due to an error was before the report start time, the report start time is used as the time the monitored service stopped for purpose of calculating the availability items.

If a stopped monitored service is still stopped at the report end time, the report end time is used as the time the monitored service stopped for purpose of calculating the availability items.

The following figure shows an example in which the monitored service is already stopped at the report start time and is stopped at the report end time.

The availability items for this example are calculated as follows:

Service availability = (T₃ - T₁)/{(T₃ - T₁) + (T₁ - T_M) + (T_N - T₃)}

= (T₃ - T₁)/(T_N - T_M)

Mean time to recovery = {(T₁ - T_M) + (T_N - T₃)}/2

Mean time between failures = (T₃ - T₁)/2
Case where the report interval contains periods of time during which monitoring is not performed

If the report creation interval contains within it periods of time during which monitoring is not performed, those periods are not included in the calculation of availability items because availability is not checked during those periods.

If an error period contains within it a period of time during which monitoring is not being performed, that error period is treated as two error periods separated by the interval when monitoring was not being performed.

The following figure shows an example in which the report interval contains periods of time during which monitoring is not performed.

The availability items for this example are calculated as follows:

Service availability = (T₁ - T_M)/{(T₁ - T_M) + (T₃ - T₂) + (T_N - T₄)}

Mean time to recovery = {(T₃ - T₂) + (T_N - T₄)}/2

Mean time between failures = (T₁ - T_M)/2
Case where the report interval contains periods of time during which information acquisition failed

If availability information cannot be acquired for a period of time during monitoring because a communication error occurred or because PFM - Agent for Service Response was not running, the availability acquired from PFM - Agent for Service Response immediately before the interval for which there is no availability information is assumed to continue.

The following figure shows an example in which the report interval contains periods of time during which information acquisition failed:

The availability items for this example are calculated as follows:

Service availability = {(T₂ - T_M) + (T_N - T₄)}/{(T₂ - T_M) + (T_N - T₄) + (T₄ - T₂)}

Mean time to recovery = (T₄ - T₂)/1

Mean time between failures = {(T₂ - T_M) + (T_N - T₄)}/1

To Page Top

(3) Reporting criteria

When a monitored service that is subject to availability monitoring is stopped, an error is reported. If either of the following criteria is satisfied, the monitored service is treated as being stopped:

An error was in effect at the time of the first measurement result obtained after monitoring started.
The previous measurement result was normal and an error had occurred by the time of the measurement result for the current time.

If monitoring is stopped, the measurement results that have been obtained so far are reset. Therefore, if monitoring stops while the monitored service is stopped and an error occurs in the measurement result obtained after monitoring is restated, the error notification indicates that another monitored service has stopped.

To Page Top

(4) Criteria for determining that performance has returned to normal

If both the following criteria are satisfied, the monitored service is determined to have recovered from the stoppage and returned to normal:

An error had occurred at the time of the previous measurement result.
The measurement result for the current time is normal.

If monitoring is stopped, the measurement results that have been obtained so far are reset. Therefore, if monitoring stops while the monitored service is stopped, recovery is not reported even if the measurement result obtained after monitoring is restarted is normal.

To Page Top

(5) Supplementary information

When PFM - Agent for Service Response is used for monitoring, a stoppage of a monitored service is reported whether it was caused by an error or by planned termination, because the difference between these two causes cannot be distinguished.

Therefore, stop the monitoring before you perform planned termination on a monitored service that is being monitored for availability.
Availability monitoring starts immediately after availability information is received from PFM - Agent for Service Response. If monitoring of a target service is stopped before availability information is received for the first time after monitoring started, availability monitoring is treated as not having started during that period. In such a case, information about the start and stop of the monitored service is not output to the service availability overview in the report.

To Page Top

(6) Related topics

To Page Top