3.1.5 Using availability monitoring for checking the availability of services (working with Performance Management)
Availability monitoring is supported when ITSLM is linked with Performance Management.
This subsection explains availability monitoring.
- Organization of this subsection
(1) About availability monitoring
Availability monitoring is a method for checking whether monitored services are running smoothly.
PFM - Agent for Service Response is used for monitoring the availability of monitored services. You can monitor the availability of monitored services even when no users are accessing them.
The following figure shows how availability monitoring works.
You can check the current availability of services in the Home window or the Real-time Monitor window. If a monitored service has stopped, an error is displayed in these windows. The following shows an example in which an error is displayed in a window.
(2) Availability items that can be output to reports
For the monitored services whose availability is being monitored, you can output availability items to reports. The availability items are metrics used to evaluate availability. Availability monitoring enables you to output the following availability items to reports:
-
Service availability
-
MTTR (mean time to recovery)
-
MTBF (mean time between failures)
The following table provides details about the availability items that can be output to reports by availability monitoring.
No. |
Evaluation metric (SLO) |
Definition |
Formula |
---|---|---|---|
1 |
Service availability |
Percentage of the time during the report interval that the service was running |
Service availability (%) = A (A + B) 100 A = Total operational period during the report interval (minutes) B = Total error period during the report interval (minutes) |
2 |
MTTR (mean time to recovery) |
Average time required from the occurrence of an error to recovery from the error during the report interval |
Mean time to recovery (minutes) = B/C B = Total error period during the report interval (minutes) C = Number of times errors occurred during the report interval |
3 |
MTBF (mean time between failures) |
Average time from one error recovery to the occurrence of the next error during the report interval |
Mean time between failures (minutes) = A/C A = Total operational period during the report interval (minutes) C = Number of times errors occurred during the report interval |
The following explains for three cases how availability items are calculated by availability monitoring.
-
Case where the monitored service is stopped at the time reporting begins or at the time reporting ends
If the time the monitored service stopped due to an error was before the report start time, the report start time is used as the time the monitored service stopped for purpose of calculating the availability items.
If a stopped monitored service is still stopped at the report end time, the report end time is used as the time the monitored service stopped for purpose of calculating the availability items.
The following figure shows an example in which the monitored service is already stopped at the report start time and is stopped at the report end time.
The availability items for this example are calculated as follows:
Service availability = (T3 - T1)/{(T3 - T1) + (T1 - TM) + (TN - T3)}
= (T3 - T1)/(TN - TM)
Mean time to recovery = {(T1 - TM) + (TN - T3)}/2
Mean time between failures = (T3 - T1)/2
-
Case where the report interval contains periods of time during which monitoring is not performed
If the report creation interval contains within it periods of time during which monitoring is not performed, those periods are not included in the calculation of availability items because availability is not checked during those periods.
If an error period contains within it a period of time during which monitoring is not being performed, that error period is treated as two error periods separated by the interval when monitoring was not being performed.
The following figure shows an example in which the report interval contains periods of time during which monitoring is not performed.
The availability items for this example are calculated as follows:
Service availability = (T1 - TM)/{(T1 - TM) + (T3 - T2) + (TN - T4)}
Mean time to recovery = {(T3 - T2) + (TN - T4)}/2
Mean time between failures = (T1 - TM)/2
-
Case where the report interval contains periods of time during which information acquisition failed
If availability information cannot be acquired for a period of time during monitoring because a communication error occurred or because PFM - Agent for Service Response was not running, the availability acquired from PFM - Agent for Service Response immediately before the interval for which there is no availability information is assumed to continue.
The following figure shows an example in which the report interval contains periods of time during which information acquisition failed:
The availability items for this example are calculated as follows:
Service availability = {(T2 - TM) + (TN - T4)}/{(T2 - TM) + (TN - T4) + (T4 - T2)}
Mean time to recovery = (T4 - T2)/1
Mean time between failures = {(T2 - TM) + (TN - T4)}/1
(3) Reporting criteria
When a monitored service that is subject to availability monitoring is stopped, an error is reported. If either of the following criteria is satisfied, the monitored service is treated as being stopped:
-
An error was in effect at the time of the first measurement result obtained after monitoring started.
-
The previous measurement result was normal and an error had occurred by the time of the measurement result for the current time.
If monitoring is stopped, the measurement results that have been obtained so far are reset. Therefore, if monitoring stops while the monitored service is stopped and an error occurs in the measurement result obtained after monitoring is restated, the error notification indicates that another monitored service has stopped.
(4) Criteria for determining that performance has returned to normal
If both the following criteria are satisfied, the monitored service is determined to have recovered from the stoppage and returned to normal:
-
An error had occurred at the time of the previous measurement result.
-
The measurement result for the current time is normal.
If monitoring is stopped, the measurement results that have been obtained so far are reset. Therefore, if monitoring stops while the monitored service is stopped, recovery is not reported even if the measurement result obtained after monitoring is restarted is normal.
(5) Supplementary information
-
When PFM - Agent for Service Response is used for monitoring, a stoppage of a monitored service is reported whether it was caused by an error or by planned termination, because the difference between these two causes cannot be distinguished.
Therefore, stop the monitoring before you perform planned termination on a monitored service that is being monitored for availability.
-
Availability monitoring starts immediately after availability information is received from PFM - Agent for Service Response. If monitoring of a target service is stopped before availability information is received for the first time after monitoring started, availability monitoring is treated as not having started during that period. In such a case, information about the start and stop of the monitored service is not output to the service availability overview in the report.