Nonstop Database, HiRDB Version 9 System Operation Guide

[Contents][Index][Back][Next]

26.1.3 Standby-less system switchover facilities

If an error occurs in HiRDB while it is performing business processing, the system switches over to another unit and an active back-end server takes over the business processing. This is called the standby-less system switchover facility. Unlike with the standby system switchover facility, you do not need to prepare a standby system unit for the standby-less system switchover facility.

The standby-less system switchover facility consists of the following two facilities:

A standby-less system switchover facility is applicable to back-end server units of a HiRDB parallel server configuration. It cannot be applied to a unit that contains any server other than back-end servers.

Organization of this subsection
(1) Standby-less system switchover (1:1) facility
(2) Standby-less system switchover (effects distributed) facility

(1) Standby-less system switchover (1:1) facility

The standby-less system switchover (1:1) facility switches from a unit in which an error has occurred to a pre-designated back-end server unit, which assumes the processing. That is, there is a one-to-one relationship between the original unit and the unit to which processing is switched in the event of an error.

A back-end server that releases a process when an error occurs is called a normal BES, and a back-end server that takes over the process is called an alternate BES. Also, the unit of the normal BESs is called the normal BES unit, and the unit of the alternate BESs is called the alternate BES unit. The following figure provides an overview of the standby-less system switchover (1:1) facility.

Figure 26-3 Overview of the standby-less system switchover (1:1) facility

[Figure]

Explanation
  • Both BES1 and BES2 are usually performing business processing.
  • The portion of the processing assumed by the alternate BES is called the alternate portion, and the action of taking over that processing by the alternate portion is called alternating.
  • After the error is resolved, the normal BES unit starts and the processing that the alternate BES had assumed switches back to the normal BES. In this way, the processing returns to normal status. This resumption of the original processing is called switching the system back.
    Hint
    The following points compare these concepts in the standby-less system switchover (1:1) facility with the primary system and other corresponding concepts in the standby system switchover facility.
    • Think of the normal BES unit as the primary system, and the alternate BES unit as the secondary system.
    • During normal operation, think of the normal BES unit as the running system and the alternate portion as the standby system. In the alternating unit after alternation, think of the alternate portion as the running system, and think of the normal BES unit as the standby system.
    Reference note
    Because a unit that is running is used as the target for the system switchover, a standby server machine is not needed. Therefore, for standby-less system switchovers, IP addresses are not transferred.
(a) Conditions

To use the standby-less system switchover (1:1) facility, all of the following conditions must be satisfied:

(b) Advantages of the standby-less system switchover facility

The standby-less system switchover (1:1) facility provides the following advantages over the standby system switchover facility:

The following table lists the resources that are needed when a standby system unit is standing by, and after a system switchover is performed.

Table 26-5 Resources needed when a standby system unit is standing by and after a system switchover is performed

Item HiRDB system server processes HiRDB server processes Shared memory for unit controller Shared memory for lock pool Shared memory for global buffers
Standby-less system switchover (1:1) facility Yes#1 --#2, #3 Yes#4 Yes --#5
Standby-less system switchover (effects distributed) facility [Figure]#6, #7 --#3, #8 Yes#9 Yes --#10
Standby system switchover facility User server hot standby No Yes No No No
Rapid system switchover facility Yes Yes Yes Yes Yes
Other No No No No No

Legend:
Yes: The resource is allocated when a standby system unit is standing by, and it is also used after a system switchover.
No: The resource is allocated at the time of a system switchover, when it becomes the running system.
[Figure]: Some resources are allocated and used after a system switchover, when the standby system becomes the running system.
--: The resource is not allocated.

#1
Some processes of system server processing generate processes while they are standing by. Because other system servers share system server processes of the alternate BES unit, no resources are needed specifically for the alternate portion.

#2
The maximum number of back-end server processes is the value for pd_max_bes_process of the alternate BES. This value is the sum of the alternating processes and the non-alternating processes. Therefore, the number of users able to connect after a system switchover might be limited.

#3
If the value of pd_process_count (the number of resident processes) and the number of back-end server processes already activated when a system switchover is performed is less than the value of pd_max_bes_process, additional back-end server processes can be activated. Make sure that the OS's operating system parameters are set so that there will be enough processes, virtual memory, ports, and other resources for the operating system after a system switchover is performed. Note also that activating additional back-end server processes might cause a temporary drop in performance after a system switchover is performed.

#4
Shared memory is allocated for the alternate portion when the alternate BES unit is started.

#5
The global buffers used by alternate BESs are shared during alternation. Therefore, these buffers are not allocated after a system switchover. For details about allocation of global buffers during alternation, see 26.4.3(5)Defining global buffers (standby-less system switchover (1:1) facility).

#6
Because system server processes are shared on a unit-by-unit basis with the accepting units, no resources are required exclusively for the guest areas.

#7
A system server process for a back-end server generates a process when it becomes the running system.

#8
The maximum permissible number of server processes in a unit after a system switchover is normally defined as the combined total of the number of processes for back-end servers and the number of processes for guest BESs (pd_ha_max_server_process).

#9
When an accepting unit is started, shared memory is allocated for the guest areas.

#10
This is shared when the global buffer normally used by the back-end server is shared with the accepting unit. Therefore, it is not allocated after a system switchover. For details about sharing a global buffer, see 26.5.2(5)Defining global buffers (standby-less system switchover (effects distributed) facility).

For details about a back-end server's resource usage status when the standby-less system switchover (effects distributed) facility is used, see 26.1.3(2) Standby-less system switchover (effects distributed) facility.

(c) Rules for defining normal BES units and alternate BES units

The following explains the rules for defining normal BES units and alternate BES units:

Figure 26-4 Examples of valid configurations of a normal BES unit and an alternate BES unit shows examples of valid configurations of a normal BES unit and an alternate BES unit. Figure 26-5 Examples of invalid configurations of a normal BES unit and an alternate BES unit shows examples of invalid configurations.

Figure 26-4 Examples of valid configurations of a normal BES unit and an alternate BES unit

[Figure]

An alternate BES is defined with the -c option in the pdstart operand. The following examples show the pdstart operand specification for Examples 1 and 2 in Figure 26-4 Examples of valid configurations of a normal BES unit and an alternate BES unit.

Example 1

 
pdstart -t BES -s bes11 -u UNT1 -c bes21
pdstart -t BES -s bes21 -u UNT2
 

Explanation
-s bes11: Specifies a normal BES.
-c bes21: Specifies an alternate BES.

Example 2

 
pdstart -t BES -s bes11 -u UNT1 -c bes21
pdstart -t BES -s bes12 -u UNT1 -c bes22
pdstart -t BES -s bes21 -u UNT2
pdstart -t BES -s bes22 -u UNT2
 

Explanation
-s bes11, -s bes12: Specifies a normal BES.
-c bes21, -c bes22: Specifies an alternate BES.

Figure 26-5 Examples of invalid configurations of a normal BES unit and an alternate BES unit

[Figure]

(2) Standby-less system switchover (effects distributed) facility

(a) Overview

When an error occurs, the standby-less system switchover (effects distributed) facility distributes processing requests intended for the back-end servers in the unit where the error occurred to multiple running system units, where these processing requests can be executed. The standby-less system switchover (effects distributed) facility does not require standby server machines or standby system units, and thus uses system resources more efficiently. After an error occurs, the processing workload increases at each unit that assumes server processing for the failed node. As a result, transaction-processing performance might be impacted negatively. However, because the processing requests intended for the servers in the failed unit are shared and executed by multiple units, the additional load per unit is kept low and degradation of system performance is minimized.

The standby-less system switchover (effects distributed) facility switches over back-end servers by distributing them, and the switchover destinations can be distributed among multiple units. Moreover, if an error occurs in a unit to which the original unit was switched, processing can be switched again to another running system unit, where it can continue. This is called a multi-step system switchover. Multi-step system switchovers cannot be performed in a system that uses the standby-less system switchover (1:1) facility. Therefore, if an error occurs in a unit to which processing was switched for the standby-less system switchover (1:1) facility, processing for the failed unit cannot be assumed and continued elsewhere.

The standby-less system switchover (effects distributed) facility is appropriate for a system whose resources must always be used efficiently, and in which performance degradation in the event of an error must be minimized.

In the standby-less system switchover (effects distributed) facility, a back-end server defined in the original unit is called a host BES, and a back-end server that is accepted by another unit is called a guest BES. The unit where the host BESs are defined is called the regular unit, and the unit where a guest BES is located is called the accepting unit. All accepting units must be defined as an HA group. The back-end server resources that correspond to a guest BES constitute a guest area.

The following figure provides an overview of the standby-less system switchover (effects distributed) facility (distributed workload transfer and multi-step system switchover).

Figure 26-6 Overview of the standby-less system switchover (effects distributed) facility (distributed workload transfer and multi-step system switchover)

[Figure]

(b) Conditions

To use the standby-less system switchover (effects distributed) facility, all of the following conditions must be satisfied:

(c) Resource usage status

The following table lists and describes the usage status of back-end server resources when the standby-less system switchover (effects distributed) facility is applied.

Table 26-6 Usage status of back-end server resources when the standby-less system switchover (effects distributed) facility is applied

Back-end server type Back-end server status Resource usage status
Host BES Accepting status An area of the size required by the back-end server's definition is created.
Running An area of the size required by the back-end server's definition is used.
Guest BES Accepting status For each resource, a guest area of the largest resource size is created in the guest server.
Running Within the prepared guest area, an area that matches the size required by the back-end server's definition is used.
(d) Operation of the standby-less system switchover (effects distributed) facility

When the standby-less system switchover (effects distributed) facility is used and an error occurs in a regular unit, that unit's primary BESs are moved automatically to various accepting units, where they perform their processing as guest BESs. If a BES at the unit where the error occurs is itself a guest BES, it also is moved automatically to an accepting unit, where it continues to perform processing as a guest BES. As is the case with the standby system switchover facility, no intervention is required from the HiRDB administrator.

The following table lists the various types of errors that can occur, and whether a system switchover occurs when the standby-less system switchover (effects distributed) facility is used.

Table 26-7 System switchover depending on the cause of the error when the standby-less system switchover (effects distributed) facility is used

Unit or server status Starting or Stopping Running
Starting or Stopping Running
Slow-down detected Not applicable Unit terminates abnormally.
System switchover occurs.
Unit terminates abnormally.
System switchover occurs.
System log full Not applicable Unit terminates abnormally.
System switchover does not occur.
Unit terminates abnormally.
System switchover does not occur.
Database path error Not applicable Unit terminates abnormally.
System switchover occurs (only the first time).
Unit terminates abnormally.
System switchover occurs (only the first time).
Back-end server terminated forcibly Back-end server terminates abnormally.
System switchover does not occur.
Back-end server terminates abnormally.
System switchover does not occur.
Back-end server terminates abnormally.
System switchover does not occur.
System terminated forcibly Unit terminates abnormally.
System switchover does not occur.
Unit terminates abnormally.
System switchover does not occur.
Unit terminates abnormally.
System switchover does not occur.
System failure Unit terminates abnormally.
System switchover does not occur.
Unit terminates abnormally.
System switchover does not occur.
Unit terminates abnormally.
System switchover occurs.

In the event of a system switchover, the host BESs and any guest BESs that are running in the unit are switched over to other units. The back-end servers might be switched to different destinations.

The standby-less system switchover (effects distributed) facility also switches systems automatically when multiple errors occur. If an error occurs in an accepting unit after an error had occurred in a regular unit, the back-end servers of the primary system and the guest BESs running in the failed accepting unit move to remaining running system units and perform their processing as guest BESs. In this case, no intervention is required from the HiRDB administrator. The move destination of each back-end server is determined by the HA Monitor definition (the cluster software definition when Hitachi HA Toolkit Extension is used).

When a unit runs out of its free guest area, the standby-less system switchover (effects distributed) facility cancels the accepting status of all guest BESs that are not running. The acceptability of a guest area is not affected by the operation of the host BES. The table below shows automatic cancellation and resetting of acceptability depending on the free space in the guest area.

When acceptability is reset automatically, all servers that are acting as running systems in other units within the HA group enter accepting status. During this process, even those back-end servers whose acceptability was stopped intentionally by entry of a command (monsbystp or pdstop -q -s back-end-server-name) also become accepting. If the number of BESs that can be accepted within an HA group is exceeded, resulting in reduced-mode operation, any server that is stopped is not returned to accepting status.

Table 26-8 Automatic cancellation and resetting of acceptability depending on the free space in the guest area

Unused guest area in the unit Guest BES acceptability
Guest BESs active in other units Guest BESs inactive in other units
Disappeared Canceled automatically No change (cancellation underway)
Generated Reset automatically No change

Note
This table excludes the situation in which the monsbystp command or the pdstop command (pdstop -u accepting-unit-ID -s server-ID) was used in order to intentionally cancel acceptability.
(e) Examples of system switchovers using the standby-less system switchover (effects distributed) facility

[Figure] Example of a system switchover during normal operation
The figure below shows an example of a system switchover during normal operation.
When an error occurs at hostA, BES1 moves to unt2 and executes processing as a guest BES. BES2 moves to unt3 and executes processing as a guest BES.

Figure 26-7 Example of a system switchover during normal operation

[Figure]

[Figure] Example of a system switchover at a host that has accepted guest BESs
The figure below shows an example of a system switchover at a host that has accepted guest BESs. In this example, after a server machine has been restored but before it is reactivated, an error occurs in another server machine.
If an error occurs in hostA while BES5 is executing processing at unt1 as a guest BES, the individual back-end servers behave as follows:
  • BES1 moves to unt2 and executes processing as a guest BES.
  • BES2 moves to unt3 and executes processing as a guest BES.
  • BES5 returns to unt3 and executes processing as a host BES.

    Figure 26-8 Example of a system switchover at a host that has accepted guest BESs

    [Figure]

Unbalanced unit loading
Whether workloads will become unbalanced among the units after a system switchover depends on the priorities assigned to the standby systems in the cluster software definition. If the priorities of the standby systems are set appropriately, workloads will be balanced even after multiple units have terminated abnormally in any combination.
However, if a system switchover occurs as a result of an error in another unit after a unit has recovered from an error, workloads might become unbalanced. For this reason, you must check the allocation of servers after a system switchover occurs. Use the pdls -d ha or pdls -d svr command to check the allocation of the servers.
If the server allocation is unbalanced, we recommend that you use a planned system switchover to modify the allocation of the servers.
Additionally, as soon as possible after a unit has recovered from an error, return each BES back to its host BES. This helps prevent server allocation from becoming unbalanced. For example, in Figure 26-8 above, hosts have accepted guest BESs. If BES5 and BES6 are returned to unt3 before the error occurs in hostA, the error in hostA will not result in unbalanced server allocation.

[Figure] System switchover when a series of errors occurs (when all back-end servers are in accepting status)
The following figure shows an example of a system switchover when a series of errors occurs.

Figure 26-9 Example of a system switchover when a series of errors occurs

[Figure]

Explanation
An error occurs at hostA, so BES1 is moved to unt2 and BES2 is moved to unt3, where they perform processing as guest BESs. Because the maximum number of BESs that can be accepted by unt2 and unt3 is four each, they can now each accept three more servers. In other words, all back-end servers running at all other units can be accepted.
If an error occurs subsequently at hostB, BES1, BES3, and BES4 running in unt2 all move to unt3 and perform processing as guest BESs. No back-end server stops.

[Figure] System switchover when a series of errors occurs (when the number of BESs that can be accepted is insufficient)
The following figure shows an example of a system switchover when a series of errors occurs but the number of BESs that can be accepted is insufficient.

Figure 26-10 Example of a system switchover when a series of errors occurs but the number of BESs that can be accepted is insufficient

[Figure]

Explanation
An error occurs in hostA, so BES1 is moved to unt2 and BES2 is moved to unt3, where they perform processing as guest BESs. Because the maximum number of BESs that can be accepted by unt2 and unt3 is two each, they can now each accept only one more server. In other words, not all the back-end servers running at other units can be accepted.
If an error occurs subsequently at hostB, only BES3 running in unt2 moves to unt3 and executes processing as a guest BES. BES1 and BES4 stop.
If it is critical that processing continue at back-end servers even when a series of errors occurs, you must set an appropriately large value for the maximum number of BESs that can be accepted.

[Figure] System switchover when the number of BESs that can be accepted is insufficient
Whether a back-end server in a unit is switched to another unit or is stopped when an error occurs is determined by the system switchover priority that is assigned to each back-end server. This order is determined by an action of the cluster software.
In this example, only BES3 was moved. However, depending on the action of the cluster software, BES1 or BES4 might have been moved.

[Figure] Example of the action to take when multiple errors occur while the number of BESs that can be accepted is insufficient
The following figure shows an example of the action to take when an error occurs while the number of BESs that can be accepted is insufficient.

Figure 26-11 Example of the action to take when multiple errors occur while the number of BESs that can be accepted is insufficient

[Figure]

Explanation
hostB has recovered from an error and unt2 starts. As a result, BES4 starts as the running system in unt2 and BES1 can be accepted.
Next, when the monact command is entered for BES1 in unt2, BES1 begins processing as a running system BES (when Hitachi HA Toolkit Extension is used, an online command of the cluster software is used).

[Figure] Example of how to avoid a shortage in the number of BESs that can be accepted
If you cannot set a large value for the number of BESs that can be accepted, you must correct errors as soon as possible to prevent server stoppage as a result of a shortage in the number of BESs that can be accepted.
The following figure shows an example of how to avoid a shortage in the number of BESs that can be accepted.

Figure 26-12 Example of how to avoid a shortage in the number of BESs that can be accepted

[Figure]

Explanation
hostA is recovered from an error and unt1 starts. As a result, BES1 and BES2 start as standby systems in unt1, so that BES3, BES4, BES5, BES6 become accepting. Therefore, even if an error occurs in hostB, BES1 and BES4 can continue processing at hostA while BES3 can continue processing at hostC.

[Figure] System switchover when a series of errors occurs (when no BES can be accepted)
The following figure shows an example in which a system switchover cannot be executed when a series of errors occurs.

Figure 26-13 Example in which a system switchover cannot be executed when a series of errors occurs

[Figure]

Explanation
An error occurs in hostA, so BES1 is moved to unt2 and BES2 is moved to unt3, where they perform processing as guest BESs. Because the maximum number of BESs that can be accepted by unt2 and unt3 is one each, neither of them can now accept any more servers.
If an error occurs subsequently at hostB, BES1, BES3, and BES4 stop.
If it is critical that processing continue at back-end servers even when a series of errors occurs, you must set an appropriately large value for the maximum number of BESs that can be accepted. The action to take when no more servers can be accepted and the method of avoiding server stoppage are the same as indicated above for the case in which the number of BESs that can be accepted is insufficient.