25.1.2 Standby-less system switchover facilities

There are two types of system switchover. One is standby system switchover; the standby system switchover facility was discussed above. The other is standby-less system switchover, which consists of two facilities:

A standby-less system switchover facility can be applied only to a HiRDB/Parallel Server's back-end servers; it cannot be applied to a unit that contains servers other than back-end servers.

In contrast to the standby system switchover facility, a standby-less system switchover facility does not require that standby system units be prepared. When an error occurs, instead of switching over to a standby system unit, the system is switched over to another unit on the running system so that the work processing is taken over by an active back-end server. This is the function of the standby-less system switchover facilities.

Organization of this subsection
(1) Standby-less system switchover (1:1) facility
(2) Standby-less system switchover (effects distributed) facility

(1) Standby-less system switchover (1:1) facility

The standby-less system switchover (1:1) facility switches from a unit in which an error has occurred to a pre-designated back-end server unit that assumes the processing; i.e., there is a one-to-one relationship between the original unit and the unit to which processing is switched in the event of an error.

A back-end server that releases a process when an error occurs is called a normal BES, and a back-end server that takes over the process is called an alternate BES. Also, the unit of the normal BESs is called the normal BES unit, and the unit of the alternate BESs is called the alternate BES unit. Figure 25-2 provides an overview of the standby-less system switchover (1:1) facility.

Figure 25-2 Overview of the standby-less system switchover (1:1) facility

[Figure]

Explanation
  • BES1 and BES2 are both usually performing work processing.
  • If an error occurs on the normal BES unit (UNT1), the system is switched over and its processing is taken over by the alternate BES. The portion of the processing assumed by the alternate BES is called the alternate portion, and the act of taking over that processing by the alternate portion is called alternating.
  • After the error is resolved, the normal BES unit is started and the processing that the alternate BES assumed is switched back to the normal BES. In this way, this processing returns to normal status; this resumption of the original processing is called switching the system back.
    Hint
    The concepts of the primary system in the standby system switchover facility and in the standby-less system switchover (1:1) facility are compared below:
    • Think of the primary system as the normal BES unit, and think of the secondary system as the alternate BES unit.
    • During normal operation, think of the normal BES unit as the running system and think of the alternate portion as the standby system. In the alternating unit after alternation, think of the alternate portion as the running system, and think of the normal BES unit as the standby system.
    Reference note
    Because a unit that is running is used as the target for system switchover, a standby server machine is not needed. Therefore, in the case of standby-less system switchover, IP addresses are not transferred.
(a) Conditions

All the following conditions must be satisfied to use the standby-less system switchover (1:1) facility:

(b) Advantages of the standby-less system switchover facility

The standby-less system switchover (1:1) facility provides the following advantages over the standby system switchover facility:

Table 25-1 lists the resources that are needed when a standby system unit is standing by and after system switchover is performed.

Table 25-1 Resources needed when a standby unit is standing by and after system switchover is performed

ItemHiRDB system server processesHiRDB server processesShared memory for unit controllerShared memory for lock poolShared memory for global buffer
Standby-less system switchover (1:1) facilityYes1[Figure]2,3Yes4Yes[Figure]5
Standby-less system switchover (effects distributed) facility[Figure]6,7[Figure]3,8Yes9Yes[Figure]10
Standby system switchover facilityUser server host standbyNoYesNoNoNo
Rapid system switchover facilityYesYesYesYesYes
All othersNoNoNoNoNo
Legend:
Yes: Resource is allocated while on standby and is also used after system switchover.
No: Resource is allocated at the time of system switchover, when it becomes the running system.
[Figure]: Some resources are allocated and used after system switchover, when the standby system becomes the running system.
[Figure]: Resource is not secured.

1 Some processes of system server processing generate processes while they are standing by. Because other system servers share system server processes of the alternate BES unit, no resources are needed specifically for the alternate portion.

2 The maximum number of back-end server processes is the value for pd_max_bes_process of the alternate BES. This value is the sum of the alternating processes and the non-alternating processes. Therefore, only a limited number of users may be able to connect after a system switchover.

3 If the value of pd_process_count (the number of resident processes) and the number of back-end server processes already activated when system switchover was performed is less than the value of pd_max_bes_process, additional back-end server processes can be activated. Be sure to set the OS's operating system parameters so there will be enough processes, virtual memory, ports, etc., for the operating system after system switchover is performed. Note also that activating additional back-end server processes may cause a temporary drop in performance after system switchover has been performed.

4 Shared memory of the alternate portion is secured when the alternate BES unit starts.

5 The global buffers used by alternate BESs are shared when alternating processes. Therefore, these buffer are not secured after system switchover occurs. For details about allocation of global buffers during alternating, see 25.5.7 Definition of global buffers (standby-less system switchover (1:1) facility only).

6 Because system server processes are shared on a unit-by-unit basis with the accepting units, no resources are required exclusively for the guest areas.

7 A system server process for a back-end server generates a process when it becomes the running system.

8 The maximum permissible number of HiRDB server (back-end server) processes in a unit after system switchover can normally be defined as the combined total of the number of processes for each back-end server and the number of processes for the guests (pd_ha_max_server_process).

9 When an accepting unit is started, shared memory is allocated for the guest areas.

10 Shared when the global buffer normally used by the back-end server is shared with the accepting unit. Therefore, it is not allocated after system switchover. For details about sharing a global buffer, see 25.5.8 Definition of global buffers (standby-less system switchover (effects distributed) facility only).

For details about a back-end server's resource usage status when the standby-less system switchover (effects distributed) facility is used, see 25.1.2(2) Standby-less system switchover (effects distributed) facility.

(c) Rules for defining normal BES units and alternate BES units

The rules for defining normal BES units and alternate BES units are explained below.

Figure 25-3 shows examples of valid configurations of a normal BES unit and alternate BES unit. Figure 25-4 shows examples of invalid configurations.

Figure 25-3 Examples of valid configurations of a normal BES unit and an alternate BES unit

[Figure]

An alternate BES is defined with the -c option in the pdstart operand. Example specifications of the pdstart operand are shown in Examples 1 and 2 below.

Example 1

pdstart -t BES -s bes11 -u UNT1 -c bes21
pdstart -t BES -s bes21 -u UNT2

Explanation
-s bes11: Specifies a normal BES.
-c bes21: Specifies an alternate BES.

Example 2

pdstart -t BES -s bes11 -u UNT1 -c bes21
pdstart -t BES -s bes12 -u UNT1 -c bes22
pdstart -t BES -s bes21 -u UNT2
pdstart -t BES -s bes22 -u UNT2

Explanation
-s bes11, -s bes12: Specifies a normal BES.
-c bes21, -c bes22: Specifies an alternate BES.

Figure 25-4 Examples of invalid configurations of a normal BES unit and an alternate BES unit

[Figure]

(2) Standby-less system switchover (effects distributed) facility

(a) Overview

When an error occurs, the standby-less system switchover (effects distributed) facility distributes processing requests intended for the back-end servers in the unit where the error occurred to multiple running units, where these processing requests can be executed. The standby-less system switchover (effects distributed) facility does not require standby server machines or standby units, and thus uses system resources more efficiently. After an error occurs, the processing workload increases at each unit that assumes server processing for the failed node; as a result, transaction-processing performance may be impacted negatively. However, because the processing requests intended for the servers in the failed unit are shared and executed by multiple units, the additional load per unit is kept low and degradation of system performance is minimized.

The standby-less system switchover (effects distributed) facility switches over back-end servers by distributing them, and the switchover destinations can be distributed among multiple units. Moreover, if an error occurs in a unit to which the original unit was switched, switching can be performed again to other running units, where processing can be continued; this is called multi-step system switchover. Multi-step system switchover cannot be performed in a system that uses the standby-less system switchover (1:1) facility; if an error occurs at a unit to which processing was switched in the case of the standby-less system switchover (1:1) facility, processing for the failed unit cannot be assumed and continued elsewhere.

The standby-less system switchover (effects distributed) facility is appropriate for a system whose resources must always be used efficiently and in which performance degradation because of an error must be minimized.

In the standby-less system switchover (effects distributed) facility, a back-end server defined in the original unit is called a host BES, and a back-end server that is accepted by another unit is called a guest BES. The unit where the host BESs are defined is called the regular unit, and the unit where a guest BES is located is called the accepting unit. All accepting units must be defined as an HA group. The back-end server resources that correspond to a guest BES constitute a guest area.

Figure 25-5 provides an overview of the standby-less system switchover (effects distributed) facility (distributed workload transfer and multi-step system switchover).

Figure 25-5 Overview of the standby-less system switchover (effects distributed) facility (distributed workload transfer and multi-step system switchover)

[Figure]

(b) Conditions

All the following conditions must be satisfied to use the standby-less system switchover (effects distributed) facility:

(c) Resource usage status

Table 25-2 shows the usage status of back-end server resources when the standby-less system switchover (effects distributed) facility is being applied.

Table 25-2 Usage status of back-end server resources when the standby-less system switchover (effects distributed) facility is applied

Back-end server typeBack-end server statusResource usage status
Host BESAccepting statusAn area of the size required by the back-end server's definition is created.
RunningAn area of the size required by the back-end server's definition is used.
Guest BESAccepting statusFor each resource, a guest area of the largest resource size is created in the guest server.
RunningWithin the prepared guest area, an area that matches the size required by the back-end server's definition is used.
(d) Operation of the standby-less system switchover (effects distributed) facility

When the standby-less system switchover (effects distributed) facility is used and an error occurs in a regular unit, that unit's primary BESs are moved automatically to various accepting units where they execute their processing as guest BESs. If a BES at the unit where the error occurs is itself a guest BES, it also is moved automatically to an accepting unit where it continues to execute processing as a guest BES. As is the case with the standby system switchover facility, no intervention is required from the HiRDB administrator.

Table 25-3 lists the various types of errors that can occur and whether or not system switchover occurs when standby-less system switchover (effects distributed) is used.

Table 25-3 System switchover depending on error cause when standby-less system switchover (effects distributed) facility is used

Unit's statusStarting or StoppingRunning
Server's statusStarting or StoppingStarting or StoppingRunning
Slow-down detectedNot applicableUnit terminates abnormally.
System switchover occurs.
Unit terminates abnormally.
System switchover occurs.
System log fullNot applicableUnit terminates abnormally.
System switchover does not occur.
Unit terminates abnormally.
System switchover does not occur.
Database path errorNot applicableUnit terminates abnormally.
System switchover occurs (only the first time).
Unit terminates abnormally.
System switchover occurs (only the first time).
Back-end server terminated forciblyBack-end server terminates abnormally.
System switchover does not occur.
Back-end server terminates abnormally.
System switchover does not occur.
Back-end server terminates abnormally.
System switchover does not occur.
System terminated forciblyUnit terminates abnormally.
System switchover does not occur.
Unit terminates abnormally.
System switchover does not occur.
Unit terminates abnormally.
System switchover does not occur.
System failureUnit terminates abnormally.
System switchover does not occur.
Unit terminates abnormally.
System switchover does not occur.
Unit terminates abnormally terminated
System switchover occurs.

In the event of system switchover, the host BESs and any guest BESs that are running in the unit are switched over to other units. The back-end servers may be switched to different destinations.

The standby-less system switchover (effects distributed) facility switches systems automatically when various types of errors occur. If an error occurs in an accepting unit after an error had occurred in a regular unit, the back-end servers of the primary system and the guest BESs running in the failed accepting unit move to remaining running units and execute their processing as guest BESs; no intervention is required from the HiRDB administrator. The move destination of each back-end server is determined by the HA monitor definition (cluster software definition when Hitachi HA Toolkit Extension is used).

When a unit runs out of its free guest area, the standby-less system switchover (effects distributed) facility cancels the accepting status of all guest BESs that are not running. The acceptability of a guest area is not affected by the operation of the host BES. Table 25-4 shows automatic cancellation and resetting of acceptability depending on the free space in the guest area.

When acceptability is reset automatically, all servers that are acting as running systems in other units within the HA group enter accepting status. During this process, even those back-end servers whose acceptability was stopped intentionally by entry of a command (monsbystp or pdstop -q -s back-end-server-name) also become accepting. If the number of BESs that can be accepted within an HA group is exceeded, resulting in reduced-mode operation, any server that is stopped is not returned to accepting status.

Table 25-4 Automatic cancellation and resetting of acceptability depending on the free space in the guest area

Unused guest area in the unitGuest BES acceptability
Guest BESs active in other unitsGuest BESs inactive in other units
DisappearedCancelled automaticallyNo change (being cancelled)
GeneratedReset automaticallyNo change
Note
This table excludes the situation in which the monsbystp command or the pdstop command (pdstop -u accepting-unit-ID -s server-ID) was used to cancel acceptability intentionally.
(e) Examples of system switchover using the standby-less system switchover (effects distributed) facility
[Figure]Example of system switchover during normal operations
Figure 25-6 shows an example of system switchover during a time of normal operations.
When an error occurs at hostA, BES1 moves to unt2 and executes processing as a guest BES; BES2 moves to unt3 and executes processing as a guest BES.

Figure 25-6 Example of system switchover during normal operations

[Figure]
[Figure]Example of system switchover at a host that has accepted guest BESs
Figure 25-7 shows an example of system switchover at a host that has accepted guest BESs. In this example, after a server machine has been restored but before it is reactivated, an error occurs in another server machine.
If an error occurs in hostA while BES5 is executing processing at unt1 as a guest BES, the individual back-end servers behave as follows:
  • BES1 moves to unt2 and executes processing as a guest BES.
  • BES2 moves to unt3 and executes processing as a guest BES.
  • BES5 returns to unt3 and executes processing as a host BES.

    Figure 25-7 Example of system switchover at a host that has accepted guest BESs

    [Figure]

Unbalanced unit loading
Whether the workloads will become unbalanced among the units following system switchover depends on the priorities assigned to the standby systems in the cluster software definition. If the priorities of the standby systems are set appropriately, workloads will be balanced even after multiple units have terminated abnormally in any combination.
However, if system switchover occurs as a result of an error in another unit after a unit has been recovered from an error, the workloads may become unbalanced. For this reason, you should check the allocation of servers after system switchover. You use the pdls -d ha or pdls -d svr command to check the allocation of servers.
If the server allocation is unbalanced, you should use planned system switchover to modify the allocation of the servers.
Additionally, as soon as possible after a unit has been recovered from an error, you should return each BES to the unit where it is defined. This helps prevent server allocation from becoming unbalanced. In the example shown in Figure 25-7, hosts have accepted guest BESs; if BES5 and BES6 are returned to unt3 before the error occurs in host A, the error in hostA will not result in unbalanced server allocation.
[Figure]System switchover when a series of errors occurs (when all back-end servers are in accepting status)
Figure 25-8 shows an example of system switchover when a series of errors occurs.

Figure 25-8 Example of system switchover when a series of errors occurs

[Figure]
Explanation
An error occurs at hostA, so BES1 is moved to unt2 and BES2 is moved to unt3, where they execute processing as guest BESs. Because the maximum number of BESs that can be accepted by unt2 and unt3 is 4 each, they can now each accept three more servers. In other words, all back-end servers running at all other units can be accepted.
If an error occurs subsequently at hostB, BES1, BES3, and BES4 running in unt2 all move to unt3 and execute processing as guest BESs. No back-end server stops.
[Figure]System switchover when a series of errors occurs (when the number of BESs that can be accepted is insufficient)
Figure 25-9 shows an example of system switchover when a series of errors occurs but the number of BESs that can be accepted is insufficient.

Figure 25-9 Example of system switchover when a series of errors occurs but the number of BESs that can be accepted is insufficient

[Figure]
Explanation
An error occurs in hostA, so BES1 is moved to unt2 and BES2 is moved to unt3, where they execute processing as guest BESs. Because the maximum number of BESs that can be accepted by unt2 and unt3 is 2 each, they can now each accept only one more server. In other words, not all the back-end servers running at other units can be accepted.
If an error occurs subsequently at hostB, only BES3 running in unt2 moves to unt3 and executes processing as a guest BES; BES1 and BES4 stop.
If it is critical that processing continue at back-end servers even when a series of errors occurs, you must set an appropriately large value for the maximum number of BESs that can be accepted.
System switchover when the number of BESs that can be accepted is insufficient
Whether a back-end server in a unit is switched to another unit or is stopped when an error occurs is determined by the priority for system switchover that is assigned to each back-end server. This order is determined by an action of the cluster software.
In this example, only BES3 was moved. However, depending on the action of the cluster software, BES1 or BES4 might have been moved.
[Figure]Example of the action to take when an error occurs while the number of BESs that can be accepted is insufficient
Figure 25-10 shows an example of the action to take when an error occurs while the number of BESs that can be accepted is insufficient.

Figure 25-10 Example of the action to take when an error occurs while the number of BESs that can be accepted is insufficient

[Figure]
Explanation
HostB is recovered from an error and unt2 starts. As a result, BES4 starts as the running system in unt2 and BES1 can be accepted.
Next, when the monact command is entered for BES1 in unt2, BES1 begins processing as a guest BES (when Hitachi HA Toolkit Extension is used, an online command of the cluster software is used).
[Figure]Example of how to avoid a shortage in the number of BESs that can be accepted
If you cannot set a large value for the number of BESs that can be accepted, you must correct errors as soon as possible in order to prevent server stoppage as a result of a shortage in the number of BESs that can be accepted.
Figure 25-11 shows an example of how to avoid a shortage in the number of BESs that can be accepted.

Figure 25-11 Example of how to avoid a shortage in the number of BESs that can be accepted

[Figure]
Explanation
HostA is recovered from an error and unt1 starts. As a result, BES1 and BES2 start as standby systems in unt1, so that BES3, BES4, BES5, BES6 become accepting. Therefore, even if an error occurs in hostB, BES1 and BES4 can continue processing at hostA while BES3 can continue processing at hostC.
[Figure]System switchover when a series of errors occurs (when no BES can be accepted)
Figure 25-12 shows an example in which system switchover cannot be executed when a series of errors occurs.

Figure 25-12 Example in which system switchover cannot be executed when a series of errors occurs

[Figure]
Explanation
An error occurs in hostA, so BES1 is moved to unt2 and BES2 is moved to unt3, where they execute processing as guest BESs. Because the maximum number of BESs that can be accepted by unt2 and unt3 is 1 each, neither of them can now accept any more servers.
If an error occurs subsequently at hostB, BES1, BES3, and BES4 stop.
If it is critical that processing continue at back-end servers, even when a series of errors occurs, you must set an appropriately large value for the maximum number of BESs that can be accepted. The action to take when no more servers can be accepted and the method of avoiding server stoppage are the same as indicated above for the case where the number of BESs that can be accepted is insufficient.