9.3.2 Setting the timers for monitoring the cluster

The EADS servers in the cluster mutually send heartbeats to notify one another that they are operating normally within the cluster.

EADS servers also detect communication errors by monitoring the communication with the EADS servers during command execution and the time spent from start to completion of command execution.

Approach: EADS detects communication errors more quickly by shortening the monitoring interval, and prevents timeouts from occurring frequently by increasing the monitoring interval.

Organization of this subsection

(1) Sending heartbeats and checking for live servers
(2) Starting the cluster
(3) Running cluster operations
(4) EADS server isolation processing
(5) Cluster recovery processing
(6) Cluster scale-out processing (adding EADS servers)
(7) Complementary processing of the history of update operations

(1) Sending heartbeats and checking for live servers

For details about monitoring the cluster by sending heartbeats, see 2.9 Monitoring a cluster.

The following figure shows the timers for heartbeat transmission and the check for live servers.

[Figure]

The alphabetical letters in the figure correspond to explanations provided in the following subsections of 9.3.3 Timeout-related parameters:

(c): 9.3.3(1)(c) eads.failureDetector.heartbeat.interval

(d): 9.3.3(1)(d) eads.failureDetector.heartbeat.timeout

(e): 9.3.3(1)(e) eads.failureDetector.connection.timeout

(f): 9.3.3(1)(f) eads.failureDetector.read.timeout

To Page Top

(2) Starting the cluster

The following figure shows the timers used when the EADS servers are started, based on an example of executing the ezstart command.

[Figure]

The alphabetical letters in the figure correspond to explanations provided in the following subsections of 9.3.3 Timeout-related parameters:

(i): 9.3.3(1)(i) eads.admin.boot.timeout

The EADS server in the cluster that has the smallest EADS server ID (as specified in the cluster properties) is started first. The EADS servers started sequentially thereafter receive heartbeats from the first EADS server and participate in the cluster.

The first EADS server that was started updates its cluster information based on the heartbeats received from the other EADS servers. This updated cluster information is then shared in the cluster.

(a) If the cluster properties differ from those of other EADS servers

An active EADS server sends a heartbeat with hashed cluster properties added.

The EADS server that receives the heartbeat checks the hash value. If the hash values do not match, the startup fails.

(b) If an EADS server already participating in the cluster is shut down while the cluster is starting

If an EADS server that is already participating in the cluster is shut down while the cluster is starting, that EADS server is isolated. If this happens, the other EADS servers stop their start processing and startup fails.

When at least half of the EADS servers in the cluster are shut down, a timeout occurs.

(c) If an EADS server that is not yet participating in the cluster is shut down while the cluster is starting

If an EADS server that is not yet participating in the cluster is shut down while the cluster is starting, any EADS server that has already started results in a timeout because the start processing of all EADS servers has not been completed.

To Page Top

(3) Running cluster operations

The following figure shows the timers used for running the cluster by executing the eztool command.

[Figure]

The alphabetical letters in the figure correspond to explanations provided in the following subsections of 9.3.3 Timeout-related parameters:

(a): 9.3.3(2)(a) eads.command.connection.timeout

(b): 9.3.3(2)(b) eads.command.common.read.timeout^#1

(c): 9.3.3(2)(c) eads.command.common.execution.timeout^#2

#1: If the eads.command.subcommand-name.read.timeout parameter is specified in the command properties, its parameter value is used.
#2: If the eads.command.subcommand-name.execution.timeout parameter is specified in the command properties, its parameter value is used.

To Page Top

(4) EADS server isolation processing

The following figure shows the flow of EADS server isolation processing and the relation with timers:

[Figure]

The letters in the figure correspond to explanations provided in the following subsections of 9.3.3 Timeout-related parameters:

(c): 9.3.3(3)(c) eads.client.clusterInfo.update.interval

(k): 9.3.3(1)(k) eads.admin.operation.isolate.gracefulstop.waitTime

If you use the eztool isolate command to isolate an EADS server, you can specify in the eads.admin.operation.isolate.gracefulstop.waitTime parameter in the server properties the time allowed for completion of isolation processing since the cluster information update operation was completed. By specifying a value that is smaller than that value in the eads.client.clusterInfo.update.interval parameter in the client properties, you can isolate the EADS server after the cluster information update operation is completed on the EADS client.

Note that the specification of the eads.admin.operation.isolate.gracefulstop.waitTime parameter is invalid if the EADS server is isolated by cluster monitoring.

For details about the complementary processing of history of update operations, see 9.3.2(7) Complementary processing of the history of update operations.

To Page Top

(5) Cluster recovery processing

The following figure shows the timers used when the cluster is recovered.

[Figure]

The alphabetical letters in the figure correspond to explanations provided in the following subsections of 9.3.3 Timeout-related parameters:

(m): 9.3.3(1)(m) eads.transfer.timeout

(n): 9.3.3(1)(n) eads.transfer.interval

Data is sent in units of 10 kilobytes until the size specified in the eads.transfer.datasize parameter in the server properties for the EADS server subject to recovery is reached. For example, if a size of 25 kilobytes is specified, 30 kilobytes of data will be sent.

During restoration processing, the active EADS servers send data to the EADS server to be restored in order to recover data consistency.

Therefore, note the following:

To restore an EADS server, it takes at least the time required for obtaining data.
The EADS server that sends data is affected correspondingly by the amount of CPU resources and network bandwidth that are allocated for sending data.
If the EADS server cannot keep up with the processing because both data operations and restoration processing must be performed, the EADS server might place data operations on hold to avoid a memory shortage.

Reference note: If you will be restoring disk caches and two-way caches, specify the size of the data to be transmitted during restoration processing in the eads.cache.disk.transfer.datasize parameter in the cache properties. Also specify the data transmission interval during restoration processing in the eads.cache.disk.transfer.interval parameter.

Even while the data is being updated, the isolated EADS server can be restored to the cluster with the data integrity recovered. For the general procedure for restoring one or more isolated EADS servers, see 12.2.1 If one or more EADS servers are isolated.

For details about the complementary processing of history of update operations, see 9.3.2(7) Complementary processing of the history of update operations.

To Page Top

(6) Cluster scale-out processing (adding EADS servers)

The timer used for scaling out the cluster (adding EADS servers to the cluster) is also used for recovering the cluster.

For details about the timer used for recovering the cluster, see 9.3.2(5) Cluster recovery processing (replacing recovery with scale-out).

To Page Top

(7) Complementary processing of the history of update operations

During EADS server isolation processing, restoration processing, scale-out processing, and locking, the EADS servers check their histories of update operations with each other. If any difference is detected, complementary processing of the history of update operations is performed. This ensures the consistency of the order in which data is written.

Complementary processing of the history of update operations consists of the following two processes:

Complementary processing of the history of update operations on the remote EADS server
Complementary processing of the history of update operations on the local EADS server

Here, the local EADS server means the following EADS server.

For isolation processing:: The EADS server (copy-destination EADS server) that takes over the processing of the EADS server to be isolated
For restoration and scale-out processing:: The EADS server to be restored and the EADS server to be added during scale-out processing

The following figure shows the flow of complementary processing of the history of update operations and the relation with timers:

[Figure]

The alphabetical letters in the figure correspond to explanations provided in the following subsections of 9.3.3 Timeout-related parameters:

(o): 9.3.3(1)(o) eads.replication.fillgap.copy.timeout

Complementary processing of the history of update operations on the remote EADS server

The history of update operations on the local EADS server is sent to each EADS server.
If the history of update operations on the local EADS server is different from that on a remote EADS server, the local EADS server sends the history of update operations to the remote EADS server. At this time, the EADS server sends the amount of data specified in the eads.replication.fillgap.copy.datasize parameter in the server properties.
The history of update operations on the remote EADS server is complemented based on the history of update operations sent from the local EADS server.

The history of update operations on the remote EADS server is complemented as shown below.

Complementary processing of the history of update operations on the remote EADS server is performed for the number of differences in the history of update operations for the copy-destination EADS servers.

Complementary processing of the history of update operations on the local EADS server

A request for complementary processing of the history of update operations is sent to each EADS server to check whether the history of update operations on the local EADS server is different from other servers.
Consensus processing of the history of update operations is performed in response to the request.

If the consensus processing does not finish within the time specified by the eads.replication.consensus.timeout parameter in the server properties, a timeout occurs, and then the consensus processing is performed again. This process is repeated until a consensus is built.
Via consensus processing, the remote EADS server sends the history of update operations to the local EADS server.
The history of update operations on the local EADS server is complemented based on the history of update operations sent by each EADS server.

The history of update operations on the local EADS server is complemented as shown in the following diagram:

Complementary processing of the history of update operations on the local EADS server is performed for the number of differences in the history of update operations on the local EADS server.

Complementary processing of the history of update operations might be performed more than once for one operation of restoration, isolation, or scale-out processing. The maximum number of times complementary processing can be performed is as follows: (data multiplicity - 1) [Figure] (number of caches).

The number of simultaneous threads for complementary processing of the history of update operations is the number of redundant copies of data plus the original - 1 (the value is 1 if the number of redundant copies of data plus the original is 1).

To Page Top