Loading data into a row-partitioned table (using the parallel loading facility)

The performance of the parallel loading facility, such as its allocation of the input data, might be affected by the environment in which data loading is executed. For this reason, in some execution environments data loading might not be any faster even when you use the parallel loading facility. Therefore, try using the parallel loading facility and measure how much time data loading takes before you start actual operations. If this trial shows that using the parallel loading facility will shorten data loading time, use the parallel loading facility. If the facility does not deliver the expected processing performance, load data in table units or divide the input data file and load data in RDAREA units. For details about creating a divided-input data file, see 6.1.7 Loading data into a row-partitioned table (Creating divided-input data files).

The conditions that affect the parallel loading facility's performance are described below.

(a) Row partitioning among servers

The parallel loading facility is advantageous when you are partitioning rows among servers. Parallel loading can also be executed on tables with row partitioning on a single server. But data loading is faster when it is divided among servers, since processing of a single pdload command to store data can monopolize the servers.

(b) Allocation of input data

During parallel loading, multiple pdload commands read data from a single input data file. Processing time can be shortened at this point by allocating data so as to increase read processing efficiency. As the following figure shows, allocating input data evenly across partitioning keys results in efficient read processing.

Figure 6-5 Allocation of input data vs. read efficiency

[Figure]

If data from a table that uses row partitioning on one server is unloaded by a pdrorg command to a single file, the unloaded data file will demonstrate lopsided data allocation, as shown in the figure. For this reason, if these tables are used as input data files, performance might decline despite the use of parallel loading.

(c) Formats for input data

The parallel loading facility allows data loading in all the data formats that can be loaded using the pdload command. However, whenever you are working with large quantities of data, we recommend using the binary format for input data. The DAT and fixed-size data formats both require that the format be converted when the data is stored in a database. This might increase CPU usage and decrease the performance of parallel processing of data loading.

(d) LOB creation types

When LOB columns and abstract data type columns that have LOB parameters are defined in the table you are processing, we recommend specifying f as the LOB creation type (the -k option) for the data load. A file is prepared for each LOB data item as the LOB input file for data loads that specify f. For this reason, there is no duplicate reading of LOB data when data loads are executed in parallel. This reduces input and output.

When d is specified as the LOB creation type, all LOB data is stored in the input data file. This generates processing to skip reading of LOB data that is not a target of processing when data is loaded in RDAREA units. This increase in processing can weaken performance.

(6) Operating examples

This section provides operating examples that use parallel loading.

In Operating example 1, LOB columns are defined.
In Operating example 2, a non-partitioning key index is defined.

The conditions for these operating examples are as follows.

Table 6-2 Parallel loading operating example conditions

Operating example	Table definition		Index definition		Input data format	Index creation method	Log acquisition mode
Operating example	Row partitioning	LOB column	Partition-ing key index	Non-partitioning key index	Input data format	Index creation method	Log acquisition mode
Operating example 1	Y	Y	Y	N	Binary	Batch index creation mode	Pre-update log acquisition mode
Operating example 2	Y	N	Y	Y	Binary	Batch index creation mode	Pre-update log acquisition mode

Legend: Y: Defined; N: Not defined

(a) Operating example 1 (defining LOB columns)

Data is loaded to TBL2, which defines LOB columns, using the parallel loading facility. For the LOB creation type (-k option), f is specified and a file is created for each LOB data item.

Table and index definitions

Table definition

 
CREATE TABLE "TBL2"(
  col001    int not null,
  col002    varchar(20),
  col003    blob(1M) in ((LOB01),(LOB02),(LOB03)),
  col004    decimal(10,3)
) fix hash hashf by col001
in (RDUSER11,RDUSER21,RDUSER31);

Index definition

 
CREATE INDEX "INDX2" ON "TBL2"(col001,col002)
in ((RDUSER12),(RDUSER22),(RDUSER32));

Data load procedure

Use the pdhold command to shut down the RDAREA to which you plan to load data.

 
pdhold -r LOB01,LOB02,LOB03,RDUSER11,RDUSER21,RDUSER31,RDUSER12,RDUSER22,RDUSER32

Create the pdparaload command control statement file.
source: Specify the server name, input data file and error information file. For a HiRDB parallel server configuration, always specify the server name.
lobdata: Specify the LOB input file.
idxwork: Specifies the storage directory for the index information file.
sort: Specifies the storage directory of the work file for sorting.
report: Specifies the processing results file name.

 
source fes01:/users/data/input_file error=/users/rep/error_file
lobdata /users/data/lob
idxwork bes01 /users/work
idxwork bes02 /users/work
idxwork bes03 /users/work
sort bes01 /users/work
sort bes02 /users/work
sort bes03 /users/work
report file=/users/rep/result_file

Use the pdparaload command to load data.
-d: Load data in creation mode.
-b: Use a binary input data file.
-k: Specify the LOB creation type. Create a file for each LOB data item (f).
-i: Specify the batch index creation mode (c).
-l: Specify the pre-update log acquisition mode (p).
pdparaload -d -b -k f -i c -l p "TBL2" /users/cntl/control_file

Use the pdcopy command to make a backup of the RDAREA.

 
pdcopy -m /users/rdarea/mast/mast01 -M r -p /users/pdcopy/list/list01
-b /users/pdcopy/backup/backup01
-r LOB01,LOB02,LOB03,RDUSER11,RDUSER21,RDUSER31,RDUSER12,RDUSER22,RDUSER32

Use the pdrels command to release the RDAREA from shutdown status.

 
pdrels -r LOB01,LOB02,LOB03,RDUSER11,RDUSER21,RDUSER31,RDUSER12,RDUSER22,RDUSER32

(b) Operating example 2 (defining a non-partitioning key index)

Load data to TBL3, which defines a non-partitioning key index, using the parallel loading facility. Since a non-partitioning key index is being defined, batch creation of the index using the pdrorg command is required after you execute the pdparaload command. For batch creation of the index, you specify the index information file output by the pdparaload command.

Table and index definitions

Table definition

 
CREATE TABLE "TBL3"(
  col001    int not null,
  col002    varchar(20),
  col003    char(32),
  col004    decimal(10,3)
) partitioned by col001
in ((RDUSER11) 10000000, (RDUSER21) 20000000, (RDUSER31));

Index definition (partitioning key index)

 
CREATE INDEX "INDX3" ON "TBL3" (col001,col002)
in ((RDUSER12),(RDUSER22),(RDUSER32));

Index definition (non-partitioning key index)

 
CREATE INDEX "INDX4" ON "TBL3" (col003)
in ((RDUSER13));

Data load procedure

Use the pdhold command to shut down the RDAREA to which you plan to load data.

 
pdhold -r RDUSER11,RDUSER21,RDUSER31,RDUSER12,RDUSER22,RDUSER32,RDUSER13

Create the pdparaload command control statement file.
source: Specify the input data file and the error information file. For a HiRDB parallel server configuration, always specify the server name.
idxwork: Specify the storage directory for the index information file. After data loading, execute batch creation of the index using the index information file created here.
sort: Specify the storage directory of the work file for sorting.
report: Specify the processing results file name.
source /users/data/input_file error=/users/rep/error_file idxwork /users/work sort /users/work report file=/users/rep/result_file
Use the pdparaload command to load data.
-d: Load data in creation mode.
-b: Use a binary input data file.
-i: Specify the batch index creation mode (c).
-l: Specify the pre-update log acquisition mode (p).
pdparaload -d -b -i c -l p "TBL3" /users/cntl/control_file

Create the pdrorg command control statement file.
In the index statement, specify the index information file output by the pdparaload command.

 
index "INDX4" RDUSER13 "/users/work/INDEX-INDX4-RDUSER13-faaG4MnMf"
index "INDX4" RDUSER13 "/users/work/INDEX-INDX4-RDUSER13-caa34EnEc"
index "INDX4" RDUSER13 "/users/work/INDEX-INDX4-RDUSER13-caa06MnMc"
sort /users/work
report file=/users/rep/result_file2

With the pdrorg command, execute batch creation of the index.

 
pdrorg -k ixmk -l p -t "TBL3" /users/cntl/control_rorg

Use the pdcopy command to make a backup of the RDAREA.

 
pdcopy -m /users/rdarea/mast/mast01 -M r -p /users/pdcopy/list/list01
-b /users/pdcopy/backup/backup01
-r RDUSER11,RDUSER21,RDUSER31,RDUSER12,RDUSER22,RDUSER32,RDUSER13

Use the pdrels command to release the RDAREA from shutdown status.

 
pdrels -r RDUSER11,RDUSER21,RDUSER31,RDUSER12,RDUSER22,RDUSER32,RDUSER13

(7) What to do when a command error occurs

There are two basic types of errors that can occur when the pdparaload command is executed, and two corresponding ways to handle them, as follows.

Errors when executing data loads in RDAREA units (pdload command)
In this case, data fails to load into some RDAREAs. After handling the error for those RDAREAs that experienced errors, re-execute the data load in RDAREA units with the pdload command.
Errors before data loads in RDAREA units (pdload command) begin
In this case, data loads to all RDAREAs fail. After handling the error, re-execute the pdparaload command.

This section describes details about errors generated when the pdparaload command is executed and their handling. The handling procedure when an error occurs when the pdparaload command is executed is shown below.

Figure 6-6 Handling procedure when an error occurs during pdparaload command execution

[Figure]

Note:: Sections (a) through (e) in the figure correspond to (a) through (e) in the figure.

(a) Identify error timing

Identify when the error occurred. Check whether the following two messages have been output.

KFPL00791-I
KFPL00792-I

Both messages have been output: An error occurred during execution of a data load in RDAREA units. This means data loads to some RDAREAs have failed. In this case, handle the error by starting with step (b) of the procedure. Thereafter, re-execute the data load in RDAREA units with the pdload command.

One of the messages has not been output: An error occurred before data loading in RDAREA units began. This means data loads to all RDAREAs have failed. In this case, handle as directed by the error message output directly prior to the KFPL00794-I message. Thereafter, re-execute the pdparaload command.

(b) Identify RDAREAs that require handling

Extract the KFPL00793-I message from the messages output to the OS console that executed the pdparaload command or to the syslog file, and then identify the RDAREAs that require handling. When extracting the message, use the following as search keywords.

Message ID (KFPL00793-I)
Authorization identifier
Table identifier

Message extraction example

 
KFPL00793-I Pdload execution abnormal terminated, table=user1."T1", RDAREA=RDAREA1, return code=8, pdload process id=1385412 (1212478)
   :
KFPL00793-I Pdload execution abnormal terminated, table=user1."T1", RDAREA=RDAREA2, return code=8, pdload process id=1385433 (1212478)

Description: Extract the KFPL00793-I message based on the authorization identifier (user1) and the table identifier (T1). This can identify the RDAREAs (RDAREA1 and RDAREA2) where the data load failed.

Take one of the following corrective actions for the identified RDAREA, depending on the return code of the KFPL00793-I message.

Return code is 4

Check the error information file output by the pdload command and take corrective action for the error. The error information file is specified by the error option of the source statement of the pdparaload command (the RDAREA name is automatically appended to the file name). When nothing is specified, the following directory and file name are used to create the file.

The error information file storage directory
1. The directory specified in the pd_tmp_directory operand
2. If the above directory is not specified, the directory specified in environment variable TMPDIR
3. If the above directory is not specified, the /tmp directory
The error information file name
The file name that includes the process ID and message ID (KFPL00793-I) identified in (c)

After corrective action is taken, re-execute the data load in units of RDAREAs using the procedure of step (e).

Return code is 8: Identify the error as instructed in steps (c) and (d) and take corrective action.; Then, re-execute the data load in units of RDAREAs using the procedure of step (e).

(c) Identify the IDs of the processes that were executed

From the message log file, identify the IDs of the pdload control process and server process that the pdparaload command executed. Extract the process IDs by the following procedure.

Output the message log from pdparaload command execution start (KFPL00791-I message) to termination (KFPL00794-I message) to a file using the pdcat command.
Extract the KFPL00711-I message, which indicates the start of the process, from the content output in step 1. To extract the message, use the following search keywords.
- Message ID (KFPL00711-I)
- Authorization identifier
- Table identifier
- RDAREA name identified in step (b)

Message extraction example

 
1572976 2010/11/04 15:55:48 0mload1 lod KFPL00711-I pdloadm started, table=user1."T1", RDAREA=RDAREA1
1728690 2010/11/04 15:55:48 bes1    lod KFPL00711-I pdbes started, table=user1."T1", RDAREA=RDAREA1

Description

Extract the KFPL00711-I message based on the authorization identifier (user1), the table identifier (T1) and the RDAREA name (RDAREA1). This can identify the IDs of the processes (1572976, 1728690) that generated the error.

In this case, these IDs identify the following processes.

The process name of 1572976 is pdloadm, so it is the pdload control process.
The process name of 1728690 is pdbes, so it is the back-end server process.

(d) Identify the error content

Identify the nature of the error from the message log file. Extract the error message that requires handling from the message log file acquired in step 1 of (c) using the process IDs identified in step (c) as search keywords.

Message extraction example (pdload control process)

 
1572976 2010/11/04 15:55:48 0mload1 lod KFPL00711-I pdloadm started, table=user1."T1", RDAREA=RDAREA1
1572976 2010/11/04 15:55:48 0mload1 lod KFPL00704-I Pdload terminated, return code=8

Message extraction example (back-end server process)

 
1728690 2010/11/04 15:55:48 bes1 lod KFPL00711-I pdbes started, table=user1."T1", RDAREA=RDAREA1
1728690 2010/11/04 15:55:48 bes1 lod KFPL00709-I Error information file was created, file=/tmp/ERROR-4cd258f41728690
1728690 2010/11/04 15:55:48 bes1 lod KFPL00709-I Lobmid file was created, file=/tmp/LOBMID-T1-4cd258f41728690
1728690 2010/11/04 15:55:48 bes1 lod KFPL00702-I Pdload started, table=user1."T1", generation=0
1728690 2010/11/04 15:55:48 bes1 lod KFPL00710-I Index information file assigned, index=user1."T1NX", RDAREA="USER01", file=/tmp/INDEX-T1NX-USER01-GN0-daan_ylid
1728690 2010/11/04 15:55:48 bes1 lod KFPL00723-I 0 rows loaded, table=user1."T1", RDAREA="USER01"
1728690 2010/11/04 15:55:48 bes1 lod KFPLxxxxx-E YYYYYY

Description: Extract messages based on the process IDs (1572976, 1728690). Here, a back-end server process has output the error message (KFPLxxxxx-E). Take corrective action as indicated by the error message.

After corrective action is taken, re-execute the data load in units of RDAREAs according to the procedure of step (e).

(e) Re-execute the data load in units of RDAREAs (pdload command)

Re-execute the data load in units of RDAREAs using the pdload command. Here, we recommend re-using the pdload command control statement generated by the pdparaload command. The storage directory and file name of the pdload command control statement file generated by the pdparaload command are as follows. Create the pdload control statement file based on these.

The pdload control statement storage directory
1. The directory specified in the pd_tmp_directory operand
2. If the above directory is not specified, the directory specified in environment variable TMPDIR
3. If the above directory is not specified, the /tmp directory
The pdload control statement file name
LOD_CTL_authorization-ID_table-ID_RDAREA-NAME

Note

When re-executing the data load in units of RDAREAs, specify divermsg=off in the option statement of pdload.

When loading data in RDAREA units, error information will be output when there is line data in the input data that does not match the RDAREA storage conditions, and the pdload command will terminate with a return code of 4. If divermsg=off is specified, output of this error data information is suppressed, and the pdload command can terminate with a return code of 0.

The following is an example of the pdload command control statements and command lines when re-executing a data load in units of RDAREAs.

Control statement example

 
option divermsg=off
source "RDUSER02" fes01:/users/data/input_file error=/users/rep/error_file_RDUSER02
lobdata /users/data/lob
idxwork bes02 /users/work
sort bes02 /users/work
report file=/users/rep/result_file_RDUSER02

Description

Specifying divermsg=off in the option statement suppresses output of error data information.
Specify the name of the RDAREA that is the target of the source statement for data loading in units of RDAREAs.

Command line example

 
pdload -d -b -k f -i c -l p "TBL2" "/users/tmp/LOD_CTL_USER02_ TBL2_RDUSER02"

6.1.6 Loading data into a row-partitioned table (using the parallel loading facility)

(1) Advantages

(2) Preparing for command execution

(a) Calculate the resources used

(b) Prepare the input data

(3) Operating procedures

(4) Limitations

(5) Conditions that might affect performance