6.1.6 Loading data into a row-partitioned table (using the parallel loading facility)

The parallel loading facility executes, in parallel, data loads from a single input data file to multiple RDAREAs that constitute a row-partitioned table. Executing a pdparaload command allows data to be loaded to multiple RDAREAs at once.

The following figure shows an overview of the parallel loading facility.

Figure 6-3 Overview of the parallel loading facility

[Figure]

Description
When the user executes a pdparaload command, pdload commands equal to the number of RDAREAs that constitute the row-partitioned table are executed automatically. At this time, the pdparaload command generates the pdload command control statement file.
Each pdload command accesses the input data file, extracts the relevant data, and stores it in the row-partitioned table within the RDAREA.
Organization of this subsection
(1) Advantages
(2) Preparing for command execution
(3) Operating procedures
(4) Limitations
(5) Conditions that might affect performance
(6) Operating examples
(7) What to do when a command error occurs

(1) Advantages

Using the parallel loading facility offers the following two advantages.

The parallel loading facility thus offers the dual advantage of fast processing from data loads by RDAREA and easy data loading by table.

(2) Preparing for command execution

Do the following prior to executing parallel loading.

(a) Calculate the resources used

With parallel loading, multiple pdload commands execute in parallel. For example, when a table is divided into three RDAREAs for storage, three pdload commands are executed simultaneously, so three portions of resources are needed. Factor this in when you calculate the amount of resources that will be necessary.

(b) Prepare the input data

Prepare the data to be input to the table as an input data file. The pdparaload command uses a single input data file.

(3) Operating procedures

The figure below shows the operating procedures for parallel loading.

Figure 6-4 Parallel loading operating procedure

[Figure]

#1
This step might not be necessary depending on the specification for the database's update log acquisition mode (the -l option).
#2
This step is required if n (index information output mode) or x (index information output suppression mode) was specified as the index creation method (the -i option). It is also necessary if c (batch index creation mode) was specified and if a non-partitioning key index has been defined.

(4) Limitations

The following limitations apply when data is loaded using the parallel loading facility:

pdparaload command options and control statements can be specified in the same manner as pdload command options and control statements, with some exceptions. For details about what can be specified, see the section on pdparaload in the manual HiRDB Version 9 Command Reference.

(5) Conditions that might affect performance

The performance of the parallel loading facility, such as its allocation of the input data, might be affected by the environment in which data loading is executed. For this reason, in some execution environments data loading might not be any faster even when you use the parallel loading facility. Therefore, try using the parallel loading facility and measure how much time data loading takes before you start actual operations. If this trial shows that using the parallel loading facility will shorten data loading time, use the parallel loading facility. If the facility does not deliver the expected processing performance, load data in table units or divide the input data file and load data in RDAREA units. For details about creating a divided-input data file, see 6.1.7 Loading data into a row-partitioned table (Creating divided-input data files).

The conditions that affect the parallel loading facility's performance are described below.

(a) Row partitioning among servers

The parallel loading facility is advantageous when you are partitioning rows among servers. Parallel loading can also be executed on tables with row partitioning on a single server. But data loading is faster when it is divided among servers, since processing of a single pdload command to store data can monopolize the servers.

(b) Allocation of input data

During parallel loading, multiple pdload commands read data from a single input data file. Processing time can be shortened at this point by allocating data so as to increase read processing efficiency. As the following figure shows, allocating input data evenly across partitioning keys results in efficient read processing.

Figure 6-5 Allocation of input data vs. read efficiency

[Figure]

If data from a table that uses row partitioning on one server is unloaded by a pdrorg command to a single file, the unloaded data file will demonstrate lopsided data allocation, as shown in the figure. For this reason, if these tables are used as input data files, performance might decline despite the use of parallel loading.

(c) Formats for input data

The parallel loading facility allows data loading in all the data formats that can be loaded using the pdload command. However, whenever you are working with large quantities of data, we recommend using the binary format for input data. The DAT and fixed-size data formats both require that the format be converted when the data is stored in a database. This might increase CPU usage and decrease the performance of parallel processing of data loading.

(d) LOB creation types

When LOB columns and abstract data type columns that have LOB parameters are defined in the table you are processing, we recommend specifying f as the LOB creation type (the -k option) for the data load. A file is prepared for each LOB data item as the LOB input file for data loads that specify f. For this reason, there is no duplicate reading of LOB data when data loads are executed in parallel. This reduces input and output.

When d is specified as the LOB creation type, all LOB data is stored in the input data file. This generates processing to skip reading of LOB data that is not a target of processing when data is loaded in RDAREA units. This increase in processing can weaken performance.

(6) Operating examples

This section provides operating examples that use parallel loading.

The conditions for these operating examples are as follows.

Table 6-2 Parallel loading operating example conditions

Operating exampleTable definitionIndex definitionInput data formatIndex creation methodLog acquisition mode
Row partitioningLOB columnPartition-ing key indexNon-partitioning key index
Operating example 1YYYNBinaryBatch index creation modePre-update log acquisition mode
Operating example 2YNYYBinaryBatch index creation modePre-update log acquisition mode
Legend
Y: Defined
N: Not defined

(a) Operating example 1 (defining LOB columns)

Data is loaded to TBL2, which defines LOB columns, using the parallel loading facility. For the LOB creation type (-k option), f is specified and a file is created for each LOB data item.

[Figure]Table and index definitions
Table definition

CREATE TABLE "TBL2"(
 col001    int not null,
 col002    varchar(20),
 col003    blob(1M) in ((LOB01),(LOB02),(LOB03)),
 col004    decimal(10,3)
) fix hash hashf by col001
in (RDUSER11,RDUSER21,RDUSER31);

Index definition

CREATE INDEX "INDX2" ON "TBL2"(col001,col002)
in ((RDUSER12),(RDUSER22),(RDUSER32));

[Figure]Data load procedure
  1. Use the pdhold command to shut down the RDAREA to which you plan to load data.

    pdhold -r LOB01,LOB02,LOB03,RDUSER11,RDUSER21,RDUSER31,RDUSER12,RDUSER22,RDUSER32

  2. Create the pdparaload command control statement file.
    source: Specify the server name, input data file and error information file. For HiRDB/Parallel Server, always specify the server name.
    lobdata: Specify the LOB input file.
    idxwork: Specifies the storage directory for the index information file.
    sort: Specifies the storage directory of the work file for sorting.
    report: Specifies the processing results file name.

    source fes01:c:\users\data\input_file error=c:\users\rep\error_file
    lobdata c:\users\data\lob
    idxwork bes01 c:\users\work
    idxwork bes02 c:\users\work
    idxwork bes03 c:\users\work
    sort bes01 c:\users\work
    sort bes02 c:\users\work
    sort bes03 c:\users\work
    report file=c:\users\rep\result_file

  3. Use the pdparaload command to load data.
    -d: Load data in creation mode.
    -b: Use a binary input data file.
    -k: Specify the LOB creation type. Create a file for each LOB data item (f).
    -i: Specify the batch index creation mode (c).
    -l: Specify the pre-update log acquisition mode (p).

    pdparaload -d -b -k f -i c -l p "TBL2" c:\users\cntl\control_file

  4. Use the pdcopy command to make a backup of the RDAREA.

    pdcopy -m c:\users\rdarea\mast\mast01 -M r -p c:\users\pdcopy\list\list01
    -b c:\users\pdcopy\backup\backup01
    -r LOB01,LOB02,LOB03,RDUSER11,RDUSER21,RDUSER31,RDUSER12,RDUSER22,RDUSER32

  5. Use the pdrels command to release the RDAREA from shutdown status.

    pdrels -r LOB01,LOB02,LOB03,RDUSER11,RDUSER21,RDUSER31,RDUSER12,RDUSER22,RDUSER32

(b) Operating example 2 (defining a non-partitioning key index)

Load data to TBL3, which defines a non-partitioning key index, using the parallel loading facility. Since a non-partitioning key index is being defined, batch creation of the index using the pdrorg command is required after you execute the pdparaload command. For batch creation of the index, you specify the index information file output by the pdparaload command.

[Figure]Table and index definitions
Table definition

CREATE TABLE "TBL3"(
 col001    int not null,
 col002    varchar(20),
 col003    char(32),
 col004    decimal(10,3)
) partitioned by col001
in ((RDUSER11) 10000000​, (RDUSER21) 20000000​, (RDUSER31));

Index definition (partitioning key index)

CREATE INDEX "INDX3" ON "TBL3" (col001,col002)
in ((RDUSER12),(RDUSER22),(RDUSER32));

Index definition (non-partitioning key index)

CREATE INDEX "INDX4" ON "TBL3" (col003)
in ((RDUSER13));

Data load procedure
  1. Use the pdhold command to shut down the RDAREA to which you plan to load data.

    pdhold -r RDUSER11,RDUSER21,RDUSER31,RDUSER12,RDUSER22,RDUSER32,RDUSER13

  2. Create the pdparaload command control statement file.
    source: Specify the input data file and the error information file. For HiRDB/Parallel Server, always specify the server name.
    idxwork: Specify the storage directory for the index information file. After data loading, execute batch creation of the index using the index information file created here.
    sort: Specify the storage directory of the work file for sorting.
    report: Specify the processing results file name.

    source c:\users\data\input_file error=c:\users\rep\error_file
    idxwork c:\users\work
    sort c:\users\work
    report file=c:\users\rep\result_file

  3. Use the pdparaload command to load data.
    -d: Load data in creation mode.
    -b: Use a binary input data file.
    -i: Specify the batch index creation mode (c).
    -l: Specify the pre-update log acquisition mode (p).

    pdparaload -d -b -i c -l p "TBL3" c:\users\cntl\control_file

  4. Create the pdrorg command control statement file.
    In the index statement, specify the index information file output by the pdparaload command.

    index "INDX4" RDUSER13 "c:\users\work\INDEX-INDX4-RDUSER13-faaG4MnMf"
    index "INDX4" RDUSER13 "c:\users\work\INDEX-INDX4-RDUSER13-caa34EnEc"
    index "INDX4" RDUSER13 "c:\users\work\INDEX-INDX4-RDUSER13-caa06MnMc"
    sort c:\users\work
    report file=c:\users\rep\result_file2

  5. With the pdrorg command, execute batch creation of the index.

    pdrorg -k ixmk -l p -t "TBL3" c:\users\cntl\control_rorg

  6. Use the pdcopy command to make a backup of the RDAREA.

    pdcopy -m c:\users\rdarea\mast\mast01 -M r -p c:\users\pdcopy\list\list01
    -b c:\users\pdcopy\backup\backup01
    -r RDUSER11,RDUSER21,RDUSER31,RDUSER12,RDUSER22,RDUSER32,RDUSER13

  7. Use the pdrels command to release the RDAREA from shutdown status.

    pdrels -r RDUSER11,RDUSER21,RDUSER31,RDUSER12,RDUSER22,RDUSER32,RDUSER13

(7) What to do when a command error occurs

There are two basic types of errors that can occur when the pdparaload command is executed, and two corresponding ways to handle them, as follows.

This section describes details about errors generated when the pdparaload command is executed and their handling. The handling procedure when an error occurs when the pdparaload command is executed is shown below.

Figure 6-6 Handling procedure when an error occurs during pdparaload command execution

[Figure]

Note:
Sections (a) through (e) in the figure correspond to (a) through (e) in the figure.
(a) Identify error timing

Identify when the error occurred. Check whether the following two messages have been output.

Both messages have been output
An error occurred during execution of a data load in RDAREA units. This means data loads to some RDAREAs have failed. In this case, handle the error by starting with step (b) of the procedure. Thereafter, re-execute the data load in RDAREA units with the pdload command.
One of the messages has not been output
An error occurred before data loading in RDAREA units began. This means data loads to all RDAREAs have failed. In this case, handle as directed by the error message output directly prior to the KFPL00794-I message. Thereafter, re-execute the pdparaload command.
(b) Identify RDAREAs that require handling

Extract the KFPL00793-I message from the messages output to the command prompt that executed the pdparaload command or to the event log, and then identify the RDAREAs that require handling. When extracting the message, use the following as search keywords.

Message extraction example

KFPL00793-I Pdload execution abnormal terminated, table=user1."T1", RDAREA=RDAREA1, return code=8, pdload process id=1385412​ (1212478​)
  :
KFPL00793-I Pdload execution abnormal terminated, table=user1."T1", RDAREA=RDAREA2, return code=8, pdload process id=1385433​ (1212478​)

Description
Extract the KFPL00793-I message based on the authorization identifier (user1) and the table identifier (T1). This can identify the RDAREAs (RDAREA1 and RDAREA2) where the data load failed.

Take one of the following corrective actions for the identified RDAREA, depending on the return code of the KFPL00793-I message.

Return code is 4
Check the error information file output by the pdload command and take corrective action for the error. The error information file is specified by the error option of the source statement of the pdparaload command (the RDAREA name is automatically appended to the file name). When nothing is specified, the following directory and file name are used to create the file.
  • The error information file storage directory
    1. The directory specified in the pd_tmp_directory operand
    2. If the above directory is not specified, the directory specified in system environment variable TMP
    3. If the above directory is not specified, the %PDDIR%\tmp directory
  • The error information file name
    The file name that includes the process ID and message ID (KFPL00793-I) identified in (c)
After corrective action is taken, re-execute the data load in units of RDAREAs using the procedure of step (e).
Return code is 8
Identify the error as instructed in steps (c) and (d) and take corrective action.
Then, re-execute the data load in units of RDAREAs using the procedure of step (e).
(c) Identify the IDs of the processes that were executed

From the message log file, identify the IDs of the pdload control process and server process that the pdparaload command executed. Extract the process IDs by the following procedure.

  1. Output the message log from pdparaload command execution start (KFPL00791-I message) to termination (KFPL00794-I message) to a file using the pdcat command.
  2. Extract the KFPL00711-I message, which indicates the start of the process, from the content output in step 1. To extract the message, use the following search keywords.
    • Message ID (KFPL00711-I)
    • Authorization identifier
    • Table identifier
    • RDAREA name identified in step (b)
Message extraction example

1572976​ 2010/11/04 15:55:48 0mload1 lod KFPL00711-I pdloadm started, table=user1."T1", RDAREA=RDAREA1
1728690​ 2010/11/04 15:55:48 bes1    lod KFPL00711-I pdbes started, table=user1."T1", RDAREA=RDAREA1

Description
Extract the KFPL00711-I message based on the authorization identifier (user1), the table identifier (T1) and the RDAREA name (RDAREA1). This can identify the IDs of the processes (1572976, 1728690) that generated the error.
In this case, these IDs identify the following processes.
  • The process name of 1572976 is pdloadm, so it is the pdload control process.
  • The process name of 1728690 is pdbes, so it is the back-end server process.
(d) Identify the error content

Identify the nature of the error from the message log file. Extract the error message that requires handling from the message log file acquired in step 1 of (c) using the process IDs identified in step (c) as search keywords.

Message extraction example (pdload control process)

1572976​ 2010/11/04 15:55:48 0mload1 lod KFPL00711-I pdloadm started, table=user1."T1", RDAREA=RDAREA1
1572976​ 2010/11/04 15:55:48 0mload1 lod KFPL00704-I Pdload terminated, return code=8

Message extraction example (back-end server process)

1728690​ 2010/11/04 15:55:48 bes1 lod KFPL00711-I pdbes started, table=user1."T1", RDAREA=RDAREA1
1728690​ 2010/11/04 15:55:48 bes1 lod KFPL00709-I Error information file was created, file=c:/tmp/ERROR-4cd258f41728690​
1728690​ 2010/11/04 15:55:48 bes1 lod KFPL00709-I Lobmid file was created, file=c:/tmp/LOBMID-T1-4cd258f41728690​
1728690​ 2010/11/04 15:55:48 bes1 lod KFPL00702-I Pdload started, table=user1."T1", generation=0
1728690​ 2010/11/04 15:55:48 bes1 lod KFPL00710-I Index information file assigned, index=user1."T1NX", RDAREA="USER01", file=c:/tmp/INDEX-T1NX-USER01-GN0-daan_ylid
1728690​ 2010/11/04 15:55:48 bes1 lod KFPL00723-I 0 rows loaded, table=user1."T1", RDAREA="USER01"
1728690​ 2010/11/04 15:55:48 bes1 lod KFPLxxxxx-E YYYYYY

Description
Extract messages based on the process IDs (1572976, 1728690). Here, a back-end server process has output the error message (KFPLxxxxx-E). Take corrective action as indicated by the error message.

After corrective action is taken, re-execute the data load in units of RDAREAs according to the procedure of step (e).

(e) Re-execute the data load in units of RDAREAs (pdload command)

Re-execute the data load in units of RDAREAs using the pdload command. Here, we recommend re-using the pdload command control statement generated by the pdparaload command. The storage directory and file name of the pdload command control statement file generated by the pdparaload command are as follows. Create the pdload control statement file based on these.

The following is an example of the pdload command control statements and command lines when re-executing a data load in units of RDAREAs.

Control statement example

option divermsg=off
source "RDUSER02" fes01:c:\users\data\input_file error=\users\rep\error_file_RDUSER02_101
lobdata c:\users\data\lob
idxwork bes02 c:\users\work
sort bes02 c:\users\work
report file=c:\users\rep\result_file_RDUSER02_101

Description
  • Specifying divermsg=off in the option statement suppresses output of error data information.
  • Specify the name of the RDAREA that is the target of the source statement for data loading in units of RDAREAs.
Command line example

pdload -d -b -k f -i c -l p "TBL2" "c:\users\tmp\LOD_CTL_USER02_ TBL2_101"