6.1.6 Loading data into a row-partitioned table (using the parallel loading facility)
The parallel loading facility executes, in parallel, data loads from a single input data file to multiple RDAREAs that constitute a row-partitioned table. Executing a pdparaload command allows data to be loaded to multiple RDAREAs at once.
The following figure shows an overview of the parallel loading facility.
Figure 6-3 Overview of the parallel loading facility
![[Figure]](figure/zu060025.gif)
- Description
- When the user executes a pdparaload command, pdload commands equal to the number of RDAREAs that constitute the row-partitioned table are executed automatically. At this time, the pdparaload command generates the pdload command control statement file.
- Each pdload command accesses the input data file, extracts the relevant data, and stores it in the row-partitioned table within the RDAREA.
- Organization of this subsection
- (1) Advantages
- (2) Preparing for command execution
- (3) Operating procedures
- (4) Limitations
- (5) Conditions that might affect performance
- (6) Operating examples
- (7) What to do when a command error occurs
(1) Advantages
Using the parallel loading facility offers the following two advantages.
- Data loads can take less time to process.
When pdparaload is executed, multiple pdload commands are executed in parallel. For this reason, data loading takes less processing time than loading data by table using a single pdload command.
- Operation is easy, since there is a single input data file and a single command.
Through the use of divided-input data files, you can also execute the pdload command in parallel and load data by RDAREA. However, this can become complicated because you need to divide the data file and execute the pdload command many times. When you use the parallel loading facility, you do not need to divide the input data file by RDAREA. Moreover, you only need to execute the command once.
The parallel loading facility thus offers the dual advantage of fast processing from data loads by RDAREA and easy data loading by table.
(2) Preparing for command execution
Do the following prior to executing parallel loading.
(a) Calculate the resources used
With parallel loading, multiple pdload commands execute in parallel. For example, when a table is divided into three RDAREAs for storage, three pdload commands are executed simultaneously, so three portions of resources are needed. Factor this in when you calculate the amount of resources that will be necessary.
(b) Prepare the input data
Prepare the data to be input to the table as an input data file. The pdparaload command uses a single input data file.
(3) Operating procedures
The figure below shows the operating procedures for parallel loading.
Figure 6-4 Parallel loading operating procedure
![[Figure]](figure/zu060028.gif)
- #1
- This step might not be necessary depending on the specification for the database's update log acquisition mode (the -l option).
- #2
- This step is required if n (index information output mode) or x (index information output suppression mode) was specified as the index creation method (the -i option). It is also necessary if c (batch index creation mode) was specified and if a non-partitioning key index has been defined.
(4) Limitations
The following limitations apply when data is loaded using the parallel loading facility:
- Data cannot be loaded in tables that define flexible hash partitioning.
- Data loads that specify a synchronization point cannot be performed.
- NOWAIT searches cannot be conducted on tables that are executing parallel loading.
- LOB column structure base tables and LOB data cannot be loaded separately.
- Input data files that use tape devices as the media cannot be used.
pdparaload command options and control statements can be specified in the same manner as pdload command options and control statements, with some exceptions. For details about what can be specified, see the section on pdparaload in the manual HiRDB Version 9 Command Reference.
(5) Conditions that might affect performance
The performance of the parallel loading facility, such as its allocation of the input data, might be affected by the environment in which data loading is executed. For this reason, in some execution environments data loading might not be any faster even when you use the parallel loading facility. Therefore, try using the parallel loading facility and measure how much time data loading takes before you start actual operations. If this trial shows that using the parallel loading facility will shorten data loading time, use the parallel loading facility. If the facility does not deliver the expected processing performance, load data in table units or divide the input data file and load data in RDAREA units. For details about creating a divided-input data file, see 6.1.7 Loading data into a row-partitioned table (Creating divided-input data files).
The conditions that affect the parallel loading facility's performance are described below.
(a) Row partitioning among servers
The parallel loading facility is advantageous when you are partitioning rows among servers. Parallel loading can also be executed on tables with row partitioning on a single server. But data loading is faster when it is divided among servers, since processing of a single pdload command to store data can monopolize the servers.
(b) Allocation of input data
During parallel loading, multiple pdload commands read data from a single input data file. Processing time can be shortened at this point by allocating data so as to increase read processing efficiency. As the following figure shows, allocating input data evenly across partitioning keys results in efficient read processing.
Figure 6-5 Allocation of input data vs. read efficiency
![[Figure]](figure/zu060026.gif)
If data from a table that uses row partitioning on one server is unloaded by a pdrorg command to a single file, the unloaded data file will demonstrate lopsided data allocation, as shown in the figure. For this reason, if these tables are used as input data files, performance might decline despite the use of parallel loading.
(c) Formats for input data
The parallel loading facility allows data loading in all the data formats that can be loaded using the pdload command. However, whenever you are working with large quantities of data, we recommend using the binary format for input data. The DAT and fixed-size data formats both require that the format be converted when the data is stored in a database. This might increase CPU usage and decrease the performance of parallel processing of data loading.
(d) LOB creation types
When LOB columns and abstract data type columns that have LOB parameters are defined in the table you are processing, we recommend specifying f as the LOB creation type (the -k option) for the data load. A file is prepared for each LOB data item as the LOB input file for data loads that specify f. For this reason, there is no duplicate reading of LOB data when data loads are executed in parallel. This reduces input and output.
When d is specified as the LOB creation type, all LOB data is stored in the input data file. This generates processing to skip reading of LOB data that is not a target of processing when data is loaded in RDAREA units. This increase in processing can weaken performance.
(6) Operating examples
This section provides operating examples that use parallel loading.
- In Operating example 1, LOB columns are defined.
- In Operating example 2, a non-partitioning key index is defined.
The conditions for these operating examples are as follows.
Table 6-2 Parallel loading operating example conditions
Operating example | Table definition | Index definition | Input data format | Index creation method | Log acquisition mode |
---|
Row partitioning | LOB column | Partition-ing key index | Non-partitioning key index |
---|
Operating example 1 | Y | Y | Y | N | Binary | Batch index creation mode | Pre-update log acquisition mode |
Operating example 2 | Y | N | Y | Y | Binary | Batch index creation mode | Pre-update log acquisition mode |
- Legend
- Y: Defined
- N: Not defined
(a) Operating example 1 (defining LOB columns)
Data is loaded to TBL2, which defines LOB columns, using the parallel loading facility. For the LOB creation type (-k option), f is specified and a file is created for each LOB data item.
Table and index definitions- Table definition
CREATE TABLE "TBL2"(
col001 int not null,
col002 varchar(20),
col003 blob(1M) in ((LOB01),(LOB02),(LOB03)),
col004 decimal(10,3)
) fix hash hashf by col001
in (RDUSER11,RDUSER21,RDUSER31); |
- Index definition
CREATE INDEX "INDX2" ON "TBL2"(col001,col002)
in ((RDUSER12),(RDUSER22),(RDUSER32)); |
Data load procedure- Use the pdhold command to shut down the RDAREA to which you plan to load data.
pdhold -r LOB01,LOB02,LOB03,RDUSER11,RDUSER21,RDUSER31,RDUSER12,RDUSER22,RDUSER32 |
- Create the pdparaload command control statement file.
source: Specify the server name, input data file and error information file. For HiRDB/Parallel Server, always specify the server name.
lobdata: Specify the LOB input file.
idxwork: Specifies the storage directory for the index information file.
sort: Specifies the storage directory of the work file for sorting.
report: Specifies the processing results file name.
source fes01:c:\users\data\input_file error=c:\users\rep\error_file
lobdata c:\users\data\lob
idxwork bes01 c:\users\work
idxwork bes02 c:\users\work
idxwork bes03 c:\users\work
sort bes01 c:\users\work
sort bes02 c:\users\work
sort bes03 c:\users\work
report file=c:\users\rep\result_file |
- Use the pdparaload command to load data.
-d: Load data in creation mode.
-b: Use a binary input data file.
-k: Specify the LOB creation type. Create a file for each LOB data item (f).
-i: Specify the batch index creation mode (c).
-l: Specify the pre-update log acquisition mode (p).
pdparaload -d -b -k f -i c -l p "TBL2" c:\users\cntl\control_file |
- Use the pdcopy command to make a backup of the RDAREA.
pdcopy -m c:\users\rdarea\mast\mast01 -M r -p c:\users\pdcopy\list\list01
-b c:\users\pdcopy\backup\backup01
-r LOB01,LOB02,LOB03,RDUSER11,RDUSER21,RDUSER31,RDUSER12,RDUSER22,RDUSER32 |
- Use the pdrels command to release the RDAREA from shutdown status.
pdrels -r LOB01,LOB02,LOB03,RDUSER11,RDUSER21,RDUSER31,RDUSER12,RDUSER22,RDUSER32 |
(b) Operating example 2 (defining a non-partitioning key index)
Load data to TBL3, which defines a non-partitioning key index, using the parallel loading facility. Since a non-partitioning key index is being defined, batch creation of the index using the pdrorg command is required after you execute the pdparaload command. For batch creation of the index, you specify the index information file output by the pdparaload command.
Table and index definitions- Table definition
CREATE TABLE "TBL3"(
col001 int not null,
col002 varchar(20),
col003 char(32),
col004 decimal(10,3)
) partitioned by col001
in ((RDUSER11) 10000000, (RDUSER21) 20000000, (RDUSER31)); |
- Index definition (partitioning key index)
CREATE INDEX "INDX3" ON "TBL3" (col001,col002)
in ((RDUSER12),(RDUSER22),(RDUSER32)); |
- Index definition (non-partitioning key index)
CREATE INDEX "INDX4" ON "TBL3" (col003)
in ((RDUSER13)); |
- Data load procedure
- Use the pdhold command to shut down the RDAREA to which you plan to load data.
pdhold -r RDUSER11,RDUSER21,RDUSER31,RDUSER12,RDUSER22,RDUSER32,RDUSER13 |
- Create the pdparaload command control statement file.
source: Specify the input data file and the error information file. For HiRDB/Parallel Server, always specify the server name.
idxwork: Specify the storage directory for the index information file. After data loading, execute batch creation of the index using the index information file created here.
sort: Specify the storage directory of the work file for sorting.
report: Specify the processing results file name.
source c:\users\data\input_file error=c:\users\rep\error_file
idxwork c:\users\work
sort c:\users\work
report file=c:\users\rep\result_file |
- Use the pdparaload command to load data.
-d: Load data in creation mode.
-b: Use a binary input data file.
-i: Specify the batch index creation mode (c).
-l: Specify the pre-update log acquisition mode (p).
pdparaload -d -b -i c -l p "TBL3" c:\users\cntl\control_file |
- Create the pdrorg command control statement file.
In the index statement, specify the index information file output by the pdparaload command.
index "INDX4" RDUSER13 "c:\users\work\INDEX-INDX4-RDUSER13-faaG4MnMf"
index "INDX4" RDUSER13 "c:\users\work\INDEX-INDX4-RDUSER13-caa34EnEc"
index "INDX4" RDUSER13 "c:\users\work\INDEX-INDX4-RDUSER13-caa06MnMc"
sort c:\users\work
report file=c:\users\rep\result_file2 |
- With the pdrorg command, execute batch creation of the index.
pdrorg -k ixmk -l p -t "TBL3" c:\users\cntl\control_rorg |
- Use the pdcopy command to make a backup of the RDAREA.
pdcopy -m c:\users\rdarea\mast\mast01 -M r -p c:\users\pdcopy\list\list01
-b c:\users\pdcopy\backup\backup01
-r RDUSER11,RDUSER21,RDUSER31,RDUSER12,RDUSER22,RDUSER32,RDUSER13 |
- Use the pdrels command to release the RDAREA from shutdown status.
pdrels -r RDUSER11,RDUSER21,RDUSER31,RDUSER12,RDUSER22,RDUSER32,RDUSER13 |
(7) What to do when a command error occurs
There are two basic types of errors that can occur when the pdparaload command is executed, and two corresponding ways to handle them, as follows.
- Errors when executing data loads in RDAREA units (pdload command)
In this case, data fails to load into some RDAREAs. After handling the error for those RDAREAs that experienced errors, re-execute the data load in RDAREA units with the pdload command.
- Errors before data loads in RDAREA units (pdload command) begin
In this case, data loads to all RDAREAs fail. After handling the error, re-execute the pdparaload command.
This section describes details about errors generated when the pdparaload command is executed and their handling. The handling procedure when an error occurs when the pdparaload command is executed is shown below.
Figure 6-6 Handling procedure when an error occurs during pdparaload command execution
![[Figure]](figure/zu060029.gif)
- Note:
- Sections (a) through (e) in the figure correspond to (a) through (e) in the figure.
(a) Identify error timing
Identify when the error occurred. Check whether the following two messages have been output.
- Both messages have been output
- An error occurred during execution of a data load in RDAREA units. This means data loads to some RDAREAs have failed. In this case, handle the error by starting with step (b) of the procedure. Thereafter, re-execute the data load in RDAREA units with the pdload command.
- One of the messages has not been output
- An error occurred before data loading in RDAREA units began. This means data loads to all RDAREAs have failed. In this case, handle as directed by the error message output directly prior to the KFPL00794-I message. Thereafter, re-execute the pdparaload command.
(b) Identify RDAREAs that require handling
Extract the KFPL00793-I message from the messages output to the command prompt that executed the pdparaload command or to the event log, and then identify the RDAREAs that require handling. When extracting the message, use the following as search keywords.
- Message ID (KFPL00793-I)
- Authorization identifier
- Table identifier
- Message extraction example
KFPL00793-I Pdload execution abnormal terminated, table=user1."T1", RDAREA=RDAREA1, return code=8, pdload process id=1385412 (1212478)
:
KFPL00793-I Pdload execution abnormal terminated, table=user1."T1", RDAREA=RDAREA2, return code=8, pdload process id=1385433 (1212478) |
- Description
- Extract the KFPL00793-I message based on the authorization identifier (user1) and the table identifier (T1). This can identify the RDAREAs (RDAREA1 and RDAREA2) where the data load failed.
Take one of the following corrective actions for the identified RDAREA, depending on the return code of the KFPL00793-I message.
- Return code is 4
- Check the error information file output by the pdload command and take corrective action for the error. The error information file is specified by the error option of the source statement of the pdparaload command (the RDAREA name is automatically appended to the file name). When nothing is specified, the following directory and file name are used to create the file.
- The error information file storage directory
1. The directory specified in the pd_tmp_directory operand
2. If the above directory is not specified, the directory specified in system environment variable TMP
3. If the above directory is not specified, the %PDDIR%\tmp directory
- The error information file name
The file name that includes the process ID and message ID (KFPL00793-I) identified in (c)
- After corrective action is taken, re-execute the data load in units of RDAREAs using the procedure of step (e).
- Return code is 8
- Identify the error as instructed in steps (c) and (d) and take corrective action.
- Then, re-execute the data load in units of RDAREAs using the procedure of step (e).
(c) Identify the IDs of the processes that were executed
From the message log file, identify the IDs of the pdload control process and server process that the pdparaload command executed. Extract the process IDs by the following procedure.
- Output the message log from pdparaload command execution start (KFPL00791-I message) to termination (KFPL00794-I message) to a file using the pdcat command.
- Extract the KFPL00711-I message, which indicates the start of the process, from the content output in step 1. To extract the message, use the following search keywords.
- Message ID (KFPL00711-I)
- Authorization identifier
- Table identifier
- RDAREA name identified in step (b)
- Message extraction example
1572976 2010/11/04 15:55:48 0mload1 lod KFPL00711-I pdloadm started, table=user1."T1", RDAREA=RDAREA1
1728690 2010/11/04 15:55:48 bes1 lod KFPL00711-I pdbes started, table=user1."T1", RDAREA=RDAREA1 |
- Description
- Extract the KFPL00711-I message based on the authorization identifier (user1), the table identifier (T1) and the RDAREA name (RDAREA1). This can identify the IDs of the processes (1572976, 1728690) that generated the error.
- In this case, these IDs identify the following processes.
- The process name of 1572976 is pdloadm, so it is the pdload control process.
- The process name of 1728690 is pdbes, so it is the back-end server process.
(d) Identify the error content
Identify the nature of the error from the message log file. Extract the error message that requires handling from the message log file acquired in step 1 of (c) using the process IDs identified in step (c) as search keywords.
- Message extraction example (pdload control process)
1572976 2010/11/04 15:55:48 0mload1 lod KFPL00711-I pdloadm started, table=user1."T1", RDAREA=RDAREA1
1572976 2010/11/04 15:55:48 0mload1 lod KFPL00704-I Pdload terminated, return code=8 |
- Message extraction example (back-end server process)
1728690 2010/11/04 15:55:48 bes1 lod KFPL00711-I pdbes started, table=user1."T1", RDAREA=RDAREA1
1728690 2010/11/04 15:55:48 bes1 lod KFPL00709-I Error information file was created, file=c:/tmp/ERROR-4cd258f41728690
1728690 2010/11/04 15:55:48 bes1 lod KFPL00709-I Lobmid file was created, file=c:/tmp/LOBMID-T1-4cd258f41728690
1728690 2010/11/04 15:55:48 bes1 lod KFPL00702-I Pdload started, table=user1."T1", generation=0
1728690 2010/11/04 15:55:48 bes1 lod KFPL00710-I Index information file assigned, index=user1."T1NX", RDAREA="USER01", file=c:/tmp/INDEX-T1NX-USER01-GN0-daan_ylid
1728690 2010/11/04 15:55:48 bes1 lod KFPL00723-I 0 rows loaded, table=user1."T1", RDAREA="USER01"
1728690 2010/11/04 15:55:48 bes1 lod KFPLxxxxx-E YYYYYY |
- Description
- Extract messages based on the process IDs (1572976, 1728690). Here, a back-end server process has output the error message (KFPLxxxxx-E). Take corrective action as indicated by the error message.
After corrective action is taken, re-execute the data load in units of RDAREAs according to the procedure of step (e).
(e) Re-execute the data load in units of RDAREAs (pdload command)
Re-execute the data load in units of RDAREAs using the pdload command. Here, we recommend re-using the pdload command control statement generated by the pdparaload command. The storage directory and file name of the pdload command control statement file generated by the pdparaload command are as follows. Create the pdload control statement file based on these.
- The pdload control statement storage directory
1. The directory specified in the pd_tmp_directory operand
2. If the above directory is not specified, the directory specified in system environment variable TMP
3. If the above directory is not specified, the %PDDIR%\tmp directory
- The pdload control statement file name
LOD_CTL_authorization-ID_table-ID_RDAREA-ID
- Note
- When re-executing the data load in units of RDAREAs, specify divermsg=off in the option statement of pdload.
- When loading data in RDAREA units, error information will be output when there is line data in the input data that does not match the RDAREA storage conditions, and the pdload command will terminate with a return code of 4. If divermsg=off is specified, output of this error data information is suppressed, and the pdload command can terminate with a return code of 0.
The following is an example of the pdload command control statements and command lines when re-executing a data load in units of RDAREAs.
- Control statement example
option divermsg=off
source "RDUSER02" fes01:c:\users\data\input_file error=\users\rep\error_file_RDUSER02_101
lobdata c:\users\data\lob
idxwork bes02 c:\users\work
sort bes02 c:\users\work
report file=c:\users\rep\result_file_RDUSER02_101 |
- Description
- Specifying divermsg=off in the option statement suppresses output of error data information.
- Specify the name of the RDAREA that is the target of the source statement for data loading in units of RDAREAs.
- Command line example
pdload -d -b -k f -i c -l p "TBL2" "c:\users\tmp\LOD_CTL_USER02_ TBL2_101" |