4.1.1 Specification format and rules for COPY statements
Output the table search results as data in PARQUET format or CSV format to the following external storage.
-
Object storage on the foreign server
-
File storage
- Organization of this subsection
(1) Specification format
COPY-statement ::= COPY (query-expression-body [ORDER-BY-clause] [LIMIT-clause])
TO output-location-string
[SERVER foreign-server-name]
[OPTIONS (COPY-option [,COPY-option]...)]
output-location-string ::= character-string-literal
COPY-option ::= {output-data-format-specification
|output-data-format-option-specification
|output-data-compression-format-specification}
output-data-format-specification ::= FORMAT {PARQUET|CSV}
output-data-format-option-specification ::= {delimiter-character-specification|header-specification}
delimiter-character-specification ::= DELIMITER_CHAR delimiting-character
header-specification ::= HEADER_SPEC {YES|NO}
delimiting-character ::= character-string-literal
output-data-compression-format-specification ::= COMPRESSION {ZSTD|SNAPPY|GZIP|LZ4|BROTLI|NONE}
(2) Explanation of specification format
- ● query-expression-body [ORDER-BY-clause] [LIMIT-clause]
-
- query-expression-body:
-
Specifies a query expression body to be used to retrieve the data to be output to the file. For details about the query expression body, see (b) query-expression-body in (2) Explanation of specification format in 7.1.1 Specification format and rules for query expressions.
- ORDER-BY-clause:
-
Specify if you want to sort the search results of the query expression body in ascending or descending order. If the ORDER BY clause is omitted, the order of the search results of the query expression body is not guaranteed. For details about the ORDER BY clause, see (2) Explanation of specification format in 4.5.1 Specification format and rules for the SELECT statement.
- LIMIT-clause:
-
Specifies the maximum number of rows that will be retrieved from the results of the query expression body. For details about the LIMIT clause, see 7.9 LIMIT clause.
- ● TO output-location-string
-
Specifies the output destination of the table search result data and the prefix of the output file name in the form of a character string literal. For details about character string literals, see 6.3 Literals. The following is an example:
-
When outputting data to object storage
Specify the output destination of the table search result data in S3 URI format.
- Example 1: When specifying the output directory and the prefix of the file name
-
TO 's3://data-bucket/ftbl/T1/res_sql'
In this example, res_sql is specified as the prefix of the file name. In this case, files such as the following are output under thes3://data-bucket/ftbl/T1/ directory.
-
res_sql-1234-1-1-1.parquet
-
res_sql-1234-1-1-2.parquet
-
- Example 2: When specifying only the output directory
-
TO 's3://data-bucket/ftbl/T1/'
In this example, no prefix is specified for the file name. In this case, files such as the following are output under the s3://data-bucket/ftbl/T1/ directory.
-
adb-1234-1-1-1.parquet
-
adb-1234-1-1-2.parquet
If the prefix of the file name is not specified as shown here, the beginning string of the file name will be adb.
-
-
When outputting data to file storage
Specify the output destination of the table search result data in FILE URI format.
- Example 1: When specifying the output directory and the prefix of the file name
-
TO 'file:///data/ftbl/T1/res_sql'
In this example, res_sql is specified as the prefix of the file name. In this case, files such as the following are output under the/data/ftbl/T1/ directory.
-
res_sql-1234-1-1-1.parquet
-
res_sql-1234-1-1-2.parquet
-
- Example 2: When specifying only the output directory
-
TO 'file:///data/ftbl/T1/'
In this example, no prefix is specified for the file name. In this case, files such as the following are output under the /data/ftbl/T1/ directory.
-
adb-1234-1-1-1.parquet
-
adb-1234-1-1-2.parquet
If the prefix of the file name is not specified as shown here, the beginning string of the file name will be adb.
-
- Specification rules for the output location string
-
-
If the output destination of the table search result data is object storage, specify s3:// at the beginning of the output location string. If the output destination of the table search result data is file storage, specify file://.
-
If a / is specified at the end of the output location string, it is considered that no prefix for the file name is specified. In that case, adb is assumed as the prefix of the file name.
-
The output location string must not exceed 1,024 bytes.
-
When outputting data to object storage, if there is no bucket name specified in the output location string, the COPY statement will result in an error.
-
- Naming rules for files to output data
-
The names of files to output data are determined according to the following conventions.
-
When outputting data in PARQUET format
prefix-seq1-seq2-seq3-seq4.parquet
-
When outputting data in CSV format
prefix-seq1-seq2-seq3-seq4.csv
Explanation:
-
prefix
The prefix of the file name specified in the output location string. If no prefix is specified in the output location string, the prefix adb is assumed.
-
seq1
The transaction ID of the transaction that executed the COPY statement (a zero-suppressed number with a maximum of 20 digits).
-
seq2
The SQL statement sequence number in the connection that executed the COPY statement (a zero-suppressed number with a maximum of 20 digits).
-
seq3
The real thread number of the real thread that output the file (a sequence number that starts from 1 and is assigned for each SQL statement, a zero-suppressed number with a maximum of 4 digits).
-
seq4
The output sequence number of the file output within the real thread (a sequence number starting from 1, a zero-suppressed number with a maximum of 11 digits).
-
-
- ● SERVER foreign-server-name
-
Specify this option when outputting data to object storage. Do not specify this option when outputting data to file storage.
Specifies the name of the foreign server to which the file will be output. You cannot specify a foreign server name that does not exist.
If the foreign server cannot be accessed with the authentication information specified when defining the foreign server, the COPY statement will result in an error.
- ● OPTIONS (COPY-option [,COPY-option]…)
-
COPY-option ::= {output-data-format-specification |output-data-format-option-specification |output-data-compression-format-specification}The COPY option specifies the format of the output data, etc.
The same type of COPY option cannot be specified, such as specifying the output data format specification twice.
- output-data-format-specification:
-
output-data-format-specification ::= FORMAT {PARQUET|CSV}Specifies whether data should be output in PARQUET or CSV format.
-
PARQUET
Output the data in PARQUET format.
If the output data format specification is omitted, data in PARQUET format is output.
-
CSV
Output the data in CSV format.
-
- output-data-format-option-specification:
-
output-data-format-option-specification ::= {delimiter-character-specification|header-specification} delimiter-character-specification ::= DELIMITER_CHAR delimiting-character header-specification ::= HEADER_SPEC {YES|NO}The output data format option specification allows you to specify the delimiting characters and whether to output headers.
This option cannot be specified when outputting data in PARQUET format.
-
delimiter-character-specification
delimiter-character-specification ::= DELIMITER_CHAR delimiting-character
Specifies the delimiting character that separates each piece of data in CSV format in the form of a character string literal. For details about character string literals, see 6.3 Literals. The following is an example of specifying ":" as a delimiting character.
DELIMITER_CHAR ':'
The following specification rules apply:
-
Specify a 1-byte character string as the delimiting character.
-
If the delimiting character specification is omitted, a comma (,) is assumed as the delimiter character specification.
-
To define a single quotation mark (') as a delimiting character, specify two single quotation marks as the delimiting character.
-
-
header-specification
header-specification ::= HEADER_SPEC {YES|NO}Specifies whether or not to output a header on the first row of the CSV format output data.
YES: Outputs a header. If the header specification is omitted, YES is assumed.
NO: A header is not output.
The header outputs the retrieval item column name for each column derived from the result of the query expression body, separated by a delimiting character. For details about retrieval item column names, see (4) Rules in 4.5.1 Specification format and rules for the SELECT statement.
-
- output-data-compression-format-specification:
-
output-data-compression-format-specification ::= COMPRESSION {ZSTD|SNAPPY|GZIP|LZ4|BROTLI|NONE}Specifies the compression format of the file in which the data is output.
This option cannot be specified when outputting data in CSV format.
-
ZSTD
Compresses the file in ZSTD format. If the output data compression format specification is omitted, the file is compressed in ZSTD format.
-
SNAPPY
Compresses the file in SNAPPY format.
-
GZIP
Compresses the file in GZIP format.
-
LZ4
Compresses the file in LZ4 format.
-
BROTLI
Compresses the file in BROTLI format.
-
NONE
Does not compress the file.
-
(3) Privileges required at execution
To execute the COPY statement, all of the following privileges are required:
-
CONNECT privilege
-
COPY privilege for all tables that are to be specified in the query expression body and ORDER BY clause
(4) Rules
-
A COPY statement will result in an error if the column derived from the result of the query expression body is of any of the following data types.
-
When outputting PARQUET format data
TIME(0), TIME(12), TIMESTAMP(0) WITHOUT TIME ZONE, TIMESTAMP(12) WITHOUT TIME ZONE, TIMESTAMP(0) WITH TIME ZONE, TIMESTAMP(12) WITH TIME ZONE
If the data type of a column derived from the result of the query expression body is ARRAY, and its element data type is one of the above, the COPY statement will return an error.
In addition, if the data type of a column derived from the result of the query expression body is STRUCT, and any of its field data types are among those listed above, the COPY statement will return an error.
-
When outputting CSV format data
TIME(12), TIMESTAMP(12) WITHOUT TIME ZONE, TIMESTAMP(12) WITH TIME ZONE, UUID, ARRAY, STRUCT
-
-
If all of the following conditions are met, the COPY statement will result in an error.
-
Columns derived from the result of a query expression body have CHARACTER, VARCHAR, STRING, ARRAY#, or STRUCT# type columns
-
The data in the above columns contain characters that cannot be expressed in UTF8
- #:
-
If the element data type is CHARACTER, VARCHAR, or STRING, and the data of that element data type contains characters that cannot be expressed in UTF8, the COPY statement will result in an error.
In addition, if the field data type is CHARACTER, VARCHAR, or STRING, and the data of that field data type contains characters that cannot be expressed in UTF8, the COPY statement will result in an error.
-
-
When outputting data in CSV format, the COPY statement will result in an error if all of the following conditions are met.
-
Columns derived from the result of the query expression body have BINARY, VARBINARY, ARRAY#, or STRUCT# type columns
-
The data in the above columns contain binary values that cannot be expressed in UTF8 as characters
- #:
-
If the element data type is BINARY or VARBINARY, and the data of that element data type contains binary values that cannot be expressed as characters in UTF8, the COPY statement will result in an error.
-
In addition, if the field data type is BINARY or VARBINARY, and the data of that field data type contains binary values that cannot be expressed as characters in UTF8, the COPY statement will result in an error.
-
-
If all of the following conditions are met, the COPY statement will result in an error.
-
There is a TIMESTAMP(9) WITHOUT TIME ZONE, TIMESTAMP(9) WITH TIME ZONE, ARRAY#, or STRUCT# column in the columns derived from the result of the query expression body
-
The above column contains data outside the range 1677-09-21 00:12:43.145224192 to 2262-04-11 23:47:16.854775807
- #:
-
If the element data type is TIMESTAMP(9) WITHOUT TIME ZONE or TIMESTAMP(9) WITH TIME ZONE, and the data of that element data type contains data outside the range of 1677-09-21 00:12:43.145224192 to 2262-04-11 23:47:16.854775807, the COPY statement will result in an error.
In addition, if the field data type is TIMESTAMP(9) WITHOUT TIME ZONE or TIMESTAMP(9) WITH TIME ZONE, and the data of that field data type contains data outside the range of 1677-09-21 00:12:43.145224192 to 2262-04-11 23:47:16.854775807, the COPY statement will result in an error.
-
-
The total number of tables, derived tables, table function derived tables, and collection derived tables specified in the COPY statement cannot exceed 2,048. For details, see (4) Rules in 4.5.1 Specification format and rules for the SELECT statement.
-
If the set operations specified in the COPY statement are all UNION, a maximum of 1,023 set operations can be specified. However, if the specified set operations include EXCEPT or INTERSECT, no more than 63 set operations can be specified.
-
A maximum of 63 FULL OUTER JOIN (or FULL JOIN) can be specified in the COPY statement.
-
The name of each column output by the COPY statement will be the same as the derived column name of the query expression body. If the column name is not set in the derived column name, the column name will be EXPnnnn_NO_NAME (nnnn is an unsigned integer from 0001 to 4000). For details about derived column names see 6.9 Derived column names.
Note that if the column derived from the query expression body result is of STRUCT type, the name of each field in the data output by the COPY statement will be the same as the field name defined in the STRUCT type column.
-
When outputting data in PARQUET format, the data type of each column of output data is determined based on the data type of the HADB of each column derived from the result of the query expression body of the COPY statement. The following table shows the correspondence between the data type of HADB and the data type of PARQUET format data.
Table 4‒1: Correspondence between HADB data types and PARQUET format data types HADB data type
Data type of data in PARQUET format
SMALLINT#1
INT(bitWidth=16, isSigned=true)
INTEGER#2
INT(bitWidth=32, isSigned=true)
BIGINT
INT(bitWidth=64, isSigned=true)
DECIMAL(m,n)#3
DECIMAL(precision=m,scale=n)
NUMERIC(m,n)#3
REAL
FLOAT
DOUBLE PRECISION
DOUBLE
FLOAT
CHARACTER
STRING
VARCHAR
STRING
BINARY
BYTE_ARRAY
VARBINARY
DATE
DATE
TIME(3)
TIME(timeUnit=milliseconds)
TIME(6)
TIME(timeUnit=microseconds)
TIME(9)
TIME(timeUnit=nanoseconds)
TIMESTAMP(3) WITHOUT TIME ZONE
TIMESTAMP(timeUnit=milliseconds, isAdjustedToUTC=false)
TIMESTAMP(6) WITHOUT TIME ZONE
TIMESTAMP(timeUnit=microseconds, isAdjustedToUTC=false)
TIMESTAMP(9) WITHOUT TIME ZONE
TIMESTAMP(timeUnit=nanoseconds, isAdjustedToUTC=false)
TIMESTAMP(3) WITH TIME ZONE
TIMESTAMP(timeUnit=milliseconds, isAdjustedToUTC=true)
TIMESTAMP(6) WITH TIME ZONE
TIMESTAMP(timeUnit=microseconds, isAdjustedToUTC=true)
TIMESTAMP(9) WITH TIME ZONE
TIMESTAMP(timeUnit=nanoseconds, isAdjustedToUTC=true)
BOOLEAN
BOOLEAN
UUID
UUID
ARRAY
LIST
STRUCT
STRUCT
- #1
-
If the integer data type format is in a legacy format, SMALLINT corresponds to INT(bitWidth=32, isSigned=true).
- #2
-
If the integer data type format is in a legacy format, INTEGER corresponds to INT(bitWidth=64, isSigned=true).
- #3
-
The precision is m and the scaling is n.
-
When outputting data in CSV format, the data type of each column of output data is determined as follows.
-
The HADB data type of each column derived from the result of the query expression body of the COPY statement is converted to the Apache Arrow data type and the data type is determined on that basis.
The following table shows the correspondence between HADB data types and Apache Arrow data types.
Table 4‒2: Correspondence between HADB data types and Apache Arrow data types HADB data type
Apache Arrow data types
SMALLINT#1
INT16
INTEGER#2
INT32
BIGINT
INT64
DECIMAL(m,n)#3
DECIMAL128(m,n)
NUMERIC(m,n)#3
REAL
FLOAT64
DOUBLE PRECISION
FLOAT
CHARACTER
STRING
VARCHAR
STRING
DATE
DATE32
TIME(0)
TIME32(timeUnit=seconds)
TIME(3)
TIME32(timeUnit=milliseconds)
TIME(6)
TIME64(timeUnit=microseconds)
TIME(9)
TIME64(timeUnit=nanoseconds)
TIMESTAMP(0) WITHOUT TIME ZONE
TIMESTAMP(timeUnit=seconds, isAdjustedToUTC=false)
TIMESTAMP(3) WITHOUT TIME ZONE
TIMESTAMP(timeUnit=milliseconds, isAdjustedToUTC=false)
TIMESTAMP(6) WITHOUT TIME ZONE
TIMESTAMP(timeUnit=microseconds, isAdjustedToUTC=false)
TIMESTAMP(9) WITHOUT TIME ZONE
TIMESTAMP(timeUnit=nanoseconds, isAdjustedToUTC=false)
TIMESTAMP(0) WITH TIME ZONE
TIMESTAMP(timeUnit=seconds, isAdjustedToUTC=true)
TIMESTAMP(3) WITH TIME ZONE
TIMESTAMP(timeUnit=milliseconds, isAdjustedToUTC=true)
TIMESTAMP(6) WITH TIME ZONE
TIMESTAMP(timeUnit=microseconds, isAdjustedToUTC=true)
TIMESTAMP(9) WITH TIME ZONE
TIMESTAMP(timeUnit=nanoseconds, isAdjustedToUTC=true)
BINARY
BINARY
VARBINARY
BOOLEAN
BOOL
- #1
-
If the integer data type format is in a legacy format, SMALLINT corresponds to INT32.
- #2
-
If the integer data type format is in a legacy format, INTEGER corresponds to INT64.
- #3
-
The precision is m and the scaling is n.
-
(5) Examples
- Example 1
-
Output the data in columns C1 and C2 of table T1 to the s3://data/ directory on the object storage of the foreign server SERVER1. Output the data in PARQUET format.
COPY (SELECT "C1","C2" FROM "T1") TO 's3://data/' SERVER "SERVER1" OPTIONS (FORMAT PARQUET)Example of the name of the output file:
-
adb-1234-1-1-1.parquet
-
adb-1234-1-1-2.parquet
-
adb-1234-1-2-1.parquet
-
adb-1234-1-3-1.parquet
-
- Example 2
-
Output the data of columns C1 and C2 of table T1 to the /data/ directory on the file storage. Output the data in PARQUET format. Also, specify res_sql as the prefix of the file.
COPY (SELECT "C1","C2" FROM "T1") TO 'file:///data/res_sql' OPTIONS (FORMAT PARQUET)Example of the name of the output file:
-
res_sql-1234-1-1-1.parquet
-
res_sql-1234-1-1-2.parquet
-
res_sql-1234-1-2-1.parquet
-
res_sql-1234-1-3-1.parquet
-
(6) Notes
-
The execution result of a COPY statement may be output to multiple files.
-
If an ORDER BY clause is specified in the COPY statement, the execution result of a COPY statement may be output to multiple files. If output to multiple files, the data will be sorted for each file.
- Example:
-
Assume the following COPY statement is executed. Column C1 contains sequential data of INTEGER type ranging from 1 to 3,000.
COPY (SELECT "C1","C2" FROM "T1" ORDER BY "C1") TO 's3://data/res_sql' SERVER "SERVER1" OPTIONS (FORMAT PARQUET)Assume that the execution results are output to the following three files.
-
res_sql-1234-1-7-1.parquet
This file contains sorted data for column C1, ranging from 1 to 1,000.
-
res_sql-1234-1-7-2.parquet
This file contains sorted data for column C1, ranging from 1,001 to 2,000.
-
res_sql-1234-1-7-3.parquet
This file contains sorted data for column C1, ranging from 2,001 to 3,000.
The underlined portion above represents a sequential number starting from 1. The data is stored sorted according to this sequential number.
-
-
If an ORDER BY clause is specified in the COPY statement, a work table might be created. If the size of the work table DB area where the work table will be created has not been estimated correctly, it might result in performance degradation. For details about estimating the size of the work table DB area, see the HADB Setup and Operation Guide. For details about work tables, see Considerations when executing an SQL statement that creates work tables in the HADB Application Development Guide.
-
If an error occurs when executing a COPY statement, the file to which the data is output remains created (is not deleted).