Hitachi

Hitachi Advanced Database SQL Reference


4.1.1 Specification format and rules for COPY statements

Output the table search results as data in PARQUET format or CSV format to the following external storage.

Organization of this subsection

(1) Specification format

COPY-statement ::= COPY (query-expression-body [ORDER-BY-clause] [LIMIT-clause])
               TO output-location-string
               [SERVER foreign-server-name]
               [OPTIONS (COPY-option [,COPY-option]...)]

  output-location-string ::= character-string-literal

  COPY-option ::= {output-data-format-specification
                  |output-data-format-option-specification
                  |output-data-compression-format-specification}
    output-data-format-specification ::= FORMAT {PARQUET|CSV}
 
    output-data-format-option-specification ::= {delimiter-character-specification|header-specification}
      delimiter-character-specification ::= DELIMITER_CHAR delimiting-character
      header-specification ::= HEADER_SPEC {YES|NO}
        delimiting-character ::= character-string-literal
 
    output-data-compression-format-specification ::= COMPRESSION {ZSTD|SNAPPY|GZIP|LZ4|BROTLI|NONE}

(2) Explanation of specification format

query-expression-body [ORDER-BY-clause] [LIMIT-clause]
query-expression-body:

Specifies a query expression body to be used to retrieve the data to be output to the file. For details about the query expression body, see (b) query-expression-body in (2) Explanation of specification format in 7.1.1 Specification format and rules for query expressions.

ORDER-BY-clause:

Specify if you want to sort the search results of the query expression body in ascending or descending order. If the ORDER BY clause is omitted, the order of the search results of the query expression body is not guaranteed. For details about the ORDER BY clause, see (2) Explanation of specification format in 4.5.1 Specification format and rules for the SELECT statement.

LIMIT-clause:

Specifies the maximum number of rows that will be retrieved from the results of the query expression body. For details about the LIMIT clause, see 7.9 LIMIT clause.

TO output-location-string

Specifies the output destination of the table search result data and the prefix of the output file name in the form of a character string literal. For details about character string literals, see 6.3 Literals. The following is an example:

  • When outputting data to object storage

    Specify the output destination of the table search result data in S3 URI format.

    Example 1: When specifying the output directory and the prefix of the file name
    TO 's3://data-bucket/ftbl/T1/res_sql'

    In this example, res_sql is specified as the prefix of the file name. In this case, files such as the following are output under thes3://data-bucket/ftbl/T1/ directory.

    • res_sql-1234-1-1-1.parquet

    • res_sql-1234-1-1-2.parquet

    Example 2: When specifying only the output directory
    TO 's3://data-bucket/ftbl/T1/'

    In this example, no prefix is specified for the file name. In this case, files such as the following are output under the s3://data-bucket/ftbl/T1/ directory.

    • adb-1234-1-1-1.parquet

    • adb-1234-1-1-2.parquet

    If the prefix of the file name is not specified as shown here, the beginning string of the file name will be adb.

  • When outputting data to file storage

    Specify the output destination of the table search result data in FILE URI format.

    Example 1: When specifying the output directory and the prefix of the file name
    TO 'file:///data/ftbl/T1/res_sql'

    In this example, res_sql is specified as the prefix of the file name. In this case, files such as the following are output under the/data/ftbl/T1/ directory.

    • res_sql-1234-1-1-1.parquet

    • res_sql-1234-1-1-2.parquet

    Example 2: When specifying only the output directory
    TO 'file:///data/ftbl/T1/'

    In this example, no prefix is specified for the file name. In this case, files such as the following are output under the /data/ftbl/T1/ directory.

    • adb-1234-1-1-1.parquet

    • adb-1234-1-1-2.parquet

    If the prefix of the file name is not specified as shown here, the beginning string of the file name will be adb.

Specification rules for the output location string
  • If the output destination of the table search result data is object storage, specify s3:// at the beginning of the output location string. If the output destination of the table search result data is file storage, specify file://.

  • If a / is specified at the end of the output location string, it is considered that no prefix for the file name is specified. In that case, adb is assumed as the prefix of the file name.

  • The output location string must not exceed 1,024 bytes.

  • When outputting data to object storage, if there is no bucket name specified in the output location string, the COPY statement will result in an error.

Naming rules for files to output data

The names of files to output data are determined according to the following conventions.

  • When outputting data in PARQUET format

    prefix-seq1-seq2-seq3-seq4.parquet
  • When outputting data in CSV format

    prefix-seq1-seq2-seq3-seq4.csv

Explanation:

  • prefix

    The prefix of the file name specified in the output location string. If no prefix is specified in the output location string, the prefix adb is assumed.

  • seq1

    The transaction ID of the transaction that executed the COPY statement (a zero-suppressed number with a maximum of 20 digits).

  • seq2

    The SQL statement sequence number in the connection that executed the COPY statement (a zero-suppressed number with a maximum of 20 digits).

  • seq3

    The real thread number of the real thread that output the file (a sequence number that starts from 1 and is assigned for each SQL statement, a zero-suppressed number with a maximum of 4 digits).

  • seq4

    The output sequence number of the file output within the real thread (a sequence number starting from 1, a zero-suppressed number with a maximum of 11 digits).

SERVER foreign-server-name

Specify this option when outputting data to object storage. Do not specify this option when outputting data to file storage.

Specifies the name of the foreign server to which the file will be output. You cannot specify a foreign server name that does not exist.

If the foreign server cannot be accessed with the authentication information specified when defining the foreign server, the COPY statement will result in an error.

OPTIONS (COPY-option [,COPY-option])
COPY-option ::= {output-data-format-specification
                |output-data-format-option-specification
                |output-data-compression-format-specification}

The COPY option specifies the format of the output data, etc.

The same type of COPY option cannot be specified, such as specifying the output data format specification twice.

output-data-format-specification:
output-data-format-specification ::= FORMAT {PARQUET|CSV}

Specifies whether data should be output in PARQUET or CSV format.

  • PARQUET

    Output the data in PARQUET format.

    If the output data format specification is omitted, data in PARQUET format is output.

  • CSV

    Output the data in CSV format.

output-data-format-option-specification:
output-data-format-option-specification ::= {delimiter-character-specification|header-specification}
    delimiter-character-specification ::= DELIMITER_CHAR delimiting-character
    header-specification ::= HEADER_SPEC {YES|NO}

The output data format option specification allows you to specify the delimiting characters and whether to output headers.

This option cannot be specified when outputting data in PARQUET format.

  • delimiter-character-specification

    delimiter-character-specification ::= DELIMITER_CHAR delimiting-character

    Specifies the delimiting character that separates each piece of data in CSV format in the form of a character string literal. For details about character string literals, see 6.3 Literals. The following is an example of specifying ":" as a delimiting character.

    DELIMITER_CHAR ':'

    The following specification rules apply:

    • Specify a 1-byte character string as the delimiting character.

    • If the delimiting character specification is omitted, a comma (,) is assumed as the delimiter character specification.

    • To define a single quotation mark (') as a delimiting character, specify two single quotation marks as the delimiting character.

  • header-specification

    header-specification ::= HEADER_SPEC {YES|NO}

    Specifies whether or not to output a header on the first row of the CSV format output data.

    YES: Outputs a header. If the header specification is omitted, YES is assumed.

    NO: A header is not output.

    The header outputs the retrieval item column name for each column derived from the result of the query expression body, separated by a delimiting character. For details about retrieval item column names, see (4) Rules in 4.5.1 Specification format and rules for the SELECT statement.

output-data-compression-format-specification:
output-data-compression-format-specification ::= COMPRESSION {ZSTD|SNAPPY|GZIP|LZ4|BROTLI|NONE}

Specifies the compression format of the file in which the data is output.

This option cannot be specified when outputting data in CSV format.

  • ZSTD

    Compresses the file in ZSTD format. If the output data compression format specification is omitted, the file is compressed in ZSTD format.

  • SNAPPY

    Compresses the file in SNAPPY format.

  • GZIP

    Compresses the file in GZIP format.

  • LZ4

    Compresses the file in LZ4 format.

  • BROTLI

    Compresses the file in BROTLI format.

  • NONE

    Does not compress the file.

(3) Privileges required at execution

To execute the COPY statement, all of the following privileges are required:

(4) Rules

  1. A COPY statement will result in an error if the column derived from the result of the query expression body is of any of the following data types.

    • When outputting PARQUET format data

      TIME(0), TIME(12), TIMESTAMP(0) WITHOUT TIME ZONE, TIMESTAMP(12) WITHOUT TIME ZONE, TIMESTAMP(0) WITH TIME ZONE, TIMESTAMP(12) WITH TIME ZONE

      If the data type of a column derived from the result of the query expression body is ARRAY, and its element data type is one of the above, the COPY statement will return an error.

      In addition, if the data type of a column derived from the result of the query expression body is STRUCT, and any of its field data types are among those listed above, the COPY statement will return an error.

    • When outputting CSV format data

      TIME(12), TIMESTAMP(12) WITHOUT TIME ZONE, TIMESTAMP(12) WITH TIME ZONE, UUID, ARRAY, STRUCT

  2. If all of the following conditions are met, the COPY statement will result in an error.

    • Columns derived from the result of a query expression body have CHARACTER, VARCHAR, STRING, ARRAY#, or STRUCT# type columns

    • The data in the above columns contain characters that cannot be expressed in UTF8

    #:

    If the element data type is CHARACTER, VARCHAR, or STRING, and the data of that element data type contains characters that cannot be expressed in UTF8, the COPY statement will result in an error.

    In addition, if the field data type is CHARACTER, VARCHAR, or STRING, and the data of that field data type contains characters that cannot be expressed in UTF8, the COPY statement will result in an error.

  3. When outputting data in CSV format, the COPY statement will result in an error if all of the following conditions are met.

    • Columns derived from the result of the query expression body have BINARY, VARBINARY, ARRAY#, or STRUCT# type columns

    • The data in the above columns contain binary values that cannot be expressed in UTF8 as characters

    #:

    If the element data type is BINARY or VARBINARY, and the data of that element data type contains binary values that cannot be expressed as characters in UTF8, the COPY statement will result in an error.

    In addition, if the field data type is BINARY or VARBINARY, and the data of that field data type contains binary values that cannot be expressed as characters in UTF8, the COPY statement will result in an error.

  4. If all of the following conditions are met, the COPY statement will result in an error.

    • There is a TIMESTAMP(9) WITHOUT TIME ZONE, TIMESTAMP(9) WITH TIME ZONE, ARRAY#, or STRUCT# column in the columns derived from the result of the query expression body

    • The above column contains data outside the range 1677-09-21 00:12:43.145224192 to 2262-04-11 23:47:16.854775807

    #:

    If the element data type is TIMESTAMP(9) WITHOUT TIME ZONE or TIMESTAMP(9) WITH TIME ZONE, and the data of that element data type contains data outside the range of 1677-09-21 00:12:43.145224192 to 2262-04-11 23:47:16.854775807, the COPY statement will result in an error.

    In addition, if the field data type is TIMESTAMP(9) WITHOUT TIME ZONE or TIMESTAMP(9) WITH TIME ZONE, and the data of that field data type contains data outside the range of 1677-09-21 00:12:43.145224192 to 2262-04-11 23:47:16.854775807, the COPY statement will result in an error.

  5. The total number of tables, derived tables, table function derived tables, and collection derived tables specified in the COPY statement cannot exceed 2,048. For details, see (4) Rules in 4.5.1 Specification format and rules for the SELECT statement.

  6. If the set operations specified in the COPY statement are all UNION, a maximum of 1,023 set operations can be specified. However, if the specified set operations include EXCEPT or INTERSECT, no more than 63 set operations can be specified.

  7. A maximum of 63 FULL OUTER JOIN (or FULL JOIN) can be specified in the COPY statement.

  8. The name of each column output by the COPY statement will be the same as the derived column name of the query expression body. If the column name is not set in the derived column name, the column name will be EXPnnnn_NO_NAME (nnnn is an unsigned integer from 0001 to 4000). For details about derived column names see 6.9 Derived column names.

    Note that if the column derived from the query expression body result is of STRUCT type, the name of each field in the data output by the COPY statement will be the same as the field name defined in the STRUCT type column.

  9. When outputting data in PARQUET format, the data type of each column of output data is determined based on the data type of the HADB of each column derived from the result of the query expression body of the COPY statement. The following table shows the correspondence between the data type of HADB and the data type of PARQUET format data.

    Table 4‒1: Correspondence between HADB data types and PARQUET format data types

    HADB data type

    Data type of data in PARQUET format

    SMALLINT#1

    INT(bitWidth=16, isSigned=true)

    INTEGER#2

    INT(bitWidth=32, isSigned=true)

    BIGINT

    INT(bitWidth=64, isSigned=true)

    DECIMAL(m,n)#3

    DECIMAL(precision=m,scale=n)

    NUMERIC(m,n)#3

    REAL

    FLOAT

    DOUBLE PRECISION

    DOUBLE

    FLOAT

    CHARACTER

    STRING

    VARCHAR

    STRING

    BINARY

    BYTE_ARRAY

    VARBINARY

    DATE

    DATE

    TIME(3)

    TIME(timeUnit=milliseconds)

    TIME(6)

    TIME(timeUnit=microseconds)

    TIME(9)

    TIME(timeUnit=nanoseconds)

    TIMESTAMP(3) WITHOUT TIME ZONE

    TIMESTAMP(timeUnit=milliseconds, isAdjustedToUTC=false)

    TIMESTAMP(6) WITHOUT TIME ZONE

    TIMESTAMP(timeUnit=microseconds, isAdjustedToUTC=false)

    TIMESTAMP(9) WITHOUT TIME ZONE

    TIMESTAMP(timeUnit=nanoseconds, isAdjustedToUTC=false)

    TIMESTAMP(3) WITH TIME ZONE

    TIMESTAMP(timeUnit=milliseconds, isAdjustedToUTC=true)

    TIMESTAMP(6) WITH TIME ZONE

    TIMESTAMP(timeUnit=microseconds, isAdjustedToUTC=true)

    TIMESTAMP(9) WITH TIME ZONE

    TIMESTAMP(timeUnit=nanoseconds, isAdjustedToUTC=true)

    BOOLEAN

    BOOLEAN

    UUID

    UUID

    ARRAY

    LIST

    STRUCT

    STRUCT

    #1

    If the integer data type format is in a legacy format, SMALLINT corresponds to INT(bitWidth=32, isSigned=true).

    #2

    If the integer data type format is in a legacy format, INTEGER corresponds to INT(bitWidth=64, isSigned=true).

    #3

    The precision is m and the scaling is n.

  10. When outputting data in CSV format, the data type of each column of output data is determined as follows.

    • The HADB data type of each column derived from the result of the query expression body of the COPY statement is converted to the Apache Arrow data type and the data type is determined on that basis.

    The following table shows the correspondence between HADB data types and Apache Arrow data types.

    Table 4‒2: Correspondence between HADB data types and Apache Arrow data types

    HADB data type

    Apache Arrow data types

    SMALLINT#1

    INT16

    INTEGER#2

    INT32

    BIGINT

    INT64

    DECIMAL(m,n)#3

    DECIMAL128(m,n)

    NUMERIC(m,n)#3

    REAL

    FLOAT64

    DOUBLE PRECISION

    FLOAT

    CHARACTER

    STRING

    VARCHAR

    STRING

    DATE

    DATE32

    TIME(0)

    TIME32(timeUnit=seconds)

    TIME(3)

    TIME32(timeUnit=milliseconds)

    TIME(6)

    TIME64(timeUnit=microseconds)

    TIME(9)

    TIME64(timeUnit=nanoseconds)

    TIMESTAMP(0) WITHOUT TIME ZONE

    TIMESTAMP(timeUnit=seconds, isAdjustedToUTC=false)

    TIMESTAMP(3) WITHOUT TIME ZONE

    TIMESTAMP(timeUnit=milliseconds, isAdjustedToUTC=false)

    TIMESTAMP(6) WITHOUT TIME ZONE

    TIMESTAMP(timeUnit=microseconds, isAdjustedToUTC=false)

    TIMESTAMP(9) WITHOUT TIME ZONE

    TIMESTAMP(timeUnit=nanoseconds, isAdjustedToUTC=false)

    TIMESTAMP(0) WITH TIME ZONE

    TIMESTAMP(timeUnit=seconds, isAdjustedToUTC=true)

    TIMESTAMP(3) WITH TIME ZONE

    TIMESTAMP(timeUnit=milliseconds, isAdjustedToUTC=true)

    TIMESTAMP(6) WITH TIME ZONE

    TIMESTAMP(timeUnit=microseconds, isAdjustedToUTC=true)

    TIMESTAMP(9) WITH TIME ZONE

    TIMESTAMP(timeUnit=nanoseconds, isAdjustedToUTC=true)

    BINARY

    BINARY

    VARBINARY

    BOOLEAN

    BOOL

    #1

    If the integer data type format is in a legacy format, SMALLINT corresponds to INT32.

    #2

    If the integer data type format is in a legacy format, INTEGER corresponds to INT64.

    #3

    The precision is m and the scaling is n.

(5) Examples

Example 1

Output the data in columns C1 and C2 of table T1 to the s3://data/ directory on the object storage of the foreign server SERVER1. Output the data in PARQUET format.

COPY (SELECT "C1","C2" FROM "T1")
    TO 's3://data/'
    SERVER "SERVER1"
    OPTIONS (FORMAT PARQUET)

Example of the name of the output file:

  • adb-1234-1-1-1.parquet

  • adb-1234-1-1-2.parquet

  • adb-1234-1-2-1.parquet

  • adb-1234-1-3-1.parquet

Example 2

Output the data of columns C1 and C2 of table T1 to the /data/ directory on the file storage. Output the data in PARQUET format. Also, specify res_sql as the prefix of the file.

COPY (SELECT "C1","C2" FROM "T1")
    TO 'file:///data/res_sql'
    OPTIONS (FORMAT PARQUET)

Example of the name of the output file:

  • res_sql-1234-1-1-1.parquet

  • res_sql-1234-1-1-2.parquet

  • res_sql-1234-1-2-1.parquet

  • res_sql-1234-1-3-1.parquet

(6) Notes