Hitachi

Hitachi Advanced Database SQL Reference


7.15.3 ADB_CSVREAD function

Converts the data in a CSV file into a table format such that the data can be retrieved by the HADB server.

Note
  • For an overview of retrieving data from CSV files, see Retrieving data from CSV files in the HADB Setup and Operation Guide.

  • For details about the operations for retrieving data from CSV files, see Handling of data retrieval from CSV files in the HADB Setup and Operation Guide.

Organization of this subsection

(1) Specification format

ADB_CSVREAD-function ::=
    [MASTER.]ADB_CSVREAD(CSV-file-path-name-specification,function-option-specification)
 
        CSV-file-path-name-specification ::= multiset-value-expression
 
        function-option-specification ::= 'function-option[;function-option]...[;]'
          function-option ::= {compression-format-option|specification-column-option
                              |binary-string-format-option|enclosing-character-specification-option
                              |delimiting-character-specification-option}

(2) Explanation of specification format

CSV-file-path-name-specification:

Specifies the path names of the CSV files containing the input data for the ADB_CSVREAD function. The path names are specified in the form of a multiset value expression. For details about multiset value expressions, see 7.16 Multiset value expression.

The following rules apply:

  • The data type of the result of the multiset value expression must be character string data.

  • Absolute paths must be specified for the path names of the CSV files specified in CSV-file-path-name-specification.

    This rule is checked during execution of the SQL statement, not during preprocessing of the SQL statement.

  • Existing files must be specified in the path names of the CSV files specified in CSV-file-path-name-specification.

    This rule is checked during execution of the SQL statement, not during preprocessing of the SQL statement.

function-option-specification:

Specifies one or more of the following options to the ADB_CSVREAD function:

  • The compression format option

  • The specification column option

  • The binary string format option

  • The enclosing character specification option

  • The delimiting character specification option

The details of each option are described in (3) Compression format option and the following subsections.

The following rules apply:

  • function-option-specification is specified in the form of a character string literal. For details about the description format of character string literals, see 6.3.2 Description format of literals.

  • The entire function-option-specification must be enclosed in single quotation marks (').

  • If multiple function options are specified, they must be separated by semicolons (;).

  • Function options can be specified in any order.

  • Function options must be unique.

  • The half-width lowercase letters specified in function options are treated as half-width uppercase letters. However, for enclosing characters and delimiting characters, half-width lowercase letters and half-width uppercase letters are distinguished.

  • A separator can be specified before and after each option and special character (,, -, :, ;, =, NL, CR, half-width space, and full-width space).

(3) Compression format option

The compression format option specifies the compression format of the CSV files. The compression format option cannot be omitted.

(a) Specification format

COMPRESSION_FORMAT= {GZIP | NONE}

(b) Explanation of specification format

GZIP:

Specify this keyword if the CSV file is compressed in GZIP format.

NONE:

Specify this keyword if the CSV file is not compressed.

(4) Specification column option

The specification column option specifies the field data numbers of field data in CSV files. Field data numbers are numbers representing the order of the field data in the CSV file. Field data number 1 (field 1 data) corresponds to data in the first field in the records, field data number 2 (field 2 data) corresponds to data in the second field, and so on.

Example:

[Figure]

The ADB_CSVREAD function extracts the field data corresponding to the field data number specified here.

(a) Specification format

FIELD_NUM=filed-data-number-specification[,filed-data-number-specification]...

(b) Explanation of specification format

filed-data-number-specification:

Specifies the field data numbers of the field data to be extracted.

To specify multiple field data numbers, separate them with commas (,). You can also specify a range of field data numbers such as 1-5.

Example:

FIELD_NUM=3            ...1
FIELD_NUM=1,3,4,6      ...2
FIELD_NUM=1,3-5,8-10   ...3
  1. The field data in field 3 will be extracted.

  2. The field data in fields 1, 3, 4, and 6 will be extracted.

  3. The field data in fields 1, 3 to 5, and 8 to 10 will be extracted.

(c) Rules

  1. The number of columns of field data to be extracted must be the same as the number of columns in the table function column list.

    Example:

    [Figure]

    For details about table function column lists, see (2) Explanation of specification format in 7.11.1 Specification format for table references.

  2. Specify field data numbers as unsigned integer literals from 1 to 30,000.

  3. Field data numbers must be unique and cannot overlap.

    Examples that generate errors:

    FIELD_NUM=1,2,2 ← Duplication of 2 causes an error.
    FIELD_NUM=1,1-3 ← Duplication of 1 causes an error.
  4. No more than 1,000 columns of field data can be targeted.

    Example that generates an error:

    FIELD_NUM=1-1001 ← Having too many (1,001) field data columns targeted causes an error.
  5. An error results if there is no field data corresponding to a field data number. For example, the code fragments below generate an error when there are five columns of field data.

    Examples that generate errors:

    FIELD_NUM=6
    FIELD_NUM=1-7

    Because there are only five columns of field data, you cannot specify field data number 6 or higher.

    Note that this rule is checked during execution of the SQL statement, not during preprocessing of the SQL statement.

  6. If specification-column-option is omitted, the field data numbers from 1 to the number of columns in the table function column list are assumed. In cases where there is no field data corresponding to a field data number, the null value is stored in the table function derived table.

    Example:

    SELECT * FROM TABLE(ADB_CSVREAD(MULTISET['/dir/file.csv.gz'],
                                   'COMPRESSION_FORMAT = GZIP;'))
                    AS "T1" ("C1" INTEGER,"C2" INTEGER,"C3" INTEGER,
                             "C4" INTEGER,"C5" INTEGER)

    The underlined portion is the table function column list specification.

    [Figure]

    There are three columns of field data in the CSV file, but five columns in the table function column list. As a result, null values are stored in columns C4 and C5 of the table function derived table.

(5) Binary string format option

The binary string format option specifies the format of binary data (BINARY or VARBINARY) in the CSV file.

(a) Specification format

BINARY_STRING_FORMAT=filed-data-number-specification:binary-format-specification
                  [,filed-data-number-specification:binary-format-specification]...
 
  binary-format-specification ::= {HEX | BIN}

(b) Explanation of specification format

filed-data-number-specification:

Specifies the field data numbers of the binary data in the CSV file. For the field data number specification rules, see (4) Specification column option.

The field data numbers specified here must be among the field data numbers specified in the specification column option (FIELD_NUM).

Example:

FIELD_NUM=1-5;BINARY_STRING_FORMAT=1:BIN,4-5:HEX;

If the specification column option is omitted, specify integers that are less than or equal to the number of columns in the table function derived table.

binary-format-specification:

Specifies the format of the binary data.

HEX:

Specify this when the binary data is in hexadecimal format.

BIN:

Specify this when the binary data is in binary format.

The following is an example of specifying the binary string format option.

Example:
SELECT * FROM TABLE(ADB_CSVREAD(MULTISET['/dir/file.csv.gz'],
                               'COMPRESSION_FORMAT = GZIP;
                                BINARY_STRING_FORMAT=3:BIN,4:HEX;'))
                AS "T1" ("C1" INTEGER,"C2" INTEGER,
                         "C3" BINARY(1),"C4" BINARY(1),"C5" BINARY(1))

The underlined portion is the binary string format option specification.

[Figure]

Explanation

  • The field data in columns C3 to C5 is binary data.

  • Because the binary data in column C3 is in binary format, BIN is specified in the binary format specification.

  • Because the binary data in column C4 is in hexadecimal format, HEX is specified in the binary format specification.

  • Because the binary data in column C5 is in hexadecimal format, the binary format specification can be omitted (HEX is the default value).

(c) Rules

  1. When binary data columns are specified in the table function derived table and the binary string format option is omitted, the following specifications are assumed:

    • The field data numbers corresponding to the binary data in the table function derived table are assumed for the field data number specification

    • HEX is assumed for the binary format specification

  2. The data types of the columns of the table function derived table corresponding to the field data numbers must be binary (BINARY or VARBINARY).

(6) Enclosing character specification option

The enclosing character specification option specifies the enclosing character that is used to enclose field data items in the CSV file.

(a) Specification format

ENCLOSING_CHAR={enclosing-character | NONE}

(b) Explanation of specification format

enclosing-character:

Specifies the enclosing character that is to be used to enclose field data items in the CSV file. You can specify a single-byte character for enclosing-character.

Note the following points concerning the enclosing character:

  • Characters such as the following are not suitable as the enclosing character because they are likely to overlap with characters in the field data in the CSV file:

    Sign (+, -), forward slash (/), colon (:), period (.), |, \, [, ], (, ), {, }, ~

  • Do not specify the same character for enclosing-character as the character used for the separator. The separator character will not be recognized as an enclosing character (it is treated as the separator). As a result, if you specify the separator character as the enclosing character, there is a risk of unintended consequences as in the following example.

    Example where a single-byte space, which is the separator, is specified as the enclosing character (Δ represents a single-byte space)

    '...;ENCLOSING_CHAR=∆;'

    In this example, HADB assumes that the semicolon (;) is specified as the enclosing character.

NONE:

Specify NONE if no enclosing character is used in the field data in the CSV file.

Important

Do not specify NONE if there are newline characters or the same character as that specified for the delimiting character in the field data. Specifying NONE might produce unintended consequences.

  • If the field data contains a newline character, the characters to the newline character will be treated as a single line of data.

  • If the field data contains the same character as that specified for the delimiting character, it will be treated as a delimiting character, not field data.

(c) Rules

  1. If enclosing-character-specification-option is omitted, the double quotation mark (") is assumed as the enclosing character.

  2. The following characters are not permitted as the enclosing character:

    • Space, tab, asterisk (*), newline (0x0A), carriage return (0x0D)

    • The delimiting character specified in delimiting-character-specification-option

  3. To define a single quotation mark (') as an enclosing character, specify two single quotation marks. The specification is as follows:

    ENCLOSING_CHAR=''

(7) Delimiting character specification option

The delimiting character specification option specifies the delimiting character that is used to delimit field data items in the CSV file.

(a) Specification format

DELIMITER_CHAR={delimiting-character | TAB | SP}

(b) Explanation of specification format

delimiting-character:

Specifies the delimiting character that is to be used to delimit field data items in the CSV file. You can specify a single-byte character for delimiting-character.

Note the following points concerning the delimiting character:

  • Characters such as the following are not suitable as the delimiting character because they are likely to overlap with characters in the field data in the CSV file:

    Sign (+, -), forward slash (/), colon (:), period (.), |, \, [, ], (, ), {, }, ~

  • Do not specify the same character for delimiting-character as the character used for the separator. The separator character will not be recognized as a delimiting character (it is treated as the separator). As a result, if you specify the separator character as the delimiting character, there is a risk of unintended consequences as in the following example.

    Example where a single-byte space, which is the separator, is specified as the delimiting character (Δ represents a single-byte space)

    '...;DELIMITER_CHAR=∆;'

    In this example, HADB assumes that the semicolon (;) is specified as the delimiting character.

TAB:

Specify TAB when the field data in the CSV file is delimited by tabs.

SP:

Specify SP when the field data in the CSV file is delimited by spaces.

(c) Rules

  1. If delimiting-character-specification-option is omitted, the comma (,) is assumed as the delimiting character.

  2. The following characters are not permitted as the delimiting character:

    • Alphabetic characters (A to Z, a to z), digits (0 to 9), underscore (_), double quotation marks ("), space, tab, asterisk (*), newline (0x0A), carriage return (0x0D)

    • The enclosing character specified in enclosing-character-specification-option

  3. To define a single quotation mark (') as a delimiting character, specify two single quotation marks. The specification is as follows:

    DELIMITER_CHAR=''

(8) Rules

(a) Rules for the ADB_CSVREAD function

If the result of the multiset value expression specified in the CSV file path name specification is the empty set, the result of the table function derived table will be the empty set.

(b) Rules for CSV files

  1. Each CSV file must be one of the following types:

    • Files compressed in GZIP format by using the gzip command of the OS

    • Output data files exported in GZIP format by using the adbexport command

    • CSV files that are not compressed

    This rule is checked during execution of the SQL statement, not during preprocessing of the SQL statement.

  2. The HADB administrator must have read privileges for the CSV files. Grant the HADB administrator read and execute privileges to the directories where the CSV files are stored.

    This rule is checked during execution of the SQL statement, not during preprocessing of the SQL statement.

  3. Leading or trailing spaces around CSV file path names are removed before the files are processed.

    Examples:

    '[Figure]/dir/file.csv.gz''/dir/file.csv.gz'

    '/dir/file.csv.gz[Figure]''/dir/file.csv.gz'

    '[Figure]/dir/file.csv.gz [Figure]''/dir/file.csv.gz'

    '[Figure]/dir/fiΔ le.csv.gz[Figure]''/dir/fiΔle.csv.gz'

    Δ: Space

    Important

    Do not specify a CSV file path name that begins or ends with spaces. If you specify a path name that begins or ends with spaces, the spaces are removed. Therefore, the specified path name might be changed to an unintended path name.

  4. The length of the path name of the CSV file must not exceed 510 bytes, excluding leading and trailing spaces around the path name.

    This rule is checked during execution of the SQL statement, not during preprocessing of the SQL statement.

(c) Rules for CSV file formats

  1. Each line of the CSV file corresponds to one row of the table function derived table. Lines are terminated with the newline character X'0A' (LF), X'0D0A' (CRLF), or X'00'.

  2. Specify the delimiting character to delimit field data items.

  3. A character string surrounded by the enclosing character is treated as field data.

  4. The data in the CSV file must use the character encoding specified in the environment variable ADBLANG.

  5. Do not specify the EOF control character in the CSV file.

  6. When specifying an enclosing character, specify the delimiting character and enclosing character contiguously, with no spaces between them. Spaces between the delimiting character and enclosing character will be treated as field data. As a result, the enclosing character might be treated as part of the field data, or an error might be generated due to invalid specification of the enclosing character.

    Note that this rule is checked during execution of the SQL statement, not during preprocessing of the SQL statement.

  7. To specify the enclosing character inside field data, write it twice in a row.

    Example when the enclosing character is a single quotation mark ('):

    'AB''CD' (field data) → AB'CD (data stored in the table function derived table)

  8. When specifying the enclosing character as the first character of field data (excluding leading single-byte spaces or tabs), do not omit the first enclosing character.

    Example when the enclosing character is a single quotation mark ('):

    '''AB' (field data) → 'AB (data stored in the table function derived table)

  9. To specify the delimiting character inside field data, you must surround the field data with the enclosing character. Otherwise, the character will be treated as a delimiting character rather than part of the field data, which might cause an error due to the fact that the specified field no longer exists.

    Examples with double quotation marks (") as the enclosing character and the comma (,) as the delimiting character:

    1,"foo,bar",3

    In the above example, three columns of field data are recognized: 1, foo,bar, and 3.

    1,foo,bar,3

    In the above example, four columns of field data are recognized: 1, foo, bar, and 3.

    Note that this rule is checked during execution of the SQL statement, not during preprocessing of the SQL statement.

  10. The table below shows examples of field data character strings and the corresponding data stored in the table function derived table. In these examples, the comma (,) is used as the delimiting character.

    Field data character string

    Data stored in the table function derived table

    With double quotation marks (") specified in the enclosing character specification option

    With NONE specified in the enclosing character specification option

    ABC,DEF

    • ABC

    • DEF

    • ABC

    • DEF

    "ABC""","DEF"

    • ABC"

    • DEF

    • "ABC"""

    • "DEF"

    "ABC,DEF"

    • ABC,DEF

    • "ABC

    • DEF"

    "ABC,DEF

    Error

    • "ABC

    • DEF

  11. The field data in the CSV file is converted to the data corresponding to the data type of the respective column of the table function derived table. The data types of the columns of the table function derived table must therefore be compatible with the description format of the field data. For details about the field data description rules, see (4) Storage assignment to a table function derived table (in the case of the ADB_CSVREAD function) in 6.2.2 Data types that can be converted, assigned, and compared.

    Note that this rule is checked during execution of the SQL statement, not during preprocessing of the SQL statement.

(9) Notes

  1. When an SQL statement in which the ADB_CSVREAD function is specified is executed, the HADB server opens the CSV file to read data. The CSV file must therefore not be edited during execution of the SQL statement.

  2. No field data is extracted into any columns specified in the table function column list that do not affect the retrieval results (the columns not used in the query). The specifications pertaining to such columns are automatically removed from the SQL statement.

    Example:

    [Figure]

    In the above example, columns C2 and C5 do not affect the retrieval results. Before execution, the SELECT statement is therefore converted to a statement in which the specifications pertaining to columns C2 and C5 are removed. Specifically, the portions that are shaded in blue are deleted when the SELECT statement is run.

    Note
    • The following specifications are targeted for removal:

      • Columns specified in the table function column list

      • Specifications of field data numbers in the specification column option

      • Binary string format option specifications

    • Once the specifications pertaining to the extraneous columns are removed, only the field data for the remaining columns is targeted for extraction. Furthermore, only the field data for the remaining columns is subject to the rules pertaining to CSV files described above.

    • If the specifications of all the columns in the table function derived table are targeted for removal, the specifications of all the columns corresponding to the field data numbers specified in the specification column option will be removed, except for the lowest-numbered one.

  3. Some of the rules for specifying the ADB_CSVREAD function are checked during execution of the SQL statement (not during preprocessing of the SQL statement). The descriptions of these rules above include the sentence This rule is checked during execution of the SQL statement, not during preprocessing of the SQL statement.

(10) Examples

Example 1

Extract the following data from a CSV file (/dir/file.csv.gz) compressed in GZIP format:

  • Customer ID (USERID)

  • Customer name (NAME)

  • Age (AGE)

SELECT "USERID","NAME","AGE"
    FROM TABLE(ADB_CSVREAD(MULTISET ['/dir/file.csv.gz'],
                          'COMPRESSION_FORMAT=GZIP;'))
         AS "USERSLIST" ("USERID" CHAR(5),
                         "NAME" VARCHAR(100),
                         "AGE" INTEGER,
                         "COUNTRY" VARCHAR(100),
                         "INFORMATION" VARBINARY(10))

The underlined portion indicates the specification of the ADB_CSVREAD function.

[Figure]

Example 2

Extract the following data from a CSV file (/dir/file.csv.gz) compressed in GZIP format:

  • Customer name (NAME)

  • Country of origin (COUNTRY)

  • Various bit flags (INFORMATION)

SELECT "NAME","COUNTRY",BIN("INFORMATION")
    FROM TABLE(ADB_CSVREAD(MULTISET ['/dir/file.csv.gz'],
                          'COMPRESSION_FORMAT=GZIP;
                           FIELD_NUM=2,4,5;
                           BINARY_STRING_FORMAT=5:BIN;
                           ENCLOSING_CHAR=";
                           DELIMITER_CHAR=,;'))
         AS "USERSLIST" ("NAME" VARCHAR(100),
            "COUNTRY" VARCHAR(100),
            "INFORMATION" VARBINARY(10))

The underlined portion indicates the specification of the ADB_CSVREAD function.

[Figure]

Example 3

Extract the following data from a CSV file (/dir/file.csv):

  • Customer ID (USERID)

  • Customer name (NAME)

  • Age (AGE)

SELECT "USERID","NAME","AGE"
    FROM TABLE(ADB_CSVREAD(MULTISET ['/dir/file.csv'],
                          'COMPRESSION_FORMAT=NONE;'))
         AS "USERSLIST" ("USERID" CHAR(5),
                         "NAME" VARCHAR(100),
                         "AGE" INTEGER,
                         "COUNTRY" VARCHAR(100),
                         "INFORMATION" VARBINARY(10))

The underlined portion indicates the specification of the ADB_CSVREAD function.

[Figure]