Hitachi

Hitachi Advanced Database Setup and Operation Guide


5.2.2 Criteria for selecting row store tables and column store tables

This section explains the criteria for selecting whether to define a table as a row store table, or as a column store table.

Organization of this subsection

(1) Restrictions on column store tables

The following operational restrictions apply to column store tables. If operational issues could occur as a result of any of these restrictions, define the table as a row store table.

(2) Criteria for selecting whether to define a table as a row store table or column store table

Use the following criteria to determine whether to define a table as a row store table or a column store table:

Table 5‒1: Criteria for selecting row store tables and column store tables

Manner of data retrieval from table or salient characteristic of table being defined

Suitable table type

Manner of data retrieval from table

Data is retrieved in the following manner:

  • Data retrieval that accesses data at the row level is common

    This applies to situations where an asterisk (*) is specified in the selection expression of a SELECT statement, and where the selection expression specifies almost every column.

  • Operations are common that retrieve data by narrowing down the search range using B-tree indexes

Row store table

Data is retrieved in the following manner:

  • Searches often target data in a specific column in its entirety without significantly narrowing the search range

  • The data in a specific column within a specific range (such as a specific month or year) is frequently accessed

  • There are frequent calculations (such as computing averages and totals) in relation to the value data in a specific column

  • Grouping data in specific columns on a frequent basis

Column store table

Operation performed for the table to be defined

  • If data is added, updated, or deleted for the table:

    For the INSERT, UPDATE, and DELETE statements, processing performance is superior when the processing target is a row store table. If the INSERT, UPDATE, or DELETE statement is executed for a column store table, the retrieval performance might be degraded# after the statement is executed. Therefore, if you use SQL statements to add, update, and delete data as part of routine tasks, define a row store table.

Row store table

Size of table being defined

  • When the table will store a relatively small amount of data

    This applies to tables that store a relatively small amount of data, and to tables into which a relatively small amount of data is imported in a single import operation. As a general rule, the size of the input data file to be imported into such a table will be less than 1 GB.

    Note that the size of the data in a row store table is approximately twice the size of the input data file. However, this varies according to the nature of the data being imported.

    Normally, the data in a row store table is larger than the same data in a column store table.

Row store table

  • When the table will store a large amount of data

    This applies to tables that store a large amount of data, and to tables into which a large amount of data is imported in a single import operation. As a general rule, such a table will have an input data file size of 1 GB or more.

    Note that the size of the data in a column store table is approximately half the input data file size. However, this varies according to the nature of the data being imported.

    Normally, the data in a column store table is smaller than the same data in a row store table.

Column store table

Use in a data warehouse

  • When the table is joined by a primary key with tables that store data to be analyzed

    This applies to dimension tables and master tables in a data warehouse.

Row store table

  • When the table stores data to be analyzed

    This applies to fact tables in a data warehouse, and tables that store log data and sensor data.

    We particularly recommend that you define a table as a column store table when it will store several gigabytes or more of data to be analyzed (data that is searched by narrowing the search range using a range index).

    The size of the data to be analyzed can be estimated based on the size of the input data file.

    If the size of the data to be analyzed does not exceed several hundred megabytes, we recommend that you define the table that stores data to be analyzed as a row store table.

Column store table

#

You can prevent degradation of retrieval performance for column store tables by enabling the updated-row columnizing facility. For details about the updated-row columnizing facility, see 11.18 Using the updated-row columnizing facility (maintaining the retrieval performance for column store tables).

Important

A table suited to definition as a column store table is also generally best defined as a multi-chunk table. For this reason, we recommend that you define these tables as multi-chunk column store tables. For details about the points to consider when defining a table as a multi-chunk table, see 5.2.4 Points to consider in defining a multi-chunk table.

To define a table as a row store table, either specify ROW in the STORAGE FORMAT option in the CREATE TABLE statement, or omit the STORAGE FORMAT option altogether. To define a table as a column store table, specify COLUMN in the STORAGE FORMAT option in the CREATE TABLE statement.

For details about the CREATE TABLE statement, see Definition SQL in the manual HADB SQL Reference.

(3) B-tree indexes used when retrieving data from column store tables

A B-tree index defined for a column store table can be used to search that column store table only in limited circumstances. The following are typical cases in which a B-tree index is used for a column store table. We recommend that you define a B-tree index for a column store table only when the range of the processing target can be narrowed as shown in these cases.

In cases such as accessing data for a specific year or month, we recommend that you define range indexes and use them to narrow the search range.

For details about the rules that govern the use of B-tree indexes when retrieving data from column store tables, see B-tree indexes and text indexes used during execution of SQL statements in the HADB Application Development Guide.

Note

The updated-row columnizing facility is not applied to column store tables for which B-tree indexes are defined. For details about the updated-row columnizing facility, see 11.18 Using the updated-row columnizing facility (maintaining the retrieval performance for column store tables).

(4) Column-data compression types for column store tables

When importing data into a column store table, the data in each column is compressed using one of the compression types listed in the following table. This means that the same data will be smaller in a column store table than in a row store table. When importing data, the HADB server automatically selects one of the compression types in the following table based on the data being imported.

Table 5‒2: Column-data compression types for column store tables

No.

Compression type

Description

1

Non-compression (NONE)

The data is stored uncompressed.

2

Run-length encoding (RUNLENGTH)

When the same data value is repeated, the data is stored as the data value and number of consecutive occurrences. This compression type is effective on data where the same value occurs in consecutive elements.

Compression example: AAAABB4A2B

3

Delta encoding (DELTA)

Data is stored as the first data value and the subsequent differences between adjacent data values. This compression type is effective when data is concentrated around a specific value.

Compression example: 100,99,101,103100,-1,2,2

4

Delta run-length encoding (DELTA_RUNLENGTH)

This compression type applies run-length encoding to delta-encoded data. This compression type is effective on data that is incremented by a certain value (data that follows an arithmetic progression). This compression type is effective on data such as slip numbers that are incremented by one each time.

5

Dictionary encoding (DICTIONARY)

Data that appears often is stored in a dictionary, and the data in the table contains only index references to the dictionary. This compression type is effective on data such as character string data with low cardinality (low uniqueness of data).

The maximum dictionary size is 16 kilobytes.

As shown in the preceding table, there are compression methods that eliminate redundancy among data values, and those that make use of the differences between values. This means that compression is less effective when applied to data that has little repetition or little difference between data items, such as character string data and binary data of 128 bytes and longer.

Note
  • Upon completing data import, HADB server outputs the compression type it selected in a message.

  • The compression type is determined when data is imported. Therefore, if the pattern of data changes while data is being imported into a table, the compression type might change even within the same column.

  • HADB server selects the compression type at the level of the individual threads that store the data being imported.

  • HADB server calculates the compression rate for the first 16 megabytes of column data to be stored, and selects the compression type with the best compression rate. If a poor compression rate means the size of the compressed data is the same as or larger than the uncompressed data, the HADB server selects non-compression as the compression type.

Note that you can also specify the compression type for each column in a column store table when performing the following operations. In this case, the column data will always be compressed using the specified compression type.

■Relationship between a column store table for which the UPDATE statement or DELETE statement is executed and the compression type

In the case of column store tables that have an update as a prerequisite, we recommend specifying a compression type other than a delta compression type#1. Do not let HADB automatically select the compression type#2.

Reason:

When an UPDATE statement or DELETE statement is executed on data in column-store format, HADB extends the length of all columns for all the targeted row data. When extending the length for DELTA compression, HADB needs to calculate the data values from the beginning of the data to the data subject to the lengthening. As a result, processing to extend the length of the columns takes longer than other compression methods. Therefore, the time required to execute the UPDATE statement or DELETE statement takes longer when a DELTA compression method is used for a column store table.

#1

In the COMPRESSION TYPE specification in a CREATE TABLE statement, we recommend specifying a compression type other than DELTA and specifying something other than automatic selection (AUTO).

#2

If the compression type specification is omitted, or if AUTO is specified for the compression type specification, HADB automatically selects the compression method. As a result, HADB might apply a DELTA compression method.

(5) Examples of when to use row store tables and column store tables

This subsection explains when you might use a row store table and when you might use a column store table, based on the following operational example:

Operational example

Company A intends to operate a data warehouse for the purpose of analyzing the sales data collected from its sales offices. The analysis axes for sales data are dates, sales offices, products, and customers. Company A decided to implement the following star schema as the schema model for its data warehouse.

Figure 5‒2: Data warehouse schema model for analysis of Company A's sales data

[Figure]

In the preceding schema, the SALES table is a fact table. The PRODUCTS table, CUSTOMERS table, OFFICES table, and CALENDAR table are dimension tables. The following explains the data stored in each table:

When activity such as compiling monthly sales data and analyzing sales at the customer and office level takes place in relation to the preceding schema model, the following forms of data retrieval are likely to occur frequently:

Therefore, the SALES table is defined as a column store table. A range index is defined for the column that stores dates of sale.

The PRODUCTS table, CUSTOMERS table, OFFICES table, and CALENDAR table are likely to be joined using the foreign keys stored in the SALES table. Because these tables contain a relatively small number of data items, they are defined as row store tables. A B-tree index is defined for columns that correspond to primary keys.