5.2.2 Criteria for selecting row store tables and column store tables : Hitachi Advanced Database Setup and Operation Guide

(1) Restrictions on column store tables

The following operational restrictions apply to column store tables. If operational issues could occur as a result of any of these restrictions, define the table as a row store table.

You cannot define a text index for a column store table.
You cannot define a column store table as an archivable multi-chunk table.
You cannot define a column store table as a FIX table.

To Page Top

(2) Criteria for selecting whether to define a table as a row store table or column store table

Use the following criteria to determine whether to define a table as a row store table or a column store table:

Table 5‒1: Criteria for selecting row store tables and column store tables
Manner of data retrieval from table or salient characteristic of table being defined		Suitable table type
Manner of data retrieval from table	Data is retrieved in the following manner: Data retrieval that accesses data at the row level is common This applies to situations where an asterisk (*) is specified in the selection expression of a `SELECT` statement, and where the selection expression specifies almost every column. Operations are common that retrieve data by narrowing down the search range using B-tree indexes	Row store table
Manner of data retrieval from table	Data is retrieved in the following manner: Searches often target data in a specific column in its entirety without significantly narrowing the search range The data in a specific column within a specific range (such as a specific month or year) is frequently accessed There are frequent calculations (such as computing averages and totals) in relation to the value data in a specific column Grouping data in specific columns on a frequent basis	Column store table
Operation performed for the table to be defined	If data is added, updated, or deleted for the table: For the `INSERT`, `UPDATE`, and `DELETE` statements, processing performance is superior when the processing target is a row store table. If the `INSERT`, `UPDATE`, or `DELETE` statement is executed for a column store table, the retrieval performance might be degraded^# after the statement is executed. Therefore, if you use SQL statements to add, update, and delete data as part of routine tasks, define a row store table.	Row store table
Size of table being defined	When the table will store a relatively small amount of data This applies to tables that store a relatively small amount of data, and to tables into which a relatively small amount of data is imported in a single import operation. As a general rule, the size of the input data file to be imported into such a table will be less than 1 GB. Note that the size of the data in a row store table is approximately twice the size of the input data file. However, this varies according to the nature of the data being imported. Normally, the data in a row store table is larger than the same data in a column store table.	Row store table
Size of table being defined	When the table will store a large amount of data This applies to tables that store a large amount of data, and to tables into which a large amount of data is imported in a single import operation. As a general rule, such a table will have an input data file size of 1 GB or more. Note that the size of the data in a column store table is approximately half the input data file size. However, this varies according to the nature of the data being imported. Normally, the data in a column store table is smaller than the same data in a row store table.	Column store table
Use in a data warehouse	When the table is joined by a primary key with tables that store data to be analyzed This applies to dimension tables and master tables in a data warehouse.	Row store table
Use in a data warehouse	When the table stores data to be analyzed This applies to fact tables in a data warehouse, and tables that store log data and sensor data. We particularly recommend that you define a table as a column store table when it will store several gigabytes or more of data to be analyzed (data that is searched by narrowing the search range using a range index). The size of the data to be analyzed can be estimated based on the size of the input data file. If the size of the data to be analyzed does not exceed several hundred megabytes, we recommend that you define the table that stores data to be analyzed as a row store table.	Column store table

#: You can prevent degradation of retrieval performance for column store tables by enabling the updated-row columnizing facility. For details about the updated-row columnizing facility, see 11.18 Using the updated-row columnizing facility (maintaining the retrieval performance for column store tables).

Important: A table suited to definition as a column store table is also generally best defined as a multi-chunk table. For this reason, we recommend that you define these tables as multi-chunk column store tables. For details about the points to consider when defining a table as a multi-chunk table, see 5.2.4 Points to consider in defining a multi-chunk table.

To define a table as a row store table, either specify ROW in the STORAGE FORMAT option in the CREATE TABLE statement, or omit the STORAGE FORMAT option altogether. To define a table as a column store table, specify COLUMN in the STORAGE FORMAT option in the CREATE TABLE statement.

For details about the CREATE TABLE statement, see Definition SQL in the manual HADB SQL Reference.

To Page Top

(3) B-tree indexes used when retrieving data from column store tables

A B-tree index defined for a column store table can be used to search that column store table only in limited circumstances. The following are typical cases in which a B-tree index is used for a column store table. We recommend that you define a B-tree index for a column store table only when the range of the processing target can be narrowed as shown in these cases.

Case where an index that will be used for index specification is specified when an SQL statement is executed
Case where the minimum value is obtained by specifying the set function MIN or the maximum value is obtained by specifying the set function MAX
Case where the UPDATE or DELETE statement is executed by specifying a search condition

In cases such as accessing data for a specific year or month, we recommend that you define range indexes and use them to narrow the search range.

For details about the rules that govern the use of B-tree indexes when retrieving data from column store tables, see B-tree indexes and text indexes used during execution of SQL statements in the HADB Application Development Guide.

Note: The updated-row columnizing facility is not applied to column store tables for which B-tree indexes are defined. For details about the updated-row columnizing facility, see 11.18 Using the updated-row columnizing facility (maintaining the retrieval performance for column store tables).

To Page Top

(4) Column-data compression types for column store tables

When importing data into a column store table, the data in each column is compressed using one of the compression types listed in the following table. This means that the same data will be smaller in a column store table than in a row store table. When importing data, the HADB server automatically selects one of the compression types in the following table based on the data being imported.

Table 5‒2: Column-data compression types for column store tables
No.	Compression type	Description
1	Non-compression (`NONE`)	The data is stored uncompressed.
2	Run-length encoding (`RUNLENGTH`)	When the same data value is repeated, the data is stored as the data value and number of consecutive occurrences. This compression type is effective on data where the same value occurs in consecutive elements. Compression example: `AAAABB` → `4A2B`
3	Delta encoding (`DELTA`)	Data is stored as the first data value and the subsequent differences between adjacent data values. This compression type is effective when data is concentrated around a specific value. Compression example: `100,99,101,103` → `100,-1,2,2`
4	Delta run-length encoding (`DELTA_RUNLENGTH`)	This compression type applies run-length encoding to delta-encoded data. This compression type is effective on data that is incremented by a certain value (data that follows an arithmetic progression). This compression type is effective on data such as slip numbers that are incremented by one each time.
5	Dictionary encoding (`DICTIONARY`)	Data that appears often is stored in a dictionary, and the data in the table contains only index references to the dictionary. This compression type is effective on data such as character string data with low cardinality (low uniqueness of data). The maximum dictionary size is 16 kilobytes.

As shown in the preceding table, there are compression methods that eliminate redundancy among data values, and those that make use of the differences between values. This means that compression is less effective when applied to data that has little repetition or little difference between data items, such as character string data and binary data of 128 bytes and longer.

Note

Upon completing data import, HADB server outputs the compression type it selected in a message.
The compression type is determined when data is imported. Therefore, if the pattern of data changes while data is being imported into a table, the compression type might change even within the same column.
HADB server selects the compression type at the level of the individual threads that store the data being imported.
HADB server calculates the compression rate for the first 16 megabytes of column data to be stored, and selects the compression type with the best compression rate. If a poor compression rate means the size of the compressed data is the same as or larger than the uncompressed data, the HADB server selects non-compression as the compression type.

Note that you can also specify the compression type for each column in a column store table when performing the following operations. In this case, the column data will always be compressed using the specified compression type.

Using the CREATE TABLE statement to define a column store table
Using the ALTER TABLE statement to add a column to a column store table

■Relationship between a column store table for which the UPDATE statement or DELETE statement is executed and the compression type

In the case of column store tables that have an update as a prerequisite, we recommend specifying a compression type other than a delta compression type^#1. Do not let HADB automatically select the compression type^#2.

Reason:: When an UPDATE statement or DELETE statement is executed on data in column-store format, HADB extends the length of all columns for all the targeted row data. When extending the length for DELTA compression, HADB needs to calculate the data values from the beginning of the data to the data subject to the lengthening. As a result, processing to extend the length of the columns takes longer than other compression methods. Therefore, the time required to execute the UPDATE statement or DELETE statement takes longer when a DELTA compression method is used for a column store table.
#1: In the COMPRESSION TYPE specification in a CREATE TABLE statement, we recommend specifying a compression type other than DELTA and specifying something other than automatic selection (AUTO).
#2: If the compression type specification is omitted, or if AUTO is specified for the compression type specification, HADB automatically selects the compression method. As a result, HADB might apply a DELTA compression method.

To Page Top

(5) Examples of when to use row store tables and column store tables

This subsection explains when you might use a row store table and when you might use a column store table, based on the following operational example:

Operational example

Company A intends to operate a data warehouse for the purpose of analyzing the sales data collected from its sales offices. The analysis axes for sales data are dates, sales offices, products, and customers. Company A decided to implement the following star schema as the schema model for its data warehouse.

Figure 5‒2: Data warehouse schema model for analysis of Company A's sales data

In the preceding schema, the SALES table is a fact table. The PRODUCTS table, CUSTOMERS table, OFFICES table, and CALENDAR table are dimension tables. The following explains the data stored in each table:

SALES table

The SALES table stores the past 10 years of sales data. This consists of such data as product IDs of sold products, the number of products sold, unit prices, customer IDs, IDs of offices whose sales contributed to the totals, and the dates on which sales occurred. Because approximately 100,000 items of data are stored per day, this table will ultimately store approximately 360,000,000 items of sales data.

The data stored as product IDs, customer IDs, office IDs, and sales dates in this table serve as primary keys (foreign keys) for the PRODUCTS, CUSTOMERS, OFFICES, and CALENDAR tables respectively.
PRODUCTS table

The PRODUCTS table stores data about the products handled by Company A (such as product IDs, names, and categories).
CUSTOMERS table

The CUSTOMERS table stores data about the customers of Company A (such as their customer IDs, names, industries, and addresses).
OFFICES table

The OFFICES table stores data about the sales offices of Company A (such as office IDs, names, and addresses).
CALENDAR table

The CALENDAR table stores dimension data related to dates and times (date, month, business quarter, year).

When activity such as compiling monthly sales data and analyzing sales at the customer and office level takes place in relation to the preceding schema model, the following forms of data retrieval are likely to occur frequently:

Retrieving all data in the SALES table
Retrieving data in the SALES table for a specific period (such as the current month or the previous month)

Therefore, the SALES table is defined as a column store table. A range index is defined for the column that stores dates of sale.

The PRODUCTS table, CUSTOMERS table, OFFICES table, and CALENDAR table are likely to be joined using the foreign keys stored in the SALES table. Because these tables contain a relatively small number of data items, they are defined as row store tables. A B-tree index is defined for columns that correspond to primary keys.

To Page Top