Hitachi

Hitachi Advanced Database Setup and Operation Guide


2.17.1 Correction search

The following describes the correction search in text data.

Organization of this subsection

(1) Overview of correction search

Correction search is a function that performs a text data search, ignoring the differences between uppercase and lowercase letters, between half-width and full-width characters, and between Japanese hiragana and katakana characters. For example, if you specify Hitachi as a search string, the function searches not only for Hitachi, but also for HITACHI and [Figure] at the same time. The correction search function is useful when searching text data that has notational inconsistencies because you can perform a collective search as if you execute an SQL statement in which multiple search conditions are combined by using the OR operator.

Notational inconsistencies are likely to arise in text data created by multiple persons, such as log records of a call center and applications for use of products. If you want to retrieve from such text data all records that include a specific keyword (such as the name of a company, product, or person in charge), you can conveniently use the correction search function.

The following figure shows an example of correction search.

Figure 2‒58: Example of correction search

[Figure]

Explanation

If Hitachi is specified as a search condition, the function retrieves all character strings that include the word Hitachi, irrespective of character case or width.

To perform correction search, use the scalar function CONTAINS. For details about the scalar function CONTAINS, see CONTAINS in Character string functions (acquisition of character string information) in Scalar Functions in the manual HADB SQL Reference.

Note that if you perform correction search, you can reduce the number of pages to be loaded by defining a text index that supports correction search. This improves table retrieval performance.

Important

The correction search function can be used if the character encoding used on the HADB server is Unicode (the value specified for the environment variable ADBLANG is UTF8). It cannot be used if the character encoding is Shift-JIS (the value specified for the environment variable ADBLANG is SJIS).

(2) Rules of correction search

The following table describes the rules of correction search.

Table 2‒21: Rules of correction search

No.

Character type

Rule

Example of correction search

1

Alphabetic character

Correction search ignores the differences among the following character types:

  • Uppercase letter

  • Lowercase letter

  • Half-width alphabetic character

  • Full-width alphabetic character

If max is specified as a search condition, the correction search function searches for the following character strings:

[Figure]

2

Number

Correction search ignores the differences between the following character types:

  • Half-width number

  • Full-width number

3

  • Hiragana character

  • Katakana character

▪ Japanese hiragana and katakana characters

Correction search ignores the differences between the following character types:

  • Hiragana character

  • Katakana character

▪ Full-width and half-width katakana characters

Correction search ignores the differences between the following character types:

  • Full-width katakana character

  • Half-width katakana character

▪ Japanese dakuten and handakuten marks

If one of the following characters is followed by a full-width or half-width dakuten mark (voiced sound mark), the correction search function assumes that the character and sign make up a single character.

  • Hiragana character

  • Full-width katakana character

  • Half-width katakana character

The same rule also applies to a full-width or half-width handakuten mark (semi-voiced sound mark).

▪ Japanese youon and sokuon signs

The correction search function equates Japanese youon and sokuon signs to their corresponding regular-sized characters.

▪ Japanese ombiki sign

The correction search function does not ignore whether the ombiki sign is used, and treats the sign as a single ordinary character.

▪ Example 1

[Figure]

▪ Example 2

[Figure]

▪ Example 3

[Figure]

4

Diacritical marks

The correction search function ignores the difference between the following character types:

  • Characters with a diacritical mark (such as umlaut)

  • Characters without a diacritical mark (such as umlaut)

The correction search function assumes the following characters to be the same:

[Figure]

Therefore, if MAX is specified as a search condition, the correction search function searches for the following character strings:

[Figure]

5

Single character representing a specific character string

If the correction search function encounters a character that represents a specific character string, the function expands the character to the character string. The function equates the character to the character string.

The correction search function assumes the following characters to be the same:

[Figure]

For hiragana, full-width katakana, and half-width katakana characters, a character with a dakuten or handakuten sign and a character without a dakuten or handakuten sign are assumed to be different. Therefore, the correction search function does not ignore the difference between those characters. The following shows examples. In the following combinations of characters, each combination uses the same (dakuten or handakuten) mark. Therefore, both characters are assumed to be the same character.

On the other hand, [Figure], [Figure], and [Figure] do not have the same (dakuten or handakuten) mark. Therefore, these characters are assumed to be different. For example, if [Figure] is specified as a search condition, [Figure] and [Figure] can be retrieved. However, [Figure], [Figure], [Figure], and [Figure] cannot be retrieved because the correction search function does not ignore the difference among them.

Correction search uses sort codes, which are codes specified in the ISO/IEC 14651:2011 standard for sorting and comparing characters. The characters of the same sort code are assumed to be the same character.

Note
  • For symbols and other characters that are assumed to be non-existent in the sort codes, the bytecode is used for sorting and comparison.

  • The sort code is used if SORTCODE(simple-string-specification) is specified as notation-correction-search-specification of the scalar function CONTAINS.