Converting the character codes of extracted data

When you extract or import data between systems that employ different character locales, you can convert the extracted data to the character codes employed by the target system.

This subsection describes the character code sets that can be converted. For details about how to use this function, see 4.2.3 Additional data extraction and import functions.

Organization of this subsection: (1) Data types supported for conversion of character codes; (2) Convertible character code sets; (3) Character code set conversion ranges; (4) Converting from SJIS to EUC; (5) Converting from SJIS to UTF-8; (6) Converting from EUC to SJIS; (7) Converting from EUC to UTF-8; (8) Converting from UTF-8 to SJIS or EUC

(1) Data types supported for conversion of character codes

For extracted data, the following data types are supported for conversion of character codes:

CHAR¹
VARCHAR¹
MCHAR¹
MVARCHAR¹
NCHAR¹
NVARCHAR¹
SGMLTEXT^{1, 2}

¹ You can use the null value information file to exclude desired columns from code conversion. For details about how to specify the null value information file, see 4.2.4 Contents of files specified with the xtrep command.

² If you are only creating a file without importing data into a HiRDB table, the data type is excluded from code conversion because it is treated as having the BLOB attribute. In such a case, you can use one of the following methods to convert the codes of such data:

In the null value information file, specify CODECONV.

For details about the specification method, see 4.2.4 Contents of files specified with the xtrep command.
In the import information file, specify the SGMLTEXT type.

For details about the specification method, see 4.2.4 Contents of files specified with the xtrep command.

If you are using the created file as the input file of pdload, use this method to convert the codes.

(2) Convertible character code sets

Source character code set	Target character code set
SJIS	EUC	UTF-8
SJIS	--	Y	Y
EUC	Y	--	Y
UTF-8	Y	Y	--

(3) Character code set conversion ranges

Figure 3-8 shows the conversion range for the SJIS character code set.

Figure 3-8 SJIS character code set conversion range

[Figure]

Figure 3-9 shows the conversion range for the EUC character code set.

Figure 3-9 EUC character code set conversion range

[Figure]

Table 3-9 shows the conversion range for the UTF-8 character code set.

Table 3-9 UTF-8 character code set

1st byte	2nd byte	3rd byte	Conversion rule
0x00-0x7F	--	--	Recognize as a 1-byte code, and convert to the corresponding code.
0x80-0xBF	0x00-0xFF	--	Convert to 0x20.
0x80-0xBF	--	--	Convert to 0x20.
0x2C-0xDE	0x80-0xFF	--	Recognize as a 2-byte code, and convert to the corresponding code.
	Other than above	--	Convert according to the specification of the XTUNDEF environment variable.
	--	--	Recognize as an incomplete code, and skip without converting it.
0xDF	0x80-0xBF	--	Recognize as a 2-byte code, and convert to the corresponding code.
	Other than above	--	Convert according to the specification of the XTUNDEF environment variable.
	--	--	Recognize as an incomplete code, and skip without converting it.
0xE0	0xA0-0xFF	0x80-0xFF	Recognize as a 3-byte code, and convert to the corresponding code.
		Other than above	Convert according to the specification of the XTUNDEF environment variable.
		--	Recognize as an incomplete code, and skip without converting it.
	Other than above	0x80-0xFF	Convert according to the specification of the XTUNDEF environment variable.
		Other than above
		--	Recognize as an incomplete code, and skip without converting it.
	--	--
0xE1-0xEE	0x80-0xFF	0x80-0xFF	Recognize as a 3-byte code, and convert to the corresponding code.
		Other than above	Convert according to the specification of the XTUNDEF environment variable.
		--	Recognize as an incomplete code, and skip without converting it.
	Other than above	0x80-0xFF	Convert according to the specification of the XTUNDEF environment variable.
	Other than above	--	Recognize as an incomplete code, and skip without converting it.
	--	--
0xEF	0x80-0xBF	0x80-0xBF	Recognize as a 3-byte code, and convert to the corresponding code.
		Other than above	Convert according to the specification of the XTUNDEF environment variable.
		--	Recognize as an incomplete code, and skip without converting it.
	Other than above	0x80-0xBF	Convert according to the specification of the XTUNDEF environment variable.
		Other than above
		--	Recognize as an incomplete code, and skip without converting it.
	--	--
Other than above	--	--	Convert according to the specification of the XTUNDEF environment variable.

Legend:

--: The character codes are not converted.

(4) Converting from SJIS to EUC

(a) Single-byte codes ((1) - (4) in Figure 3-8)

Each code is converted to the corresponding EUC code.
Kana characters are converted to single-byte EUC codes.
Space character (0x20) remains unchanged (0x20).

(b) Double-byte codes (SJIS standard Kanji area; (6) and (7) in Figure 3-8)

Each code is converted to the corresponding code in the EUC standard Kanji area.

(c) Double-byte codes (Gaiji area; (8) in Figure 3-8)

Each code is converted according to a mapping table for converting character codes. If a code is not defined in the mapping table, it is regarded as being undefined and is converted to the value specified in the XTUNDEF environment variable. For details about mapping tables for converting character codes, see 4.2.3 Additional data extraction and import functions. For details about the environment variables, see 2.2.3 Specifying environment variables.
Space character (0x8140) is converted to space character (0xa1a1).

(d) Double-byte codes (other than (b) or (c))

All other double-byte codes are regarded as being undefined and are converted to the value specified in the XTUNDEF environment variable. For details about the environment variables, see 2.2.3 Specifying environment variables.

(5) Converting from SJIS to UTF-8

(a) Single-byte codes

Each code is converted to the corresponding UTF-8 code.
Kana characters are converted to triple-byte UTF-8 codes.
Space character (0x20) remains unchanged (0x20).

(b) Double-byte codes (standard character set)

Each code is converted to the corresponding standard character code of UTF-8.

(c) Double-byte codes (Gaiji codes)

Each code is converted according to a mapping table for converting character codes. If a code is not defined in the mapping table, it is regarded as being undefined and is converted to the value specified in the XTUNDEF environment variable. For details about mapping tables for converting character codes, see 4.2.3 Additional data extraction and import functions. For details about the environment variables, see 2.2.3 Specifying environment variables.
Space character (0x8140) is converted to space character (0xE38080).

(d) Double-byte codes (other than (b) or (c))

All other double-byte codes are regarded as being undefined and are converted to the value specified in the XTUNDEF environment variable. For details about the environment variables, see 2.2.3 Specifying environment variables.

(6) Converting from EUC to SJIS

(a) Single-byte codes ((1) - (4) in Figure 3-9)

Each code is converted to the corresponding SJIS code.
Kana characters are converted to single-byte SJIS codes.
Space character (0x20) remains unchanged (0x20).

(b) Double-byte codes (Standard Kanji codes; (6) in Figure 3-9)

Each code is converted to the corresponding code in the SJIS standard Kanji area.

(c) Double-byte codes (Gaiji area; (5) in Figure 3-9)

Each code is converted according to a mapping table for converting character codes. If a code is not defined in the mapping table, it is regarded as being undefined and is converted to the value specified in the XTUNDEF environment variable. For details about mapping tables for converting character codes, see 4.2.3 Additional data extraction and import functions. For details about the environment variables, see 2.2.3 Specifying environment variables.
Space character (0xa1a1) is converted to space character (0x8140).
Code set 3 may result in an error during data import to HiRDB.

(d) Double-byte codes (other than (b) or (c))

All other double-byte codes are regarded as being undefined and are converted to the value specified in the XTUNDEF environment variable. For details about the environment variables, see 2.2.3 Specifying environment variables.

(7) Converting from EUC to UTF-8

(a) Single-byte codes

Each code is converted to the corresponding UTF-8 code.
Space character (0x20) remains unchanged (0x20).

(b) Double-byte codes (standard character set)

Each code is converted to the corresponding standard character code of UTF-8.
Kana characters are converted to triple-byte UTF-8 codes.

(c) Double-byte codes (Gaiji code)

Each code is converted according to a mapping table for converting character codes. If a code is not defined in the mapping table, it is regarded as being undefined and is converted to the value specified in the XTUNDEF environment variable. For details about mapping tables for converting character codes, see 4.2.3 Additional data extraction and import functions. For details about the environment variables, see 2.2.3 Specifying environment variables.
Space character (0x8140) is converted to space character (0xE38080).

(d) Double-byte codes (other than (b) or (c))

All other double-byte codes are regarded as being undefined and are converted to the value specified in the XTUNDEF environment variable. For details about the environment variables, see 2.2.3 Specifying environment variables.

(8) Converting from UTF-8 to SJIS or EUC

(a) Single-byte codes

Each code is converted to the corresponding character code.
Space character (0x20) remains unchanged (0x20).

(b) Double-byte codes (standard character set)

Each code is converted to the corresponding character code.

(c) Double-byte codes (Gaiji codes)

Each code is converted according to a mapping table for converting character codes. If a code is not defined in the mapping table, it is regarded as being undefined and is converted to the value specified in the XTUNDEF environment variable. For details about mapping tables for converting character codes, see 4.2.3 Additional data extraction and import functions. For details about the environment variables, see 2.2.3 Specifying environment variables.
Conversion of space character (0xE38080) differs depending on the corresponding character code type.
- SJIS
  Space character (0xE38080) is converted to space character (0x8140).
- EUC
  Space character (0xE38080) is converted to space character (0xa1a1).

(d) Double-byte codes (other than (b) or (c))

All other double-byte codes are regarded as being undefined and are converted to the value specified in the XTUNDEF environment variable. For details about the environment variables, see 2.2.3 Specifying environment variables.