Processing Strings in Unicode Systems

Overview

1. Statements for processing strings
2. Comparison operators for processing strings
3. Functions for processing byte and character strings
4. Output in fields and lists

String processing commands, whose arguments were all interpreted as fields of type C until now, are now divided into strings with character arguments and those with byte arguments.

1. Statements for processing strings

With the statements for processing byte and character strings the IN BYTE|CHARACTER addition specifies whether byte or character string processing is executed. The arguments must be single fields of type C, N, D, T, or STRING, plus purely character-type structures are also permitted. A syntax or runtime error occurs if arguments of a different type are transferred. The processing of byte strings, that is operands of type X or XSTRING, is a subset of the function available for processing character strings.

It is not possible to mix the processing types. A CONCATENATE a x b INTO c statement is not possible if a, b, and c are character-type but x has type X.

CLEAR ... WITH
CONCATENATE
CONDENSE
CONVERT TEXT ... INTO SORTABLE CODE
FIND
OVERLAY
REPLACE
SEARCH
SHIFT
SPLIT
TRANSLATE ... TO UPPER/LOWER CASE
TRANSLATE ... USING

In the TRANSLATE statements, the additions FROM CODEPAGE and FROM NUMBER FORMAT are not allowed in a Unicode program. New conversion classes are provided to replace these statements.

2. Comparison operators for processing byte and character strings

As with statements for character string processing, these operators require single fields of type C, N, D, T, or STRING as arguments, again purely character-type structures are allowed.

CO
CN
CA
NA
CS
NS
CP
NP

Special comparison operators defined with the BYTE- prefix are provided for byte strings.

3. Functions for processing byte and character strings

The STRLEN function only works with character-type fields and returns the length in characters. The new function, XSTRLEN, finds the length of byte strings.

In a non-Unicode system, the CHARLEN function returns the value 1 for a character-type field beginning with a single byte character. For a character-type field beginning with a double-byte character, the system retuns the value 1 or maximum 2. In Unicode Systems, the returned value is generally 1. Only with character-type fields that receive Unicode double characters from the surrogate area, is the value 2 returned.

The NUMOFCHAR function returns the number of characters in a character-type field. In Unicode Systems and non-Unicode systems with single-byte codepages, this function returns the same result as STRLEN. Only in non-Unicode systems with double-byte codepages, are characters that take up more than 1 byte counted as length 1, while STRLEN returns the value 2.

4. Output in fields and lists

In WRITE ... TO, any flat data types that were handled like C fields were allowed as target. A Unicode program requires that the target field be of character type in a WRITE ... TO ... statement. The line type of the table must be of character type for table variant WRITE ... TO itab INDEX idx. The offset and length are counted in characters.

Until now, you could display any flat structure using WRITE. If the source field is a flat structure in a WRITE statement, it must have character type only, in a Unicode system. This affects the following statements:

WRITE f.
WRITE f TO g[+off][(len)]..
WRITE (name) TO g..
WRITE f TO itab[+off][(len)] INDEX idx..
WRITE (name) TO itab[+off][(len)] INDEX idx..