The original concept for these files was for ICU to allow 83 * in principle to set which UTF (UTF-8/16/32) is used internally 84 * by defining UTF_SIZE to either 8, 16, or 32. utf.h would then define the UChar type 85 * accordingly. UTF-16 was the default.
This concept has been abandoned. 88 * A lot of the ICU source code assumes UChar strings are in UTF-16. 89 * This is especially true for low-level code like 90 * conversion, normalization, and collation. 91 * The utf.h header enforces the default of UTF-16. 92 * The UTF-8 and UTF-32 macros remain for now for completeness and backward compatibility.
Accordingly, utf.h defines UChar to be an unsigned 16-bit integer. If this matches wchar_t, then 95 * UChar is defined to be exactly wchar_t, otherwise uint16_t.
UChar32 is defined to be a signed 32-bit integer (int32_t), large enough for a 21-bit 98 * Unicode code point (Unicode scalar value, 0..0x10ffff). 99 * Before ICU 2.4, the definition of UChar32 was similarly platform-dependent as 100 * the definition of UChar. For details see the documentation for UChar32 itself.
utf.h also defines a number of C macros for handling single Unicode code points and 103 * for using UTF Unicode strings. It includes utf8.h, utf16.h, and utf32.h for the actual 104 * implementations of those macros and then aliases one set of them (for UTF-16) for general use. 105 * The UTF-specific macros have the UTF size in the macro name prefixes (UTF16_...), while 106 * the general alias macros always begin with UTF_...
Many string operations can be done with or without error checking. 109 * Where such a distinction is useful, there are two versions of the macros, "unsafe" and "safe" 110 * ones with ..._UNSAFE and ..._SAFE suffixes. The unsafe macros are fast but may cause 111 * program failures if the strings are not well-formed. The safe macros have an additional, boolean 112 * parameter "strict". If strict is false, then only illegal sequences are detected. 113 * Otherwise, irregular sequences and non-characters are detected as well (like single surrogates). 114 * Safe macros return special error code points for illegal/irregular sequences: 115 * Typically, U+ffff, or values that would result in a code unit sequence of the same length 116 * as the erroneous input sequence. 117 * Note that _UNSAFE macros have fewer parameters: They do not have the strictness parameter, and 118 * they do not have start/length parameters for boundary checking.
Here, the macros are aliased in two steps: 121 * In the first step, the UTF-specific macros with UTF16_ prefix and _UNSAFE and _SAFE suffixes are 122 * aliased according to the UTF_SIZE to macros with UTF_ prefix and the same suffixes and signatures. 123 * Then, in a second step, the default, general alias macros are set to use either the unsafe or 124 * the safe/not strict (default) or the safe/strict macro; 125 * these general macros do not have a strictness parameter.
It is possible to change the default choice for the general alias macros to be unsafe, safe/not strict or safe/strict. 128 * The default is safe/not strict. It is not recommended to select the unsafe macros as the basis for 129 * Unicode string handling in ICU! To select this, define UTF_SAFE, UTF_STRICT, or UTF_UNSAFE.
For general use, one should use the default, general macros with UTF_ prefix and no _SAFE/_UNSAFE suffix. 132 * Only in some cases it may be necessary to control the choice of macro directly and use a less generic alias. 133 * For example, if it can be assumed that a string is well-formed and the index will stay within the bounds, 134 * then the _UNSAFE version may be used. 135 * If a UTF-8 string is to be processed, then the macros with UTF8_ prefixes need to be used.
Usage: ICU coding guidelines for if() statements should be followed when using these macros. 786 * Compound statements (curly braces {}) must be used for if-else-while... 787 * bodies and all macro statements should be terminated with semicolon.