Proposal for C2x | |
---|---|
WG14 N2231 | |
Title: | char8_t: A type for UTF-8 characters and strings |
Author: | Tom Honermann <tom@honermann.net> |
Date: | 2018-03-25 |
Proposal category: | New features, change to existing features |
Target audience: | Developers working on combined C and C++ code bases |
Abstract: A proposal [WG21 P0482R1] currently under consideration for C++ adds a new char8_t fundamental type to be used as the code unit type of u8 string and character literals. This paper proposes a corresponding char8_t typedef and related library functions to enable conversions between the execution character encoding and UTF-8. These facilities are intended to improve support for UTF-8 and to retain source code compatibility across the C and C++ languages.
C11 introduced support for UTF-8, 16-bit, and 32-bit encoded string literals. New char16_t and char32_t typedefs were added to hold values of code units for the 16-bit and 32-bit variants, but a new type was not added for the UTF-8 variant. Instead, UTF-8 string literals were defined in terms of the char type used for the code unit type of ordinary string literals. UTF-8 is the only text encoding mandated to be supported by the C standard for which there is no distinctly named code unit type.
Whether char is a signed or unsigned type is implementation defined and implementations that use an 8-bit signed char are at a disadvantage with respect to working with UTF-8 encoded text due to the necessity of having to rely on conversions to unsigned types in order to correctly process leading and continuation code units of multi-byte encoded code points.
The lack of a distinct type and the use of a code unit type with a range that does not portably include the full unsigned range of UTF-8 code units presents challenges for working with UTF-8 encoded text that are not present when working with UTF-16 or UTF-32 encoded text. Enclosed is a proposal for a new char8_t typedef and related library enhancements intended to remove barriers to working with UTF-8 encoded text and to enable working with all five of the standard mandated text encodings in a consistent manner.
As of November 2017, UTF-8 is now used by more than 90% of all websites [W3Techs]. While UTF-8 now dominates websites, it has not attained similar usage success as the execution character encoding of C and C++ compilers. Important compilers, such as Microsoft's Visual Studio, do not support use of UTF-8 as the execution character encoding[*]. Programs that must consume and produce text in the execution character encoding and manipulate UTF-8 text must choose one of two approaches to managing text in these distinct encodings:
The challenge with the first approach is ensuring that text is appropriately transcoded and is in the correct encoding when passed to other functions. Since the same type, char, is used as the code unit type for both encodings, the programmer is unable to rely on the type system to help identify mistakes.
The challenge with the second approach is that UTF-8 string literals have type array of char. Direct comparisons with UTF-8 string literals are subject to sign mismatch (depending on the sign of char), and attempts to assign pointers to the desired code unit type directly to UTF-8 string literals results in assignment from incompatible pointer types (regardless of the sign of char).
The following example demonstrates a potential consequence of failure to manage character encodings correctly. The mb_utf8.c example incorrectly passes UTF-8 string literals to the "ANSI" version of the Windows MessageBox() function. This function requires strings to be provided in the system encoding (Windows-1252 on the Windows 10 sytem used to produce the output below). As shown, when run, mojibake is produced. The mb_utf16.c example is a correct program intended to demonstrate that Windows supports the example Unicode characters and is able to display them correctly. This example is intended to demonstrate that, though the mb_utf8.c code is incorrect, the compiler is unable to assist in diagnosing what is wrong.
Difficulty in managing multiple encodings with the same code unit type is not the only challenge posed by use of char as the UTF-8 code unit type. The following code exhibits implementation defined behavior.
_Bool is_utf8_multibyte_code_unit(char c) { return c >= 0x80; }
UTF-8 leading and continuation code units have values in the range 128 (0x80) to 255 (0xFF). In the common case where char is implemented as a signed 8-bit type with a two's complement representation and a range of -128 (-0x80) to 127 (0x7F), these values exceed the unsigned range of the char type. Such implementations typically encode such code units as unsigned values which are then reinterpreted as signed values when read. In the code above, integral promotion rules result in c being promoted to type int for comparison to the 0x80 operand. if c holds a value corresponding to a leading or continuation code unit value, then its value will be interpreted as negative and the promoted value of type int will likewise be negative. The result is that the comparison is always false for these implementations.
To correct the code above, explicit conversions are required. For example:
_Bool is_utf8_multibyte_code_unit(char c) { return ((unsigned char)c) >= 0x80; }
Finally, no facilities are currently provided for transcoding between the execution character encoding and UTF-8.
The issues described above present significant challenges to working with UTF-8 encoded text. As the use of UTF-8 continues to rise, the ability to work well with UTF-8 text will only grow more important. The changes proposed in this paper are intended to address the above issues while retaining the ability to write source code that is compatible across C and C++.
[*]: Microsoft Visual Studio 2015 added /utf-8, /source-charset:utf-8, and /execution-charset:utf-8 options that enable use of UTF-8 as the execution character encoding, but in practice, these options are of limited use since the Windows platform SDK does not, in general, support UTF-8.
The proposed changes include:
The addition of the char8_t typedef is intended to support source code compatibility between C and C++ assuming the adoption of WG21 P0482R1 [WG21 P0482R1] by the C++ committee. Mutual adoption would enable the following code to be well-formed and portable for both languages while providing additional type safety and protection from implementation defined sign issues.
#include <uchar.h> void use_utf8(const char8_t *p) { if (p && p[0] >= 0x80) { /* Handle UTF-8 lead or continuation code unit... */ } } int main() { use_utf8(u8"text"); }
The changes proposed in this paper impact backward compatibility as a result of changing the type of UTF-8 string literals. There are two primary consequences:
These changes are a primary objective of this proposal. Implementations are encouraged to add options to disable char8_t support entirely when necessary to preserve compatibility with prior C language standards.
The proposed changes in the corresponding C++ WG21 P0482R1 [WG21 P0482R1] proposal have been implemented in a fork of gcc and are available on GitHub in the char8_t branch of the following repository:
The proposed changes in this paper are being implemented in forks of gcc and glibc, but are not yet complete. Once completed, they will be available in the char8_t branches of the following repositories:
The new gcc -fchar8_t and -fno-char8_t compiler options support enabling and disabling the new features. No backward compatibility features are currently implemented.
These changes are relative to the ISO/IEC 9899:2017 committee draft as of 2018-03-17.
Additional updates will be necessary if WG14 N2198 [WG14 N2198] is adopted.
Change in 6.4.5 (String Literals) paragraph 6:
[…] For UTF-8 string literals, the array elements have typecharchar8_t, and are initialized with the characters of the multibyte character sequence, as encoded in UTF–8. […]
Change in 6.7.9 (Initialization) paragraph 14:
An array of character type may be initialized by a character string literalor UTF-8 string literal, optionally enclosed in braces. Successive bytes of the string literal (including the terminating null character if there is room or if the array is of unknown size) initialize the elements of the array.
Drafting note: The changes to 6.7.9p14 affect backward compatibility by removing the ability to initialize an array of character type with a UTF-8 string literal. This is an intentional change made to align with the changes to C++ proposed in WG21 P0482R1 [WG21 P0482R1].
Insert a new paragraph after 6.7.9 (Initialization) paragraph 14:
An array with element type compatible with a qualified or unqualified version of char8_t may be initialized by a UTF-8 string literal, optionally enclosed in braces. Successive bytes of the string literal (including the terminating null character if there is room or if the array is of unknown size) initialize the elements of the array.
Change in 6.10.8.2 (Environment macros) paragraph 1:
The following macro names are conditionally defined by the implementation:
[…]
__STDC_UTF_8__ The integer constant 1, intended to indicate that values of type char8_t are UTF-8 encoded. If some other encoding is used, the macro shall not be defined and the actual encoding used is implementation-defined.
__STDC_UTF_16__ The integer constant 1, intended to indicate that values of type char16_t are UTF-16 encoded. If some other encoding is used, the macro shall not be defined and the actual encoding used is implementation-defined.
__STDC_UTF_32__ The integer constant 1, intended to indicate that values of type char32_t are UTF-32 encoded. If some other encoding is used, the macro shall not be defined and the actual encoding used is implementation-defined.
[…]
Change in 7.17.1 (Introduction) paragraph 3:
The macros defined are the atomic lock-free macrosATOMIC_BOOL_LOCK_FREE[…]
ATOMIC_CHAR_LOCK_FREE
ATOMIC_CHAR8_T_LOCK_FREE
ATOMIC_CHAR16_T_LOCK_FREE
ATOMIC_CHAR32_T_LOCK_FREE
ATOMIC_WCHAR_T_LOCK_FREE
ATOMIC_SHORT_LOCK_FREE
ATOMIC_INT_LOCK_FREE
ATOMIC_LONG_LOCK_FREE
ATOMIC_LLONG_LOCK_FREE
ATOMIC_POINTER_LOCK_FREE
Change in 7.17.6 (Atomic integer types) paragraph 1:
For each line in the following table,261) the atomic type name is declared as a type that has the same representation and alignment requirements as the corresponding direct type.262)
Atomic type name Direct type […] […] atomic_ullong _Atomic unsigned long long atomic_char8_t _Atomic char8_t atomic_char16_t _Atomic char16_t atomic_char32_t _Atomic char32_t atomic_wchar_t _Atomic wchar_t […] […]
Change in 7.28 (Unicode utilities <uchar.h>) paragraph 2:
The types declared are mbstate_t (described in 7.29.1) and size_t (described in 7.19);char8_twhich is an unsigned integer type used for 8-bit characters and is the same type as unsigned char; andchar16_twhich is an unsigned integer type used for 16-bit characters and is the same type as uint_least16_t (described in 7.20.1.12); andchar32_twhich is an unsigned integer type used for 32-bit characters and is the same type as uint_least32_t (described in 7.20.1.12).
Insert a new subclause before 7.28.1.1 (The mbrtoc16 function):
7.28.1.1 The mbrtoc8 function
Add a new paragraph 1:
Synopsis
#include <uchar.h>
size_t mbrtoc8(char8_t * restrict pc8,
const char * restrict s, size_t n,
mbstate_t * restrict ps);
Add a new paragraph 2:
Description
If s is a null pointer, the mbrtoc8 function is equivalent to the call:In this case, the values of the parameters pc8 and n are ignored.mbrtoc8(NULL, "", 1, ps)
Add a new paragraph 3:
If s is not a null pointer, the mbrtoc8 function inspects at most n bytes beginning with the byte pointed to by s to determine the number of bytes needed to complete the next multibyte character (including any shift sequences). If the function determines that the next multibyte character is complete and valid, it determines the values of the corresponding characters and then, if pc8 is not a null pointer, stores the value of the first (or only) such character in the object pointed to by pc8. Subsequent calls will store successive characters without consuming any additional input until all the characters have been stored. If the corresponding character is the null character, the resulting state described is the initial conversion state.
Add a new paragraph 4:
Returns
The mbrtoc8 function returns the first of the following that applies (given the current conversion state):
0 if the next n or fewer bytes complete the multibyte character that corresponds to the null character (which is the value stored). between 1 and n inclusive if the next n or fewer bytes complete a valid multibyte character (which is the value stored); the value returned is the number of bytes that complete the multibyte character. (size_t) (−3) if the next character resulting from a previous call has been stored (no bytes from the input have been consumed by this call). (size_t) (−2) if the next n bytes contribute to an incomplete (but potentially valid) multibyte character, and all n bytes have been processed (no value is stored).Footnote) (size_t) (−1) if an encoding error occurs, in which case the next n or fewer bytes do not contribute to a complete and valid multibyte character (no value is stored); the value of the macro EILSEQ is stored in errno, and the conversion state is unspecified.
Add a new footnote for the reference in paragraph 4 above:
Footnote)When n has at least the value of the MB_CUR_MAX macro, this case can only occur if s points at a sequence of redundant shift sequences (for implementations with state-dependent encodings).
Insert another new subclause before 7.28.1.1 (The mbrtoc16 function):
7.28.1.2 The c8rtomb function
Add a new paragraph 1:
Synopsis
#include <uchar.h>
size_t c8rtomb(char * restrict s, char8_t c8,
mbstate_t * restrict ps);
Add a new paragraph 2:
Description
If s is a null pointer, the c8rtomb function is equivalent to the callwhere buf is an internal buffer.c8rtomb(buf, '\0', ps)
Drafting note: If WG14 N2198 [WG14 N2198] is adopted, the character literal in paragraph 2 above should be changed from '\0' to u8'\0'.
Add a new paragraph 3:
If s is not a null pointer, the c8rtomb function determines the number of bytes needed to represent the multibyte character that corresponds to the character given or completed by c8 (including any shift sequences), and stores the multibyte character representation in the array whose first element is pointed to by s, or stores nothing if c8 does not represent a complete character. At most MB_CUR_MAX bytes are stored. If c8 is a null character, a null byte is stored, preceded by any shift sequence needed to restore the initial shift state; the resulting state described is the initial conversion state.
Drafting note: The wording in paragraph 3 above includes the proposed wording updates from WG14 DR 488 [WG14 DR 488].
Add a new paragraph 4:
Returns
The c8rtomb function returns the number of bytes stored in the array object (including any shift sequences). When c8 is not a valid character, an encoding error occurs: the function stores the value of the macro EILSEQ in errno and returns (size_t) (−1); the conversion state is unspecified.
Change in B.16 (Atomics <stdatomic.h>)
[…]
ATOMIC_CHAR_LOCK_FREE
ATOMIC_CHAR8_T_LOCK_FREE
ATOMIC_CHAR16_T_LOCK_FREE
ATOMIC_CHAR32_T_LOCK_FREE
ATOMIC_WCHAR_T_LOCK_FREE
[…]
atomic_ullong
atomic_char8_t
atomic_char16_t
atomic_char32_t
atomic_wchar_t
[…]
Change in B.27 (Unicode utilities <uchar.h>)
mbstate_t size_t char8_t char16_t char32_t size_t mbrtoc8(char8_t * restrict pc8,
const char * restrict s, size_t n,
mbstate_t * restrict ps);
size_t c8rtomb(char * restrict s, char8_t c8,
mbstate_t * restrict ps);
size_t mbrtoc16(char16_t * restrict pc16,
const char * restrict s, size_t n,
mbstate_t * restrict ps);
size_t c16rtomb(char * restrict s, char16_t c16,
mbstate_t * restrict ps);
size_t mbrtoc32(char32_t * restrict pc32,
const char * restrict s, size_t n,
mbstate_t * restrict ps);
size_t c32rtomb(char * restrict s, char32_t c32,
mbstate_t * restrict ps);
Change in J.3.4 (Characters):
[…]
— The encoding of any of wchar_t, char8_t, char16_t, and char32_t where the corresponding standard encoding macro (__STDC_ISO_10646__, __STDC_UTF_8__, __STDC_UTF_16__, or __STDC_UTF_32__) is not defined (6.10.8.2).
Thank you to Aaron Ballman for his kind assistance facilitating interaction with WG14.
[W3Techs] |
"Usage of UTF-8 for websites", W3Techs, 2017. https://w3techs.com/technologies/details/en-utf8/all/all |
[WG21 P0482R1] |
Tom Honermann,
"char8_t: A type for UTF-8 characters and strings (Revision 1)", P0482R1, 2018. http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p0482r1.html |
[WG14 N2198] |
Aaron Ballman,
"Adding the u8 character prefix", N2198, 2017. http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2198.pdf |
[WG14 DR 488] |
"c16rtomb() on wide characters encoded as multiple char16_t", DR 488, 2016. http://www.open-std.org/jtc1/sc22/WG14/www/docs/summary.htm#dr_488 |