May 15th, 2021
Document: n2728
Previous Revisions: None
Audience: WG14
Proposal Category: Change Request
Target Audience: General Developers, Library Developers
Latest Revision: https://thephd.dev/_vendor/future_cxx/papers/C%20-%20char16_t%20&%20char32_t%20string%20literals%20shall%20be%20UTF-16%20&%20UTF-32.html
This paper closes a as-yet unused degree of freedom in string literal representations. We observe that wide string literals and narrow string literals provide the implementation-defined flexibility that vendors need for choosing an arbitrary representation, and that no vendor (that we currently know of) has taken advantage of such functionality for char16_t and char32_t string literals. Therefore, we are hoping to settle on UTF-16 and UTF-32 for char16_t literals, similar to have u8-string literals are defined to be UTF-8.
Of all the string literal types, only one of them carries a well-defined encoding according to the standard:
char
/narrow/“multibyte” strings and string literals - narrow locale encoding or execution encoding, implementation-defined;wchar_t
/wide strings and string literals - narrow locale encoding or execution encoding, implementation-defined and tied to mbstowcs
;u8
/char
strings and string literals - UTF-8 encoding (when not confused for normal char
literals by the type system);u
/char16_t
strings and string literals - implementation-defined encoding tied to mbrtoc16
;U
/char32_t
strings and string literals - implementation-defined encoding tied to mbrtoc32
;Narrow/multibyte strings and literals have uncountably many applied encodings at translation time and execution time in practice. Wide strings and literals have encodings such as UCS-2 or UTF-16, UTF-32, EUC-TW and EUC-JP in practice. These encodings are determined at translation time, and not execution time.
The side-effect of the existing practice is that almost no existing implementation can or has been matching a potentially valid reading of §6.4.5 String literals, paragraph 6. According to the standard, each string literal matches the mbrtoc16/32
or mbstowcs
functions; this would imply that there is a translation-time encoding used, or that string literals are tied to an execution-time property. It is unclear whether there exists any implementation that, after compilation, inserts code that reacts to setlocale
and transcodes all existing string literals to the new locale of the program. There does not seem to exist any interpreters that behave in this manner, either.
The alternative explanation is that the encoding of the string is only determined at the point of creation, and that strings later in the program translation can be affected if they are created after a call to setlocale
. This interpretation, while granting clemency to any vendor that has created strings with a fixed encoding even in an interpreter-like implementation, still presents the incredibly awkward problem that strings become a property of the translation of the program, and brings up very interesting consequences if an implementation allows for someone to set the encoding of their string literals in the middle of their program. This still runs the risk of not being identical to what mbstowcs
or mbrtoc16/32
do at program runtime, even if the program never calls setlocale
.
Notably, neither of these (valid) interpretations is useful, worthwhile, or reflect today’s existing practice.
It’s also important to note that traditional compilers can, have, and do create wide (L
) and normal string literals based on a compile-time encoding that does not match the run-time encoding. The specification as it is underserves implementations by tying them to a property they have not been able to provide since the earliest days of setlocale
, and in doing so has introduced an unnecessary schism in what can be guaranteed at translation time and execution time.
Thanks to the unending and unyielding pain that both of these two have provided us, the industry for the last decade has settled very strongly on UTF-8, UTF-16, and UTF-32 for the encoding of u8
, char16_t
, and char32_t
strings and string literals. The standard only mandates this behavior for u8
; the other two are left as implementation-defined (and tied to the locale
), with a Macro that says whether or not it supports any of the Unicode code points specified in ISO 10646 (but without specifying the underlying encoding still). We know of no implementation that ships on any system large or small that uses a different encoding for u8
, char16_t
, or char32_t
strings and string literals.
Normally, more freedom tends to be a good thing for implementations, but there is an enormous risk to the ecosystem by continuing to leave this freedom open. As we have experienced with "abc"
and L"abc"
literals and their associated locale-based encodings, the entire ecosystem is plagued with Mojibake or similar issues when software is transferred across computers and to different regions and domains, through different systems with different defaults, and more.
Therefore, this paper proposes to solidify what is existing practice in almost every single compiler known to date. All char16_t
strings and literals shall be UTF-16 encoded, and all char32_t
strings and literals shall be UTF-32 encoded, unless otherwise explicitly specified.
We settle on this solution because it is existing practice. A survey of implementations overseen by Tom Honermann of Synopsys through Coverity as well as reaching out to certain vendors for confirmation has revealed that there exists no implementation which interprets char16_t
or char32_t
literals as anything other than UTF-16 or UTF-32. All of the major compilers also follow this behavior (GCC, ICC, Clang, IBM xlC/C++, TinyCC, SDCC, EDG C and C++ in all of its modes, MSVC, IAR C/C++ Compilers, Embarcadero C, and more).
To achieve our goal, we state this in the front-matter for “string literals”. We then provide wording in other relevant places for char16_t
and char32_t
functions that state the desired encoding (UTF-16 and UTF-32 respectively). Finally, we disentangle the compile-time encoding of string literals and the locale-based functions.
Note that the disentanglement does not reduce implementer freedom. The encoding for string literals and wchar_t
string literals is still implementation-defined; it is just explicitly defined as such now and no longer makes any forward mention of mbtowc
, mbstowcs
, mbrtoc16
or mbrtoc32
. The conversion from the translation time execution character set to the literal and wide literal encoding is still completely implementation-defined; that means implementations can continue to:
mbtowc
/mbstowcs
and friends on their Host platform (a very literal interpretation) to encode/decode string literals from the execution character set (older compilers relying directly on the packaged C library)-fexec-charset=owo
+ -fwide-exec-charset=òwó
, or /execution-charset:uwu
, are supposed to accomplish (GCC, MSVC, and many others)-fexec-charset=owo
and completely ignore it (Clang)This is, essentially, an expansion of an implementation’s rights for normal and wide string literals. In exchange, we reduce hypothetical implementer freedom in exchange for programmer guarantees by making char16_t
and char32_t
be UTF-16 and UTF-32. We, once again, note that no implementer has taken advantage of this flexibility since 2006 when these features were first being cooked up by the C Standards Committee.
No. ISO/IEC 10646 is kept as an undated reference in the C Standard, which means it always refers to the latest standard. UTF-8, UTF-16, and UTF-32 are all mentioned and described in ISO/IEC 10646. Therefore the wording can reference this terminology directly, by reference, as it does with UTF-8.
The following wording is relative to N2596.
6.2.9 Encodings
1 The literal encoding is an implementation-defined mapping of the characters of the execution character set to the values in a character constant (6.4.4.4) or string literal (6.4.5). It shall support a mapping from all the basic execution character set values into the implementation-defined encoding. It may contain multibyte character sequences (5.2.1.2).
2 The wide literal encoding is an implementation-defined mapping of the characters of the execution character set to the values in a
wchar_t
character constant (6.4.4.4) or awchar_t
string literal (6.4.5). It shall support a mapping from all the basic execution character set values into the implementation-defined encoding. The mapping shall produce values identical to the literal encoding for all the basic execution character set values if an implementation does not define__STDC_MB_MIGHT_NEQ_WC__
. One or more values may map to one or more values of the extended execution character set.
2 An integer character constant is a sequence of one or more multibyte characters enclosed in single-quotes, as in
'x'
. A UTF–8 character constant is the same, except prefixed byu8
.A wide character constant is the same, except prefixed by the letterAL
,u
, orU
.wchar_t
character constant is prefixed by the letterL
. A UTF-16 character constant is prefixed by the letteru
. A UTF-32 character constant is prefixed by the letterU
. Collectively,wchar_t
, UTF-16, and UTF-32 character constants are called wide character constants. With a few exceptions detailed later, the elements of the sequence are any members of the source character set; they are mapped in an implementation-defined manner to members of the execution character set.
10 A UTF–8, UTF-16, or UTF-32 character constant shall not contain more than one character.85) The value shall be representable with a single UTF–8, UTF-16, or UTF-32 code unit.
Semantics
11An integer character constant has type
int
. The value of an integer character constant containing a single character that maps to asingle-byte execution charactersingle value in the literal encoding is the numerical value of the representation of the mapped character in the literal encoding interpreted as an integer. The value of an integer character constant containing more than one character (e.g.,'ab'
), or containing a character or escape sequence that does not map toa single-byte execution charactera single value in the literal encoding, is implementation-defined.
12
A UTF–8 character constant has typeA UTF–8 character constant has typeunsigned char
which is an unsigned integer types defined in the<uchar.h>
header. The value of a UTF–8 character constant is equal to its ISO/IEC 10646 code point value, provided that the code point value can be encoded as a single UTF–8 code unit.unsigned char
. If the UTF-8 character constant is not produced through a hexadecimal or octal escape sequence, the value of a UTF–8 character constant is equal to its ISO/IEC 10646 code point value, provided that the code point value can be encoded as a single UTF–8 code unit. Otherwise, the value of the UTF-8 character constant is the numeric value specified in the hexadecimal or octal escape sequence.13A UTF–16 character constant has type
char16_t
which is an unsigned integer types defined in the<uchar.h>
header. If the UTF-16 character constant is not produced through a hexadecimal or octal escape sequence, the value of a UTF–16 character constant is equal to its ISO/IEC 10646 code point value, provided that the code point value can be encoded as a single UTF–16 code unit. Otherwise, the value of the UTF-16 character constant is the numeric value specified in the hexadecimal or octal escape sequence.14A UTF–32 character constant has type
char32_t
which is an unsigned integer types defined in the<uchar.h>
header. If the UTF-32 character constant is not produced through a hexadecimal or octal escape sequence, the value of a UTF–32 character constant is equal to its ISO/IEC 10646 code point value, provided that the code point value can be encoded as a single UTF–32 code unit. Otherwise, the value of the UTF-32 character constant is the numeric value specified in the hexadecimal or octal escape sequence.
1315 Awchar_t
character constant prefixed by the letterL
has typewchar_t
, an integer type defined in the<stddef.h>
header. The value of awchar_t
character constant containing a single multibyte character that maps to a single member of the extended execution character set is the wide character corresponding to that multibyte character in the implementation-defined wide literal encoding. The value of a wide character constant containing more than one multibyte character or a single multibyte character that maps to multiple members of the extended execution character set, or containing a multibyte character or escape sequence not represented in the extended execution character set, is implementation-defined.
3… A UTF–8 string literal is the same, except prefixed by
u8
. Awchar_t
string literal is the same, except prefixed byL
. A UTF–16 string literal is the same, except prefixed byu
. A UTF–32 string literal is the same, except prefixed byu8
.A wide string literal is the same, except prefixed by the letterCollectively,L
,u
, orU
.wchar_t
, UTF-16, and UTF-32 string literals are called wide string literals.…
6 In translation phase 7, a byte or code of value zero is appended to each multibyte character sequence that results from a string literal or literals.86) The multibyte character sequence is then used to initialize an array of static storage duration and length just sufficient to contain the sequence. For character string literals, the array elements have type
char
, and are initialized with the individual bytes of the multibyte character sequence corresponding to the literal encoding. For UTF–8 string literals, the array elements have typechar
, and are initialized with the characters of the multibyte character sequence, as encoded in UTF–8. For wide string literals prefixed by the letterL
, the array elements have typewchar_t
and are initialized with the sequence of wide characters correspondingto the multibyte character sequence, as defined by theto the wide literal encoding For wide string literals prefixed by the lettermbstowcs
function with an implementation-defined current locale.u
orU
, the array elements have typechar16_t
orchar32_t
, respectively, and are initialized with thesequence of wide characters corresponding to the multibyte character sequence, as defined by successive calls to thesequence of wide characters corresponding to UTF-16 and UTF-32 encoded text, respectively. The value of a string literal containing a multibyte character or escape sequence not represented in the execution character set is implementation-defined. Any hexadecimal escape sequence or octal escape sequence specified in ambrtoc16
, ormbrtoc32
function as appropriate for its type, with an implementation-defined current localeu8
,u
, orU
specifies a singlechar
,char16_t
, orchar32_t
value and may result in the full character sequence not being valid UTF-8, UTF-16, or UTF-32.
__STDC_UTF_16__
and __STDC_UTF_32__
Macros to §6.10.8.1 Mandatory Macros so that they are always defined1 The following macro names shall be defined by the implementation:
…
__STDC_UTF_16__
The integer constant1
, intended to indicate that values of typechar16_t
are UTF–16 encoded.
__STDC_UTF_32__
The integer constant1
, intended to indicate that values of typechar32_t
are UTF–32 encoded.
__STDC_UTF_16__
and __STDC_UTF_32__
Macros from §6.10.8.2 Environment Macros1 The following macro names are conditionally defined by the implementation:
…
__STDC_UTF_16__
The integer constant1
, intended to indicate that values of typechar16_t
are UTF–16 encoded. If some other encoding is used, the macro shall not be defined and the actual encoding used is implementation-defined.
__STDC_UTF_32__
The integer constant1
, intended to indicate that values of typechar32_t
are UTF–32 encoded. If some other encoding is used, the macro shall not be defined and the actual encoding used is implementation-defined.
— The encoding of any ofwchar_t
,char16_t
, andchar32_t
where the corresponding standard encoding macro (__STDC_ISO_10646__
,__STDC_UTF_16__
, or__STDC_UTF_32__
) is not defined (6.10.8.2).— The encoding of any of
wchar_t
where the corresponding standard encoding macro (__STDC_ISO_10646__
) is not defined (6.10.8.2).