N2902: Restartable and Non-Restartable Functions for Efficient Character Conversions

1. Changelog

1.1. Revision 6 - January 1^st, 2022

Add design critique for the latest interface suggestion in § 4.1 Which Function Form?.
Remove all non-mbstate_t functions, to reduce the function count, and change behavior of the function.
Make sure stdc_mcerr is meant to be a proper error enumeration and a typedef.
Properly define indivisible unit of work.

1.2. Revision 5 - November 30^th, 2021

Design critique and benchmark 3 different styles of function declaration and discuss benefits.
A full, independent implementation of this paper (and more) is now available.

1.3. Revision 4 - December 1^st, 2020

Add missing functions for c8/16/32 to the platform-specific variants.
Ensure that mbstate_t is used throughout rather than mcstate_t.
Explain behavior of NULL for mbstate_t to avoid use of global values.

1.4. Revision 3 - October 27^th, 2020

Completely Reformulate Paper based on community, musl-libc, and glibc feedback.
Completely rewrite every section past § 6 Proposed Wording, and change many more.

1.5. Revision 0-2 - March 2^nd, 2020

Introduce new functions and gather consensus to move forward.
Attempt to implement in other standard libraries and gather feedback.

2. Introduction and Motivation

C adopted conversion routines for the current active locale-derived/LC_TYPE-controlled/implementation-defined encoding for Multibyte (mb) Strings and Wide (wc) Strings. While the rationale for having such conversion routines to and from Multibyte and Wide strings in the C library are not explicitly stated in the documents, it is easy to derive the many benefits of a full ecosystem of both restarting (r) and non-restarting conversion routines for both single units and string-based bulk conversions for mb and wc strings. From ease of use with string literals to performance optimizations from bulk processing with vectorization and SIMD operations, the mbs(r)towcs — and vice-versa — granted a rich and fertile ground upon which C library developers took advantage of platform amenities, encoding specifics, and hardware support to provide useful and fast abstractions upon which encoding-aware applications could build.

Unfortunately, none of these API designs were granted to char16_t (c16) or char32_t (c32) conversion functions. Nor were they given a way to work with a well-defined 8-bit multibyte encoding such as UTF8 without having to first pin it down with platform-specific setlocale(...) calls. This has resulted in a series of extremely vexing problems when trying to write a portable, reliable C library code that is not locked to a specific vendor.

This paper looks at the problems, and then proposes a solution with the goal of hoping to arrive at a solution that is worth implementing for the C Standard Library.

2.1. Problem 1: Lack of Portability

Already, Windows, z/OS, and POSIX platforms greatly differ in what they offer for char-typed, Multibyte string encodings. EBCDIC is still in play after many decades. Windows’s Active Code Page functionality on its machine prevents portability even within its own ecosystem. Platforms where LANG environment variables control functionality make communication between even processes on the same hardware a silent and often unforeseen gamble for library developers. Using functions which convert to/from mbs make it impossible to have stability guarantees not only between platforms, but for individual machines. Sometimes even cross-process communication becomes exceedingly problematic without opting into a serious amount of platform-specific or vendor-specific code and functionality to lock encodings in, harming the portability of C code greatly.

wchar_t does not fare better. By definition, a wide character type must be capable of holding the entire character set in a single unit of wchar_t. Reality, however, is different: this has been a fundamental impossibility for decades for implementers that switched to 16-bit UCS-2 early. IBM machines persist with this issue for all 32-bit builds, though some IBM platforms took advantage of the 64-bit change to do an ABI break and use UTF32 like other Linux distributions settled on. Even if one were to know this knowledge about IBM and program exclusively on their machines, certain IBM platforms can still end up in a situation where wchar_t is neither 32-bit UTF32 or 16-bit UCS-2/UTF16: the encoding can change to something else in certain Chinese locales, becoming completely different.

Windows is permanently stuck on having to explicitly detail that its implementation is "16-bit, UCS-2 as per the standard", before explicitly informing developers to use vendor-specific WideCharToMultibyte/MultibyteToWideChar to handle UTF16-encoded characters in wchar_t.

These solutions provide ways to achieve a local maxima for a specific vendor or platform. Unfortunately, this comes at the extreme cost of portability: the code has no guarantee it will work anywhere but your machine, and in a world that is increasingly interconnected by devices that interface with networks it makes sharing both data and code troublesome and hard to work with.

2.2. Problem 2: What is the Encoding?

With setlocale and getlocale only responding to and returning implementation-defined (const )char*, there is no way to portably determine what the locale (and any associated encoding) should or should not be. The typical solution for this has been to code and program only for what is guaranteed by the Standard as what is in the Basic Character Set. While this works fine for source code itself, this produces an extremely hostile environment:

conversion functions in the standard mangle and truncate data in (sometimes troubling, sometimes hilarious) fashion;
programs which are not careful to meticulously track encoding of incoming text often lose the ability to understand that text;
programmers can never trust the platform will support even the Latin characters in any representation of data beyond the 7th bit of a byte;
and, interchange between cultures with different default encodings makes it impossible to communicate with others without entirely forsaking the standard library.

Abandoning the C Standard Library -- to get standard behavior across platforms -- is an exceedingly bitter pill to have to swallow as an enthusiastic C developer.

2.3. Problem 3: Performance

The current version of the C Standard includes functions which attempt to alleviate Problems 1 and 2 by providing conversions from the per-process (and sometimes per-thread), locale-sensitive black box encoding of multibyte char* strings. They do this by providing conversions to char16_t units or char32_t units with mbrtoc(16|32) and c(16|32)rtomb functions. We will for a brief moment ignore the presence of the __STD_C_UTF16__ and __STD_C_UTF32__ macros and assume the two types mean that string literals and library functions convert to and from UTF16 and UTF32 respectively. We will also ignore that wchar_t's encoding -- which is just as locale-sensitive and unknown at compile and runtime as char's encoding is -- has no such conversion functions. These givens make it possible to say that we, as C programmers, have 2 known encodings which we can use to shepherd data into a stable state for manipulation and processing as library developers.

Even with that knowledge, these one-unit-at-a-time conversions functions are slower than they should be.

On many platforms, these one-at-a-time function calls come from the operating system, dynamically loaded libraries, or other places which otherwise inhibit compiler observation and optimizer inspection. Attempts to vectorize code or unroll loops built around these functions is thoroughly thwarted by this. Building static libraries or from source is very often a non-starter for many platforms. Since the encoding used for multibyte strings and wide strings are controlled by the implementation, it becomes increasingly difficult to provide the functionality to convert long segments of data with decent performance characteristics without needing to opt into vendor or platform specific tricks.

2.4. Problem 4: `wchar_t` Cannot Roundtrip

With no wctoc32 or wctoc16 functions, the only way to convert a wide character or wide character string to a program-controlled, statically known encoding UTF encoding is to first invoke the wide character to multibyte function, and then invoke the multibyte function to either char16_t or char32_t.

This means that even if we have a well-behaved wchar_t that is not sensitive to the locale (e.g., on Windows machines), we lose data if the locale-controlled char encoding is not set to something that can handle all incoming code unit sequences. The locale-based encoding in a program can thus tank what is simply meant to be a pass-through encoding from wchar_t to char16_t/char32_t, all because the only Standards-compliant conversion channels data through the locale-based multibyte encoding mb(s)(r)toX(s) functions.

For example, it was fundamentally impossible to engage in a successful conversion from wchar_t strings to char multibyte strings on Windows using the C Standard Library. Until a very recent Windows 10 update, UTF8 could not be set as the active system codepage either programmatically or through an experimental, deeply-buried setting. This has changed with Windows Version 1903 (May 2019 Update), but the problems do not stop there.

No dedicated UTF-8 support (the standard mandates no specific encodings or charsets) leaves developers to write the routines themselves. Sometimes worse, roundtrip it through the locale after forcing a change to a UTF-8 locale, which may not be supported. While the non-restartable functions can save quite a bit of code size, unfortunately there are many encodings which are not as nice and require state to be processed correctly (e.g., Shift JIS and other ISO-2022 encodings). Not being able to retain that state between potential calls in a mbstate_t is detrimental to the ability to move forward with any encoding endeavor that wishes to bridge the gap between these disparate platform encodings and the current locale.

Because other library functions can be used to change or alter the locale in some manner, it once again becomes impossible to have a portable, compliant program with deterministic behavior if just one library changes the locale of the program, let alone if the encoding or locale is unexpected by the developer because they do not know of that culture or its locale setting. This hidden state is nearly impossible to account for: the result is software systems that cannot properly handle text in a meaningful way without abandoning C’s encoding facilities, relying on vendor-specific extensions/encodings/tools, or confining one’s program to only the 7-bit plane of existence.

2.5. Problem 5: The C Standard Cannot Handle Existing Practice

The C standard does not allow a wide variety of encodings that implementations have already crammed into their backing locale blocks to work, resulting in the abandonment of locale-related text facilities by those with double-byte character sets, primarily from East Asia. For example, there is a serious bug that cannot be fixed without non-conforming, broken behavior:

...

This call writes the second Unicode code point, but does not consume any input. 0 is returned since no input is consumed. According to the C standard, a return of 0 is reserved for when a null character is written, but since the C standard doesn’t acknowledge the existence of characters that can’t be represented in a single wchar_t, we’re already operating outside the scope of the standard.

The standard cannot handle encodings that must return two or more wchar_t for however many -- up to MB_MAX_LEN -- it consumes. This is even for when the target wchar_t "wide execution" encoding is UTF-32; this is a fundamental limitation of the C Standard Library that is absolutely insurmountable by the current specification. This is exacerbated by the standard’s insistence that a single wchar_t must be capable of representing all characters as a single element, a philosophy which has been bled into the relevant interfaces such as mbrtowc and other *wc* related types. As the values cannot be properly represented in the standard, this leaves people to either make stuff up or abandon it altogether. This means that the design introduced from C11 and beyond is fundamentally broken when it comes to handling existing practice.

Furthermore, clarification requests have had to be filed for other functions, just to improve their behavior with respect to multiple input and multiple output. Many have been noted as issues for mbrtoc16 and similar functionality, as was originally part of Dr. Philip K. Krause’s fixes to the functions. This paper attempts to solve the same problem in a more fundamental manner.

2.6. In Summary

The problems C developers face today with respect to encoding and dealing with vendor and platform-specific black boxes is a staggering trifecta: non-portability between processes running on the same physical hardware, performance degradation from using standard facilities, and potentially having a locale changed out from under your program to prevent roundtripping.

This serves as the core motivation for this proposal.

3. Prior Art

There are many sources of prior art for the desired feature set. Some functions (with fixes) were implemented directly in implementations, embedded and otherwise. Others rely exclusively platform-specific code in both Windows and POSIX implementations. Others have cross-platform libraries that work across a myriad of platforms, such as ICU or iconv. We discuss the most diverse and exemplary implementations.

3.1. Standard C

To understand what this paper proposes, an explanation of the current landscape is necessary. The below table is meant to be read as being {row}to{column}. The symbols provide the following information:

✔️: Function exists in both its restartable (function name has the indicative r in it) and its canonical non-restartable form ({row}to{column} and {row}rto{column}).
🇷: Function exists only in its "restartable" form ({row}rto{column}).
❌: Function does not exist at all.

Here is what exists in the C Standard Library so far:

	mb	wc	mbs	wcs	c8	c16	c32	c8s	c16s	c32s
mb	➖	✔️			❌	🇷	🇷
wc	✔️	➖			❌	❌	❌
mbs			➖	✔️				❌	❌	❌
wcs			✔️	➖				❌	❌	❌
c8	❌	❌			➖	❌	❌
c16	🇷	❌			❌	➖	❌
c32	🇷	❌			❌	❌	➖
c8s			❌	❌				➖	❌	❌
c16s			❌	❌				❌	➖	❌
c32s			❌	❌				❌	❌	➖

There is a lot of missing functionality here in this table, and it is important to note that a large amount of this comes from both not being willing to standardize more than the bare minimum and not having a cohesive vision for improving encoding conversions in the C Standard. Notably, string-based {prefix}s functions are missing, leaving performance-oriented multi-unit conversions out of the standard. There are also severe API flaws in the C standard, as discussed above.

3.2. Win32

WideCharToMultiByte and MultiByteToWideChar are the APIs of choice for those in Win32 environments to get to and from the run-time execution encoding and -- if it matches -- the translation-time execution encoding. Unfortunately, these APIs are locked within the Windows ecosystem entirely as they are not available as a standalone library. Furthermore, as an operating system Windows exclusively controls what it can and cannot convert from and to; some of these functions power the underlying portions of the character conversion functions in their Standard Library, but they notably truncate multi-code-unit characters for their UTF-16 wchar_t. This produces a broken, deprecated UCS-2 encoding when e.g. mbrtowc is used instead of directly relying on the operating system functionality, making the C standard’s functions of dubious use.

3.3. `nl_langinfo`

nl_langinfo is a POSIX function that returns various pieces of information based on an enumerated input and some extra parameters. It has been suggested that this be standardized over anything else, to make it easier to determine what to do with a given locale.

The first problem with this is it returns a string-based identifier that can be whatever an implementation decides it should be. This makes nl_langinfo is no better than setlocale(LC_CHARSET, NULL) in its design:

Specifies the name of the coded character set for which the charmap file is defined. This value determines the value returned by the nl_langinfo subroutine. The <code_set_name> must be specified using any character from the portable character set, except for control and space characters.

Any name can be chosen that fits this description, and POSIX nails nothing down for portability or identification reasons. There is no canonical list, just whatever implementations happen to supply as their "charmap" definitions.

3.4. SDCC

The Small Device C Compiler (SDCC) has already begun some of this work. One of its principle contributors, Dr. Philip K. Krause, wrote papers addressing exactly this problem. Krause’s work focuses entirely on non-restartable conversions from Multibyte Strings to char16_t and char32_t. There is no need for a conversion to a UTF8 char style string for SDCC, since the Multibyte String in SDCC is always UTF8. This means that mbstoc16s and mbstoc32s and the "reverse direction" functions encompass an entire ecosystem of UTF8, UTF16, and UTF32.

While this is good for SDCC, this is not quite enough for other developers who attempt to write code in a cross-platform manner.

Nevertheless, SDCC’s work is still important: it demonstrates that these functions are implementable, even for small devices. With additional work being done to implement them for other platforms, there is strong evidence that this can be implemented in a cross-platform manner and thusly is suitable for the Standard Library.

3.5. iconv/ICU

The C functions presented below is motivated primarily by concepts found in a popular POSIX library, [iconv]. We do not provide the full power of iconv here but we do mimic its interface to allow for a better definition of functions, as explained in Problem 5. The core of the functionality can be embodied in this parameterized function signature:

stdc_mcerr stdc_XstoYs(const charX** input, size_t* input_bytes, const charY** output, size_t* output_bytes);

In iconv's case, an additional first parameter describing the conversion (of type iconv_t). That is not needed for this proposal, because we are not making a generic conversion API. This proposal is focused on doing 2 things and doing them extremely well:

Getting data from the current execution encoding (char) to a Unicode encoding (unsigned char/UTF-8, char16_t/UTF-16, char32_t/UTF-32), and the reverse.
Getting data from the current wide execution encoding (wchar_t) to a Unicode encoding (unsigned char/UTF-8, char16_t/UTF-16, char32_t/UTF-32), and the reverse.

iconv can do the above conversions, but also supports a complete list of pairwise conversions between about 49 different encodings. It can also be extended at translation time by programming more functionality into its library. This proposal is focusing just in doing conversions to and from encodings that the implementation owns to/from Unicode. This results in the design found below.

4. Design

Given the problems before, the prior art, the implementation experience, and the vendor experience, it is clear that we need something outside of nl_langinfo, lighter weight than all of iconv, and more resilient and encompassing than what the C Standard offers. Therefore, the solution to our problem of having a wide variety of implementation encodings is to expand the contract of wchar_t for an entirely new set of functions which avoid the problems and pitfalls of the old mechanism.

Notably, both of the multibyte string’s function design and the wide character string’s definition of a single character is broken in terms of existing practice today. The primary problem relies in the inability for both APIs in either direction to handle N:M encodings, rather than N:1 or 1:M. Therefore, these new functions focus on providing an interface to allow multi-code-unit conversions, in both directions.

To facilitate this, a new header -- <stdmchar.h> -- is introduced. The header contains the "multi character" (mc) and "multi wide character" (mwc) conversion routines, respectively. To support getting data losslessly out of wchar_t and char strings controlled firmly by the implementation -- and back into those types if the code units in the characters are supported -- the following functionality is proposed using the new multi (wide) character (m[w]c) prefixes and suffixes:

	mc	mwc	mcs	mwcs	c8	c16	c32	c8s	c16s	c32s
mc	🅿️✔️	✔️			🅿️✔️	🅿️✔️	🅿️✔️
mwc	✔️	🅿️✔️			🅿️✔️	🅿️✔️	🅿️✔️
mcs			🅿️✔️	✔️				🅿️✔️	🅿️✔️	🅿️✔️
mwcs			✔️	🅿️✔️				🅿️✔️	🅿️✔️	🅿️✔️
c8	🅿️✔️	🅿️✔️			🅿️✔️	🅿️✔️	🅿️✔️
c16	🅿️✔️	🅿️✔️			🅿️✔️	🅿️✔️	🅿️✔️
c32	🅿️✔️	🅿️✔️			🅿️✔️	🅿️✔️	🅿️✔️
c8s			🅿️✔️	🅿️✔️				🅿️✔️	🅿️✔️	🅿️✔️
c16s			🅿️✔️	🅿️✔️				🅿️✔️	🅿️✔️	🅿️✔️
c32s			🅿️✔️	🅿️✔️				🅿️✔️	🅿️✔️	🅿️✔️

In particular, it is imperative to recognize that the implementation is the "sole proprietor" of the wide locale encodings and multibyte locale encodings for its string literals (compiler) and library functions (standard library). Therefore, the mc and mwc functions simply focus on providing a good interface for these encodings. The form of both the individual and string conversion functions are:

stdc_mcerr stdc_XnrtoYn(const size_t* output_size, charY** output,
	size_t* input_size, const charX** input, mbstate_t* state);
stdc_mcerr stdc_XsnrtoYsn(const size_t* output_size, charY** output,
	size_t* input_size, const charX** input, mbstate_t* state);

The input and output sizes are expressed in terms of the # of charX/charYs. They take the input/output sizes as pointers, and decrement the value by the amount of input/output consumed. Similarly, the input/output data pointers themselves are incremented by the amount of spaces consumed / written to. This only happens when an irreversible and successful conversion of input data can successfully and without error be written to the output. The s functions work on whole strings rather than just a single complete irreversible conversion, the n stands for taking a size value.

Input is consumed and output is written (with sizes updated) in accordance with a single, successful computation of an indivisible unit of work. An indivisible unit of work is the smallest set of input that can be consumed that produces no error and guarantees forward progress through either the input or output buffer (most of the time, both). No output is guaranteed to occur (e.g., during the consumption of a shift state mechanism for e.g. SHIFT-JIS), but if output does happen then it only occurs upon the successful completion of an indivisible unit of work.

If an error happens, the conversion is stopped and an error code is returned. The function does not decrement the input or output sizes for the failed operation, nor does it shift the input and output pointers forward for the failed operation. "Failed operation" refers to a single, indivisible unit of work. The error codes are as follows:

stdc_mcerr_insufficient_output = -3 - the input is correct but there is not enough output space
stdc_mcerr_incomplete_input = -2 - an incomplete input was found after exhausting the input
stdc_mcerr_invalid = -1 - an encoding error occurred
stdc_mcerr_ok = 0 - the operation was successful

The behaviors are as follows:

if state is NULL, then:
an automatic storage duration (non-static) mbstate_t object is initialized to the initial conversion sequence;
and, a pointer to this state object plus the original four parameters are passed to the restartable version of the function.
if output is NULL, then no output will be written. If *output_size is not-NULL, the value will be decremented the amount of characters that would have been written.
if output is non-NULL and output_size is NULL, then enough space is assumed in the output buffer for the entire operation.
if input is NULL, then state is set to the initial conversion sequence and no other actions are performed; otherwise, input must not be NULL.

Finally, it is useful to prevent the class of stdc_mcerr_insufficient_output/-3 errors from showing up in your code if you know you have enough space. For the non-string (the functions lacking s) that perform a single conversion, a user can pre-allocate a suitably sized static buffer in automatic storage duration space. This will be facilitated by a group of integral constant expressions contained in macros, which would be;

STDC_MC_MAX, which is the maximum output for a call to one of the X to multi character functions
STDC_MWC_MAX, which is the maximum output for a call to one of the X to multi wide character functions
STDC_C8_MAX, which is the maximum output for a call to one of the X to UTF-8 character functions
STDC_C16_MAX, which is the maximum output for a call to one of the X to UTF-16 character functions
STDC_C32_MAX, which is the maximum output for a call to one of the X to UTF-32 character functions

these values are suitable for use as the size of an array, allowing a properly sized buffer to hold all of the output from the non-string functions. These limits apply only to the non-string functions, which perform a single unit of irreversible input consumption and output (or fail with one of the error codes and outputs nothing).

Here is the full list of proposed functions:

#include <stdmchar.h>

#define STDC_C8_MAX  4
#define STDC_C16_MAX 2
#define STDC_C32_MAX 1
#define STDC_MC_MAX  1
#define STDC_MWC_MAX 1

typedef enum stdc_mcerr {
  stdc_mcerr_ok                  =  0;
  stdc_mcerr_invalid             = -1;
  stdc_mcerr_incomplete_input    = -2;
  stdc_mcerr_insufficient_output = -3;
} stdc_mcerr;

stdc_mcerr stdc_mcnrtowcn(size_t* output_size, char** output,
	size_t* input_size, char** input, mbstate_t* state);
stdc_mcerr stdc_mcnrtomwcn(size_t* output_size, wchar_t** output,
	size_t* input_size, char** input, mbstate_t* state);
stdc_mcerr stdc_mcnrtoc8n(size_t* output_size, unsigned char** output,
	size_t* input_size, char** input, mbstate_t* state);
stdc_mcerr stdc_mcnrtoc16n(size_t* output_size, char16_t** output,
	size_t* input_size, char** input, mbstate_t* state);
stdc_mcerr stdc_mcnrtoc32n(size_t* output_size, char32_t** output,
	size_t* input_size, char** input, mbstate_t* state);

stdc_mcerr stdc_mwcnrtomcn(size_t* output_size, char** output,
	size_t* input_size, wchar_t** input, mbstate_t* state);
stdc_mcerr stdc_mwcnrtomwcn(size_t* output_size, wchar_t** output,
	size_t* input_size, wchar_t** input, mbstate_t* state);
stdc_mcerr stdc_mwcnrtoc8n(size_t* output_size, unsigned char** output,
	size_t* input_size, wchar_t** input, mbstate_t* state);
stdc_mcerr stdc_mwcnrtoc16n(size_t* output_size, char16_t** output,
	size_t* input_size, wchar_t** input, mbstate_t* state);
stdc_mcerr stdc_mwcnrtoc32n(size_t* output_size, char32_t** output,
	size_t* input_size, wchar_t** input, mbstate_t* state);

stdc_mcerr stdc_c8nrtomcn(size_t* output_size, char** output,
	size_t* input_size, unsigned char** input, mbstate_t* state);
stdc_mcerr stdc_c8nrtomwcn(size_t* output_size, wchar_t** output,
	size_t* input_size, unsigned char** input, mbstate_t* state);
stdc_mcerr stdc_c8nrtoc8n(size_t* output_size, unsigned char** output,
	size_t* input_size, unsigned char** input, mbstate_t* state);
stdc_mcerr stdc_c8nrtoc16n(size_t* output_size, char16_t** output,
	size_t* input_size, unsigned char** input, mbstate_t* state);
stdc_mcerr stdc_c8nrtoc32n(size_t* output_size, char32_t** output,
	size_t* input_size, unsigned char** input, mbstate_t* state);

stdc_mcerr stdc_c16nrtomcn(size_t* output_size, char** output,
	size_t* input_size, char16_t** input, mbstate_t* state);
stdc_mcerr stdc_c16nrtomwcn(size_t* output_size, wchar_t** output,
	size_t* input_size, char16_t** input, mbstate_t* state);
stdc_mcerr stdc_c16nrtoc8n(size_t* output_size, unsigned char** output,
	size_t* input_size, char16_t** input, mbstate_t* state);
stdc_mcerr stdc_c16nrtoc16n(size_t* output_size, char16_t** output,
	size_t* input_size, char16_t** input, mbstate_t* state);
stdc_mcerr stdc_c16nrtoc32n(size_t* output_size, char32_t** output,
	size_t* input_size, char16_t** input, mbstate_t* state);

stdc_mcerr stdc_c32nrtomcn(size_t* output_size, char** output,
	size_t* input_size, char32_t** input, mbstate_t* state);
stdc_mcerr stdc_c32nrtomwcn(size_t* output_size, wchar_t** output,
	size_t* input_size, char32_t** input, mbstate_t* state);
stdc_mcerr stdc_c32nrtoc8n(size_t* output_size, unsigned char** output,
	size_t* input_size, char32_t** input, mbstate_t* state);
stdc_mcerr stdc_c32nrtoc16n(size_t* output_size, char16_t** output,
	size_t* input_size, char32_t** input, mbstate_t* state);
stdc_mcerr stdc_c32nrtoc32n(size_t* output_size, char32_t** output,
	size_t* input_size, char32_t** input, mbstate_t* state);

stdc_mcerr stdc_mcsnrtomcsn(size_t* output_size, char** output,
	size_t* input_size, char** input, mbstate_t* state);
stdc_mcerr stdc_mcsnrtomwcsn(size_t* output_size, wchar_t** output,
	size_t* input_size, char** input, mbstate_t* state);
stdc_mcerr stdc_mcsnrtoc8sn(size_t* output_size, unsigned char** output,
	size_t* input_size, char** input, mbstate_t* state);
stdc_mcerr stdc_mcsnrtoc16sn(size_t* output_size, char16_t** output,
	size_t* input_size, char** input, mbstate_t* state);
stdc_mcerr stdc_mcsnrtoc32sn(size_t* output_size, char32_t** output,
	size_t* input_size, char** input, mbstate_t* state);

stdc_mcerr stdc_mwcsnrtomcsn(size_t* output_size, char** output,
	size_t* input_size, wchar_t** input, mbstate_t* state);
stdc_mcerr stdc_mwcsnrtomwcsn(size_t* output_size, char** output,
	size_t* input_size, wchar_t** input, mbstate_t* state);
stdc_mcerr stdc_mwcsnrtoc8sn(size_t* output_size, unsigned char** output,
	size_t* input_size, wchar_t** input, mbstate_t* state);
stdc_mcerr stdc_mwcsnrtoc16sn(size_t* output_size, char16_t** output,
	size_t* input_size, wchar_t** input, mbstate_t* state);
stdc_mcerr stdc_mwcsnrtoc32sn(size_t* output_size, char32_t** output,
	size_t* input_size, wchar_t** input, mbstate_t* state);

stdc_mcerr stdc_c8snrtomwcsn(size_t* output_size, wchar_t** output,
	size_t* input_size, unsigned char** input, mbstate_t* state);
stdc_mcerr stdc_c8snrtomcsn(size_t* output_size, char** output,
	size_t* input_size, unsigned char** input, mbstate_t* state);
stdc_mcerr stdc_c8snrtoc8sn(size_t* output_size, unsigned char** output,
	size_t* input_size, unsigned char** input, mbstate_t* state);
stdc_mcerr stdc_c8snrtoc16sn(size_t* output_size, char16_t** output,
	size_t* input_size, unsigned char** input, mbstate_t* state);
stdc_mcerr stdc_c8snrtoc32sn(size_t* output_size, char32_t** output,
	size_t* input_size, unsigned char** input, mbstate_t* state);

stdc_mcerr stdc_c16snrtomwcsn(size_t* output_size, wchar_t** output,
	size_t* input_size, char16_t** input, mbstate_t* state);
stdc_mcerr stdc_c16snrtomcsn(size_t* output_size, char** output,
	size_t* input_size, char16_t** input, mbstate_t* state);
stdc_mcerr stdc_c16snrtoc8sn(size_t* output_size, unsigned char** output,
	size_t* input_size, char16_t** input, mbstate_t* state);
stdc_mcerr stdc_c16snrtoc16sn(size_t* output_size, char16_t** output,
	size_t* input_size, char16_t** input, mbstate_t* state);
stdc_mcerr stdc_c16snrtoc32sn(size_t* output_size, char32_t** output,
	size_t* input_size, char16_t** input, mbstate_t* state);

stdc_mcerr stdc_c32snrtomcsn(size_t* output_size, char** output,
	size_t* input_size, char32_t** input, mbstate_t* state);
stdc_mcerr stdc_c32snrtomwcsn(size_t* output_size, wchar_t** output,
	size_t* input_size, char32_t** input, mbstate_t* state);
stdc_mcerr stdc_c32snrtoc8sn(size_t* output_size, unsigned char** output,
	size_t* input_size, char32_t** input, mbstate_t* state);
stdc_mcerr stdc_c32snrtoc16sn(size_t* output_size, char16_t** output,
	size_t* input_size, char32_t** input, mbstate_t* state);
stdc_mcerr stdc_c32snrtoc32sn(size_t* output_size, char32_t** output,
	size_t* input_size, char32_t** input, mbstate_t* state);

4.1. Which Function Form?

There are several different ways to write the functions present here, each with their own unique tradeoffs. Since a lot of calling conventions cannot afford struct parameters and returns by-value without elevating them to a level of indirection (filling in a pointer of an object allocated by the caller on the stack), and since much of the functionality of the standard does not follow such a convention, in this paper we simply evaluate the pointer and integer-based forms that will allow all parameters to be passed in registers or similar on most calling conventions we know of (including but not limited to arm7e, arm, arm64, amd64 (VC++ and System V), x86). From those requirements, the most prominent forms are:

// (1)
stdc_mcerr stdc_XnrtoYn(size_t* output_size, const charY** output,
	size_t* input_size, const charX** input, mbstate_t* state);
// (2)
stdc_mcerr stdc_XnrtoYn(size_t output_size, charY** output,
	size_t input_size, const charX** input, mbstate_t* state);
// (3)
stdc_mcerr stdc_XnrtoYn(charY** output, const charY* output_last,
	const charX** input, const charX* input_last, mbstate_t* state);
// (4)
stdc_mcerr stdc_XnrtoYn(size_t* output_size, const charY* output,
	size_t* input_size, const charX* input, mbstate_t* state);

The form of (1) is what is in this paper and the form that this paper started out with. It is what we are going to move forward with for this proposal. It is similar to iconv, but deviates from that design a bit by using the typical Win32 and similar convention that a null pointer argument changes the behavior to allow for greater flexibility. For example, passing NULL to the first design for the output_size allows an implementation to assume the output buffer is large enough: this can save on size checking on every successful conversion and write out. It also allows passing NULL for output, which allows an end-user to not perform any write outs but simply determine the full count of objects.

As a negative, it requires writes through indirect pointers for both the input and output, as well as for the sizes for the input and output. This causes multiple updates to be necessary, and duplicates information in exchange for a moderate decrease in ease-of-use. Unfortunately, it turns out this form is actually necessary for all of the functionality proposed here.

4.1.1. Simplification Without Loss of Functionality?

An attempt to fix this is done by utilizing the form in (2), which prevents giving the sizes as pointers but still has pointer-to-pointer values. Unfortunately for design (2), this means that it is impossible to perform a "counting" operation (just calculate the number of code units to write out or the number of input characters that will be consumed) without having a valid buffer to write data into, so that before/after pointer values for output can be subtracted from one another.

One could then try to smuggle the error code into the return value, albeit down that path is fraught with API design issues. One would need to exclude the values 0, -1, -2, and -3 from being used in return values, or some other set of arbitrary values. These are not good ideas and C users have struggled with APIs that behaved this way in the past: see the conversion functions currently in the C Standard which behave in this manner and obfuscate the return value for 0 (which still writes out a character but also indicates other actions performed) or the case of -3 (where multiple write outs may need to happen so the function needs to be called again). These issues have also caused fundamental limitations in the C standard library, as present in [glibc-25744].

Form (3) is simply a re-visitation of form (2), but using pointers to indicate the size. This is nominally fine, until subtraction between two pointers must be done. If PTRDIFF_MAX is less than SIZE_MAX and the architecture uses, for example, segmented memory, than it is possible to create a region of memory SIZE_MAX that exceeds the size that can be understood from substracting the leading pointers from the *_last pointers. This is a mostly a theoretical concern on larger systems and hosted systems, but of much more grave concern on bare-metal machines with a tiny PTRDIFF_MAX, or machines that make full use of the address space and frequently tap into paged memory.

Finally, form (4) was the most attractive simplification. By keeping the indirect sizes but removing the double indirection from the input and output parameters, it presented a tempting bit of functionality that seemed to keep all of the benefits of form (1) but none of the drawbacks. That, unfortunately, does not apply in this one specific case for "unbounded writing":

std_mcerr err = stdc_c8srntoc32sn(NULL, some_utf32_buffer, &input_sz, some_utf8_input);

The above seems okay, until it becomes clear that you have no idea how many characters were written out into some_utf32_buffer. By passing NULL for the size but having no pointer to update, the information is lost entirely. One could argue that someone should call the version which does the counting first and THEN pass NULL for the size, but this is overly restrictive. For example, a maximally-sized buffer can be prepared before hand when doing a UTF-8 to UTF-32 by simply assuming every code unit of input will result in one code point of output (e.g., everything input is ASCII). One could guarantee the fastest possible writing speed by creating such a maximally-sized buffer and then using NULL for the size, but it would be impossible to know exactly how much output was written in that case. One could compromise the return value to return that information, but that brings up the same API design issues mentioned above. Therefore, we keep the double-pointer form to retain the information properly.

4.1.2. Performance of Double-Pointers?

Benchmarks were inconclusive when it came to determining the cost of each API design. While writing out through (doubly-)indirected pointers provided a non-negligible cost when serializing all 2^^~20.5 available Unicode code points through a UTF-8 to UTF-32 conversion, these costs became noise values when bulk functions were written that did not simply invoke the single-conversion functions repeatedly. That is: it performed the logical equivalent of performing the bulk operation, and only updated the input/output pointers and sizes when it was finished with the operation.

This could present a problem on at least one implementation, such as musl-libc. musl-libc both reportedly and in its implementation tends to implement their current bulk transcoding routines by simply looping over the single-unit transcoding routines. But, they have stated that they do not care about the performance degredation here and that they are perfectly fine with the cost of writing the bulk functions in terms of the single transcoding functions. As such, we find no reason to change the pointer-based design on the grounds of performance either.

4.1.3. Structure Returns?

We do note that there could be a better interface design in general if the error value and other information were returned in a structure (the current input pointer, output pointer, and sizes-left). Then, we would not have to compromise the error return with a size and properly separate the two so that users do not accidentally misuse it as they have in the past. But, most places in the C Standard avoid using by-value structure returns. Therefore, this idea was, similarly, discarded.

5. Conclusion

The ecosystem deserves ways to get to a statically-known encoding and not rely on implementation and locale-parameterized encodings. This allows developers a way to perform cross-platform text processing without needing to go through fantastic gymnastics to support different languages and platforms. An independent library implementation, _cuneicode_ (talked about from Meeting C++ and C++ On Sea), is now publicly available to everyone.

6. Proposed Wording

The following wording is relative to N2731.

6.1. Intent

The intent of the wording is to provide transcoding functions that:

define "code unit" as the smallest piece of information;
define the notion of an "indivisible unit of work";
introduce the notion of multi-unit work that does not use the same 1:N or M:1 design as the precious wchar_t functions;
convert from the execution ("mc") and wide execution ("mwc") encodings to the unicode ("c8", "c16", "c32") encodings and vice-versa;
convert from the execution encoding ("mc") to the wide execution ("mwc") encoding and vice-versa;
provide a way for mbstate_t to be properly initialized as the initial conversion sequence; and,
to be entirely thread-safe by default with no magic internal state asides from what is already required by locales.

6.2. Proposed Specification

Author’s Note: Any � is a stand-in character to be replaced by the editor.

6.2.1. Create a new section 7.S� Text Transcoding Utilities

7.S� Text transcoding utilities <stdmchar.h>

The header <stdmchar.h> declares four status code enumerators, five macros, several types and several functions for transcoding encoded text safely and efficiently. It is meant to supersede conversion utilities from Unicode utilities (7.28) and Extended multibyte and wide character utilities (7.29). It is meant to represent "multi character" functions. These functions can be used to count the number of input that form a complete sequence, count the number of output characters required for a conversion with no additional allocation, validate an input sequence, or transcode text from one encoding to another encoding. Particularly, it provides single unit and multi unit transcoding functions for transcoding by working on code units and code points.

A code unit is a single compositional unit of encoded information, usually of type char, unsigned char, char16_t, char32_t, or wchar_t. One or more code units are interpreted as specified by the encoding associated with a given function and code unit, described below.

A code point is a single compositional unit of decoded information. Code points are generally used as the single complete decoded output, or as an intermediary to transcode to other code units, for example, Unicode Code Points as defined in ISO/IEC 10646 for use with UTF-8, UTF-16, and UTF-32.

Inputs to the functions in this section are read until there is enough information taken in to perform an indivisible unit of work. An indivisible unit is the smallest possible input, as defined by the encoding, that can produce either one or more outputs or perform a transformation of any internal state. The conversion of these indivisible units is called an indivisible unit of work, and they are used to complete the below specified transcoding operations.

One or more of the following must hold for any given transcoding operation on an attempt to complete an indivisible unit of work:

— enough input is consumed to perform an output or change the internal state;
— output is written from consuming input, or from the internal state which causes the internal state to change; or,
— an error occurs and both the input and output do not change relative to the current indivisible operation.

The state - managed through the mbstate_t pointer - may or may not change during any of these operations, and may be left in an indeterminate state after an error occurs. For the multi unit functions, the process acts as if it completes one indivisible unit of work repeatedly. When an error occurs, only the input successfully consumed and the output successfully written according to the last indivisible unit of work are reflected in the output values: no other values are written.

The narrow execution encoding is the implementation-defined LC_CTYPE (7.11.1)-influenced locale execution environment encoding. The wide execution encoding is the implementation-defined LC_CTYPE (7.11.1)-influenced locale wide execution environment encoding. Functions in <stdmchar.h> which use char and wchar_t, or their qualified forms, derive their implementation-defined encoding from the locale. The other encodings are UTF-8, associated with unsigned char, UTF-16, associated with char16_t, and UTF-32, associated with char32_t.^FOOTNOTE0)

^FOOTNOTE0)_{Each value is treated as code units and not as a container of octets. This means that the decision of, for example, UTF-16 in big or little endian encoding scheme is decided by the endianness of the code unit type. Only whole code unit values are used where, i.e., a UTF-32 code point value of U+0001F377 represents a value identical to how U'\U0001F377' is stored.}

For the UTF-8, UTF-16, and UTF-32 encodings, collectively referred to as the Unicode encodings, an indivisible unit of work shall be the sequence of code units that corresponds to one code point. If input is exhausted before a sequence of code units corresponding to one code point can be reached, then stdc_mcerr_incomplete_input shall be returned. If there is an illegal code unit sequence, then stdc_mcerr_invalid shall be returned.^FOOTNOTE1) For the implementation-defined execution and wide execution encodings, they have the same aforementioned requirement if the implementation defines it to be one of the Unicode Encodings.^FOOTNOTE2)

^FOOTNOTE1)_{For example, if an implementation chooses to provide a UTF-8 execution encoding, then it is required to read one full complete code point’s worth of code units. If it cannot, it shall return stdc_mcerr_incomplete_input (if the input sequence is not long enough) or stdc_mcerr_invalid (if the input sequence is not a proper UTF-8 code unit sequence).}

^FOOTNOTE2)_{This may not apply to derivative encodings defined by the implementation. For example, an implementation may define a "partial UTF-8" execution encoding where it stores every read UTF-8 code unit in the state and, rather than returning stdc_mcerr_incomplete_input, returns stdc_mcerr_ok and produces no output. It may accumulate code units and write out a code point when it accumulated enough code units in its internal state. However, this encoding is distinct and separate from the UTF-8 encoding used in the c8 prefixed and suffixed functions.}

The types declared are mbstate_t (described in 7.29.1), wchar_t (described in 7.19), char16_t (described in 7.28), char32_t (described in 7.28), size_t (described in 7.19), and;

stdc_mcerr

which is both an enumerated type and a typedef whose enumerators identify the status codes from a function calls described in this section.

The five macros declared are

STDC_C8_MAX STDC_C16_MAX STDC_C32_MAX STDC_MC_MAX STDC_MWC_MAX

which correspond to the maximum output for each single unit conversion function (7.S�.1) and its corresponding output type. Each macro shall expand into an integer constant expression with minimum values, as described in the following table.

There is an association of naming convention, types, encoding, and maximums, used to describe the functions in this clause:

Name Code Unit Type Encoding Maximum Output Macro Minimum Value
mc char The narrow execution encoding,
influenced by LC_CTYPE STDC_MC_MAX 1
mwc wchar_t The wide execution encoding,
influenced by LC_CTYPE STDC_MWC_MAX 1
c8 unsigned char UTF-8 STDC_C8_MAX 4
c16 char16_t UTF-16 STDC_C16_MAX 2
c32 char32_t UTF-32 STDC_C32_MAX 1

The maximum output macro values specified in the above table are related to the single unit conversion functions (7.S�.1). These functions perform at most one indivisible unit of work, or return an error. The maximum output macro values shall be integer constant expressions large enough that conversions to the single unit conversion function’s specified encoding shall not overflow a buffer of the proper code unit type with that size. The maximum output macro values do not affect the multi unit conversion functions (7.S�.2), which perform as many indivisible units of work as is possible until an error occurs, until the output space is exhausted, or until the input is exhausted.

The enumerators of the enumerated type stdc_mcerr are defined as follows:

stdc_mcerr_insufficient_output = -3; stdc_mcerr_incomplete_input = -2; stdc_mcerr_invalid = -1; stdc_mcerr_ok = 0;

Each value represents an specific situation when calling the relevant transcoding functions in <stdmchar.h>:

— stdc_mcerr_insufficient_output, when the input is correct and an indivisible unit of work can be performed but there is not enough output space to write to;
— stdc_mcerr_incomplete_input, when input has been exhausted and the sequence is not incorrect but there are no more input values;
— stdc_mcerr_invalid, when an encoding error occurred; and,
— stdc_mcerr_ok, when the operation was successful.
No other value shall be returned from the functions described in this section.

Recommended Practice

The maximum output macro values are intended for use in making automatic storage duration array declarations. Implementations should choose values for the macros that are spacious enough to accommodate a variety of underlying implementation choices for the target encodings supported by the narrow execution encodings and wide execution encodings, which in many cases can output more than one UTF-32 code point. Below is a set of values that may be resilient to future additions and changes in implementations:

#define STDC_C8_MAX 32 #define STDC_C16_MAX 16 #define STDC_C32_MAX 8 #define STDC_MC_MAX 32 #define STDC_MWC_MAX 16

Beyond just the Unicode encodings mentioned above, implementations are encouraged to not store partial reads or partial writes in the mbstate_t object with these functions unless as is strictly necessary. Implementers providing additional encodings for use with these functions should, to the extent possible for a given encoding, always attempt to transcode a complete unit of information. If a sequence of code units cannot form a complete state transition or produce output, then an implementation should return stdc_mcerr_incomplete_input if the input is exhausted, or stdc_mcerr_invalid if the input sequence is incorrect.

Name	Code Unit Type	Encoding	Maximum Output Macro	Minimum Value
mc	`char`	The narrow execution encoding, influenced by `LC_CTYPE`	`STDC_MC_MAX`	`1`
mwc	`wchar_t`	The wide execution encoding, influenced by `LC_CTYPE`	`STDC_MWC_MAX`	`1`
c8	`unsigned char`	UTF-8	`STDC_C8_MAX`	`4`
c16	`char16_t`	UTF-16	`STDC_C16_MAX`	`2`
c32	`char32_t`	UTF-32	`STDC_C32_MAX`	`1`

7.S�.1 Restartable and Non-Restartable Sized Single Unit Conversion Functions

#include <stdmchar.h> stdc_mcerr stdc_mcnrtowcn(size_t* output_size, char** output, size_t* input_size, char** input, mbstate_t* state); stdc_mcerr stdc_mcnrtomwcn(size_t* output_size, wchar_t** output, size_t* input_size, char** input, mbstate_t* state); stdc_mcerr stdc_mcnrtoc8n(size_t* output_size, unsigned char** output, size_t* input_size, char** input, mbstate_t* state); stdc_mcerr stdc_mcnrtoc16n(size_t* output_size, char16_t** output, size_t* input_size, char** input, mbstate_t* state); stdc_mcerr stdc_mcnrtoc32n(size_t* output_size, char32_t** output, size_t* input_size, char** input, mbstate_t* state); stdc_mcerr stdc_mwcnrtomcn(size_t* output_size, char** output, size_t* input_size, wchar_t** input, mbstate_t* state); stdc_mcerr stdc_mwcnrtomwcn(size_t* output_size, wchar_t** output, size_t* input_size, wchar_t** input, mbstate_t* state); stdc_mcerr stdc_mwcnrtoc8n(size_t* output_size, unsigned char** output, size_t* input_size, wchar_t** input, mbstate_t* state); stdc_mcerr stdc_mwcnrtoc16n(size_t* output_size, char16_t** output, size_t* input_size, wchar_t** input, mbstate_t* state); stdc_mcerr stdc_mwcnrtoc32n(size_t* output_size, char32_t** output, size_t* input_size, wchar_t** input, mbstate_t* state); stdc_mcerr stdc_c8nrtomcn(size_t* output_size, char** output, size_t* input_size, unsigned char** input, mbstate_t* state); stdc_mcerr stdc_c8nrtomwcn(size_t* output_size, wchar_t** output, size_t* input_size, unsigned char** input, mbstate_t* state); stdc_mcerr stdc_c8nrtoc8n(size_t* output_size, unsigned char** output, size_t* input_size, unsigned char** input, mbstate_t* state); stdc_mcerr stdc_c8nrtoc16n(size_t* output_size, char16_t** output, size_t* input_size, unsigned char** input, mbstate_t* state); stdc_mcerr stdc_c8nrtoc32n(size_t* output_size, char32_t** output, size_t* input_size, unsigned char** input, mbstate_t* state); stdc_mcerr stdc_c16nrtomcn(size_t* output_size, char** output, size_t* input_size, char16_t** input, mbstate_t* state); stdc_mcerr stdc_c16nrtomwcn(size_t* output_size, wchar_t** output, size_t* input_size, char16_t** input, mbstate_t* state); stdc_mcerr stdc_c16nrtoc8n(size_t* output_size, unsigned char** output, size_t* input_size, char16_t** input, mbstate_t* state); stdc_mcerr stdc_c16nrtoc16n(size_t* output_size, char16_t** output, size_t* input_size, char16_t** input, mbstate_t* state); stdc_mcerr stdc_c16nrtoc32n(size_t* output_size, char32_t** output, size_t* input_size, char16_t** input, mbstate_t* state); stdc_mcerr stdc_c32nrtomcn(size_t* output_size, char** output, size_t* input_size, char32_t** input, mbstate_t* state); stdc_mcerr stdc_c32nrtomwcn(size_t* output_size, wchar_t** output, size_t* input_size, char32_t** input, mbstate_t* state); stdc_mcerr stdc_c32nrtoc8n(size_t* output_size, unsigned char** output, size_t* input_size, char32_t** input, mbstate_t* state); stdc_mcerr stdc_c32nrtoc16n(size_t* output_size, char16_t** output, size_t* input_size, char32_t** input, mbstate_t* state); stdc_mcerr stdc_c32nrtoc32n(size_t* output_size, char32_t** output, size_t* input_size, char32_t** input, mbstate_t* state);

Let:

— transcoding function be one of the functions listed above transcribed in the form stdc_mcerr stdc_XnrtoYn(size_t* output_size, charY** output, size_t* input_size, const charX** input, mbstate_t* state);
— X and Y be one of the prefixes from the table from 7.S�;
— charX and charY be the associated code unit types for X and Y from the table from 7.S�; and
— encoding X and encoding Y be the associated encoding types for X and Y from the table from 7.S�.

The transcoding functions and restartable transcoding functions take an input buffer and an output buffer of the associated code unit types, potentially with their sizes. The function consumes any number of code units of type charX to perform a single indivisible unit of work necessary to convert some amount of input from encoding X to encoding Y, which results in zero or more output code units of type charY.

Constraints

On success or failure, the transcoding functions and restartable transcoding functions shall return one of the above error codes (7.S�). state shall be a valid pointer to at least one mbstate_t object. If *state is not initialized to the initial conversion sequence for the function on first use, or is used after being input into a function whose result was not one of stdc_mcerr_ok, stdc_mcerr_insufficient_output, or stdc_mcerr_incomplete_input, then the behavior of the functions is unspecified. For the restartable transcoding functions, if input is NULL, then *state is set to the initial conversion sequence as described below and no other work is performed. Otherwise, for both restartable and non-restartable functions, input shall not be NULL.

Semantics

The restartable transcoding functions take the form:

stdc_mcerr stdc_XnrtoYn(size_t* output_size, const charY** output, size_t* input_size, const charX** input, mbstate_t* state);

They convert from code units of type charX interpreted according to encoding X to code units of type charY according to encoding Y given a conversion state of value *state. This function only performs a single indivisible unit of work. It does nothing and returns stdc_mcerr_ok if the input is empty (only signified by first checking if input_size is NULL, then checking if *input_size is zero). The behavior of the restartable transcoding functions is as follows.

— If state is NULL, then an automatic storage duration object of type mbstate_t is created. It is initialized to the initial conversion sequence and used wherever state is used in this paragraph.
— If input is NULL, then *state is set to the initial conversion sequence. The function returns stdc_mcerr_ok.
— The function reads code units from *input if *input_size is large enough to produce an indivisible unit of work. If no encoding errors have occurred but the input is exhausted before an indivisible unit of work can be computed, the function returns stdc_mcerr_incomplete_input. If an encoding error occurs, then the function returns stdc_mcerr_invalid.
— If output_size is not NULL, then *output_size will be decremented the amount of code units that would have been written to *output (even if output was NULL). If the output would be exhausted (*output_size will be decremented below zero by a write operation), the function returns stdc_mcerr_insufficient_output and does not decrement *output_size.
— If output_size is NULL and output is not NULL, then enough space is assumed in the buffer pointed to by *output for the entire operation. The behavior is undefined if the output buffer is not large enough for the transcoding operation.
— If output is NULL, then no output will be written. *input is still read and incremented. Otherwise, *output is a valid pointer. If output_size is not NULL, then *output points to at least *output_size charY code units.

If the function returns stdc_mcerr_ok, then all of the following is true:

— *input is incremented by the number of code units read and successfully converted;
— *input_size is decremented by the number of code units read and successfully converted from the input;
— if output is not NULL, *output is incremented by the number of code units written to the output; and,
— if output_size is not NULL, *output_size is decremented by the number of code units written to the output.

Otherwise, if an error is returned then none of the above occurs. If the return value is stdc_mcerr_invalid and state is not NULL, then *state is in an unspecified state.

7.S�.2 Restartable and Non-Restartable Sized Multi Unit Conversion Functions

#include <stdmchar.h> stdc_mcerr stdc_mcsnrtomcsn(size_t* output_size, char** output, size_t* input_size, char** input, mbstate_t* state); stdc_mcerr stdc_mcsnrtomwcsn(size_t* output_size, wchar_t** output, size_t* input_size, char** input, mbstate_t* state); stdc_mcerr stdc_mcsnrtoc8sn(size_t* output_size, unsigned char** output, size_t* input_size, char** input, mbstate_t* state); stdc_mcerr stdc_mcsnrtoc16sn(size_t* output_size, char16_t** output, size_t* input_size, char** input, mbstate_t* state); stdc_mcerr stdc_mcsnrtoc32sn(size_t* output_size, char32_t** output, size_t* input_size, char** input, mbstate_t* state); stdc_mcerr stdc_mwcsnrtomcsn(size_t* output_size, char** output, size_t* input_size, wchar_t** input, mbstate_t* state); stdc_mcerr stdc_mwcsnrtomwcsn(size_t* output_size, char** output, size_t* input_size, wchar_t** input, mbstate_t* state); stdc_mcerr stdc_mwcsnrtoc8sn(size_t* output_size, unsigned char** output, size_t* input_size, wchar_t** input, mbstate_t* state); stdc_mcerr stdc_mwcsnrtoc16sn(size_t* output_size, char16_t** output, size_t* input_size, wchar_t** input, mbstate_t* state); stdc_mcerr stdc_mwcsnrtoc32sn(size_t* output_size, char32_t** output, size_t* input_size, wchar_t** input, mbstate_t* state); stdc_mcerr stdc_c8snrtomwcsn(size_t* output_size, wchar_t** output, size_t* input_size, unsigned char** input, mbstate_t* state); stdc_mcerr stdc_c8snrtomcsn(size_t* output_size, char** output, size_t* input_size, unsigned char** input, mbstate_t* state); stdc_mcerr stdc_c8snrtoc8sn(size_t* output_size, unsigned char** output, size_t* input_size, unsigned char** input, mbstate_t* state); stdc_mcerr stdc_c8snrtoc16sn(size_t* output_size, char16_t** output, size_t* input_size, unsigned char** input, mbstate_t* state); stdc_mcerr stdc_c8snrtoc32sn(size_t* output_size, char32_t** output, size_t* input_size, unsigned char** input, mbstate_t* state); stdc_mcerr stdc_c16snrtomwcsn(size_t* output_size, wchar_t** output, size_t* input_size, char16_t** input, mbstate_t* state); stdc_mcerr stdc_c16snrtomcsn(size_t* output_size, char** output, size_t* input_size, char16_t** input, mbstate_t* state); stdc_mcerr stdc_c16snrtoc8sn(size_t* output_size, unsigned char** output, size_t* input_size, char16_t** input, mbstate_t* state); stdc_mcerr stdc_c16snrtoc16sn(size_t* output_size, char16_t** output, size_t* input_size, char16_t** input, mbstate_t* state); stdc_mcerr stdc_c16snrtoc32sn(size_t* output_size, char32_t** output, size_t* input_size, char16_t** input, mbstate_t* state); stdc_mcerr stdc_c32snrtomcsn(size_t* output_size, char** output, size_t* input_size, char32_t** input, mbstate_t* state); stdc_mcerr stdc_c32snrtomwcsn(size_t* output_size, wchar_t** output, size_t* input_size, char32_t** input, mbstate_t* state); stdc_mcerr stdc_c32snrtoc8sn(size_t* output_size, unsigned char** output, size_t* input_size, char32_t** input, mbstate_t* state); stdc_mcerr stdc_c32snrtoc16sn(size_t* output_size, char16_t** output, size_t* input_size, char32_t** input, mbstate_t* state); stdc_mcerr stdc_c32snrtoc32sn(size_t* output_size, char32_t** output, size_t* input_size, char32_t** input, mbstate_t* state);

Let:

— multi unit transcoding function be one of the functions listed above transcribed in the form stdc_mcerr stdc_XsnrtoYsn(size_t* output_size, charY** output, size_t* input_size, const charX** input, mbstate_t* state);
— X and Y be one of the prefixes from the table from 7.S�;
— charX and charY be the associated code unit types for X and Y from the table from 7.S�; and
— encoding X and encoding Y be the associated encoding types for X and Y from the table from 7.S�.

The multi unit transcoding functions take an input buffer and an output buffer of the associated code unit types, potentially with their sizes. The functions consume any number of code units to perform a sequence of indivisible units of work, which results in zero or more output code units. The functions will repeatedly perform an indivisible unit of work until either an error occurs or the input is exhausted.

Constraints

On success or failure, the transcoding functions and restartable transcoding functions shall return one of the above error codes (7.S�). If state is not NULL and *state is not initialized to the initial conversion sequence for the function on its first use, or is used after being input into a function whose result was not one of stdc_mcerr_ok, stdc_mcerr_insufficient_output, or stdc_mcerr_incomplete_input, then the behavior of the functions is unspecified. For the restartable transcoding functions, if input is NULL, then *state is set to the initial conversion sequence as described below and no other work is performed. Otherwise, input must not be NULL and input_size must not be NULL.

Semantics

The multi unit transcoding functions take the form:

stdc_mcerr stdc_XsnrtoYsn(size_t* output_size, const charY** output, size_t* input_size, const charX** input, mbstate_t* state);

It converts from code units of type charX interpreted according to encoding X to code units of type charY according to encoding Y given a conversion state of value *state. The behavior of these functions is as-if the analogous single unit function stdc_XntoYn was repeatedly called, with the same output, output_size, input, input_size, and state parameters, to perform multiple indivisible units of work. The function stops when an error occurs or the input is exhausted (only signified when *input_size is zero).

The restartable multi unit transcoding functions behave as-if:

1. If state is NULL, then an automatic storage duration object of type mbstate_t is created. It is initialized to the initial conversion sequence and used wherever state is used in this paragraph.
2. If input is NULL, then *state is set to the initial conversion sequence. The function returns stdc_mcerr_ok.
3. stdc_XntoYn is called with output_size, output, input_size, input, and state with its result stored in a temporary named err.
4. If err is stdc_mcerr_ok, then:

4.1 if *input_size is 0, return err;
4.2 otherwise, go back to (2).

5. Otherwise, return err.

The following is true after the invocation:

— *input will be incremented by the number of code units read and successfully converted. If stdc_mcerr_ok is returned, then this will consume all the input. Otherwise, *input will point to the location just after the last successfully performed single unit of work.
— *input_size is decremented by the number of code units read from *input that were successfully converted. If no error occurred, then *input_size will be 0.
— if output is not NULL, *output will be incremented by the number of code units written from successfully performed single units of work.
— if output_size is not NULL, *output_size is decremented by the number of code units written to the output or that would have been written to the output.

If the return value is stdc_mcerr_invalid and state is not NULL, then *state is in an unspecified state.

7. Acknowledgements

Thank you to Philipp K. Krause for responding to the e-mails of a newcomer to matters of C and providing me with helpful guidance. Thank you to Rajan Bhakta, Daniel Plakosh, and David Keaton for guidance on how to submit these papers and get started in WG14. Thank you to Tom Honermann for lighting the passionate fire for proper text handling in me for not just C++, but for our sibling language C.

8. Appendix

8.1. (From revisions 0-3) What about UTF{X} ↔ UTF{Y} functions?

Function interconverting between different Unicode Transformation Formats are not proposed here because -- while useful -- both sides of the encoding are statically known by the developer. The C Standard only wants to consider functionality strictly in the case where the implementation has more information / private information that the developer cannot access in a well-defined and standard manner. A developer can write their own Unicode Transformation Format conversion routines and get them completely right, whereas a developer cannot write the Wide Character and Multibyte Character functions without incredible heroics and/or error-prone assumptions.

This brings up an interesting point, however: if __STDC_UTF16__ and __STDC_UTF32__ both exist, does that not mean the implementation controls what c16 and c32 mean? This is true, however: within a (admittedly limited) survey of implementations, there has been no suggestion or report of an implementation which does not use UTF16 and UTF32 for their char16_t and char32_t literals, respectively.

Thankfully, that does not seem to be the case at this time. It will also no longer be the case in C23, as the paper char16_t and char32_t literals should be UTF-16 and UTF-32 has been accepted.

_{_{_{May the Tower of Babel’s curse be defeated.}}}

N2902
Restartable and Non-Restartable Functions for Efficient Character Conversions

Published Proposal, 2022-01-01

Abstract

1. Changelog

1.1. Revision 6 - January 1^st, 2022

1.2. Revision 5 - November 30^th, 2021

1.3. Revision 4 - December 1^st, 2020

1.4. Revision 3 - October 27^th, 2020

1.5. Revision 0-2 - March 2^nd, 2020

2. Introduction and Motivation

2.1. Problem 1: Lack of Portability

2.2. Problem 2: What is the Encoding?

2.3. Problem 3: Performance

2.4. Problem 4: `wchar_t` Cannot Roundtrip

2.5. Problem 5: The C Standard Cannot Handle Existing Practice

2.6. In Summary

3. Prior Art

3.1. Standard C

3.2. Win32

3.3. `nl_langinfo`

3.4. SDCC

3.5. iconv/ICU

4. Design

4.1. Which Function Form?

4.1.1. Simplification Without Loss of Functionality?

4.1.2. Performance of Double-Pointers?

4.1.3. Structure Returns?

5. Conclusion

6. Proposed Wording

6.1. Intent

6.2. Proposed Specification

6.2.1. Create a new section 7.S� Text Transcoding Utilities

7. Acknowledgements

8. Appendix

8.1. (From revisions 0-3) What about UTF{X} ↔ UTF{Y} functions?

References

Informative References

N2902Restartable and Non-Restartable Functions for Efficient Character Conversions

Published Proposal, 2022-01-01

Abstract

1. Changelog

1.1. Revision 6 - January 1st, 2022

1.2. Revision 5 - November 30th, 2021

1.3. Revision 4 - December 1st, 2020

1.4. Revision 3 - October 27th, 2020

1.5. Revision 0-2 - March 2nd, 2020

2. Introduction and Motivation

2.1. Problem 1: Lack of Portability

2.2. Problem 2: What is the Encoding?

2.3. Problem 3: Performance

2.4. Problem 4: wchar_t Cannot Roundtrip

2.5. Problem 5: The C Standard Cannot Handle Existing Practice

2.6. In Summary

3. Prior Art

3.1. Standard C

3.2. Win32

3.3. nl_langinfo

3.4. SDCC

3.5. iconv/ICU

4. Design

4.1. Which Function Form?

4.1.1. Simplification Without Loss of Functionality?

4.1.2. Performance of Double-Pointers?

4.1.3. Structure Returns?

5. Conclusion

6. Proposed Wording

6.1. Intent

6.2. Proposed Specification

6.2.1. Create a new section 7.S� Text Transcoding Utilities

7. Acknowledgements

8. Appendix

8.1. (From revisions 0-3) What about UTF{X} ↔ UTF{Y} functions?

References

Informative References

N2902
Restartable and Non-Restartable Functions for Efficient Character Conversions

1.1. Revision 6 - January 1^st, 2022

1.2. Revision 5 - November 30^th, 2021

1.3. Revision 4 - December 1^st, 2020

1.4. Revision 3 - October 27^th, 2020

1.5. Revision 0-2 - March 2^nd, 2020

2.4. Problem 4: `wchar_t` Cannot Roundtrip

3.3. `nl_langinfo`