Title: Restartable Functions for Efficient Character Conversions Shortname: 3366 Revision: 13 !Previous Revisions: N3265 (r12), N3095 (r11), N3075 (r10), N3031 (r9), N2999 (r8), N2966 (r7), N2902 (r6), N2730 (r5), N2620 (r4), n2595 (r3), n2500 (r2), n2440 (r1), n2431 (r0) Status: P Date: 2024-10-01 Group: WG14 !Proposal Category: Library Feature Request !Target: C2y Editor: JeanHeyd Meneide, phdofthehouse@gmail.com Editor: Shepherd (Shepherd's Oasis LLC), shepherd@soasis.org URL: https://thephd.dev/_vendor/future_cxx/papers/C%20-%20Efficient%20Character%20Conversions.html !Latest: https://thephd.dev/_vendor/future_cxx/papers/C%20-%20Efficient%20Character%20Conversions.html !Paper Source: GitHub ThePhD/future_cxx Issue Tracking: GitHub https://github.com/ThePhD/future_cxx/issues Metadata Order: Previous Revisions, Editor, Latest, Paper Source, Implementation, Issue Tracking, Project, Audience, Proposal Category, Target Markup Shorthands: markdown yes Toggle Diffs: no Abstract: Implementations firmly control what both the Wide Character and Multi-Byte Character strings are treated at runtime by the Standard Library. While this control is fine, users of the Standard Library have no portability guarantees about how these library functions may behave, especially in the face of encodings that do not support each other's full codepage. And, despite additions to C11 for maybe-UTF16 and maybe-UTF32 encoded types, these functions only offer conversions of a single unit of information at a time, leaving orders of magnitude of performance on the table. This paper proposes and explores additional library functionality to allow users to retrieve multibyte and wide character into a statically known encoding to enhance the ability to work with text.

	`mb`	`wc`	`mbs`	`wcs`	`c8`	`c16`	`c32`	`c8s`	`c16s`	`c32s`
`mb`	➖	✔️			❌	🇷	🇷
`wc`	✔️	➖			❌	❌	❌
`mbs`			➖	✔️				❌	❌	❌
`wcs`			✔️	➖				❌	❌	❌
`c8`	❌	❌			➖	❌	❌
`c16`	🇷	❌			❌	➖	❌
`c32`	🇷	❌			❌	❌	➖
`c8s`			❌	❌				➖	❌	❌
`c16s`			❌	❌				❌	➖	❌
`c32s`			❌	❌				❌	❌	➖

`mb`

`wc`

`mbs`

`wcs`

`c8`

`c16`

`c32`

`c8s`

`c16s`

`c32s`

`mb`

➖

✔️

❌

🇷

`wc`

✔️

➖

❌

`mbs`

➖

✔️

❌

`wcs`

✔️

➖

❌

`c8`

❌

➖

❌

`c16`

🇷

❌

➖

❌

`c32`

🇷

❌

➖

`c8s`

❌

➖

❌

`c16s`

❌

➖

❌

`c32s`

❌

➖

	`mc`	`mwc`	`mcs`	`mwcs`	`c8`	`c16`	`c32`	`c8s`	`c16s`	`c32s`
`mc`	🅿️✔️	✔️			🅿️✔️	🅿️✔️	🅿️✔️
`mwc`	✔️	🅿️✔️			🅿️✔️	🅿️✔️	🅿️✔️
`mcs`			🅿️✔️	✔️				🅿️✔️	🅿️✔️	🅿️✔️
`mwcs`			✔️	🅿️✔️				🅿️✔️	🅿️✔️	🅿️✔️
`c8`	🅿️✔️	🅿️✔️			🅿️✔️	🅿️✔️	🅿️✔️
`c16`	🅿️✔️	🅿️✔️			🅿️✔️	🅿️✔️	🅿️✔️
`c32`	🅿️✔️	🅿️✔️			🅿️✔️	🅿️✔️	🅿️✔️
`c8s`			🅿️✔️	🅿️✔️				🅿️✔️	🅿️✔️	🅿️✔️
`c16s`			🅿️✔️	🅿️✔️				🅿️✔️	🅿️✔️	🅿️✔️
`c32s`			🅿️✔️	🅿️✔️				🅿️✔️	🅿️✔️	🅿️✔️

`mc`

`mwc`

`mcs`

`mwcs`

`c8`

`c16`

`c32`

`c8s`

`c16s`

`c32s`

`mc`

🅿️✔️

✔️

🅿️✔️

`mwc`

✔️

🅿️✔️

`mcs`

🅿️✔️

✔️

🅿️✔️

`mwcs`

✔️

🅿️✔️

`c8`

🅿️✔️

`c16`

🅿️✔️

`c32`

🅿️✔️

`c8s`

🅿️✔️

`c16s`

🅿️✔️

`c32s`

🅿️✔️

#define MB_UTF8 (MB_CUR_MAX == 4) #define MB_UTF16 0 #define MB_UTF32 0 #define WCHAR_UTF8 0 #define WCHAR_UTF16 1 #define WCHAR_UTF32 0 ``` A more sophisticated implementation is demonstrated using values computed with functions in [ztd.idk's headers](https://github.com/soasis/idk/blob/613a52df6995c0afd3f2457219a3961859f006b2/include/ztd/idk/encoding_detection.h#L60-L89) and [ztd.idk's source files](https://github.com/soasis/idk/blob/613a52df6995c0afd3f2457219a3961859f006b2/source/ztd/idk/encoding_detection.c.cpp). ## `mbstate_t` and state handling ## {#design-state} The newly proposed functions have some behavior that, on inspection, may seem unnecessary, duplicated, or superfluous. For example, processing a null pointer for the `input` (or `*input`) resulting in setting `*state` to the initial conversion sequence. We note here that this functionality exists in the API because the legacy APIs such as `mbrtowc` and its friends also have that behavior. In order to make portability as easy as humanly possible (and to allow, with minimal tweaking, reusing such functions internally if its helpful), we leave the equivalent behaviors intact in this API. Additionally, the processing of code units with a value of `\0` also setting `*state` to the initial conversion state is a legacy holdover from how the previous functions work. Again, this is multiple ways to clear `*state` in a way that is not just doing the equivalent of zero-initializing the bits with `= {}`, `= (mbstate_t){}`, or similar assignment or initialization expressions. This can be useful when one wants to preserve certain internal bits in their library, such as not clearing an implementation-defined set of bits for doing non-validating, fast conversions. See [the `cnc_mcstate_t` assumption setting/getting documentation](https://ztdcuneicode.readthedocs.io/en/latest/api/mcstate_t.html#state-functions) for an example of affecting a state object to perform additional behaviors beyond what is sanctioned by the specification itself. # Conclusion # {#conclusion} The ecosystem deserves ways to get to a statically-known encoding and not rely on implementation and locale-parameterized encodings. This allows developers a way to perform cross-platform text processing without needing to go through fantastic gymnastics to support different languages and platforms. An independent library implementation, *cuneicode* (talked about from [[Unicode_greater_detail|Meeting C++]] and [[Unicode_deep_c_diving|C++ On Sea]]), is now [[cuneicode|publicly available to everyone]]. # Proposed Wording # {#wording} The following wording is [[n3054|relative to n3054]]. ## Intent ## {#wording-intent} The intent of the wording is to provide transcoding functions that: - define "code unit" as the smallest piece of information; - define the notion of an "indivisible unit of work"; - remove the requirement that wide characters must represent a full, complete unit for all wide execution encodings that exist on the machine as they do not today; - introduce the notion of multi-unit work that does not use the same 1:N or M:1 design as the precious `wchar_t` functions; - introduce new macros that allow for a programmer to tell the difference between - convert from the execution ("`mc`") and wide execution ("`mwc`") encodings to the Unicode ("`c8`", "`c16`", "`c32`") encodings and vice-versa; - convert from the execution ("`mc`") encoding to the wide execution ("`mwc`") encoding and vice-versa; - provide a way for `mbstate_t` to be properly initialized as the initial conversion state; and, - to be entirely thread-safe by default with no magic internal state asides from what is already required by locales. ## Proposed Specification ## {#wording-specification} *Author's Note: Any � or ✨ is a stand-in character to be replaced by the editor.* ### Modify §3.12.3 Wide Character to change the definition to remove the one-to-one correspondence ### {#wording-specification-3.12.3}

…

3.12.3

**wide character** value representable by an object of type `wchar_t`~~, capable of representing any character in the current locale~~

…

### Modify §6.10.9.2 Environment Macros to add new wide literal-only predefined macro ### {#wording-specification-6.10.9.2}

6.10.10.3 Environment Macros

... :: `__STDC_ISO_10646__` An integer constant of the form `yyyymmL` (for example, `199712L`). If this symbol is defined, then every character in the Unicode required set, when stored in an object of type `wchar_t`, has the same value as the short identifier of that character. The Unicode required set consists of all the characters that are defined by ISO/IEC 10646, along with all amendments and technical corrigenda, as of the specified year and month. If some other encoding is used, the macro shall not be defined and the actual encoding used is implementation-defined. :: `__STDC_LITERAL_UTF8__` A strictly positive integer constant expression if this symbol is defined. If it is defined, then the literal encoding (6.2.9) is capable of storing every character in the Unicode required set and every sequence of `char` (with each the value of `char` treated as an `unsigned char`) from a string literal (excluding non-Universal character escape sequences) forms a valid encoding of a UTF-8 sequence. :: `__STDC_LITERAL_UTF16__` A strictly positive integer constant expression if this symbol is defined. If it is defined, then the literal encoding (6.2.9) is capable of storing every character in the Unicode required set and every sequence of `char` (with each the value of `char` treated as an `unsigned char`) from a string literal (excluding non-Universal character escape sequences) forms a valid encoding of a UTF-16 sequence. :: `__STDC_LITERAL_UTF32__` A strictly positive integer constant expression if this symbol is defined. If it is defined, then the literal encoding (6.2.9) is capable of storing every character in the Unicode required set and every sequence of `char` (with each the value of `char` treated as an `unsigned char`) from a string literal (excluding non-Universal character escape sequences) forms a valid encoding of a UTF-32 sequence. :: `__STDC_MB_MIGHT_NEQ_WC__` The integer constant `1`, intended to indicate that, in the wide literal encoding for `wchar_t`, a member of the basic character set need not have a code value equal to its value when used as the lone character in an integer character constant. :: `__STDC_WIDE_LITERAL_UTF8__` A strictly positive integer constant expression if this symbol is defined. If it is defined, then the wide literal encoding (6.2.9) is capable of storing every character in the Unicode required set and every sequence of `wchar_t` from a `wchar_t` string literal (excluding non-Universal character escape sequences) forms a valid encoding of a UTF-8 sequence. :: `__STDC_WIDE_LITERAL_UTF16__` A strictly positive integer constant expression if this symbol is defined. If it is defined, then the wide literal encoding (6.2.9) is capable of storing every character in the Unicode required set and every sequence of `wchar_t` from a `wchar_t` string literal (excluding non-Universal character escape sequences) forms a valid encoding of a UTF-16 sequence. :: `__STDC_WIDE_LITERAL_UTF32__` A strictly positive integer constant expression if this symbol is defined. If it is defined, then the wide literal encoding (6.2.9) is capable of storing every character in the Unicode required set and every sequence of `wchar_t` from a `wchar_t` string literal (excluding non-Universal character escape sequences) forms a valid encoding of a UTF-32 sequence.

### Modify §7.21 Common Definitions `` to Remove Harmful `wchar_t` Text ### {#wording-specification-7.21}

7.21 Common definitions `stddef.h`

…

The types are … … which is an object type whose alignment is the greatest fundamental alignment; ```cpp wchar_t ``` which is an integer type whose range of values can represent ~~distinct codes for all members of the largest extended character set specified among the supported locales;~~codes (or part of a sequence of codes) for all the members of the supported wide execution and wide literal encodings (6.2.9); the null character shall have the code value zero. Each member of the basic character set shall have a code value equal to its value when used as the lone character in an integer character constant if an implementation does not define `__STDC_MB_MIGHT_NEQ_WC__`; and,

### Modify §7.31 Extended Multibyte and Wide Character Utilities `` to Clarify Role of `wchar_t` ### {#wording-specification-7.31}

7.31 Environment Macros

7.31.1 Introduction

The header `` defines ~~four~~several macros, and declares four data types, one tag, and many functions.

…

The macros defined are `NULL` (described in 7.21); `WCHAR_MIN`, `WCHAR_MAX`, and `WCHAR_WIDTH` (de- scribed in 7.22); ```cpp WCHAR_UTF8 WCHAR_UTF16 WCHAR_UTF32 ``` which expand to an expression of signed or unsigned integer type (that is potentially not an integer constant expression) whose value is non-zero if: - the wide execution encoding (6.2.9) is capable of representing every character in the required Unicode set; - the width of `wchar_t` is at least 8, 16, or 32 for UTF-8, UTF-16, or UTF-32, respectively; - and, the values of a sequence of `wchar_t` objects consumed and produced by related character functions have a values consistent with a sequence of code units of the UTF-8, UTF-16, or UTF-32 encodings, respectively; ```cpp MB_UTF8 MB_UTF16 MB_UTF32 ``` which expand to an expression of signed or unsigned integer type (that is potentially not an integer constant expression) whose value is non-zero if: - the execution encoding (6.2.9) is capable of representing every character in the required Unicode set; - the width of `char` is at least 8, 16, or 32 for UTF-8, UTF-16, or UTF-32, respectively; - and, the values of a sequence of `char` objects (treated a sequence of `unsigned char` objects) consumed and produced by related character functions have a values consistent with a sequence of code units of the UTF-8, UTF-16, or UTF-32 encodings, respectively; and, ````cpp WEOF ```` which expands to a …

### Modify §6.2.9 Encodings to include definitions of Code Point, Code Unit, Wide/Narrow Execution Encodings, and Encoding Error ### {#wording-specification-6.2.9}

6.2.9 Encodings

…

…

A *code unit* is a single compositional unit of encoded information, usually of type `char`, `wchar_t`, `char8_t`, `char16_t`, or `char32_t`.

A *code point* is a single compositional unit of decoded information. Code points are generally used as the single complete decoded output, or as an intermediary to transcode to other code units. A *Unicode code point* is a single compositional unit of decoded information as defined in ISO/IEC 10646, typically used to convert to or from UTF-8, UTF-16, and UTF-32.

The *narrow execution encoding* is the implementation-defined, `LC_CTYPE`, (7.11.1)-influenced, locale-based execution environment encoding. The *wide execution encoding* is the implementation-defined, `LC_CTYPE` (7.11.1)-influenced, locale-based wide execution environment encoding. Both of these encodings are called the *execution encodings*.

An *unrecorded encoding error* occurs when an encoding, decoding, or transcoding function encounters an input sequence of code units or code points that :: — does not form a valid sequence according to the encoding being associated with the sequence, or :: — is not representable in the output encoding or coded character set.

An *encoding error* is the same as an unrecorded encoding error, except that the value of the macro `EILSEQ` (7.5) is stored in `errno` when such an error occurs during execution of the functions defined in this document unless otherwise specified.

### Modify §7.21.3 Files to remove the italics from the term "encoding error", since it's initial definition was moved to §6.2.9 Encodings ### {#wording-specification-7.21.3}

7.21.3 Files

…

An ~~*encoding error*~~encoding error occurs if the character sequence presented to the underlying `mbrtowc` function does not form a valid (generalized) multibyte character, or if the code value passed to the underlying `wcrtomb` does not correspond to a valid (generalized) multibyte character. The wide character input/output functions and the byte input/output functions store the value of the macro `EILSEQ` in `errno` if and only if an encoding error occurs.

### Create a new section §7.S✨ and §7.S✨.1 Text Transcoding Utilities ### {#wording-specification-7.S✨}

7.S✨ Text transcoding utilities ``

7.S✨.1 General

The header `` declares four status code enumerators, several macros, several types and several functions for transcoding encoded text safely and efficiently. It is meant to supersede conversion utilities from Unicode utilities (7.28) and Extended multibyte and wide character utilities (7.29). It is meant to represent "multi character" functions. These functions can be used to count the number of input that form a complete sequence, count the number of output characters required for a conversion with no additional allocation, validate an input sequence, or transcode text from one encoding to another encoding. Particularly, it provides single unit and multi unit transcoding functions for transcoding by working on *code units* and *code points*.

Inputs to the functions in this clause are read until there is enough information taken in to perform an *indivisible unit of work*. An indivisible unit is the smallest possible input, as defined by the encoding, that can produce one or more outputs, perform a transformation of any state, or both. The conversion of these indivisible units is called an indivisible unit of work, and they are used to complete the transcoding operations specified in this subclause.

One or more of the following must hold for any given transcoding operation on an attempt to complete an indivisible unit of work: :: — enough input is consumed to perform an output or change the state; :: — output is written from consuming input, or output is written from the state which causes the state to change; or, :: — an error occurs and both the input and output do not change relative to the current indivisible unit of work. For the multi unit functions, the process acts as if it completes one indivisible unit of work repeatedly. When an error occurs, only the input successfully consumed, the state successfully altered, and the output successfully written according to the last indivisible unit of work are reflected in the output values of the functions in this clause: no other values are written.

Functions in `` which use `char` and `wchar_t`, or their qualified forms, derive their implementation-defined encodings from the narrow execution encoding or the wide execution encoding (6.2.9), respectively. The other encodings are UTF-8, associated with `char8_t`, UTF-16, associated with `char16_t`, and UTF-32, associated with `char32_t`.

NOTE Each value is treated as code units with an unsigned value and not as a container of octets. This means that the decision of, for example, UTF-16 in big or little endian encoding scheme is decided by the endianness of the code unit type. Only whole code unit values are used (i.e. a UTF-32 code point value of U+1F377 represents a value identical to how `U'\U0001F377'` is stored by the implementation).

For the UTF-8, UTF-16, and UTF-32 encodings, collectively referred to as the *Unicode encodings*, an indivisible unit of work for a read operation shall be the sequence of code units that corresponds to one Unicode code point. The value of each code unit is treated as a sequence of unsigned values. If input is exhausted before a sequence of code units corresponding to one Unicode code point can be reached, then `stdc_mcerr_incomplete_input` shall be returned. If there is an illegal code unit sequence, then `stdc_mcerr_invalid` shall be returned. For the implementation-defined execution and wide execution encodings, they have the same aforementioned requirement if the implementation defines it to be one of the Unicode encodings.

NOTE If an implementation chooses to provide, for example, an execution encoding as the input encoding for a transcoding function that is defined to be the same as the UTF-8 encoding, then it is required to read one full complete Unicode code point's worth of code units. If it cannot, then it returns `stdc_mcerr_incomplete_input` (if the input sequence is not long enough but does not have any invalid code units in the sequence) or `stdc_mcerr_invalid` (if the input sequence is not a proper code unit sequence).

NOTE The requirements for Unicode encodings do not apply to derivative encodings defined by the implementation. For example, an implementation may define a "partial UTF-8" execution encoding where it stores every read UTF-8 code unit in the state and, rather than returning `stdc_mcerr_incomplete_input`, returns `stdc_mcerr_ok` and produces no output. It may accumulate code units and write out a code point when it accumulates enough code units in its internal state. However, such an encoding is distinct and separate from the UTF-8 encoding used in the `c8` prefixed and suffixed functions described in this clause.

NOTE The implementation-defined execution, wide execution, literal, and wide literal encodings can also have different behaviors if they do not define themselves as one of the Unicode encodings. For example, if `__STDC_ENDIAN_NATIVE__` (7.18.2) is equivalent to `__STDC_ENDIAN_LITTLE__`, but the wide execution encoding is defined to be "UTF-16 Big Endian" ("UTF16-BE"), then it may be classified as not one of the three recognized Unicode encodings according to this subclause. As such, a sequence of `wchar_t` elements that is null-terminated produced by transcoding functions in this subclause can behave differently than expected; e.g. `L"\U0001F377"`, if valid and defined to be UTF-16, can potentially not compare equal to a sequence of `wchar_t` objects produced by successful use of the transcoding functions from the code points U+01F377 and U+000000.

For all functions in this clause, when a code unit value of 0 (e.g. `'\0'`) is encountered in the input, the `mbstate_t` object in use for the transcoding operation is set to the initial shift state. The output associated with the indivisible unit of work consists of the appropriate null character preceded by any shift sequence necessary to cause the output to be in the initial shift state.

NOTE As described in 7.30.6, an object of type `mbstate_t` can always be set to the initial conversion sequence by initializing it with `= {0};` or `= {};`. An existing `mbstate_t` object can always be set to the initial conversion sequence by assigning to it from the expressions `(mbstate_t){0}` or `(mbstate_t){}`.

Changing the `LC_CTYPE` category causes any conversion state already in use with the functions in this clause to be indeterminate.

The types declared are `mbstate_t` (described in 7.29.1), `wchar_t` (described in 7.19), `char8_t` (described in 7.28), `char16_t` (described in 7.28), `char32_t` (described in 7.28), `size_t` (described in 7.19), and;
```cpp stdc_mcerr ```
which is both an enumerated type and a typedef whose enumerators identify the status codes from a function calls described in this clause.

The macros declared are `NULL` (described in 7.21); `WCHAR_MIN`, `WCHAR_MAX`, and `WCHAR_WIDTH` (described in 7.22); `WCHAR_UTF8`, `WCHAR_UTF16`, `WCHAR_UTF32`, `MB_UTF8`, `MB_UTF16`, and `MB_UTF32` (described in 7.31); and,
```cpp STDC_C8_MAX STDC_C16_MAX STDC_C32_MAX STDC_MC_MAX STDC_MWC_MAX ```
which correspond to the maximum output for each single unit conversion function (7.S✨.2) and its corresponding output type. Each macro shall expand into an integer constant expression with minimum values, as described in Table ✨MEOW✨.

There is an association of naming convention, types, encoding, and maximums, used to describe the functions in this clause:

Table ✨MEOW✨: Transcoding function associations

Name Code Unit Type Encoding Maximum Output Macro Minimum Value

mc `char` The narrow execution encoding,
influenced by `LC_CTYPE` `STDC_MC_MAX` `1`

mwc `wchar_t` The wide execution encoding,
influenced by `LC_CTYPE` `STDC_MWC_MAX` `1`

c8 `char8_t` UTF-8 `STDC_C8_MAX` `4`

c16 `char16_t` UTF-16 `STDC_C16_MAX` `2`

c32 `char32_t` UTF-32 `STDC_C32_MAX` `1`

The maximum output macro values specified in the Table ✨MEOW✨ are related to the single unit conversion functions (7.S✨.2). These functions perform at most one indivisible unit of work, or return an error. The maximum output macro values shall be integer constant expressions large enough that conversions to the single unit conversion function's specified encoding shall not overflow a buffer of the proper code unit type with that size. The maximum output macro values do not affect the multi unit conversion functions (7.S✨.3), which perform as many indivisible units of work as is possible until an error occurs, until the output space is exhausted, or until the input is exhausted.

Unlike the functions present in `` and ``, the functions present in this clause can write more than one `wchar_t` value for conversions based on the wide execution encoding to accommodate a wider set of implementation-defined encodings, so long as the number of code units does not exceed the maximum output macro value of `STDC_MWC_MAX`.

The enumerators of the enumerated type `stdc_mcerr` are defined as follows:
```cpp stdc_mcerr_ok = 0 stdc_mcerr_invalid = -1, stdc_mcerr_incomplete_input = -2, stdc_mcerr_insufficient_output = -3, ```
Each value represents a specific situation when calling the relevant transcoding functions in ``: :: — `stdc_mcerr_insufficient_output`, when the input is correct and an indivisible unit of work can be performed but there is not enough output space to write to; :: — `stdc_mcerr_incomplete_input`, when input has been exhausted and the sequence is not incorrect but there are no more input values; :: — `stdc_mcerr_invalid`, when a unrecorded encoding error occurred; and, :: — `stdc_mcerr_ok`, when the operation was successful (none of the situations described for the other values of this enumerated type apply). No other value shall be returned from the functions described in this clause.

Recommended Practice

The maximum output macro values are intended for use in making automatic storage duration array declarations. Implementations should choose values for the macros that are spacious enough to accommodate a variety of underlying implementation choices for the target encodings supported by the narrow execution encodings and wide execution encodings, which for some encodings can output more than one UTF-32 code point. A set of values which are most resilient to future additions and changes in implementations is as follows:
```cpp #define STDC_C8_MAX 32 #define STDC_C16_MAX 16 #define STDC_C32_MAX 8 #define STDC_MC_MAX 32 #define STDC_MWC_MAX 16 ```

Beyond just the Unicode encodings specified previously, implementations are encouraged to not store partial reads or partial writes in the `mbstate_t` object with these functions unless as is strictly necessary. Implementations providing additional encodings for use with these functions should, to the extent possible for a given encoding, always define an indivisible unit of work to transcode as complete a unit of information as is possible or produce an error. If a sequence of code units cannot form a complete shift sequence or produce output, then an implementation should return `stdc_mcerr_incomplete_input` if the input is exhausted, or `stdc_mcerr_invalid` if the input sequence is incorrect.

Table ✨MEOW✨: Transcoding function associations
Name	Code Unit Type	Encoding	Maximum Output Macro	Minimum Value
mc	`char`	The narrow execution encoding, influenced by `LC_CTYPE`	`STDC_MC_MAX`	`1`
mwc	`wchar_t`	The wide execution encoding, influenced by `LC_CTYPE`	`STDC_MWC_MAX`	`1`
c8	`char8_t`	UTF-8	`STDC_C8_MAX`	`4`
c16	`char16_t`	UTF-16	`STDC_C16_MAX`	`2`
c32	`char32_t`	UTF-32	`STDC_C32_MAX`	`1`

### Create a new section §7.S✨.2 Single Unit Sized Conversion Functions ### {#wording-specification-7.S✨.2}
7.S✨.2Single Unit Sized Conversion Functions

Synopsis

```cpp #include stdc_mcerr stdc_mcnrtomcn(size_t*restrict output_size, char*restrict *restrict output, size_t*restrict input_size, const char*restrict *restrict input, mbstate_t*restrict state); stdc_mcerr stdc_mcnrtomwcn(size_t*restrict output_size, wchar_t*restrict *restrict output, size_t*restrict input_size, const char*restrict *restrict input, mbstate_t*restrict state); stdc_mcerr stdc_mcnrtoc8n(size_t*restrict output_size, char8_t*restrict *restrict output, size_t*restrict input_size, const char*restrict *restrict input, mbstate_t*restrict state); stdc_mcerr stdc_mcnrtoc16n(size_t*restrict output_size, char16_t*restrict *restrict output, size_t*restrict input_size, const char*restrict *restrict input, mbstate_t*restrict state); stdc_mcerr stdc_mcnrtoc32n(size_t*restrict output_size, char32_t*restrict *restrict output, size_t*restrict input_size, const char*restrict *restrict input, mbstate_t*restrict state); stdc_mcerr stdc_mwcnrtomcn(size_t*restrict output_size, char*restrict *restrict output, size_t*restrict input_size, wchar_t*restrict *restrict input, mbstate_t*restrict state); stdc_mcerr stdc_mwcnrtomwcn(size_t*restrict output_size, wchar_t*restrict *restrict output, size_t*restrict input_size, wchar_t*restrict *restrict input, mbstate_t*restrict state); stdc_mcerr stdc_mwcnrtoc8n(size_t*restrict output_size, char8_t*restrict *restrict output, size_t*restrict input_size, wchar_t*restrict *restrict input, mbstate_t*restrict state); stdc_mcerr stdc_mwcnrtoc16n(size_t*restrict output_size, char16_t*restrict *restrict output, size_t*restrict input_size, wchar_t*restrict *restrict input, mbstate_t*restrict state); stdc_mcerr stdc_mwcnrtoc32n(size_t*restrict output_size, char32_t*restrict *restrict output, size_t*restrict input_size, wchar_t*restrict *restrict input, mbstate_t*restrict state); stdc_mcerr stdc_c8nrtomcn(size_t*restrict output_size, char*restrict *restrict output, size_t*restrict input_size, const char8_t*restrict *restrict input, mbstate_t*restrict state); stdc_mcerr stdc_c8nrtomwcn(size_t*restrict output_size, wchar_t*restrict *restrict output, size_t*restrict input_size, const char8_t*restrict *restrict input, mbstate_t*restrict state); stdc_mcerr stdc_c8nrtoc8n(size_t*restrict output_size, char8_t*restrict *restrict output, size_t*restrict input_size, const char8_t*restrict *restrict input, mbstate_t*restrict state); stdc_mcerr stdc_c8nrtoc16n(size_t*restrict output_size, char16_t*restrict *restrict output, size_t*restrict input_size, const char8_t*restrict *restrict input, mbstate_t*restrict state); stdc_mcerr stdc_c8nrtoc32n(size_t*restrict output_size, char32_t*restrict *restrict output, size_t*restrict input_size, const char8_t*restrict *restrict input, mbstate_t*restrict state); stdc_mcerr stdc_c16nrtomcn(size_t*restrict output_size, char*restrict *restrict output, size_t*restrict input_size, const char16_t*restrict *restrict input, mbstate_t*restrict state); stdc_mcerr stdc_c16nrtomwcn(size_t*restrict output_size, wchar_t*restrict *restrict output, size_t*restrict input_size, const char16_t*restrict *restrict input, mbstate_t*restrict state); stdc_mcerr stdc_c16nrtoc8n(size_t*restrict output_size, char8_t*restrict *restrict output, size_t*restrict input_size, const char16_t*restrict *restrict input, mbstate_t*restrict state); stdc_mcerr stdc_c16nrtoc16n(size_t*restrict output_size, char16_t*restrict *restrict output, size_t*restrict input_size, const char16_t*restrict *restrict input, mbstate_t*restrict state); stdc_mcerr stdc_c16nrtoc32n(size_t*restrict output_size, char32_t*restrict *restrict output, size_t*restrict input_size, const char16_t*restrict *restrict input, mbstate_t*restrict state); stdc_mcerr stdc_c32nrtomcn(size_t*restrict output_size, char*restrict *restrict output, size_t*restrict input_size, const char32_t*restrict *restrict input, mbstate_t*restrict state); stdc_mcerr stdc_c32nrtomwcn(size_t*restrict output_size, wchar_t*restrict *restrict output, size_t*restrict input_size, const char32_t*restrict *restrict input, mbstate_t*restrict state); stdc_mcerr stdc_c32nrtoc8n(size_t*restrict output_size, char8_t*restrict *restrict output, size_t*restrict input_size, const char32_t*restrict *restrict input, mbstate_t*restrict state); stdc_mcerr stdc_c32nrtoc16n(size_t*restrict output_size, char16_t*restrict *restrict output, size_t*restrict input_size, const char32_t*restrict *restrict input, mbstate_t*restrict state); stdc_mcerr stdc_c32nrtoc32n(size_t*restrict output_size, char32_t*restrict *restrict output, size_t*restrict input_size, const char32_t*restrict *restrict input, mbstate_t*restrict state); ```

Description

Let *transcoding function* be one of the functions listed previously transcribed in the form ```cpp stdc_mcerr stdc_XnrtoYn(size_t*restrict output_size, charY*restrict *restrict output, size_t*restrict input_size, const charX*restrict *restrict input, mbstate_t*restrict state) ``` with the following properties: :: — *X* and *Y* be one of the prefixes/suffixes in the table from 7.S✨.1; :: — `charX` and `charY` be the associated code unit types for *X* and *Y* in the table from 7.S✨.1; and :: — *encoding X* and *encoding Y* be the associated encoding types for *X* and *Y* in the table from 7.S✨.1. The transcoding functions take an input buffer and possibly an output buffer of the associated code unit types, potentially with their sizes. The function consumes any number of code units of type `charX` to perform a single indivisible unit of work necessary to convert some amount of input from encoding X to encoding Y, which results in zero or more output code units of type `charY`.

An `mbstate_t` object describes the conversion state for the current conversion if it is not in an unspecified state (as described further later in this clause) and: :: — the conversion is between Unicode encodings; or :: — the input fragment is the start of an encoded sequence in the input encoding and the `mbstate_t` object was initialized to the initial conversion state; or :: — the input fragment is a continuation of an encoded sequence and `mbstate_t` is the result of having advanced to the input and possibly output positions through the application of prior calls to the same transcoding function. The behavior is undefined when a function described by this subclause is invoked and `*state` (or, if `state` is a null pointer, the `mbstate_t` object created that is unique to the current invocation) does not describe the conversion state for the current conversion.

The transcoding functions convert from code units of type `charX` interpreted according to encoding X to code units of type `charY` according to encoding Y given a conversion state of value `*state`. This function only performs a single indivisible unit of work. It returns `stdc_mcerr_ok` if the input is empty. The input is considered empty if `input_size` is a null pointer, or `*input_size` is zero if `input_size` is not a null pointer.

Any time *input code unit reads* in the following description is used: :: — code units are read from `*input`, sequentially, and interpreted according to encoding X; :: — if `*input_size` is smaller than the necessary amount of sequential reads that must performed from `input` to complete an indivisible unit of work, the function does not modify any of `*input`, `*input_size`, `*output`, or `*output_size`. It returns `stdc_mcerr_incomplete_sequence`; :: — if an unrecorded encoding error occurs (e.g. the input read is invalid according to encoding X or the input is valid but cannot be converted to encoding Y), then the function returns `stdc_mcerr_invalid`; :: — if the function returns `stdc_mcerr_ok`, the function decrements `*input_size` by the number of input code units that were read and increments `*input` by the number of input code units that were read for the complete indivisible unit of work. Any time *output code unit writes* in the following description is used: :: — converted code units are potentially written into `*output`, sequentially, according to encoding Y; :: — if `output_size` is not a null pointer and if `*output_size` is smaller than the necessary amount of sequential writes that must be performed to complete an indivisible unit of work, the function does not modify any of `*input`, `*input_size`, `*output`, or `*output_size`. It returns `stdc_mcerr_insufficient_output`; :: — if `output_size` is a null pointer, but `output` and `*output` are not null pointers, it is assumed `*output` has enough space to perform the necessary sequential writes and the behavior is undefined if the target output buffer is not large enough for this transcoding operation's indivisible unit of work; :: — if the function returns `stdc_mcerr_ok`, the function decrements `*output_size` by the number of output code units that are, or could have been (if `output` or `*output` are null pointers), written. If `output` and `*output` are not null pointers, then `*output` is incremented by the number of output code units that are written to complete an indivisible unit of work.

The behavior of the transcoding functions is as follows: 1. If `state` is a null pointer, then an automatic storage duration object of type `mbstate_t` is created which is unique to the current invocation. It is initialized to the initial conversion state and a pointer to this object is used wherever `state` is used in this paragraph. 2. Then, if `input` is a null pointer or `*input` is a null pointer, then `*state` is set to the initial conversion state, the function returns `stdc_mcerr_ok`, and no other actions are token. 3. Otherwise, if `*state` is in an implementation-defined conversion state that requires it, any necessary output code units writes are performed to return `*state` to the initial conversion state. 4. The function performs input code unit reads and subsequently performs the output code unit writes as is necessary to complete an indivisible unit of work. 5. The function returns `stdc_mcerr_ok`.

NOTE If `state` is a null pointer, and the function uses e.g. a created automatic storage duration `mbstate_t` object that is discarded by the end of the invocation, then any potential conversion state contained in the created `mbstate_t` object and used during processing could become unrecoverable to the program.

Returns

On success or failure, the transcoding functions shall return one of the above error codes (7.S✨.1). If `input` is a null pointer or `*input` is a null pointer, then `*state` is set to the initial conversion state and no other work is performed.

If the function returns `stdc_mcerr_ok`, then all of the following is true: :: — if `input` and `*input` are not null pointers, `*input` is incremented by the number of code units read and successfully converted; :: — if `input_size` is not a null pointer, `*input_size` is decremented by the number of code units read and successfully converted from the input; :: — if `output` and `*output` are not null pointers, `*output` is incremented by the number of code units written to the output; and, :: — if `output_size` is not a null pointer, `*output_size` is decremented by the number of code units written to the output. Otherwise, if an error is returned then none of the above occurs. If the return value is `stdc_mcerr_invalid`, then `*state` is in an unspecified state. If the return value is `stdc_mcerr_incomplete_input` or `stdc_mcerr_insufficient_output`, then `*state` is not changed.

Recommended Practice

Implementations should take advantage of the information of null pointer values for the output size pointer, output data pointer, or both, to drastically improve performance characteristics for assumed unlimited write space, output counting scenarios, or input validation/counting, respectively.

Implementations should prefer returning an error for an incomplete input sequence over storing intermediate data within the state where possible for non-Unicode encodings. This can make it easier for functionality built on top of the functions in this subclause to report errors without skipping over potentially invalid input data, resulting in potentially more accurate reports. Error handling and recovery also greatly benefit from being able to examine invalid input; avoiding skipping over invalid data by consuming it into a state and reporting no errors means that functionality built on top can potentially discard what should be considered unneeded, already-processed data.

### Create a new subsection §7.S✨.3 Multi Unit Sized Conversion Functions ### {#wording-specification-7.S✨.3}

7.S✨.3 Multi Unit Sized Conversion Functions

Synopsis

```cpp #include stdc_mcerr stdc_mcsnrtomcsn(size_t*restrict output_size, char*restrict *restrict output, size_t*restrict input_size, const char*restrict *restrict input, mbstate_t*restrict state); stdc_mcerr stdc_mcsnrtomwcsn(size_t*restrict output_size, wchar_t*restrict *restrict output, size_t*restrict input_size, const char*restrict *restrict input, mbstate_t*restrict state); stdc_mcerr stdc_mcsnrtoc8sn(size_t*restrict output_size, char8_t*restrict *restrict output, size_t*restrict input_size, const char*restrict *restrict input, mbstate_t*restrict state); stdc_mcerr stdc_mcsnrtoc16sn(size_t*restrict output_size, char16_t*restrict *restrict output, size_t*restrict input_size, const char*restrict *restrict input, mbstate_t*restrict state); stdc_mcerr stdc_mcsnrtoc32sn(size_t*restrict output_size, char32_t*restrict *restrict output, size_t*restrict input_size, const char*restrict *restrict input, mbstate_t*restrict state); stdc_mcerr stdc_mwcsnrtomcsn(size_t*restrict output_size, char*restrict *restrict output, size_t*restrict input_size, wchar_t*restrict *restrict input, mbstate_t*restrict state); stdc_mcerr stdc_mwcsnrtomwcsn(size_t*restrict output_size, char*restrict *restrict output, size_t*restrict input_size, wchar_t*restrict *restrict input, mbstate_t*restrict state); stdc_mcerr stdc_mwcsnrtoc8sn(size_t*restrict output_size, char8_t*restrict *restrict output, size_t*restrict input_size, wchar_t*restrict *restrict input, mbstate_t*restrict state); stdc_mcerr stdc_mwcsnrtoc16sn(size_t*restrict output_size, char16_t*restrict *restrict output, size_t*restrict input_size, wchar_t*restrict *restrict input, mbstate_t*restrict state); stdc_mcerr stdc_mwcsnrtoc32sn(size_t*restrict output_size, char32_t*restrict *restrict output, size_t*restrict input_size, wchar_t*restrict *restrict input, mbstate_t*restrict state); stdc_mcerr stdc_c8snrtomwcsn(size_t*restrict output_size, wchar_t*restrict *restrict output, size_t*restrict input_size, const char8_t*restrict *restrict input, mbstate_t*restrict state); stdc_mcerr stdc_c8snrtomcsn(size_t*restrict output_size, char*restrict *restrict output, size_t*restrict input_size, const char8_t*restrict *restrict input, mbstate_t*restrict state); stdc_mcerr stdc_c8snrtoc8sn(size_t*restrict output_size, char8_t*restrict *restrict output, size_t*restrict input_size, const char8_t*restrict *restrict input, mbstate_t*restrict state); stdc_mcerr stdc_c8snrtoc16sn(size_t*restrict output_size, char16_t*restrict *restrict output, size_t*restrict input_size, const char8_t*restrict *restrict input, mbstate_t*restrict state); stdc_mcerr stdc_c8snrtoc32sn(size_t*restrict output_size, char32_t*restrict *restrict output, size_t*restrict input_size, const char8_t*restrict *restrict input, mbstate_t*restrict state); stdc_mcerr stdc_c16snrtomwcsn(size_t*restrict output_size, wchar_t*restrict *restrict output, size_t*restrict input_size, const char16_t*restrict *restrict input, mbstate_t*restrict state); stdc_mcerr stdc_c16snrtomcsn(size_t*restrict output_size, char*restrict *restrict output, size_t*restrict input_size, const char16_t*restrict *restrict input, mbstate_t*restrict state); stdc_mcerr stdc_c16snrtoc8sn(size_t*restrict output_size, char8_t*restrict *restrict output, size_t*restrict input_size, const char16_t*restrict *restrict input, mbstate_t*restrict state); stdc_mcerr stdc_c16snrtoc16sn(size_t*restrict output_size, char16_t*restrict *restrict output, size_t*restrict input_size, const char16_t*restrict *restrict input, mbstate_t*restrict state); stdc_mcerr stdc_c16snrtoc32sn(size_t*restrict output_size, char32_t*restrict *restrict output, size_t*restrict input_size, const char16_t*restrict *restrict input, mbstate_t*restrict state); stdc_mcerr stdc_c32snrtomcsn(size_t*restrict output_size, char*restrict *restrict output, size_t*restrict input_size, const char32_t*restrict *restrict input, mbstate_t*restrict state); stdc_mcerr stdc_c32snrtomwcsn(size_t*restrict output_size, wchar_t*restrict *restrict output, size_t*restrict input_size, const char32_t*restrict *restrict input, mbstate_t*restrict state); stdc_mcerr stdc_c32snrtoc8sn(size_t*restrict output_size, char8_t*restrict *restrict output, size_t*restrict input_size, const char32_t*restrict *restrict input, mbstate_t*restrict state); stdc_mcerr stdc_c32snrtoc16sn(size_t*restrict output_size, char16_t*restrict *restrict output, size_t*restrict input_size, const char32_t*restrict *restrict input, mbstate_t*restrict state); stdc_mcerr stdc_c32snrtoc32sn(size_t*restrict output_size, char32_t*restrict *restrict output, size_t*restrict input_size, const char32_t*restrict *restrict input, mbstate_t*restrict state); ```

Description

Let *multi unit transcoding function* in this function be one of the functions listed above transcribed in the form ```cpp stdc_mcerr stdc_XsnrtoYsn(size_t*restrict output_size, charY*restrict *restrict output, size_t*restrict input_size, const charX*restrict *restrict input, mbstate_t*restrict state); ``` with the following properties: :: — *X* and *Y* be one of the prefixes/suffixes in the table from 7.S✨.1; :: — `charX` and `charY` be the associated code unit types for *X* and *Y* in the table from 7.S✨.1; and :: — *encoding X* and *encoding Y* be the associated encoding types for *X* and *Y* in the table from 7.S✨.1. The multi unit transcoding functions take an input buffer and possibly an output buffer of the associated code unit types, potentially with their sizes. The functions consume any number of code units to perform a sequence of indivisible units of work, which results in zero or more output code units. The functions repeatedly perform an indivisible unit of work until either an error occurs or the input is exhausted.

An `mbstate_t` object describes the conversion state for the current conversion if it is not in an unspecified state (as described later in this subclause) and: - the conversion is between Unicode encodings; or - the input fragment is the start of an encoded sequence in the input encoding and the `mbstate_t` object was initialized to the initial conversion state; or - the input fragment is a continuation of an encoded sequence and `mbstate_t` is the result of having advanced to the input and possibly output positions through the application of prior calls to the same transcoding function. The behavior is undefined when a function described by this subclause is invoked and `*state` (or, if `state` is a null pointer, the `mbstate_t` object created for that case) does not describe the conversion state for the current conversion.

If `input_size` is a null pointer, `input` shall either be a null pointer or point to a null pointer. Otherwise, `input` shall be a pointer to a non-null pointer to an array of at least `*input_size` elements.

The multi unit transcoding functions convert from code units of type `charX` interpreted according to encoding X to code units of type `charY` according to encoding Y given a conversion state of value `*state`. The behavior of these functions is as-if the analogous single unit function `XntoYn` was repeatedly called, with the same `output`, `output_size`, `input`, `input_size`, and `state` parameters, to perform multiple indivisible units of work. The function stops when an error occurs or the input is exhausted (only signified when `*input_size` is zero).

The multi unit transcoding functions behave as-if: 1. If `state` is a null pointer, then an automatic storage duration object of type `mbstate_t` is created which is unique to the current invocation. It is initialized to the initial conversion state and a pointer to this object is used wherever `state` is used in this paragraph. 2. `stdc_XnrtoYn` is called with `output_size`, `output`, `input_size`, `input`, and `state` with its result stored in a temporary named `err`. 3. If `input` is a null pointer or `*input` is a null pointer, return `err`. 4. If `err` is not `stdc_mcerr_ok`, then return `err`. 5. Otherwise, if `*input_size` is greater than zero, go back to (2). 6. Otherwise, if `mbsinit(*state)` returns zero, go back to (2). 7. Otherwise, return `err`;

Returns

On success or failure, the transcoding functions shall return one of the above error codes (7.S✨.1). If `state` is not a null pointer and `*state` is not initialized to the initial conversion state for the function on its first use, or is used after being input into a function whose result is not one of `stdc_mcerr_ok`, `stdc_mcerr_incomplete_input`, or `stdc_mcerr_insufficient_output`, the behavior of the functions is unspecified.

The following is true after the invocation: :: — `*input` is incremented by the number of code units read and successfully converted if `input` and `*input` are not null pointers. If `stdc_mcerr_ok` is returned, then all the input is consumed. Otherwise, `*input` points to the location just after the last successfully completed indivisible unit of work. :: — `*input_size` is decremented by the number of code units read from `*input` that were successfully converted. If no error occurred, then `*input_size` is 0. :: — if `output` and `*output` is not a null pointer, `*output` is incremented by the number of code units written from successfully completed indivisible unit of work. :: — if `output_size` is not a null pointer, `*output_size` is decremented by the number of code units written to the output or that would have been written to the output. If the return value is `stdc_mcerr_invalid` and `state` is not a null pointer, then `*state` is in an unspecified state.

NOTE The object unique to the invocation is reused for every call in the second step of the multi unit sized conversion algorithm, and not recreated. If `state` is a null pointer, and the function uses e.g. a created automatic storage duration `mbstate_t` object that is discarded by the end of the invocation, then any potential conversion state contained in the created `mbstate_t` object, and used or accumulated during multi unit processing, could become unrecoverable to the program.

**EXAMPLE 1** The following is an example of using a single indivisible unit sized conversion function `stdc_mcnrtoc8n` to implement a multi unit sized conversion algorithm: ```cpp #include stdc_mcerr sample_mcsnrtoc8sn(size_t*restrict output_size, char8_t*restrict *restrict output, size_t*restrict input_size, const char*restrict *restrict input, mbstate_t* restrict state) { mbstate_t invocation_unique_internal_state; if (state == nullptr) { invocation_unique_internal_state = (mbstate_t){}; state = &invocation_unique_internal_state; } if (input == nullptr || *input == nullptr) { return stdc_mcnrtoc8n(output_size, output, input_size, input, state); } for (;;) { stdc_mcerr err = stdc_mcnrtoc8n(output_size, output, input_size, input, state); if (err != stdc_mcerr_ok) { return err; } if (*input_size > 0) { continue; } // some execution encodings (6.2.9) may contain // additional output as input gets processed int state_finished = mbsinit(state); if (state_finished == 0) { continue; } return err; } } ```

**EXAMPLE 2** The multi unit sized conversion functions can be used to perform other functionality, such as counting, validation, and more by using a null pointer value for specific arguments: ```cpp #include bool is_valid_utf16_from_utf8(size_t str_n, const char8_t str[restrict static str_n]) { stdc_mcerr err = stdc_c8snrtoc16sn(nullptr, nullptr, &str_n, &str); return err == stdc_mcerr_ok; } size_t count_utf16_from_utf8(size_t str_n, const char8_t str[restrict static str_n]) { const size_t utf16_before_n = SIZE_MAX; size_t utf16_after_n = utf16_before_n; stdc_mcerr err = stdc_c8snrtoc16sn(&utf16_after_n, nullptr, &str_n, &str); return err == stdc_mcerr_ok ? utf16_before_n - utf16_after_n : 0; } bool unbounded_conversion_utf16_from_utf8(size_t str_n, const char8_t str[restrict static str_n], char16_t* restrict dest_str) { stdc_mcerr err = stdc_c8snrtoc16sn(nullptr, &dest_str, &str_n, &str); return err == stdc_mcerr_ok; } int main () { const char8_t str[] = u8"\"Saw a \U0001F9DC \u2014" u8"didn't catch her\u2026 \U0001F61E\"\n\t- Sniff"; // include null terminator const size_t str_n = (sizeof(str) / sizeof(*str)); if (!is_valid_utf16_from_utf8(str_n, str)) { // input not valid return 1; } size_t utf16_str_n = count_utf16_from_utf8(str_n, str); constexpr size_t utf16_str_max_size = STDC_C16_MAX * (sizeof(str) / sizeof(*str)); char16_t utf16_str[utf16_str_max_size] = {}; if (utf16_str_max_size < utf16_str_n) { // buffer too small return 2; } if (!unbounded_conversion_utf16_from_utf8(str_n, str, utf16_str)) { // write failed return 3; } // At this point, utf16_str is a veritable UTF-16 string. // As noted above, null terminator from utf8_str was included: // utf16_str is a sequence of UTF-16 code units plus the null // terminator, in a suitable form at the end of the UTF-16 string. return 0; } ``` The above program demonstrates validating, counting, and doing an unbounded (size unsafe) write using the provided functions. Caution should be taken when a program uses unbounded writes, as the size of the buffer is assumed to be large enough during the call to the multi unit sized conversion function when `output_size` is a null pointer. An implementation can detect the above cases where specific arguments or their pointed to values are a null pointer value, and provide improved implementations relying on properties from these assumptions.

Recommended Practice

The multi unit transcoding functions are explicitly for the purpose of performing conversions on the largest contiguous section of valid data in the shortest amount of time possible. Implementations should take advantage of the information of null pointer values for the output size pointer, output data pointer, or both, to drastically improve performance characteristics for assumed unlimited write space, output counting scenarios, or input validation/counting, respectively.

Implementations should prefer returning an error for an incomplete input sequence over storing intermediate data within the state where possible for non-Unicode encodings. By leaving partial input unconsumed, it can be easier for functionality built on top of the functions in this subclause to report errors without skipping over potentially invalid input data.

### Add unspecified behavior to Annex J.1 Unspecified behavior ### {#wording-specification-j.1}

J.1 Unspecified Behavior

The following are unspecified: :: … :: — The conversion state after an encoding error (6.2.9) occurs (7.30.6.3.2, 7.30.6.3.3, 7.30.6.4.1, 7.30.6.4.2). :: — The conversion state after a unrecorded encoding error (6.2.9) occurs (7.S✨). :: — The use of an `mbstate_t` object that contains conversion state from an unrelated conversion (7.S✨). :: …

### Add unspecified behavior to Annex J.2 Undefined behavior ### {#wording-specification-j.2}

J.2 Undefined Behavior

The following are undefined: :: … :: — Using a buffer that is too small but providing a null pointer to the `output_size` argument of a transcoding function (7.S✨). :: … :: — Using a pointed to `state` object that does not describe the conversion state for the current conversion (7.S✨.1, 7.S✨.2). :: …

# Acknowledgements # {#acknowledgements} Thank you to Philipp K. Krause for responding to the e-mails of a newcomer to matters of C and providing me with helpful guidance. Thank you to Rajan Bhakta, Daniel Plakosh, and David Keaton for guidance on how to submit these papers and get started in WG14. Thank you to Tom Honermann for lighting the passionate fire for proper text handling in me for not just C++, but for our sibling language C. # Appendix # {#appendix} ## (From revisions 0-3) What about UTF{X} ↔ UTF{Y} functions? ## {#appendix-proposed-utf} Function interconverting between different Unicode Transformation Formats are not proposed here because -- while useful -- both sides of the encoding are statically known by the developer. The C Standard only wants to consider functionality strictly in the case where the implementation has more information / private information that the developer cannot access in a well-defined and standard manner. A developer can write their own Unicode Transformation Format conversion routines and get them completely right, whereas a developer cannot write the Wide Character and Multibyte Character functions without incredible heroics and/or error-prone assumptions. This brings up an interesting point, however: if `__STDC_UTF16__` and `__STDC_UTF32__` both exist, does that not mean the implementation controls what `c16` and `c32` mean? This is true, **however**: within a (admittedly limited) survey of implementations, there has been no suggestion or report of an implementation which does not use UTF16 and UTF32 for their `char16_t` and `char32_t` literals, respectively. Thankfully, that does not seem to be the case at this time. It will also no longer be the case in C23, as the paper [[n2728|char16_t and char32_t literals should be UTF-16 and UTF-32]] has been accepted.

{
	"n3054": {
		"authors": [
			"ISO/IEC JTC1 SC22 WG14 - Programming Languages, C",
			"JeanHeyd Meneide",
			"Freek Wiedijk"
		],
		"title": "n3054: ISO/IEC 9899:202x - Programming Languages, C",
		"href": "https://www.open-std.org/jtc1/sc22/wg14/www/docs/n3054.pdf",
		"date": "September 3rd, 2022"
	},
	"glibc-25744": {
		"authors": [
			"Tom Honermann",
			"Carlos O'Donnell"
		],
		"title": "`mbrtowc` with Big5-HKSCS returns 2 instead of 1 when consuming the second byte of certain double byte characters",
		"href": "https://sourceware.org/bugzilla/show_bug.cgi?id=25744",
		"date": ""
	},
	"N2282": {
		"authors": [
			"Philip K. Krause"
		],
		"title": "Additional multibyte/wide string conversion functions",
		"href": "https://www.open-std.org/jtc1/sc22/wg14/www/docs/n2282.htm",
		"date": "June 2018"
	},
	"iconv": {
		"authors": [
			"Bruno Haible",
			"Daiki Ueno"
		],
		"title": "libiconv",
		"href": "https://savannah.gnu.org/git/?group=libiconv",
		"date": "August 2020"
	},
	"N2244": {
		"authors": [
			"WG14"
		],
		"title": "Clarification Request Summary for C11, Version 1.13",
		"href": "https://www.open-std.org/jtc1/sc22/wg14/www/docs/n2244.htm",
		"date": "October 2017"
	},
	"cuneicode": {
		"authors": [
			"JeanHeyd Meneide",
			"Shepherd's Oasis, LLC"
		],
		"title": "cuneicode - A spicy text library for C",
		"href": "https://ztdcuneicode.rtfd.io",
		"date": "November 20th, 2021"
	},
	"N1570": {
		"authors": [
			"ISO/IEC JTC1 SC22 WG14 - Programming Languages, C"
		],
		"title": "C11 Committee Draft",
		"href": "https://www.open-std.org/jtc1/sc22/WG14/www/docs/n1570.pdf",
		"date": "April 12, 2011"
	},
	"Unicode_greater_detail": {
		"authors": [
			"JeanHeyd Meneide"
		],
		"title": "Catching ⬆️: Unicode for C++ in Greater Detail",
		"href": "https://www.youtube.com/watch?v=FQHofyOgQtM",
		"date": "November 2019"
	},
	"Unicode_deep_c_diving": {
		"authors": [
			"JeanHeyd Meneide"
		],
		"title": "Deep C Diving - Fast and Scalable Text Interfaces at the Bottom",
		"href": "https://youtu.be/X-FLGsa8LVc",
		"date": "July 2020"
	},
	"n2728": {
		"authors": [
			"JeanHeyd Meneide"
		],
		"title": "char16_t and char32_t shall be UTF-16 and UTF-32",
		"href": "https://thephd.dev/_vendor/future_cxx/papers/C%20-%20char16_t%20&%20char32_t%20string%20literals%20shall%20be%20UTF-16%20&%20UTF-32.html",
		"date": "May 15th, 20201"
	},
	"clang-iso10646": {
		"authors": [
			"Corentin Jabot"
		],
		"title": "Define __STDC_ISO_10646__",
		"href": "https://reviews.llvm.org/D106577",
		"date": "June 22nd, 2021"
	},
	"lemire-spire2021": {
		"authors": [
			"Daniel Lemire"
		],
		"title": "Unicode at Gigabytes per Second",
		"href": "https://doi.org/10.48550/arXiv.2111.08692",
		"date": "November 14th, 2021"
	},
	"n2892": {
		"authors": [
			"Jens Gustedt"
		],
		"title": "N2892: Basic lambdas for C",
		"href": "https://www.open-std.org/jtc1/sc22/wg14/www/docs/n2892.pdf"
	}
}

_{_{_{May the Tower of Babel's curse be defeated.}}}