Title: Restartable Functions for Efficient Character Conversions
Shortname: 3265
Revision: 12
!Previous Revisions: N3095 (r11), N3075 (r10), N3031 (r9), N2999 (r8), N2966 (r7), N2902 (r6), N2730 (r5), N2620 (r4), n2595 (r3), n2500 (r2), n2440 (r1), n2431 (r0)
Status: P
Date: 2024-05-20
Group: WG14
!Proposal Category: Library Feature Request
!Target: C2y/C3a
Editor: JeanHeyd Meneide, phdofthehouse@gmail.com
Editor: Shepherd (Shepherd's Oasis LLC), shepherd@soasis.org
URL: https://thephd.dev/_vendor/future_cxx/papers/C%20-%20Efficient%20Character%20Conversions.html
!Latest: https://thephd.dev/_vendor/future_cxx/papers/C%20-%20Efficient%20Character%20Conversions.html
!Paper Source: GitHub ThePhD/future_cxx
Issue Tracking: GitHub https://github.com/ThePhD/future_cxx/issues
Metadata Order: Previous Revisions, Editor, Latest, Paper Source, Implementation, Issue Tracking, Project, Audience, Proposal Category, Target
Markup Shorthands: markdown yes
Toggle Diffs: no
Abstract: Implementations firmly control what both the Wide Character and Multi-Byte Character strings are treated at runtime by the Standard Library. While this control is fine, users of the Standard Library have no portability guarantees about how these library functions may behave, especially in the face of encodings that do not support each other's full codepage. And, despite additions to C11 for maybe-UTF16 and maybe-UTF32 encoded types, these functions only offer conversions of a single unit of information at a time, leaving orders of magnitude of performance on the table. This paper proposes and explores additional library functionality to allow users to retrieve multibyte and wide character into a statically known encoding to enhance the ability to work with text.
path: resources/css/bikeshed-wording.html
# Changelog # {#changelog} ## Revision 12 - May 20th, 2024 ## {#changelog-r12} - Rewrite specification to be far more understandable. - Move most footnotes to proper NOTEs. (This is in line with ISO editorial guidance to reduce footnote usage as much as possible.) - Remove uses of "may" in NOTEs/EXAMPLEs. - Define special "write output"/"read input" words and use them to simplify specification for transcoding functions. - Point out that C++ has made a similar change for its `wchar_t` in [[p2460r2]] and that if we would like to remain in sync, C should consider following suit, as described in [[#design-wchar_t]]. - Improve the set of macros available from `` and predefined in the compiler based on feedback and design discussion in [[#design-wchar_t]]. - Clarify that setting `*state` to its initial sequence when given a null pointer as `input` (or `*input`) is a feature of the legacy API that is being carried over in [[#design-state]]. - Add additional paragraph at the end of [[#design-forms-state]] explaining why its still good to have the `mbstate_t` parameter for Unicode encodings. ## Revision 11 - January 31st, 2023 ## {#changelog-r11} - Fix minor typos in example specification. - Point out that C++ has made a similar change for its `wchar_t` in [[p2460r2]] and that if we would like to remain in sync, C should consider following suit, as described in [[#design-wchar_t]]. ## Revision 10 - January 5th, 2023 ## {#changelog-r10} - Retargeting for C2y/C3a, after failure during the July 2022 Virtual Meeting to advance the state of Unicode. - Apply all of the changes recommended by Hubert Tong, concerning `\0` write out (turn it into being a part of an indivisible unit of work), the "clear the `mbstate_t`" footnote being applied to both places, and the new way to write the paragraph concerning what a valid `state` is. - WG14 could not decide between prefix vs. no-prefix, or prefix-but-only-for-macros. All votes came out even. In the interest of not trampling the user namespace and since this is an entirely new header, this paper restores the prefixes. - Fix double-italicized *unrecorded encoding error* and leave only one (thanks, Aaron Ballman!). - During the July 2022 Virtual Meeting, it was made clear that a DeathStation 9000 implementation could use the wording in the standard to ignore the multi-output capabilities of the new functions and still artificially restrict itself to UCS-2 and similar for "conformance" purposes. Therefore, the following specific wording changes were made to address such concerns: - for the initial "wide character" definition ([[#wording-specification-3.7.3]]); - for definition of pre-defined macros by the compiler ([[#wording-specification-6.10.9.2]]); - for definition of `wchar_t` in `` ([[#wording-specification-7.21]]); - and, for usage of `wchar_t` in ``'s specification ([[#wording-specification-7.31]]). ## Revision 9 - July 19th, 2022 ## {#changelog-r9} - Removed the `stdc_` prefix in all of the prose. - Wording changes: - Removed the `stdc_` prefix for the transcoding functions and `stdc_mcerr_*`. (This particular change is not highlighted because it is a find-and-replace and would turn much of the paper a different color over what is a mechanical/editorial name change.) - Remove freestanding changes. - Add a new sentence for properly writing out the null character AND any preceding conversion state. - Add a new paragraph to directly explain what states are valid to place into the functions (both the single and multi-conversion functions). - Change the wording for when `state` is a null pointer, saying instead: "automatic storage duration object of type `mbstate_t` is created unique to the current invocation". - Remove the text describing "first use" of `mbstate_t` as it is now covered by the new paragraph. - Add 1 new footnote to ensure users understand that they may lose conversion state data for restarting a conversion if the `mbstate_t*` state parameter is not provided or is a null pointer. - Add 1 new footnote to make it clear that the state can be reset by using `mbstate_t s = {0};` with `s = (mbstate_t){0};`, as described in §7.30.6 (Extended multibyte/wide character conversion utilities). - Greatest of thanks to Dr. H. for the continual review and catching most of my most glaring mistakes. ## Revision 8 - June 17th, 2022 ## {#changelog-r8} - Editorial changes: - Use "is a null pointer" versus "is `NULL`" and similar. - Fix a grammar mistake for a sentence discussing code points. - Use "initial shift state" over "initial conversion sequence" - Talk about how `mbstate_t` is not necessary for the u8/u16/u32 functions, but preserves function form for macro-generic programming. - Wording changes: - This header shall exist for freestanding. - Define code point, code unit, encoding error, unrecorded encoding error, and more in 6.2.9 for use elsewhere. - Properly change how the functions handle the `*state` and `state` object in the wording. - Fix the problem with the specification of the output values when processing a null character value or when using `input`/`*input` that are null pointers. - Adjust the specification to give proper unspecified behavior. - Add Annex J.1 entry. ## Revision 7 - April 12th, 2022 ## {#changelog-r7} - Update definitions for when the `input` pointer - and its pointed to pointer - are `NULL`, and make sure they have identical behaviors. - Update definitions for when the `output` pointer - and its pointed to pointer - are `NULL`, and make sure they have identical behaviors. - Ensure that `mbstate_t` is properly cleared to the initial conversion state on both `NULL` input pointers or U+000000 being processed in the input string. Ensure that it outputs any code unit sequences to the output necessary to return the output to its initial conversion state as well (thanks, Hubert Tong!). - Handle `LC_CTYPE` being changed in the wording (it's unspecified behavior) (thanks, Hubert Tong!). - Switch to using `char8_t` officially now that it has been accepted to C23 (thanks, Joseph Myers!). - `restrict` added to the function prototypes ([[#design-forms-choice]]). ## Revision 6 - January 1st, 2022 ## {#changelog-r6} - Add design critique for the latest interface suggestion in [[#design-forms]]. - Remove all non-`mbstate_t` functions, to reduce the function count, and change behavior of the function. - Make sure `stdc_mcerr` is meant to be a proper error enumeration and a typedef. - Properly define *indivisible unit of work*. ## Revision 5 - November 30th, 2021 ## {#changelog-r5} - Design critique and benchmark 3 different styles of function declaration and discuss benefits. - A full, independent implementation of [[cuneicode|this paper (and more) is now available]]. ## Revision 4 - December 1st, 2020 ## {#changelog-r4} - Add missing functions for c8/16/32 to the platform-specific variants. - Ensure that `mbstate_t` is used throughout rather than `mcstate_t`. - Explain behavior of `NULL` for `mbstate_t` to avoid use of global values. ## Revision 3 - October 27th, 2020 ## {#changelog-r3} - Completely Reformulate Paper based on community, musl-libc, and glibc feedback. - Completely rewrite every section past [[#wording]], and change many more. ## Revision 0-2 - March 2nd, 2020 ## {#changelog-r0} - Introduce new functions and gather consensus to move forward. - Attempt to implement in other standard libraries and gather feedback. # Introduction and Motivation # {#intro} C adopted conversion routines for the current active locale-derived/`LC_TYPE`-controlled/implementation-defined encoding for Multibyte (`mb`) Strings and Wide (`wc`) Strings. While the rationale for having such conversion routines to and from Multibyte and Wide strings in the C library are not explicitly stated in the documents, it is easy to derive the many benefits of a full ecosystem of both restarting (`r`) and non-restarting conversion routines for both single units and string-based bulk conversions for `mb` and `wc` strings. From ease of use with string literals to performance optimizations from bulk processing with vectorization and SIMD operations, the `mbs(r)towcs` — and vice-versa — granted a rich and fertile ground upon which C library developers took advantage of platform amenities, encoding specifics, and hardware support to provide useful and fast abstractions upon which encoding-aware applications could build. Unfortunately, none of these API designs were granted to `char16_t` (`c16`) or `char32_t` (`c32`) conversion functions. Nor were they given a way to work with a well-defined 8-bit multibyte encoding such as UTF8 without having to first pin it down with platform-specific `setlocale(...)` calls. This has resulted in a series of extremely vexing problems when trying to write a portable, reliable C library code that is not locked to a specific vendor. This paper looks at the problems, and then proposes a solution with the goal of hoping to arrive at a solution that is worth implementing for the C Standard Library. ## Problem 1: Lack of Portability ## {#intro-problem-portability} Already, Windows, z/OS, and POSIX platforms greatly differ in what they offer for `char`-typed, Multibyte string encodings. EBCDIC is still in play after many decades. Windows's Active Code Page functionality on its machine prevents portability even within its own ecosystem. Platforms where LANG environment variables control functionality make communication between even processes on the same hardware a silent and often unforeseen gamble for library developers. Using functions which convert to/from `mbs` make it impossible to have stability guarantees not only between platforms, but for individual machines. Sometimes even cross-process communication becomes exceedingly problematic without opting into a serious amount of platform-specific or vendor-specific code and functionality to lock encodings in, harming the portability of C code greatly. `wchar_t` does not fare better. By definition, a wide character type must be capable of holding the entire character set in a single unit of `wchar_t`. Reality, however, is different: this has been a fundamental impossibility for decades for implementers that switched to 16-bit UCS-2 early. IBM machines persist with this issue for all 32-bit builds, though some IBM platforms took advantage of the 64-bit change to do an ABI break and use UTF32 like other Linux distributions settled on. Even if one were to know this knowledge about IBM and program exclusively on their machines, certain IBM platforms can still end up in a situation where `wchar_t` is neither 32-bit UTF32 or 16-bit UCS-2/UTF16: the encoding can change to something else in certain Chinese locales, becoming completely different. Windows is permanently stuck on having to explicitly detail that its implementation is "16-bit, UCS-2 as per the standard", before explicitly informing developers to use vendor-specific `WideCharToMultibyte`/`MultibyteToWideChar` to handle UTF16-encoded characters in `wchar_t`. These solutions provide ways to achieve a local maxima for a specific vendor or platform. Unfortunately, this comes at the extreme cost of portability: the code has no guarantee it will work anywhere but your machine, and in a world that is increasingly interconnected by devices that interface with networks it makes sharing both data and code troublesome and hard to work with. ## Problem 2: What is the Encoding? ## {#intro-problem-what} With `setlocale` and `getlocale` only responding to and returning implementation-defined `(const )char*`, there is no way to portably determine what the locale (and any associated encoding) should or should not be. The typical solution for this has been to code and program only for what is guaranteed by the Standard as what is in the Basic Character Set. While this works fine for source code itself, this produces an extremely hostile environment: - conversion functions in the standard mangle and truncate data in (sometimes troubling, sometimes hilarious) fashion; - programs which are not careful to meticulously track encoding of incoming text often lose the ability to understand that text; - programmers can never trust the platform will support even the Latin characters in any representation of data beyond the 7th bit of a byte; - and, interchange between cultures with different default encodings makes it impossible to communicate with others without entirely forsaking the standard library. Abandoning the C **Standard** Library -- to get **standard** behavior across platforms -- is an exceedingly bitter pill to have to swallow as an enthusiastic C developer. ## Problem 3: Performance ## {#intro-problem-performance} The current version of the C Standard includes functions which attempt to alleviate Problems 1 and 2 by providing conversions from the per-process (and sometimes per-thread), locale-sensitive black box encoding of multibyte `char*` strings. They do this by providing conversions to `char16_t` units or `char32_t` units with `mbrtoc(16|32)` and `c(16|32)rtomb` functions. We will for a brief moment ignore the presence of the `__STD_C_UTF16__` and `__STD_C_UTF32__` macros and assume the two types mean that string literals and library functions convert to and from UTF16 and UTF32 respectively. We will also ignore that `wchar_t`'s encoding -- which is just as locale-sensitive and unknown at compile and runtime as `char`'s encoding is -- has no such conversion functions. These givens make it possible to say that we, as C programmers, have 2 known encodings which we can use to shepherd data into a stable state for manipulation and processing as library developers. Even with that knowledge, these one-unit-at-a-time conversions functions are slower than they should be. On many platforms, these one-at-a-time function calls come from the operating system, dynamically loaded libraries, or other places which otherwise inhibit compiler observation and optimizer inspection. Attempts to vectorize code or unroll loops built around these functions is thoroughly thwarted by this. Building static libraries or from source is very often a non-starter for many platforms. Since the encoding used for multibyte strings and wide strings are controlled by the implementation, it becomes increasingly difficult to provide the functionality to convert long segments of data with decent performance characteristics without needing to opt into vendor or platform specific tricks. ## Problem 4: `wchar_t` Cannot Roundtrip ## {#intro-problem-roundtrip} With no `wctoc32` or `wctoc16` functions, the only way to convert a wide character or wide character string to a program-controlled, statically known encoding UTF encoding is to first invoke the wide character to multibyte function, and then invoke the multibyte function to either `char16_t` or `char32_t`. This means that even if we have a well-behaved `wchar_t` that is not sensitive to the locale (e.g., on Windows machines), we lose data if the locale-controlled `char` encoding is not set to something that can handle all incoming code unit sequences. The locale-based encoding in a program can thus tank what is simply meant to be a pass-through encoding from `wchar_t` to `char16_t`/`char32_t`, all because the only Standards-compliant conversion channels data through the locale-based multibyte encoding `mb(s)(r)toX(s)` functions. For example, it was fundamentally impossible to engage in a successful conversion from `wchar_t` strings to `char` multibyte strings on Windows using the C Standard Library. Until a very recent Windows 10 update, UTF8 could **not** be set as the active system codepage either programmatically or through an experimental, deeply-buried setting. This has changed with Windows Version 1903 (May 2019 Update), but the problems do not stop there. No dedicated UTF-8 support (the standard mandates no specific encodings or charsets) leaves developers to write the routines themselves. Worse, roundtrip through the locale after forcing a change to a UTF-8 locale may not be supported, leaving the developer to use the combination of functions to hope that the multibyte locale encoding is good enough to transfer data from the Unicode encodings to the wide character encodings (and vice-versa). While the non-restartable functions can save quite a bit of code size, unfortunately there are many encodings which are not as nice and require state to be processed correctly (e.g., Shift JIS and other ISO-2022 encodings). Not being able to retain that state between potential calls in a `mbstate_t` is detrimental to the ability to move forward with any encoding endeavor that wishes to bridge the gap between these disparate platform encodings and the current locale. Because other library functions can be used to change or alter the locale in some manner, it once again becomes impossible to have a portable, compliant program with deterministic behavior if just one library changes the locale of the program, let alone if the encoding or locale is unexpected by the developer because they do not know of that culture or its locale setting. This hidden state is nearly impossible to account for: the result is software systems that cannot properly handle text in a meaningful way without abandoning C's encoding facilities, relying on vendor-specific extensions/encodings/tools, or confining one's program to only the 7-bit plane of existence. ## Problem 5: The C Standard Cannot Handle Existing Practice ## {#intro-problem-standard} The C standard does not allow a wide variety of encodings that implementations have already crammed into their backing locale blocks to work, resulting in the abandonment of locale-related text facilities by those with double-byte character sets, primarily from East Asia. For example, there is a serious bug that cannot be fixed without [[glibc-25744|non-conforming, broken behavior]]: > ... > > This call writes the second Unicode code point, but does not consume > any input. 0 is returned since no input is consumed. According to > the C standard, a return of 0 is reserved for when a null character is > written, but since the C standard doesn't acknowledge the existence of > characters that can't be represented in a single `wchar_t`, we're already > operating outside the scope of the standard. The standard cannot handle encodings that must return two or more `wchar_t` for however many -- up to `MB_MAX_LEN` -- `char`s it consumes. This is even for when the target `wchar_t` "wide execution" encoding is UTF-32; this is a **fundamental limitation of the C Standard Library that is absolutely insurmountable by the current specification**. This is exacerbated by the standard's insistence that a single `wchar_t` must be capable of representing all characters as a single element, a philosophy which has been bled into the relevant interfaces such as `mbrtowc` and other `*wc*` related types. As the values cannot be properly represented in the standard, this leaves people to either make stuff up or abandon it altogether. This [[N1570|means that the design introduced from C11]] and beyond is fundamentally broken when it comes to handling existing practice. Furthermore, clarification requests have had to be filed for other functions, [[N2244|just to improve their behavior with respect to multiple input and multiple output]]. Many have been noted as issues for `mbrtoc16` and similar functionality, as was originally part of [[N2282|Dr. Philip K. Krause's fixes to the functions]]. This paper attempts to solve the same problem in a more fundamental manner. ## In Summary ## {#intro-summary} The problems C developers face today with respect to encoding and dealing with vendor and platform-specific black boxes is a staggering trifecta: non-portability between processes running on the same physical hardware, performance degradation from using standard facilities, and potentially having a locale changed out from under your program to prevent roundtripping. This serves as the core motivation for this proposal. # Prior Art # {#prior} There are many sources of prior art for the desired feature set. Some functions (with fixes) were implemented directly in implementations, embedded and otherwise. Others rely exclusively platform-specific code in both Windows and POSIX implementations. Others have cross-platform libraries that work across a myriad of platforms, such as ICU or iconv. We discuss the most diverse and exemplary implementations. ## Standard C ## {#prior-standard} To understand what this paper proposes, an explanation of the current landscape is necessary. The following table is meant to be read as being `{row}to{column}`. The symbols provide the following information: - ✔️: Function exists in both its restartable (function name has the indicative `r` in it) and its canonical non-restartable form (`{row}to{column}` and `{row}rto{column}`). - 🇷: Function exists only in its "restartable" form (`{row}rto{column}`). - ❌: Function does not exist at all. Here is what exists in the C Standard Library so far:
`mb` `wc` `mbs` `wcs` `c8` `c16` `c32` `c8s` `c16s` `c32s`
`mb` ✔️ 🇷 🇷
`wc` ✔️
`mbs` ✔️
`wcs` ✔️
`c8`
`c16` 🇷
`c32` 🇷
`c8s`
`c16s`
`c32s`
There is a lot of missing functionality here in this table, and it is important to note that a large amount of this comes from both not being willing to standardize more than the bare minimum and not having a cohesive vision for improving encoding conversions in the C Standard. Notably, string-based `{prefix}s` functions are missing, leaving performance-oriented multi-unit conversions out of the standard. There are also severe API flaws in the C standard, [as discussed above](#intro-problem-standard). ## Win32 ## {#prior-win32} `WideCharToMultiByte` and `MultiByteToWideChar` are the APIs of choice for those in Win32 environments to get to and from the run-time execution encoding and -- if it matches -- the translation-time execution encoding. Unfortunately, these APIs are locked within the Windows ecosystem entirely as they are not available as a standalone library. Furthermore, as an operating system Windows exclusively controls what it can and cannot convert from and to; some of these functions power the underlying portions of the character conversion functions in their Standard Library, but they notably truncate multi-code-unit characters for their UTF-16 `wchar_t`. This produces a broken, deprecated UCS-2 encoding when e.g. `mbrtowc` is used instead of directly relying on the operating system functionality, making the C standard's functions of dubious use. ## `nl_langinfo` ## {#prior-nl_langinfo} `nl_langinfo` is a POSIX function that returns various pieces of information based on an enumerated input and some extra parameters. It has been suggested that this be standardized over anything else, to make it easier to determine what to do with a given locale. The first problem with this is it returns a string-based identifier that can be whatever an implementation decides it should be. This makes `nl_langinfo` is no better than `setlocale(LC_CHARSET, NULL)` in its design: > Specifies the name of the coded character set for which the **charmap** file is defined. This value determines the value returned by the `nl_langinfo` subroutine. The `` must be specified using any character from the portable character set, except for control and space characters. Any name can be chosen that fits this description, and POSIX nails nothing down for portability or identification reasons. There is no canonical list, just whatever implementations happen to supply as their "charmap" definitions. ## SDCC ## {#prior-sdcc} The Small Device C Compiler (SDCC) has already begun some of this work. One of its principle contributors, Dr. Philip K. Krause, [[N2282|wrote papers addressing exactly this problem]]. Krause's work focuses entirely on non-restartable conversions from Multibyte Strings to `char16_t` and `char32_t`. There is no need for a conversion to a UTF8 `char` style string for SDCC, since the Multibyte String in SDCC is always UTF8. This means that `mbstoc16s` and `mbstoc32s` and the "reverse direction" functions encompass an entire ecosystem of UTF8, UTF16, and UTF32. While this is good for SDCC, this is not quite enough for other developers who attempt to write code in a cross-platform manner. Nevertheless, SDCC's work is still important: it demonstrates that these functions are implementable, even for small devices. With additional work being done to implement them for other platforms, there is strong evidence that this can be implemented in a cross-platform manner and thusly is suitable for the Standard Library. ## iconv/ICU ## {#prior-iconv} The following C functions presented is motivated primarily by concepts found in a popular POSIX library, [[iconv]]. We do not provide the full power of iconv here but we do mimic its interface to allow for a better definition of functions, as explained in [Problem 5](#intro-problem-standard). The core of the functionality can be embodied in this parameterized function signature: ```cpp stdc_mcerr XstoYs(const charX*restrict *restrict input,size_t*restrict input_bytes, const charY*restrict *restrict output, size_t*restrict output_bytes); ``` In `iconv`'s case, an additional first parameter describing the conversion (of type `iconv_t`). That is not needed for this proposal, because we are not making a generic conversion API. This proposal is focused on doing 2 things and doing them extremely well: - Getting data from the current execution encoding (`char`) to a Unicode encoding (`char8_t`/UTF-8, `char16_t`/UTF-16, `char32_t`/UTF-32), and the reverse. - Getting data from the current wide execution encoding (`wchar_t`) to a Unicode encoding (`char8_t`/UTF-8, `char16_t`/UTF-16, `char32_t`/UTF-32), and the reverse. iconv can do the above conversions, but also supports a complete list of pairwise conversions between about 49 different encodings. It can also be extended at translation time by programming more functionality into its library. This proposal is focusing just in doing conversions to and from encodings that the implementation owns to/from Unicode. This results in the design found [[#wording|the wording]].
# Design # {#design} Given the problems before, the prior art, the implementation experience, and the vendor experience, it is clear that we need something outside of `nl_langinfo`, lighter weight than all of `iconv`, and more resilient and encompassing than what the C Standard offers. Therefore, the solution to our problem of having a wide variety of implementation encodings is to expand the contract of `wchar_t` for an **entirely new set of functions** which avoid the problems and pitfalls of the old mechanism. Notably, both of the multibyte string's function design and the wide character string's definition of a single character is broken in terms of existing practice today. The primary problem relies in the inability for both APIs in either direction to handle `N:M` encodings, rather than `N:1` or `1:M`. Therefore, these new functions focus on providing an interface to allow multi-code-unit conversions, in both directions. To facilitate this, a new header -- `` -- is introduced. The header contains the "multi character" (`mc`) and "multi wide character" (`mwc`) conversion routines, respectively. To support getting lossless data out of `wchar_t` and `char` strings controlled firmly by the implementation -- and back into those types if the code units in the characters are supported -- the following functionality is proposed using the new multi (wide) character (`m[w]c`) prefixes and suffixes:
`mc` `mwc` `mcs` `mwcs` `c8` `c16` `c32` `c8s` `c16s` `c32s`
`mc` 🅿️✔️ ✔️ 🅿️✔️ 🅿️✔️ 🅿️✔️
`mwc` ✔️ 🅿️✔️ 🅿️✔️ 🅿️✔️ 🅿️✔️
`mcs` 🅿️✔️ ✔️ 🅿️✔️ 🅿️✔️ 🅿️✔️
`mwcs` ✔️ 🅿️✔️ 🅿️✔️ 🅿️✔️ 🅿️✔️
`c8` 🅿️✔️ 🅿️✔️ 🅿️✔️ 🅿️✔️ 🅿️✔️
`c16` 🅿️✔️ 🅿️✔️ 🅿️✔️ 🅿️✔️ 🅿️✔️
`c32` 🅿️✔️ 🅿️✔️ 🅿️✔️ 🅿️✔️ 🅿️✔️
`c8s` 🅿️✔️ 🅿️✔️ 🅿️✔️ 🅿️✔️ 🅿️✔️
`c16s` 🅿️✔️ 🅿️✔️ 🅿️✔️ 🅿️✔️ 🅿️✔️
`c32s` 🅿️✔️ 🅿️✔️ 🅿️✔️ 🅿️✔️ 🅿️✔️
In particular, it is imperative to recognize that the implementation is the "sole proprietor" of the wide locale encodings and multibyte locale encodings for its string literals (compiler) and library functions (standard library). Therefore, the `mc` and `mwc` functions simply focus on providing a good interface for these encodings. The form of both the individual and string conversion functions are: ```cpp stdc_mcerr stdc_XnrtoYn(const size_t*restrict output_size, charY*restrict *restrict output, size_t*restrict input_size, const charX*restrict *restrict input, mbstate_t*restrict state); stdc_mcerr stdc_XsnrtoYsn(const size_t*restrict output_size, charY*restrict *restrict output, size_t*restrict input_size, const charX*restrict *restrict input, mbstate_t*restrict state); ``` The input and output sizes are expressed in terms of the # of `charX`/`charY`s. They take the input/output sizes as pointers, and decrement the value by the amount of input/output consumed. Similarly, the input/output data pointers themselves are incremented by the amount of spaces consumed / written to. This only happens when an irreversible and successful conversion of input data can successfully and without error be written to the output. The `s` functions work on whole strings rather than just a single complete irreversible conversion, the `n` stands for taking a size value. Input is consumed and output is written (with sizes updated) in accordance with a single, successful computation of an *indivisible unit of work*. An indivisible unit of work is the smallest set of input that can be consumed that produces no error and guarantees forward progress through either the input or output buffer (most of the time, both). No output is guaranteed to occur (e.g., during the consumption of a shift state mechanism for e.g. SHIFT-JIS), but if output does happen then it only occurs upon the successful completion of an indivisible unit of work. If an error happens, the conversion is stopped and an error code is returned. The function does not decrement the input or output sizes for the failed operation, nor does it shift the input and output pointers forward for the failed operation. "Failed operation" refers to a single, indivisible unit of work. The error codes are as follows: - `stdc_mcerr_insufficient_output = -3` - the input is correct but there is not enough output space - `stdc_mcerr_incomplete_input = -2` - an incomplete input was found after exhausting the input - `stdc_mcerr_invalid = -1` - an encoding error occurred - `stdc_mcerr_ok = 0` - the operation was successful The behaviors are as follows: - if `state` is a null pointer, then: - an automatic storage duration (non-`static`) `mbstate_t` object is initialized to the initial conversion state; - and, a pointer to this state object plus the original four parameters are passed to the function. - if `output` is a null pointer or `*output` is a null pointer, then no output will be written. If `*output_size` is not a null pointer, the value will be decremented the amount of characters that would have been written. - if `output` and `*output` are not null pointers, and `output_size` is a null pointer, then enough space is assumed in the output buffer for the entire operation. - if `input` is a null pointer or `*input` is a null pointer, then `state` is set to the initial conversion state and, if `output` is not a null pointer, will write out a sequence - if any - to represent a change to the initial conversion state. No other actions are performed. In all other cases, `input` must not be a null pointer. Finally, it is useful to prevent the class of `stdc_mcerr_insufficient_output`/`-3` errors from showing up in your code if you know you have enough space. For the non-string (the functions lacking `s`) that perform a single conversion, a user can pre-allocate a suitably sized static buffer in automatic storage duration space. This will be facilitated by a group of integral constant expressions contained in macros, which would be; - `STDC_MC_MAX`, which is the maximum output for a call to one of the X to multi character (execution encoding) functions - `STDC_MWC_MAX`, which is the maximum output for a call to one of the X to multi wide character (wide execution encoding) functions - `STDC_C8_MAX`, which is the maximum output for a call to one of the X to UTF-8 character functions - `STDC_C16_MAX`, which is the maximum output for a call to one of the X to UTF-16 character functions - `STDC_C32_MAX`, which is the maximum output for a call to one of the X to UTF-32 character functions these values are suitable for use as the size of an array, allowing a properly sized buffer to hold all of the output from the non-string functions. These limits apply **only** to the non-string functions, which perform a single unit of irreversible input consumption and output (or fail with one of the error codes and outputs nothing). Here is the full list of proposed functions: ```cpp #include #define STDC_C8_MAX 4 #define STDC_C16_MAX 2 #define STDC_C32_MAX 1 #define STDC_MC_MAX 1 #define STDC_MWC_MAX 1 typedef enum stdc_mcerr { stdc_mcerr_ok = 0, stdc_mcerr_invalid = -1, stdc_mcerr_incomplete_input = -2, stdc_mcerr_insufficient_output = -3 } stdc_mcerr; stdc_mcerr stdc_mcnrtomcn(size_t*restrict output_size, char*restrict *restrict output, size_t*restrict input_size, const char*restrict *restrict input, mbstate_t*restrict state); stdc_mcerr stdc_mcnrtomwcn(size_t*restrict output_size, wchar_t*restrict *restrict output, size_t*restrict input_size, const char*restrict *restrict input, mbstate_t*restrict state); stdc_mcerr stdc_mcnrtoc8n(size_t*restrict output_size, char8_t*restrict *restrict output, size_t*restrict input_size, const char*restrict *restrict input, mbstate_t*restrict state); stdc_mcerr stdc_mcnrtoc16n(size_t*restrict output_size, char16_t*restrict *restrict output, size_t*restrict input_size, const char*restrict *restrict input, mbstate_t*restrict state); stdc_mcerr stdc_mcnrtoc32n(size_t*restrict output_size, char32_t*restrict *restrict output, size_t*restrict input_size, const char*restrict *restrict input, mbstate_t*restrict state); stdc_mcerr stdc_mwcnrtomcn(size_t*restrict output_size, char*restrict *restrict output, size_t*restrict input_size, wchar_t*restrict *restrict input, mbstate_t*restrict state); stdc_mcerr stdc_mwcnrtomwcn(size_t*restrict output_size, wchar_t*restrict *restrict output, size_t*restrict input_size, wchar_t*restrict *restrict input, mbstate_t*restrict state); stdc_mcerr stdc_mwcnrtoc8n(size_t*restrict output_size, char8_t*restrict *restrict output, size_t*restrict input_size, wchar_t*restrict *restrict input, mbstate_t*restrict state); stdc_mcerr stdc_mwcnrtoc16n(size_t*restrict output_size, char16_t*restrict *restrict output, size_t*restrict input_size, wchar_t*restrict *restrict input, mbstate_t*restrict state); stdc_mcerr stdc_mwcnrtoc32n(size_t*restrict output_size, char32_t*restrict *restrict output, size_t*restrict input_size, wchar_t*restrict *restrict input, mbstate_t*restrict state); stdc_mcerr stdc_c8nrtomcn(size_t*restrict output_size, char*restrict *restrict output, size_t*restrict input_size, const char8_t*restrict *restrict input, mbstate_t*restrict state); stdc_mcerr stdc_c8nrtomwcn(size_t*restrict output_size, wchar_t*restrict *restrict output, size_t*restrict input_size, const char8_t*restrict *restrict input, mbstate_t*restrict state); stdc_mcerr stdc_c8nrtoc8n(size_t*restrict output_size, char8_t*restrict *restrict output, size_t*restrict input_size, const char8_t*restrict *restrict input, mbstate_t*restrict state); stdc_mcerr stdc_c8nrtoc16n(size_t*restrict output_size, char16_t*restrict *restrict output, size_t*restrict input_size, const char8_t*restrict *restrict input, mbstate_t*restrict state); stdc_mcerr stdc_c8nrtoc32n(size_t*restrict output_size, char32_t*restrict *restrict output, size_t*restrict input_size, const char8_t*restrict *restrict input, mbstate_t*restrict state); stdc_mcerr stdc_c16nrtomcn(size_t*restrict output_size, char*restrict *restrict output, size_t*restrict input_size, const char16_t*restrict *restrict input, mbstate_t*restrict state); stdc_mcerr stdc_c16nrtomwcn(size_t*restrict output_size, wchar_t*restrict *restrict output, size_t*restrict input_size, const char16_t*restrict *restrict input, mbstate_t*restrict state); stdc_mcerr stdc_c16nrtoc8n(size_t*restrict output_size, char8_t*restrict *restrict output, size_t*restrict input_size, const char16_t*restrict *restrict input, mbstate_t*restrict state); stdc_mcerr stdc_c16nrtoc16n(size_t*restrict output_size, char16_t*restrict *restrict output, size_t*restrict input_size, const char16_t*restrict *restrict input, mbstate_t*restrict state); stdc_mcerr stdc_c16nrtoc32n(size_t*restrict output_size, char32_t*restrict *restrict output, size_t*restrict input_size, const char16_t*restrict *restrict input, mbstate_t*restrict state); stdc_mcerr stdc_c32nrtomcn(size_t*restrict output_size, char*restrict *restrict output, size_t*restrict input_size, const char32_t*restrict *restrict input, mbstate_t*restrict state); stdc_mcerr stdc_c32nrtomwcn(size_t*restrict output_size, wchar_t*restrict *restrict output, size_t*restrict input_size, const char32_t*restrict *restrict input, mbstate_t*restrict state); stdc_mcerr stdc_c32nrtoc8n(size_t*restrict output_size, char8_t*restrict *restrict output, size_t*restrict input_size, const char32_t*restrict *restrict input, mbstate_t*restrict state); stdc_mcerr stdc_c32nrtoc16n(size_t*restrict output_size, char16_t*restrict *restrict output, size_t*restrict input_size, const char32_t*restrict *restrict input, mbstate_t*restrict state); stdc_mcerr stdc_c32nrtoc32n(size_t*restrict output_size, char32_t*restrict *restrict output, size_t*restrict input_size, const char32_t*restrict *restrict input, mbstate_t*restrict state); stdc_mcerr stdc_mcsnrtomcsn(size_t*restrict output_size, char*restrict *restrict output, size_t*restrict input_size, const char*restrict *restrict input, mbstate_t*restrict state); stdc_mcerr stdc_mcsnrtomwcsn(size_t*restrict output_size, wchar_t*restrict *restrict output, size_t*restrict input_size, const char*restrict *restrict input, mbstate_t*restrict state); stdc_mcerr stdc_mcsnrtoc8sn(size_t*restrict output_size, char8_t*restrict *restrict output, size_t*restrict input_size, const char*restrict *restrict input, mbstate_t*restrict state); stdc_mcerr stdc_mcsnrtoc16sn(size_t*restrict output_size, char16_t*restrict *restrict output, size_t*restrict input_size, const char*restrict *restrict input, mbstate_t*restrict state); stdc_mcerr stdc_mcsnrtoc32sn(size_t*restrict output_size, char32_t*restrict *restrict output, size_t*restrict input_size, const char*restrict *restrict input, mbstate_t*restrict state); stdc_mcerr stdc_mwcsnrtomcsn(size_t*restrict output_size, char*restrict *restrict output, size_t*restrict input_size, wchar_t*restrict *restrict input, mbstate_t*restrict state); stdc_mcerr stdc_mwcsnrtomwcsn(size_t*restrict output_size, char*restrict *restrict output, size_t*restrict input_size, wchar_t*restrict *restrict input, mbstate_t*restrict state); stdc_mcerr stdc_mwcsnrtoc8sn(size_t*restrict output_size, char8_t*restrict *restrict output, size_t*restrict input_size, wchar_t*restrict *restrict input, mbstate_t*restrict state); stdc_mcerr stdc_mwcsnrtoc16sn(size_t*restrict output_size, char16_t*restrict *restrict output, size_t*restrict input_size, wchar_t*restrict *restrict input, mbstate_t*restrict state); stdc_mcerr stdc_mwcsnrtoc32sn(size_t*restrict output_size, char32_t*restrict *restrict output, size_t*restrict input_size, wchar_t*restrict *restrict input, mbstate_t*restrict state); stdc_mcerr stdc_c8snrtomwcsn(size_t*restrict output_size, wchar_t*restrict *restrict output, size_t*restrict input_size, const char8_t*restrict *restrict input, mbstate_t*restrict state); stdc_mcerr stdc_c8snrtomcsn(size_t*restrict output_size, char*restrict *restrict output, size_t*restrict input_size, const char8_t*restrict *restrict input, mbstate_t*restrict state); stdc_mcerr stdc_c8snrtoc8sn(size_t*restrict output_size, char8_t*restrict *restrict output, size_t*restrict input_size, const char8_t*restrict *restrict input, mbstate_t*restrict state); stdc_mcerr stdc_c8snrtoc16sn(size_t*restrict output_size, char16_t*restrict *restrict output, size_t*restrict input_size, const char8_t*restrict *restrict input, mbstate_t*restrict state); stdc_mcerr stdc_c8snrtoc32sn(size_t*restrict output_size, char32_t*restrict *restrict output, size_t*restrict input_size, const char8_t*restrict *restrict input, mbstate_t*restrict state); stdc_mcerr stdc_c16snrtomwcsn(size_t*restrict output_size, wchar_t*restrict *restrict output, size_t*restrict input_size, const char16_t*restrict *restrict input, mbstate_t*restrict state); stdc_mcerr stdc_c16snrtomcsn(size_t*restrict output_size, char*restrict *restrict output, size_t*restrict input_size, const char16_t*restrict *restrict input, mbstate_t*restrict state); stdc_mcerr stdc_c16snrtoc8sn(size_t*restrict output_size, char8_t*restrict *restrict output, size_t*restrict input_size, const char16_t*restrict *restrict input, mbstate_t*restrict state); stdc_mcerr stdc_c16snrtoc16sn(size_t*restrict output_size, char16_t*restrict *restrict output, size_t*restrict input_size, const char16_t*restrict *restrict input, mbstate_t*restrict state); stdc_mcerr stdc_c16snrtoc32sn(size_t*restrict output_size, char32_t*restrict *restrict output, size_t*restrict input_size, const char16_t*restrict *restrict input, mbstate_t*restrict state); stdc_mcerr stdc_c32snrtomcsn(size_t*restrict output_size, char*restrict *restrict output, size_t*restrict input_size, const char32_t*restrict *restrict input, mbstate_t*restrict state); stdc_mcerr stdc_c32snrtomwcsn(size_t*restrict output_size, wchar_t*restrict *restrict output, size_t*restrict input_size, const char32_t*restrict *restrict input, mbstate_t*restrict state); stdc_mcerr stdc_c32snrtoc8sn(size_t*restrict output_size, char8_t*restrict *restrict output, size_t*restrict input_size, const char32_t*restrict *restrict input, mbstate_t*restrict state); stdc_mcerr stdc_c32snrtoc16sn(size_t*restrict output_size, char16_t*restrict *restrict output, size_t*restrict input_size, const char32_t*restrict *restrict input, mbstate_t*restrict state); stdc_mcerr stdc_c32snrtoc32sn(size_t*restrict output_size, char32_t*restrict *restrict output, size_t*restrict input_size, const char32_t*restrict *restrict input, mbstate_t*restrict state); ``` ## Which Function Form? ## {#design-forms} There are several different ways to write the functions present here, each with their own unique tradeoffs. Since a lot of calling conventions cannot afford struct parameters and returns by-value without elevating them to a level of indirection (filling in a pointer of an object allocated by the caller on the stack), and since much of the functionality of the standard does not follow such a convention, in this paper we simply evaluate the pointer and integer-based forms that will allow all parameters to be passed in registers or similar on most calling conventions we know of (including but not limited to arm7e, arm, arm64, amd64 (VC++ and System V), x86). From those requirements, the most prominent forms are: ```cpp // (1) stdc_mcerr stdc_XnrtoYn(size_t*restrict output_size, const charY*restrict *restrict output, size_t*restrict input_size, const charX*restrict *restrict input, mbstate_t*restrict state); // (2) stdc_mcerr stdc_XnrtoYn(size_t output_size, charY*restrict *restrict output, size_t input_size, const charX*restrict *restrict input, mbstate_t*restrict state); // (3) stdc_mcerr stdc_XnrtoYn(charY*restrict *restrict output, const charY* output_last, const charX*restrict *restrict input,const charX* input_last, mbstate_t*restrict state); // (4) stdc_mcerr stdc_XnrtoYn(size_t*restrict output_size, const charY* output, size_t*restrict input_size, const charX* input, mbstate_t*restrict state); ``` The form of (1) is what is in this paper and the form that this paper started out with. It is what we are going to move forward with for this proposal. It is similar to `iconv`, but deviates from that design a bit by using the typical Win32 and similar convention that a null pointer argument changes the behavior to allow for greater flexibility. For example, passing `NULL` to the first design for the `output_size` allows an implementation to assume the output buffer is large enough: this can save on size checking on every successful conversion and write out. It also allows passing `NULL` for `output`, which allows an end-user to not perform any write outs but simply determine the full count of objects. As a negative, it requires writes through indirect pointers for both the input and output, as well as for the sizes for the input and output. This causes multiple updates to be necessary, and duplicates information in exchange for a moderate decrease in ease-of-use. U#fortunately, it turns out this form is actually necessary for all of the functionality proposed here. ### `mbstate_t` Parameter for Unicode-only Conversions? ### {#design-forms-state} Because the Unicode conversions, as specified, require no intermediate holdings, there is technically no reason to provide the `mbstate_t` parameter. However, macro-generic programming used to wrap this code may partially rely on having an additional parameter to pass to the function. There is currently no way to have a variable argument parameter in a macro reflected with `__VA_ARGS__` without undefined behavior / implementation extensions, and solving such an issue would take even more advanced macro generic programming to solve the problem. Therefore, we keep the `mbstate_t` parameter, noting that passing the state object or just simply passing `NULL`/`0` all the time to such a parameter is a viable usage strategy. We realize this is not ideal but it is [how it has been implemented in our code](https://ztdcuneicode.readthedocs.io/en/latest/api/generic%20typed%20conversions.html#c.cnc_cxsnrtocysn). A much more compelling reason is that additional (un)safe options can be grafted into the `mbstate_t` object for use with Unicode conversions. This has been done for the purposes of optimization and has produced wildly successful results, as shown by how competitive cuneicode can be on the [transcoding benchmarks](https://ztdtext.readthedocs.io/en/latest/benchmarks/transcoding%20-%20UTF.html). ### Simplification Without Loss of Functionality? ### {#design-forms-simplify} An attempt to fix this is done by utilizing the form in (2), which prevents giving the sizes as pointers but still has pointer-to-pointer values. Unfortunately for design (2), this means that it is impossible to perform a "counting" operation (just calculate the number of code units to write out or the number of input characters that will be consumed) without having a valid buffer to write data into, so that before/after pointer values for `output` can be subtracted from one another. One could then try to smuggle the error code into the return value, albeit down that path is fraught with API design issues. One would need to exclude the values `0`, `-1`, `-2`, and `-3` from being used in return values, or some other set of arbitrary values. These are not good ideas and C users have struggled with APIs that behaved this way in the past: see the conversion functions currently in the C Standard which behave in this manner and obfuscate the return value for `0` (which still writes out a character but also indicates other actions performed) or the case of `-3` (where multiple write outs may need to happen so the function needs to be called again). These issues have also caused fundamental limitations in the C standard library, as present in [[glibc-25744]]. Form (3) is simply a re-visitation of form (2), but using pointers to indicate the size. This is nominally fine, until subtraction between two pointers must be done. If `PTRDIFF_MAX` is less than `SIZE_MAX` and the architecture uses, for example, segmented memory, than it is possible to create a region of memory `SIZE_MAX` that exceeds the size that can be understood from subtracting the leading pointers from the `*_last` pointers. This is a mostly a theoretical concern on larger systems and hosted systems, but of much more grave concern on bare-metal machines with a tiny `PTRDIFF_MAX`, or machines that make full use of the address space and frequently tap into paged memory. Finally, form (4) was the most attractive simplification. By keeping the indirect sizes but removing the double indirection from the input and output parameters, it presented a tempting bit of functionality that seemed to keep all of the benefits of form (1) but none of the drawbacks. That, unfortunately, does not apply in this one specific case for "unbounded writing": ```cpp stdc_mcerr err = c8srntoc32sn(NULL, some_utf32_buffer, &input_sz, some_utf8_input, &state); ``` The above seems okay, until it becomes clear that you have no idea how many characters were written out into `some_utf32_buffer`. By passing `NULL` for the size but having no pointer to update, the information is lost entirely. One could argue that someone should call the version which does the counting first and THEN pass `NULL` for the size, but this is overly restrictive. For example, a maximally-sized buffer can be prepared before hand when doing a UTF-8 to UTF-32 by simply assuming every code unit of input will result in one code point of output (e.g., everything input is ASCII). One could guarantee the fastest possible writing speed by creating such a maximally-sized buffer and then using `NULL` for the size, but it would be impossible to know **exactly** how much output was written in that case. One could compromise the return value to return that information, but that brings up the same API design issues mentioned above. Therefore, we keep the double-pointer form to retain the information properly. ### Performance of Double-Pointers? ### {#design-forms-performance} Benchmarks were inconclusive when it came to determining the cost of each API design. While writing out through (doubly-)indirect pointers provided a non-negligible cost when serializing all 2^~20.5 available Unicode code points through a UTF-8 to UTF-32 conversion, these costs became noise values when bulk functions were written that did not simply invoke the single-conversion functions repeatedly. That is: it performed the logical equivalent of performing the bulk operation, and only updated the input/output pointers and sizes when it was finished with the operation. This could present a problem on at least one implementation, such as musl-libc. musl-libc both reportedly and in its implementation tends to implement their current bulk transcoding routines by simply looping over the single-unit transcoding routines. But, they have stated that they do not care about the performance degradation here and that they are perfectly fine with the cost of writing the bulk functions in terms of the single transcoding functions. As such, we find no reason to change the pointer-based design on the grounds of performance either. See additional information [at this benchmarking page](https://ztdtext.readthedocs.io/en/latest/benchmarks.html), particularly the [function form](https://ztdtext.readthedocs.io/en/latest/benchmarks/function%20form.html) benchmarks and the [transcoding benchmarks](https://ztdtext.readthedocs.io/en/latest/benchmarks/transcoding%20-%20UTF.html). ### Structure Returns? ### {#design-forms-struct} We do note that there could be a better interface design in general if the error value and other information were returned in a structure (the current input pointer, output pointer, and sizes-left). Then, we would not have to compromise the error return with a size and properly separate the two so that users do not accidentally misuse it as they have in the past. But, most places in the C Standard avoid using by-value structure returns. Therefore, this idea was, similarly, discarded. ### Proposed Choice ### {#design-forms-choice} Given the above considerations about consistency with other functionality (no struct returns), the downsides of function forms (2)-(4), and the benchmark indications that the impact of doubly-indirect pointers is negligible in bulk, we therefore propose the double-pointer form, in order to retain the requisite information properly. Finally, we also added the `restrict` keyword to all pointers in the function signatures, in line with other functions currently in and proposed to the standard library. ## Extension Functions & Methods (Building on top of this API) ## {#design-extension} We can achieve a better API for all of this by utilizing either Statement Expressions (the widely-implemented GCC extension that exists in tcc, Clang, many flavors of IBM Compiler, and more) as well as with Lambdas (as shown in [[n2892]]). For example, here is an API that, without providing new backing functions, is capable of reducing the number of indirections and taking parameters directly ([implemented and deployed in the ztd.cuneicode library](https://github.com/soasis/cuneicode/blob/0d641dc8c5cd7619265b684021b0cafe6fae410b/examples/extensions/source/error_handling.utf32_to_utf8.stmt_exprs.c#L41-L46)): ```cpp #include #include int main() { // This only works if we support extension functions! ztd_char8_t output_data[ztdc_c_array_size(input_data) * CNC_C8_MAX] = { 0 }; cnc_mcstate_t state = { 0 }; cnc_c32c8_error_result_t err_result = cnc_cxsrtocysn_into_with_handler( ztdc_c_array_size(output_data), output_data, &U"Bark Bark Bark \xFFFFFFFF🐕‍🦺!"[0], &state, cnc_skip_input_and_replace_error, NULL); // … } ``` Here, we see that we only need to take the `cnc_mcstate_t` (analogous to this proposal's `mbstate_t`) parameter by pointer. The rest of the arguments (the `size_t` for the sizes, and `char32_t*` and `char8_t*` for the data) are taken without indirection (which also allows for things such as pointer decay from an array). This example even shows using an error handler function, with a `NULL`-provided `void* userdata` for said error handler, to skip over bad input and replace it with the typical "unknown character" diamond seen in many places ("�", U+FFFD REPLACEMNT CHARACTER). The API presented in this paper is the most basic and fundamental API that enables all of these use cases without loss of generality, and that is why we are focused on providing it to the C Standard. Extension methods utilizing statement expressions or lambdas in combination with macro generic programming will be able to provide additional functionality without forcing implementations to add more symbols to their symbol table. Advisement: Note that, due to C's lack of strong type safety for its core typedefs like `wchar_t` or `char16_t`, these type-based generic programming bits can be brittle. For example, on Windows machines using the Clang compiler, both `wchar_t` and `char16_t` are the same type (`unsigned short`). A `_Generic`-based switch for types will either make a compiler error if both `wchar_t` and `char16_t` are used in the same switch, or fall into the "wrong" function because it is impossible to differentiate between the two types. This is further reason why the Extension Functions should be left out of this proposal and for a later time. ## `wchar_t` and Existing practice ## {#design-wchar_t} `wchar_t` and wide characters in general have a requirement placed on them in the definition section that every character supported in the wide execution encoding (and wide literal encoding) must be fully representable as a **single** `wchar_t` value. This has not matched existing practice for the last ~20 or so years, and has produced specification issues with functions such as `mbrtowc`, which does not have a sequence of return values that can adequately represent e.g. Big5-HKSCS needing to output 2 different UTF-32 code points for some of its input characters (4 specific input sequences of them result in two UTF-32 code point outputs, to be precise). Part of this proposal removes the wording that makes this requirement. Because this is an expansion of privileges and not a shrinking, it conflicts with no existing implementation that was already working around this requirement or arbitrarily restricting their C Standard Library functionality to handle this (some *BSD-based platforms, some IBM-vended platforms, all Microsoft platforms). As explained in [[#intro-problem-standard]], this has been a long-standing issue in C and C++, but has particularly struck the C implementations due to the wording and setup of these types. Therefore, it would be extraordinarily expedient to remove the requirement and do as this proposal does, which is provide additional functionality to cover the fundamentally incompatible ABI and API — as well as user expectations — for the existing functions. C++ has adopted identical changes to the one in this proposal here in Corentin Jabot's [[p2460r2]]. As part of coping with these changes, implementations are offered 12 new macros, 6 predefined in the compiler and 6 in the standard library header `` to reflect the new existing situation: - `__STDC_LITERAL_UTF8__`, which is a predefined compiler macro describing whether the literal encoding (the compile-time `""` string literal encoding) is ISO 10646 UTF-8 compliant (each element of the string is a single UTF-8 code unit). - `__STDC_LITERAL_UTF16__`, which is a predefined compiler macro describing whether the literal encoding (the compile-time `""` string literal encoding) is ISO 10646 UTF-16 compliant (each element of the string is a single UTF-16 code unit). - `__STDC_LITERAL_UTF32__`, which is a predefined compiler macro describing whether the literal encoding (the compile-time `""` string literal encoding) is ISO 10646 UTF-32 compliant (each element of the string is a single UTF-32 code unit). - `__STDC_WIDE_LITERAL_UTF8__`, which is a predefined compiler macro describing whether the wide literal encoding (the compile-time `L""` string literal encoding) is UTF-8 compliant (each element of the string is a single UTF-8 code unit). - `__STDC_WIDE_LITERAL_UTF16__`, which is a predefined compiler macro describing whether the wide literal encoding (the compile-time `L""` string literal encoding) is UTF-16 compliant (each element of the string is a single UTF-16 code unit). - `__STDC_WIDE_LITERAL_UTF32__`, which is a predefined compiler macro describing whether the wide literal encoding (the compile-time `L""` string literal encoding) is UTF-32 compliant (each element of the string is a single UTF-32 code unit). - `WCHAR_UTF8`, which is a non-constant expression, that has a non-zero value if the **wide execution** encoding (the runtime encoding associated with `mbrtowc`, `wcrtomb`, and similar functions) is ISO 10646 compliant for UTF-8. Because this is a runtime check, it can handle the inherently runtime nature of this without compromising the compiler. - `WCHAR_UTF16`, which is a non-constant expression, that has a non-zero value if the **wide execution** encoding (the runtime encoding associated with `mbrtowc`, `wcrtomb`, and similar functions) is ISO 10646 compliant for UTF-16. Because this is a runtime check, it can handle the inherently runtime nature of this without compromising the compiler. - `WCHAR_UTF32`, which is a non-constant expression, that has a non-zero value if the **wide execution** encoding (the runtime encoding associated with `mbrtowc`, `wcrtomb`, and similar functions) is ISO 10646 compliant for UTF-32. Because this is a runtime check, it can handle the inherently runtime nature of this without compromising the compiler. - `MB_UTF8`, which is a non-constant expression, that has a non-zero value if the **execution** encoding (the runtime encoding associated with `mbrtowc`, `wcrtomb`, and similar functions) is ISO 10646 compliant for UTF-8. Because this is a runtime check, it can handle the inherently runtime nature of this without compromising the compiler. - `MB_UTF16`, which is a non-constant expression, that has a non-zero value if the **execution** encoding (the runtime encoding associated with `mbrtowc`, `wcrtomb`, and similar functions) is ISO 10646 compliant for UTF-16. Because this is a runtime check, it can handle the inherently runtime nature of this without compromising the compiler. - `MB_UTF32`, which is a non-constant expression, that has a non-zero value if the **execution** encoding (the runtime encoding associated with `mbrtowc`, `wcrtomb`, and similar functions) is ISO 10646 compliant for UTF-32. Because this is a runtime check, it can handle the inherently runtime nature of this without compromising the compiler. These changes are necessary because, as Clang implementers have pointed out in various issues against their compiler (such as [[clang-iso10646|this one]]), they cannot know before-hand whether or not the compiler can predefine this because the standard has mixed both execution encoding (execution time) and literal encoding (translation time) into the same macro, and asks the compiler to predefine it. An execution time property cannot be provably ascertained by the compiler ahead of time. GCC and other platforms work around this by using a special, internal, implementation-defined mechanism such as `stdc-predef.h`, where they collect a number of environment macros and then use them to provide enhanced compile-time information. Clang does not have the same powers right now and so, in a completely conforming manner, simply shuts off the macro despite the information it providing being very useful on a myriad of platforms. By separating the literal encoding macros into new `__STDC_(WIDE_)LITERAL_UTF(8/16/32)__` parts, and leaving the `__STDC_ISO10646__` macro to represent a **potential** interpretation of both translation and execution time behavior, we give the user more actionable information and prevent split compiler/library implementations like Clang from needing special knowledge to provide useful information. We also allow implementations to provide the information at runtime, which -- especially for large strings -- can provide sufficiently actionable information that even doing the check at runtime can give significant performance improvements. (UTF-32 based encoding and decoding routines are often very heavily optimized, as compared to generic locale-based conversion routines in not only C and C++ implementations, but routines underlying Haskell, Go, and other programming languages. See the work in Daniel Lemire's [[lemire-spire2021]].) No changes are needed to `__STDC_MB_MIGHT_NEQ_WC__` because the specification for this macro already deliberately refers to using a basic character set value "… as the line character in an integer character constant" (§6.10.9.2, ¶1, [[n3054]]). Therefore, the only change to this predefined macro is in its description where it applies directly to the literal **AND** wide literal encodings, rather than only one. These changes strengthen C's legacy as a language suitable for powerful string processing. An example implementation of these macros on a machine that uses the Microsoft Windows Universal Common Runtime ("ucrt") as its underlying implementation can be done as follows: ```cpp #include #define MB_UTF8 (MB_CUR_MAX == 4) #define MB_UTF16 0 #define MB_UTF32 0 #define WCHAR_UTF8 0 #define WCHAR_UTF16 1 #define WCHAR_UTF32 0 ``` A more sophisticated implementation is demonstrated using values computed with functions in [ztd.idk's headers](https://github.com/soasis/idk/blob/613a52df6995c0afd3f2457219a3961859f006b2/include/ztd/idk/encoding_detection.h#L60-L89) and [ztd.idk's source files](https://github.com/soasis/idk/blob/613a52df6995c0afd3f2457219a3961859f006b2/source/ztd/idk/encoding_detection.c.cpp). ## `mbstate_t` and state handling ## {#design-state} The newly proposed functions have some behavior that, on inspection, may seem unnecessary, duplicated, or superfluous. For example, processing a null pointer for the `input` (or `*input`) resulting in setting `*state` to the initial conversion sequence. We note here that this functionality exists in the API because the legacy APIs such as `mbrtowc` and its friends also have that behavior. In order to make portability as easy as humanly possible (and to allow, with minimal tweaking, reusing such functions internally if its helpful), we leave the equivalent behaviors intact in this API. Additionally, the processing of code units with a value of `\0` also setting `*state` to the initial conversion state is a legacy holdover from how the previous functions work. Again, this is multiple ways to clear `*state` in a way that is not just doing the equivalent of zero-initializing the bits with `= {}`, `= (mbstate_t){}`, or similar assignment or initialization expressions. This can be useful when one wants to preserve certain internal bits in their library, such as not clearing an implementation-defined set of bits for doing non-validating, fast conversions. See [the `cnc_mcstate_t` assumption setting/getting documentation](https://ztdcuneicode.readthedocs.io/en/latest/api/mcstate_t.html#state-functions) for an example of affecting a state object to perform additional behaviors beyond what is sanctioned by the specification itself. # Conclusion # {#conclusion} The ecosystem deserves ways to get to a statically-known encoding and not rely on implementation and locale-parameterized encodings. This allows developers a way to perform cross-platform text processing without needing to go through fantastic gymnastics to support different languages and platforms. An independent library implementation, *cuneicode* (talked about from [[Unicode_greater_detail|Meeting C++]] and [[Unicode_deep_c_diving|C++ On Sea]]), is now [[cuneicode|publicly available to everyone]]. # Proposed Wording # {#wording} The following wording is [[n3054|relative to n3054]]. ## Intent ## {#wording-intent} The intent of the wording is to provide transcoding functions that: - define "code unit" as the smallest piece of information; - define the notion of an "indivisible unit of work"; - remove the requirement that wide characters must represent a full, complete unit for all wide execution encodings that exist on the machine as they do not today; - introduce the notion of multi-unit work that does not use the same 1:N or M:1 design as the precious `wchar_t` functions; - introduce new macros that allow for a programmer to tell the difference between - convert from the execution ("`mc`") and wide execution ("`mwc`") encodings to the Unicode ("`c8`", "`c16`", "`c32`") encodings and vice-versa; - convert from the execution ("`mc`") encoding to the wide execution ("`mwc`") encoding and vice-versa; - provide a way for `mbstate_t` to be properly initialized as the initial conversion state; and, - to be entirely thread-safe by default with no magic internal state asides from what is already required by locales. ## Proposed Specification ## {#wording-specification} *Author's Note: Any � or ✨ is a stand-in character to be replaced by the editor.* ### Modify §3.7.3 Wide Character to change the definition to remove the one-to-one correspondence ### {#wording-specification-3.7.3}
3.7.3
**wide character** value representable by an object of type `wchar_t`, capable of representing any character in the current locale
### Modify §6.10.9.2 Environment Macros to add new wide literal-only predefined macro ### {#wording-specification-6.10.9.2}
6.10.9.2 Environment Macros
The following are unspecified: :: `__STDC_ISO_10646__` An integer constant of the form `yyyymmL` (for example, `199712L`). If this symbol is defined, then every character in the Unicode required set, when stored in an object of type `wchar_t`, has the same value as the short identifier of that character. The Unicode required set consists of all the characters that are defined by ISO/IEC 10646, along with all amendments and technical corrigenda, as of the specified year and month. If some other encoding is used, the macro shall not be defined and the actual encoding used is implementation-defined. :: `__STDC_LITERAL_UTF8__` An integer constant of the form `yyyymmL` (for example, `202012L`). If this symbol is defined, then the literal encoding (6.2.9) is capable of storing every character in the Unicode required set and every sequence of `char` from a string literal (excluding non-Universal character escape sequences) forms a valid encoding of UTF-8 sequence. :: `__STDC_LITERAL_UTF16__` An integer constant of the form `yyyymmL` (for example, `202012L`). If this symbol is defined, then the literal encoding (6.2.9) is capable of storing every character in the Unicode required set and every sequence of `char` from a string literal (excluding non-Universal character escape sequences) forms a valid encoding of UTF-16 sequence. :: `__STDC_LITERAL_UTF32__` An integer constant of the form `yyyymmL` (for example, `202012L`). If this symbol is defined, then the literal encoding (6.2.9) is capable of storing every character in the Unicode required set and every sequence of `char` from a string literal (excluding non-Universal character escape sequences) forms a valid encoding of UTF-32 sequence. :: `__STDC_MB_MIGHT_NEQ_WC__` The integer constant `1`, intended to indicate that, in the wide literal encoding for `wchar_t`, a member of the basic character set need not have a code value equal to its value when used as the lone character in an integer character constant. :: `__STDC_WIDE_LITERAL_UTF8__` An integer constant of the form `yyyymmL` (for example, `202012L`). If this symbol is defined, then the wide literal encoding (6.2.9) is capable of storing every character in the Unicode required set and every sequence of `wchar_t` from a `wchar_t` string literal (excluding non-Universal character escape sequences) forms a valid encoding of UTF-8 sequence. :: `__STDC_WIDE_LITERAL_UTF16__` An integer constant of the form `yyyymmL` (for example, `202012L`). If this symbol is defined, then the wide literal encoding (6.2.9) is capable of storing every character in the Unicode required set and every sequence of `wchar_t` from a `wchar_t` string literal (excluding non-Universal character escape sequences) forms a valid encoding of UTF-16 sequence. :: `__STDC_WIDE_LITERAL_UTF32__` An integer constant of the form `yyyymmL` (for example, `202012L`). If this symbol is defined, then the wide literal encoding (6.2.9) is capable of storing every character in the Unicode required set and every sequence of `wchar_t` from a `wchar_t` string literal (excluding non-Universal character escape sequences) forms a valid encoding of a UTF-32 sequence.
### Modify §7.21 Common Definitions `` to Remove Harmful `wchar_t` Text ### {#wording-specification-7.21}
7.21 Common definitions `stddef.h`
The types are … … which is an object type whose alignment is the greatest fundamental alignment; ```cpp wchar_t ``` which is an integer type whose range of values can represent distinct codes for all members of the largest extended character set specified among the supported locales;codes (or part of a sequence of codes) for all the members of the supported wide execution and wide literal encodings (6.2.9); the null character shall have the code value zero. Each member of the basic character set shall have a code value equal to its value when used as the lone character in an integer character constant if an implementation does not define `__STDC_MB_MIGHT_NEQ_WC__`; and,
### Modify §7.31 Extended Multibyte and Wide Character Utilities `` to Clarify Role of `wchar_t` ### {#wording-specification-7.31}
7.31 Environment Macros
7.31.1 Introduction
The header `` defines fourseveral macros, and declares four data types, one tag, and many functions.
The macros defined are `NULL` (described in 7.21); `WCHAR_MIN`, `WCHAR_MAX`, and `WCHAR_WIDTH` (de- scribed in 7.22); ```cpp WCHAR_UTF8 WCHAR_UTF16 WCHAR_UTF32 ``` which expand to an expression of signed or unsigned integer type (that is potentially not an integer constant expression) whose value is non-zero if: - the wide execution encoding (6.2.9) is capable of representing every character in the required Unicode set; - the width of `wchar_t` is at least 8, 16, or 32 for UTF-8, UTF-16, or UTF-32, respectively; - and, the values of a sequence of `wchar_t` objects consumed and produced by related character functions have a values consistent with a sequence of code units of the UTF-8, UTF-16, or UTF-32 encodings, respectively; ```cpp MB_UTF8 MB_UTF16 MB_UTF32 ``` which expand to an expression of signed or unsigned integer type (that is potentially not an integer constant expression) whose value is non-zero if: - the execution encoding (6.2.9) is capable of representing every character in the required Unicode set; - the width of `char` is at least 8, 16, or 32 for UTF-8, UTF-16, or UTF-32, respectively; - and, the values of a sequence of `char` objects consumed and produced by related character functions have a values consistent with a sequence of code units of the UTF-8, UTF-16, or UTF-32 encodings, respectively; and, ````cpp WEOF ```` which expands to a …
### Modify §6.2.9 Encodings to include definitions of Code Point, Code Unit, Wide/Narrow Execution Encodings, and Encoding Error ### {#wording-specification-6.2.9}
6.2.9 Encodings
A *code unit* is a single compositional unit of encoded information, usually of type `char`, `wchar_t`, `char8_t`, `char16_t`, or `char32_t`.
A *code point* is a single compositional unit of decoded information. Code points are generally used as the single complete decoded output, or as an intermediary to transcode to other code units. A *Unicode code point* is a single compositional unit of decoded information as defined in ISO/IEC 10646, typically used to convert to or from UTF-8, UTF-16, and UTF-32.
The *narrow execution encoding* is the implementation-defined, `LC_CTYPE`, (7.11.1)-influenced, locale-based execution environment encoding. The *wide execution encoding* is the implementation-defined, `LC_CTYPE` (7.11.1)-influenced, locale-based wide execution environment encoding. Both of these encodings are called the *execution encodings*.
An *unrecorded encoding error* occurs when an encoding, decoding, or transcoding function encounters an input sequence of code units or code points that :: — does not form a valid sequence according to the encoding being associated with the sequence, or :: — is not representable in the output encoding or coded character set.
An *encoding error* is the same as an unrecorded encoding error, except that the value of the macro `EILSEQ` (7.5) is stored in `errno` when such an error occurs during execution of the functions defined in this document unless otherwise specified.
### Modify §7.21.3 Files to remove the italics from the term "encoding error", since it's initial definition was moved to §6.2.9 Encodings ### {#wording-specification-7.21.3}
7.21.3 Files
An *encoding error*encoding error occurs if the character sequence presented to the underlying `mbrtowc` function does not form a valid (generalized) multibyte character, or if the code value passed to the underlying `wcrtomb` does not correspond to a valid (generalized) multibyte character. The wide character input/output functions and the byte input/output functions store the value of the macro `EILSEQ` in `errno` if and only if an encoding error occurs.
### Create a new section §7.S✨ and §7.S✨.1 Text Transcoding Utilities ### {#wording-specification-7.S✨}
7.S✨ Text transcoding utilities ``
7.S✨.1 General
The header `` declares four status code enumerators, five macros, several types and several functions for transcoding encoded text safely and efficiently. It is meant to supersede conversion utilities from Unicode utilities (7.28) and Extended multibyte and wide character utilities (7.29). It is meant to represent "multi character" functions. These functions can be used to count the number of input that form a complete sequence, count the number of output characters required for a conversion with no additional allocation, validate an input sequence, or transcode text from one encoding to another encoding. Particularly, it provides single unit and multi unit transcoding functions for transcoding by working on *code units* and *code points*.
Inputs to the functions in this clause are read until there is enough information taken in to perform an *indivisible unit of work*. An indivisible unit is the smallest possible input, as defined by the encoding, that can produce one or more outputs, perform a transformation of any state, or both. The conversion of these indivisible units is called an indivisible unit of work, and they are used to complete the transcoding operations specified in this subclause.
One or more of the following must hold for any given transcoding operation on an attempt to complete an indivisible unit of work: :: — enough input is consumed to perform an output or change the state; :: — output is written from consuming input, or output is written from the state which causes the state to change; or, :: — an error occurs and both the input and output do not change relative to the current indivisible unit of work. For the multi unit functions, the process acts as if it completes one indivisible unit of work repeatedly. When an error occurs, only the input successfully consumed, the state successfully altered, and the output successfully written according to the last indivisible unit of work are reflected in the output values of the functions in this clause: no other values are written.
Functions in `` which use `char` and `wchar_t`, or their qualified forms, derive their implementation-defined encodings from the narrow execution encoding or the wide execution encoding (6.2.9), respectively. The other encodings are UTF-8, associated with `char8_t`, UTF-16, associated with `char16_t`, and UTF-32, associated with `char32_t`.
NOTE  Each value is treated as code units and not as a container of octets. This means that the decision of, for example, UTF-16 in big or little endian encoding scheme is decided by the endianness of the code unit type. Only whole code unit values are used (i.e. a UTF-32 code point value of U+1F377 represents a value identical to how `U'\U0001F377'` is stored by the implementation).
For the UTF-8, UTF-16, and UTF-32 encodings, collectively referred to as the *Unicode encodings*, an indivisible unit of work for a read operation shall be the sequence of code units that corresponds to one Unicode code point. If input is exhausted before a sequence of code units corresponding to one Unicode code point can be reached, then `stdc_mcerr_incomplete_input` shall be returned. If there is an illegal code unit sequence, then `stdc_mcerr_invalid` shall be returned. For the implementation-defined execution and wide execution encodings, they have the same aforementioned requirement if the implementation defines it to be one of the Unicode encodings.
NOTE  If an implementation chooses to provide an i.e. execution encoding as the input encoding for a transcoding function that is defined to be the same as the UTF-8 encoding, then it is required to read one full complete Unicode code point's worth of code units. If it cannot, then it returns `stdc_mcerr_incomplete_input` (if the input sequence is not long enough but does not have any invalid code units in the sequence) or `stdc_mcerr_invalid` (if the input sequence is not a proper code unit sequence).
NOTE  The requirements for Unicode encodings do not apply to derivative encodings defined by the implementation. For example, an implementation may define a "partial UTF-8" execution encoding where it stores every read UTF-8 code unit in the state and, rather than returning `stdc_mcerr_incomplete_input`, returns `stdc_mcerr_ok` and produces no output. It may accumulate code units and write out a code point when it accumulates enough code units in its internal state. However, such an encoding is distinct and separate from the UTF-8 encoding used in the `c8` prefixed and suffixed functions described in this clause.
NOTE  The implementation-defined execution, wide execution, literal, and wide literal encodings can also have different behaviors if they do not define themselves as one of the Unicode encodings. For example, if `__STDC_ENDIAN_NATIVE__` (7.18.2) is equivalent to `__STDC_ENDIAN_LITTLE__`, but the wide execution encoding is defined to be "UTF-16 Big Endian" ("UTF16-BE"), then it may be classified as not one of the three recognized Unicode encodings according to this subclause. As such, a sequence of `wchar_t` elements that is null-terminated produced by transcoding functions in this subclause can behave differently than expected; e.g. `L"\U0001F377"`, if valid and defined to be UTF-16, can potentially not compare equal to a sequence of `wchar_t` objects produced by successful use of the transcoding functions from the code points U+01F377 and U+000000.
For all functions in this clause, when a code unit value of 0 (e.g. `'\0'`) is encountered in the input, the `mbstate_t` object in use for the transcoding operation is set to the initial shift state. The output associated with the indivisible unit of work consists of the appropriate null character preceded by any shift sequence necessary to cause the output to be in the initial shift state.
NOTE  As described in 7.30.6, an object of type `mbstate_t` can always be set to the initial conversion sequence by initializing it with `= {0};` or `= {};`. An existing `mbstate_t` object can always be set to the initial conversion sequence by assigning to it from the expressions `(mbstate_t){0}` or `(mbstate_t){}`.
Changing the `LC_CTYPE` category causes any conversion state already in use with the functions in this clause to be indeterminate.
The types declared are `mbstate_t` (described in 7.29.1), `wchar_t` (described in 7.19), `char8_t` (described in 7.28), `char16_t` (described in 7.28), `char32_t` (described in 7.28), `size_t` (described in 7.19), and;
```cpp stdc_mcerr ```
which is both an enumerated type and a typedef whose enumerators identify the status codes from a function calls described in this clause.
The macros declared are `NULL` (described in 7.21); `WCHAR_MIN`, `WCHAR_MAX`, and `WCHAR_WIDTH` (described in 7.22); `WCHAR_UTF8`, `WCHAR_UTF16`, `WCHAR_UTF32`, `MB_UTF8`, `MB_UTF16`, and `MB_UTF32` (described in 7.31); and,
```cpp STDC_C8_MAX STDC_C16_MAX STDC_C32_MAX STDC_MC_MAX STDC_MWC_MAX ```
which correspond to the maximum output for each single unit conversion function (7.S✨.2) and its corresponding output type. Each macro shall expand into an integer constant expression with minimum values, as described in Table ✨MEOW✨.
There is an association of naming convention, types, encoding, and maximums, used to describe the functions in this clause:
Table ✨MEOW✨: Transcoding function associations
Name Code Unit Type Encoding Maximum Output Macro Minimum Value
mc `char` The narrow execution encoding,
influenced by `LC_CTYPE`
`STDC_MC_MAX` `1`
mwc `wchar_t` The wide execution encoding,
influenced by `LC_CTYPE`
`STDC_MWC_MAX` `1`
c8 `char8_t` UTF-8 `STDC_C8_MAX` `4`
c16 `char16_t` UTF-16 `STDC_C16_MAX` `2`
c32 `char32_t` UTF-32 `STDC_C32_MAX` `1`
The maximum output macro values specified in the Table ✨MEOW✨ are related to the single unit conversion functions (7.S✨.2). These functions perform at most one indivisible unit of work, or return an error. The maximum output macro values shall be integer constant expressions large enough that conversions to the single unit conversion function's specified encoding shall not overflow a buffer of the proper code unit type with that size. The maximum output macro values do not affect the multi unit conversion functions (7.S✨.3), which perform as many indivisible units of work as is possible until an error occurs, until the output space is exhausted, or until the input is exhausted.
Unlike the functions present in `` and ``, the functions present in this clause can write more than 1 `wchar_t` value for conversions based on the wide execution encoding to accommodate a wider set of implementation-defined encodings, so long as the number of code units does not exceed the maximum output macro value of `STDC_MWC_MAX`.
The enumerators of the enumerated type `stdc_mcerr` are defined as follows:
```cpp stdc_mcerr_ok = 0 stdc_mcerr_invalid = -1, stdc_mcerr_incomplete_input = -2, stdc_mcerr_insufficient_output = -3, ```
Each value represents a specific situation when calling the relevant transcoding functions in ``: :: — `stdc_mcerr_insufficient_output`, when the input is correct and an indivisible unit of work can be performed but there is not enough output space to write to; :: — `stdc_mcerr_incomplete_input`, when input has been exhausted and the sequence is not incorrect but there are no more input values; :: — `stdc_mcerr_invalid`, when a unrecorded encoding error occurred; and, :: — `stdc_mcerr_ok`, when the operation was successful (none of the situations described for the other values of this enumerated type apply). No other value shall be returned from the functions described in this clause.
Recommended Practice
The maximum output macro values are intended for use in making automatic storage duration array declarations. Implementations should choose values for the macros that are spacious enough to accommodate a variety of underlying implementation choices for the target encodings supported by the narrow execution encodings and wide execution encodings, which for some encodings can output more than one UTF-32 code point. A set of values which are most resilient to future additions and changes in implementations is as follows:
```cpp #define STDC_C8_MAX 32 #define STDC_C16_MAX 16 #define STDC_C32_MAX 8 #define STDC_MC_MAX 32 #define STDC_MWC_MAX 16 ```
Beyond just the Unicode encodings specified previously, implementations are encouraged to not store partial reads or partial writes in the `mbstate_t` object with these functions unless as is strictly necessary. Implementations providing additional encodings for use with these functions should, to the extent possible for a given encoding, always define an indivisible unit of work to transcode as complete a unit of information as is possible or produce an error. If a sequence of code units cannot form a complete shift sequence or produce output, then an implementation should return `stdc_mcerr_incomplete_input` if the input is exhausted, or `stdc_mcerr_invalid` if the input sequence is incorrect.
### Create a new section §7.S✨.2 Single Unit Sized Conversion Functions ### {#wording-specification-7.S✨.2}
7.S✨.2Single Unit Sized Conversion Functions
Synopsis
```cpp #include stdc_mcerr stdc_mcnrtomcn(size_t*restrict output_size, char*restrict *restrict output, size_t*restrict input_size, const char*restrict *restrict input, mbstate_t*restrict state); stdc_mcerr stdc_mcnrtomwcn(size_t*restrict output_size, wchar_t*restrict *restrict output, size_t*restrict input_size, const char*restrict *restrict input, mbstate_t*restrict state); stdc_mcerr stdc_mcnrtoc8n(size_t*restrict output_size, char8_t*restrict *restrict output, size_t*restrict input_size, const char*restrict *restrict input, mbstate_t*restrict state); stdc_mcerr stdc_mcnrtoc16n(size_t*restrict output_size, char16_t*restrict *restrict output, size_t*restrict input_size, const char*restrict *restrict input, mbstate_t*restrict state); stdc_mcerr stdc_mcnrtoc32n(size_t*restrict output_size, char32_t*restrict *restrict output, size_t*restrict input_size, const char*restrict *restrict input, mbstate_t*restrict state); stdc_mcerr stdc_mwcnrtomcn(size_t*restrict output_size, char*restrict *restrict output, size_t*restrict input_size, wchar_t*restrict *restrict input, mbstate_t*restrict state); stdc_mcerr stdc_mwcnrtomwcn(size_t*restrict output_size, wchar_t*restrict *restrict output, size_t*restrict input_size, wchar_t*restrict *restrict input, mbstate_t*restrict state); stdc_mcerr stdc_mwcnrtoc8n(size_t*restrict output_size, char8_t*restrict *restrict output, size_t*restrict input_size, wchar_t*restrict *restrict input, mbstate_t*restrict state); stdc_mcerr stdc_mwcnrtoc16n(size_t*restrict output_size, char16_t*restrict *restrict output, size_t*restrict input_size, wchar_t*restrict *restrict input, mbstate_t*restrict state); stdc_mcerr stdc_mwcnrtoc32n(size_t*restrict output_size, char32_t*restrict *restrict output, size_t*restrict input_size, wchar_t*restrict *restrict input, mbstate_t*restrict state); stdc_mcerr stdc_c8nrtomcn(size_t*restrict output_size, char*restrict *restrict output, size_t*restrict input_size, const char8_t*restrict *restrict input, mbstate_t*restrict state); stdc_mcerr stdc_c8nrtomwcn(size_t*restrict output_size, wchar_t*restrict *restrict output, size_t*restrict input_size, const char8_t*restrict *restrict input, mbstate_t*restrict state); stdc_mcerr stdc_c8nrtoc8n(size_t*restrict output_size, char8_t*restrict *restrict output, size_t*restrict input_size, const char8_t*restrict *restrict input, mbstate_t*restrict state); stdc_mcerr stdc_c8nrtoc16n(size_t*restrict output_size, char16_t*restrict *restrict output, size_t*restrict input_size, const char8_t*restrict *restrict input, mbstate_t*restrict state); stdc_mcerr stdc_c8nrtoc32n(size_t*restrict output_size, char32_t*restrict *restrict output, size_t*restrict input_size, const char8_t*restrict *restrict input, mbstate_t*restrict state); stdc_mcerr stdc_c16nrtomcn(size_t*restrict output_size, char*restrict *restrict output, size_t*restrict input_size, const char16_t*restrict *restrict input, mbstate_t*restrict state); stdc_mcerr stdc_c16nrtomwcn(size_t*restrict output_size, wchar_t*restrict *restrict output, size_t*restrict input_size, const char16_t*restrict *restrict input, mbstate_t*restrict state); stdc_mcerr stdc_c16nrtoc8n(size_t*restrict output_size, char8_t*restrict *restrict output, size_t*restrict input_size, const char16_t*restrict *restrict input, mbstate_t*restrict state); stdc_mcerr stdc_c16nrtoc16n(size_t*restrict output_size, char16_t*restrict *restrict output, size_t*restrict input_size, const char16_t*restrict *restrict input, mbstate_t*restrict state); stdc_mcerr stdc_c16nrtoc32n(size_t*restrict output_size, char32_t*restrict *restrict output, size_t*restrict input_size, const char16_t*restrict *restrict input, mbstate_t*restrict state); stdc_mcerr stdc_c32nrtomcn(size_t*restrict output_size, char*restrict *restrict output, size_t*restrict input_size, const char32_t*restrict *restrict input, mbstate_t*restrict state); stdc_mcerr stdc_c32nrtomwcn(size_t*restrict output_size, wchar_t*restrict *restrict output, size_t*restrict input_size, const char32_t*restrict *restrict input, mbstate_t*restrict state); stdc_mcerr stdc_c32nrtoc8n(size_t*restrict output_size, char8_t*restrict *restrict output, size_t*restrict input_size, const char32_t*restrict *restrict input, mbstate_t*restrict state); stdc_mcerr stdc_c32nrtoc16n(size_t*restrict output_size, char16_t*restrict *restrict output, size_t*restrict input_size, const char32_t*restrict *restrict input, mbstate_t*restrict state); stdc_mcerr stdc_c32nrtoc32n(size_t*restrict output_size, char32_t*restrict *restrict output, size_t*restrict input_size, const char32_t*restrict *restrict input, mbstate_t*restrict state); ```
Description
Let *transcoding function* be one of the functions listed previously transcribed in the form ```cpp stdc_mcerr stdc_XnrtoYn(size_t*restrict output_size, charY*restrict *restrict output, size_t*restrict input_size, const charX*restrict *restrict input, mbstate_t*restrict state) ``` with the following properties: :: — *X* and *Y* be one of the prefixes/suffixes in the table from 7.S✨.1; :: — `charX` and `charY` be the associated code unit types for *X* and *Y* in the table from 7.S✨.1; and :: — *encoding X* and *encoding Y* be the associated encoding types for *X* and *Y* in the table from 7.S✨.1. The transcoding functions take an input buffer and possibly an output buffer of the associated code unit types, potentially with their sizes. The function consumes any number of code units of type `charX` to perform a single indivisible unit of work necessary to convert some amount of input from encoding X to encoding Y, which results in zero or more output code units of type `charY`.
An `mbstate_t` object assuredly describes the conversion state for the current conversion if it is not in an unspecified state (as described further later in this clause) and: :: — the conversion is between Unicode encodings; or :: — the input fragment is the start of an encoded sequence in the input encoding and the `mbstate_t` object was initialized to the initial conversion state; or :: — the input fragment is a continuation of an encoded sequence and `mbstate_t` is the result of having advanced to the input and possibly output positions through the application of prior calls to the same transcoding function. The behavior is undefined when a function described by this subclause is invoked and `*state` (or, if `state` is a null pointer, the `mbstate_t` object created that is unique to the current invocation) does not describe the conversion state for the current conversion.
The transcoding functions convert from code units of type `charX` interpreted according to encoding X to code units of type `charY` according to encoding Y given a conversion state of value `*state`. This function only performs a single indivisible unit of work. It returns `stdc_mcerr_ok` if the input is empty. The input is considered empty if `input_size` is a null pointer, or `*input_size` is zero if `input_size` is not a null pointer.
Any time *input code unit reads* in the following description is used: :: — code units are read from `*input`, sequentially, and interpreted according to encoding X; :: — if `*input_size` is smaller than the necessary amount of sequential reads that must performed from `input` to complete an indivisible unit of work, the function does not modify any of `*input`, `*input_size`, `*output`, or `*output_size`. It returns `stdc_mcerr_incomplete_sequence`; :: — if an unrecorded encoding error occurs (e.g. the input read is invalid according to encoding X or the input is valid but cannot be converted to encoding Y), then the function returns `stdc_mcerr_invalid`; :: — if the function returns `stdc_mcerr_ok`, the function will decrement `*input_size` by the number of input code units that were read and increments `*input` by the number of input code units that were read for the complete indivisible unit of work. Any time *output code unit writes* in the following description is used: :: — converted code units are potentially written into `*output`, sequentially, according to encoding Y; :: — if `output_size` is not a null pointer and if `*output_size` is smaller than the necessary amount of sequential writes that must be performed to complete an indivisible unit of work, the function does not modify any of `*input`, `*input_size`, `*output`, or `*output_size`. It returns `stdc_mcerr_insufficient_output`; :: — if `output_size` is a null pointer, but `output` and `*output` are not null pointers, it is assumed `*output` has enough space to perform the necessary sequential writes and the behavior is undefined if the target output buffer is not large enough for this transcoding operation's indivisible unit of work; :: — if the function returns `stdc_mcerr_ok`, the function will decrement `*output_size` by the number of output code units that are, or could have been (if `output` or `*output` are null pointers), written. If `output` and `*output` are not null pointers, then `*output` is incremented by the number of output code units that are written to complete an indivisible unit of work.
The behavior of the transcoding functions is as follows: 1. If `state` is a null pointer, then an automatic storage duration object of type `mbstate_t` is created which is unique to the current invocation. It is initialized to the initial conversion state and a pointer to this object is used wherever `state` is used in this paragraph. 2. Then, if `input` is a null pointer or `*input` is a null pointer, then `*state` is set to the initial conversion state, the function returns `stdc_mcerr_ok`, and no other actions are token. 3. Otherwise, if `*state` is in an implementation-defined conversion state that requires it, any necessary output code units writes are performed to return `*state` to the initial conversion state. 4. The function performs input code unit reads and subsequently performs the output code unit writes as is necessary to complete an indivisible unit of work. 5. The function returns `stdc_mcerr_ok`.
NOTE  If `state` is a null pointer, and the function uses e.g. a created automatic storage duration `mbstate_t` object that is discarded by the end of the invocation, then any potential conversion state contained in the created `mbstate_t` object and used during processing could become unrecoverable to the program.
Returns
On success or failure, the transcoding functions shall return one of the above error codes (7.S✨.1). If `input` is a null pointer or `*input` is a null pointer, then `*state` is set to the initial conversion state and no other work is performed.
If the function returns `stdc_mcerr_ok`, then all of the following is true: :: — if `input` and `*input` are not null pointers, `*input` is incremented by the number of code units read and successfully converted ; :: — if `input_size` is not a null pointer, `*input_size` is decremented by the number of code units read and successfully converted from the input; :: — if `output` and `*output` are not null pointers, `*output` is incremented by the number of code units written to the output; and, :: — if `output_size` is not a null pointer, `*output_size` is decremented by the number of code units written to the output. Otherwise, if an error is returned then none of the above occurs. If the return value is `stdc_mcerr_invalid`, then `*state` is in an unspecified state. If the return value is `stdc_mcerr_incomplete_input` or `stdc_mcerr_insufficient_output`, then `*state` is not changed.

Recommended Practice
Implementations should take advantage of the information of null pointer values for the output size pointer, output data pointer, or both, to drastically improve performance characteristics for assumed unlimited write space, output counting scenarios, or input validation/counting, respectively.
Implementations should prefer returning an error for an incomplete input sequence over storing intermediate data within the state where possible for non-Unicode encodings. This can make it easier for functionality built on top of the functions in this subclause to report errors without skipping over potentially invalid input data, resulting in potentially more accurate reports. Error handling and recovery also greatly benefit from being able to examine invalid input; avoiding skipping over invalid data by consuming it into a state and reporting no errors means that functionality built on top can potentially discard what should be considered unneeded, already-processed data.
### Create a new subsection §7.S✨.3 Multi Unit Sized Conversion Functions ### {#wording-specification-7.S✨.3}
7.S✨.3 Multi Unit Sized Conversion Functions
Synopsis
```cpp #include stdc_mcerr stdc_mcsnrtomcsn(size_t*restrict output_size, char*restrict *restrict output, size_t*restrict input_size, const char*restrict *restrict input, mbstate_t*restrict state); stdc_mcerr stdc_mcsnrtomwcsn(size_t*restrict output_size, wchar_t*restrict *restrict output, size_t*restrict input_size, const char*restrict *restrict input, mbstate_t*restrict state); stdc_mcerr stdc_mcsnrtoc8sn(size_t*restrict output_size, char8_t*restrict *restrict output, size_t*restrict input_size, const char*restrict *restrict input, mbstate_t*restrict state); stdc_mcerr stdc_mcsnrtoc16sn(size_t*restrict output_size, char16_t*restrict *restrict output, size_t*restrict input_size, const char*restrict *restrict input, mbstate_t*restrict state); stdc_mcerr stdc_mcsnrtoc32sn(size_t*restrict output_size, char32_t*restrict *restrict output, size_t*restrict input_size, const char*restrict *restrict input, mbstate_t*restrict state); stdc_mcerr stdc_mwcsnrtomcsn(size_t*restrict output_size, char*restrict *restrict output, size_t*restrict input_size, wchar_t*restrict *restrict input, mbstate_t*restrict state); stdc_mcerr stdc_mwcsnrtomwcsn(size_t*restrict output_size, char*restrict *restrict output, size_t*restrict input_size, wchar_t*restrict *restrict input, mbstate_t*restrict state); stdc_mcerr stdc_mwcsnrtoc8sn(size_t*restrict output_size, char8_t*restrict *restrict output, size_t*restrict input_size, wchar_t*restrict *restrict input, mbstate_t*restrict state); stdc_mcerr stdc_mwcsnrtoc16sn(size_t*restrict output_size, char16_t*restrict *restrict output, size_t*restrict input_size, wchar_t*restrict *restrict input, mbstate_t*restrict state); stdc_mcerr stdc_mwcsnrtoc32sn(size_t*restrict output_size, char32_t*restrict *restrict output, size_t*restrict input_size, wchar_t*restrict *restrict input, mbstate_t*restrict state); stdc_mcerr stdc_c8snrtomwcsn(size_t*restrict output_size, wchar_t*restrict *restrict output, size_t*restrict input_size, const char8_t*restrict *restrict input, mbstate_t*restrict state); stdc_mcerr stdc_c8snrtomcsn(size_t*restrict output_size, char*restrict *restrict output, size_t*restrict input_size, const char8_t*restrict *restrict input, mbstate_t*restrict state); stdc_mcerr stdc_c8snrtoc8sn(size_t*restrict output_size, char8_t*restrict *restrict output, size_t*restrict input_size, const char8_t*restrict *restrict input, mbstate_t*restrict state); stdc_mcerr stdc_c8snrtoc16sn(size_t*restrict output_size, char16_t*restrict *restrict output, size_t*restrict input_size, const char8_t*restrict *restrict input, mbstate_t*restrict state); stdc_mcerr stdc_c8snrtoc32sn(size_t*restrict output_size, char32_t*restrict *restrict output, size_t*restrict input_size, const char8_t*restrict *restrict input, mbstate_t*restrict state); stdc_mcerr stdc_c16snrtomwcsn(size_t*restrict output_size, wchar_t*restrict *restrict output, size_t*restrict input_size, const char16_t*restrict *restrict input, mbstate_t*restrict state); stdc_mcerr stdc_c16snrtomcsn(size_t*restrict output_size, char*restrict *restrict output, size_t*restrict input_size, const char16_t*restrict *restrict input, mbstate_t*restrict state); stdc_mcerr stdc_c16snrtoc8sn(size_t*restrict output_size, char8_t*restrict *restrict output, size_t*restrict input_size, const char16_t*restrict *restrict input, mbstate_t*restrict state); stdc_mcerr stdc_c16snrtoc16sn(size_t*restrict output_size, char16_t*restrict *restrict output, size_t*restrict input_size, const char16_t*restrict *restrict input, mbstate_t*restrict state); stdc_mcerr stdc_c16snrtoc32sn(size_t*restrict output_size, char32_t*restrict *restrict output, size_t*restrict input_size, const char16_t*restrict *restrict input, mbstate_t*restrict state); stdc_mcerr stdc_c32snrtomcsn(size_t*restrict output_size, char*restrict *restrict output, size_t*restrict input_size, const char32_t*restrict *restrict input, mbstate_t*restrict state); stdc_mcerr stdc_c32snrtomwcsn(size_t*restrict output_size, wchar_t*restrict *restrict output, size_t*restrict input_size, const char32_t*restrict *restrict input, mbstate_t*restrict state); stdc_mcerr stdc_c32snrtoc8sn(size_t*restrict output_size, char8_t*restrict *restrict output, size_t*restrict input_size, const char32_t*restrict *restrict input, mbstate_t*restrict state); stdc_mcerr stdc_c32snrtoc16sn(size_t*restrict output_size, char16_t*restrict *restrict output, size_t*restrict input_size, const char32_t*restrict *restrict input, mbstate_t*restrict state); stdc_mcerr stdc_c32snrtoc32sn(size_t*restrict output_size, char32_t*restrict *restrict output, size_t*restrict input_size, const char32_t*restrict *restrict input, mbstate_t*restrict state); ```
Description
Let *multi unit transcoding function* in this function be one of the functions listed above transcribed in the form ```cpp stdc_mcerr stdc_XsnrtoYsn(size_t*restrict output_size, charY*restrict *restrict output, size_t*restrict input_size, const charX*restrict *restrict input, mbstate_t*restrict state); ``` with the following properties: :: — *X* and *Y* be one of the prefixes/suffixes in the table from 7.S✨.1; :: — `charX` and `charY` be the associated code unit types for *X* and *Y* in the table from 7.S✨.1; and :: — *encoding X* and *encoding Y* be the associated encoding types for *X* and *Y* in the table from 7.S✨.1. The multi unit transcoding functions take an input buffer and possibly an output buffer of the associated code unit types, potentially with their sizes. The functions consume any number of code units to perform a sequence of indivisible units of work, which results in zero or more output code units. The functions will repeatedly perform an indivisible unit of work until either an error occurs or the input is exhausted.
An `mbstate_t` object assuredly describes the conversion state for the current conversion if it is not in an unspecified state (as described later in this subclause) and: - the conversion is between Unicode encodings; or - the input fragment is the start of an encoded sequence in the input encoding and the `mbstate_t` object was initialized to the initial conversion state; or - the input fragment is a continuation of an encoded sequence and `mbstate_t` is the result of having advanced to the input and possibly output positions through the application of prior calls to the same transcoding function. The behavior is undefined when a function described by this subclause is invoked and `*state` (or, if `state` is a null pointer, the `mbstate_t` object created for that case) does not describe the conversion state for the current conversion.
If `input_size` is a null pointer, `input` shall either be a null pointer or point to a null pointer. Otherwise, `input` shall be a pointer to a non-null pointer to an array of at least `*input_size` elements.
The multi unit transcoding functions convert from code units of type `charX` interpreted according to encoding X to code units of type `charY` according to encoding Y given a conversion state of value `*state`. The behavior of these functions is as-if the analogous single unit function `XntoYn` was repeatedly called, with the same `output`, `output_size`, `input`, `input_size`, and `state` parameters, to perform multiple indivisible units of work. The function stops when an error occurs or the input is exhausted (only signified when `*input_size` is zero).
The multi unit transcoding functions behave as-if: 1. If `state` is a null pointer, then an automatic storage duration object of type `mbstate_t` is created which is unique to the current invocation. It is initialized to the initial conversion state and a pointer to this object is used wherever `state` is used in this paragraph. 2. `stdc_XnrtoYn` is called with `output_size`, `output`, `input_size`, `input`, and `state` with its result stored in a temporary named `err`. 3. If `input` is a null pointer or `*input` is a null pointer, return `err`. 4. If `err` is not `stdc_mcerr_ok`, then return `err`. 5. Otherwise, if `*input_size` greater than zero, go back to (2). 6. Otherwise, if `mbsinit(*state)` returns zero, go back to (2). 7. Otherwise, return `err`;
Returns
On success or failure, the transcoding functions shall return one of the above error codes (7.S✨.1). If `state` is not a null pointer and `*state` is not initialized to the initial conversion state for the function on its first use, or is used after being input into a function whose result is not one of `stdc_mcerr_ok`, `stdc_mcerr_incomplete_input`, or `stdc_mcerr_insufficient_output`, the behavior of the functions is unspecified.
The following is true after the invocation: :: — `*input` will be incremented by the number of code units read and successfully converted if `input` and `*input` are not null pointers. If `stdc_mcerr_ok` is returned, then this will consume all the input. Otherwise, `*input` will point to the location just after the last successfully completed indivisible unit of work. :: — `*input_size` is decremented by the number of code units read from `*input` that were successfully converted. If no error occurred, then `*input_size` will be 0. :: — if `output` and `*output` is not a null pointer, `*output` will be incremented by the number of code units written from successfully completed indivisible unit of work. :: — if `output_size` is not a null pointer, `*output_size` is decremented by the number of code units written to the output or that would have been written to the output. If the return value is `stdc_mcerr_invalid` and `state` is not a null pointer, then `*state` is in an unspecified state.
NOTE  The object unique to the invocation is reused for every call in the second step of the multi unit sized conversion algorithm, and not recreated. If `state` is a null pointer, and the function uses e.g. a created automatic storage duration `mbstate_t` object that is discarded by the end of the invocation, then any potential conversion state contained in the created `mbstate_t` object, and used or accummulated during multi unit processing, could become unrecoverable to the program.
**EXAMPLE 1** The following is an example of using a single indivisible unit sized conversion function `stdc_mcnrtoc8n` to implement a multi unit sized conversion algorithm: ```cpp #include stdc_mcerr sample_mcsnrtoc8sn(size_t*restrict output_size, char8_t*restrict *restrict output, size_t*restrict input_size, const char*restrict *restrict input, mbstate_t* restrict state) { mbstate_t invocation_unique_internal_state; if (state == nullptr) { invocation_unique_internal_state = (mbstate_t){}; state = &invocation_unique_internal_state; } if (input == nullptr || *input == nullptr) { return stdc_mcnrtoc8n(output_size, output, input_size, input, state); } for (;;) { stdc_mcerr err = stdc_mcnrtoc8n(output_size, output, input_size, input, state); if (err != stdc_mcerr_ok) { return err; } if (*input_size > 0) { continue; } // some execution encodings (6.2.9) may contain // additional output as input gets processed int state_finished = mbsinit(state); if (state_finished == 0) { continue; } return err; } } ```
**EXAMPLE 2** The multi unit sized conversion functions can be used to perform other functionality, such as counting, validation, and more by using a null pointer value for specific arguments: ```cpp #include bool is_valid_utf16_from_utf8(size_t str_n, const char8_t str[restrict static str_n]) { stdc_mcerr err = stdc_c8snrtoc16sn(nullptr, nullptr, &str_n, &str); return err == stdc_mcerr_ok; } size_t count_utf16_from_utf8(size_t str_n, const char8_t str[restrict static str_n]) { const size_t utf16_before_n = SIZE_MAX; size_t utf16_after_n = utf16_before_n; stdc_mcerr err = stdc_c8snrtoc16sn(&utf16_after_n, nullptr, &str_n, &str); return err == stdc_mcerr_ok ? utf16_before_n - utf16_after_n : 0; } bool unbounded_conversion_utf16_from_utf8(size_t str_n, const char8_t str[restrict static str_n], char16_t* restrict dest_str) { stdc_mcerr err = stdc_c8snrtoc16sn(nullptr, &dest_str, &str_n, &str); return err == stdc_mcerr_ok; } int main () { const char8_t str[] = u8"\"Saw a \U0001F9DC \u2014" u8"didn't catch her\u2026 \U0001F61E\"\n\t- Sniff"; // include null terminator const size_t str_n = (sizeof(str) / sizeof(*str)); if (!is_valid_utf16_from_utf8(str_n, str)) { // input not valid return 1; } size_t utf16_str_n = count_utf16_from_utf8(str_n, str); constexpr size_t utf16_str_max_size = STDC_C16_MAX * (sizeof(str) / sizeof(*str)); char16_t utf16_str[utf16_str_max_size] = {}; if (utf16_str_max_size < utf16_str_n) { // buffer too small return 2; } if (!unbounded_conversion_utf16_from_utf8(str_n, str, utf16_str)) { // write failed return 3; } // At this point, utf16_str is a veritable UTF-16 string. // As noted above, null terminator from utf8_str was included: // utf16_str is a sequence of UTF-16 code units plus the null // terminator, in a suitable form at the end of the UTF-16 string. return 0; } ``` The above program demonstrates validating, counting, and doing an unbounded (size unsafe) write using the provided functions. Caution should be taken when a program uses unbounded writes, as the size of the buffer is assumed to be large enough during the call to the multi unit sized conversion function when `output_size` is a null pointer. An implementation can detect the above cases where specific arguments or their pointed to values are a null pointer value, and provide improved implementations relying on properties from these assumptions.
Recommended Practice
The multi unit transcoding functions are explicitly for the purpose of performing conversions on the largest contiguous section of valid data in the shortest amount of time possible. Implementations should take advantage of the information of null pointer values for the output size pointer, output data pointer, or both, to drastically improve performance characteristics for assumed unlimited write space, output counting scenarios, or input validation/counting, respectively.
Implementations should prefer returning an error for an incomplete input sequence over storing intermediate data within the state where possible for non-Unicode encodings. By leaving partial input unconsumed, it can be easier for functionality built on top of the functions in this subclause to report errors without skipping over potentially invalid input data.
### Add unspecified behavior to Annex J.1 Unspecified behavior ### {#wording-specification-j.1}
J.1 Unspecified Behavior
The following are unspecified: :: … :: — The conversion state after an encoding error (6.2.9) occurs (7.30.6.3.2, 7.30.6.3.3, 7.30.6.4.1, 7.30.6.4.2). :: — The conversion state after a unrecorded encoding error (6.2.9) occurs (7.S✨). :: — The use of an `mbstate_t` object that contains conversion state from an unrelated conversion (7.S✨). :: …
### Add unspecified behavior to Annex J.2 Undefined behavior ### {#wording-specification-j.2}
J.2 Undefined Behavior
The following are undefined: :: … :: — Using a buffer that is too small but providing a null pointer to the `output_size` argument of a transcoding function (7.S✨). :: …
# Acknowledgements # {#acknowledgements} Thank you to Philipp K. Krause for responding to the e-mails of a newcomer to matters of C and providing me with helpful guidance. Thank you to Rajan Bhakta, Daniel Plakosh, and David Keaton for guidance on how to submit these papers and get started in WG14. Thank you to Tom Honermann for lighting the passionate fire for proper text handling in me for not just C++, but for our sibling language C. # Appendix # {#appendix} ## (From revisions 0-3) What about UTF{X} ↔ UTF{Y} functions? ## {#appendix-proposed-utf} Function interconverting between different Unicode Transformation Formats are not proposed here because -- while useful -- both sides of the encoding are statically known by the developer. The C Standard only wants to consider functionality strictly in the case where the implementation has more information / private information that the developer cannot access in a well-defined and standard manner. A developer can write their own Unicode Transformation Format conversion routines and get them completely right, whereas a developer cannot write the Wide Character and Multibyte Character functions without incredible heroics and/or error-prone assumptions. This brings up an interesting point, however: if `__STDC_UTF16__` and `__STDC_UTF32__` both exist, does that not mean the implementation controls what `c16` and `c32` mean? This is true, **however**: within a (admittedly limited) survey of implementations, there has been no suggestion or report of an implementation which does not use UTF16 and UTF32 for their `char16_t` and `char32_t` literals, respectively. Thankfully, that does not seem to be the case at this time. It will also no longer be the case in C23, as the paper [[n2728|char16_t and char32_t literals should be UTF-16 and UTF-32]] has been accepted.
{
	"n3054": {
		"authors": [
			"ISO/IEC JTC1 SC22 WG14 - Programming Languages, C",
			"JeanHeyd Meneide",
			"Freek Wiedijk"
		],
		"title": "n3054: ISO/IEC 9899:202x - Programming Languages, C",
		"href": "https://www.open-std.org/jtc1/sc22/wg14/www/docs/n3054.pdf",
		"date": "September 3rd, 2022"
	},
	"glibc-25744": {
		"authors": [
			"Tom Honermann",
			"Carlos O'Donnell"
		],
		"title": "`mbrtowc` with Big5-HKSCS returns 2 instead of 1 when consuming the second byte of certain double byte characters",
		"href": "https://sourceware.org/bugzilla/show_bug.cgi?id=25744",
		"date": ""
	},
	"N2282": {
		"authors": [
			"Philip K. Krause"
		],
		"title": "Additional multibyte/wide string conversion functions",
		"href": "https://www.open-std.org/jtc1/sc22/wg14/www/docs/n2282.htm",
		"date": "June 2018"
	},
	"iconv": {
		"authors": [
			"Bruno Haible",
			"Daiki Ueno"
		],
		"title": "libiconv",
		"href": "https://savannah.gnu.org/git/?group=libiconv",
		"date": "August 2020"
	},
	"N2244": {
		"authors": [
			"WG14"
		],
		"title": "Clarification Request Summary for C11, Version 1.13",
		"href": "https://www.open-std.org/jtc1/sc22/wg14/www/docs/n2244.htm",
		"date": "October 2017"
	},
	"cuneicode": {
		"authors": [
			"JeanHeyd Meneide",
			"Shepherd's Oasis, LLC"
		],
		"title": "cuneicode - A spicy text library for C",
		"href": "https://ztdcuneicode.rtfd.io",
		"date": "November 20th, 2021"
	},
	"N1570": {
		"authors": [
			"ISO/IEC JTC1 SC22 WG14 - Programming Languages, C"
		],
		"title": "C11 Committee Draft",
		"href": "https://www.open-std.org/jtc1/sc22/WG14/www/docs/n1570.pdf",
		"date": "April 12, 2011"
	},
	"Unicode_greater_detail": {
		"authors": [
			"JeanHeyd Meneide"
		],
		"title": "Catching ⬆️: Unicode for C++ in Greater Detail",
		"href": "https://www.youtube.com/watch?v=FQHofyOgQtM",
		"date": "November 2019"
	},
	"Unicode_deep_c_diving": {
		"authors": [
			"JeanHeyd Meneide"
		],
		"title": "Deep C Diving - Fast and Scalable Text Interfaces at the Bottom",
		"href": "https://youtu.be/X-FLGsa8LVc",
		"date": "July 2020"
	},
	"n2728": {
		"authors": [
			"JeanHeyd Meneide"
		],
		"title": "char16_t and char32_t shall be UTF-16 and UTF-32",
		"href": "https://thephd.dev/_vendor/future_cxx/papers/C%20-%20char16_t%20&%20char32_t%20string%20literals%20shall%20be%20UTF-16%20&%20UTF-32.html",
		"date": "May 15th, 20201"
	},
	"clang-iso10646": {
		"authors": [
			"Corentin Jabot"
		],
		"title": "Define __STDC_ISO_10646__",
		"href": "https://reviews.llvm.org/D106577",
		"date": "June 22nd, 2021"
	},
	"lemire-spire2021": {
		"authors": [
			"Daniel Lemire"
		],
		"title": "Unicode at Gigabytes per Second",
		"href": "https://doi.org/10.48550/arXiv.2111.08692",
		"date": "November 14th, 2021"
	},
	"n2892": {
		"authors": [
			"Jens Gustedt"
		],
		"title": "N2892: Basic lambdas for C",
		"href": "https://www.open-std.org/jtc1/sc22/wg14/www/docs/n2892.pdf"
	}
}
May the Tower of Babel's curse be defeated.