"In these meetings, these conferences, we only see a little. C++ is not done in the light. The majority of C++ is not done publicly. Most C++ is done privately, in the dark, and that is where it matters most."
â Daniela K. Engert, November 14th, 2019
1. Revision History
1.1. Revision 2 - September 1st, 2022
-
Fix
missing itâs definition.unicode_scalar_value -
Adjust the names of the functions, as has been bikeshed between users and others, to match the names in the implementation (for validation and similar functions).
-
Encoding types provided by the standard should end with
when they are not templated, and common encodings should be object types (e.g.,_t
is an object of typestd :: text :: utf8
- thanks Hana DusĂkovĂĄ).std :: text :: utf8_t -
The documentation and rationale for much of the implementation that is up-to-date can be found online.
1.2. Revision 1 - March 2nd, 2020
-
Thoroughly improve §âŻ2 Motivation.
-
Explicitly state goals and non-goals in the §âŻ2.3 Statement of Objectives.
-
-
Rewrite most of paper to more thoroughly explain the API, especially the §âŻ3.3 High Level section with
,validate
,count_code_points
, and more APIs.count_code_units -
Include drastically improve the explanation for the free functions in §âŻ3.3.1 Eager Free Functions.
-
Emphasize the need for ranges in §âŻ3.3.3 Improving Usability for Low-Memory Environments: Ranges.
-
-
Add new descriptions in the low-level API regarding error handling in §âŻ3.2.2.2 Error Handling: Allow All The Options.
-
Describe customization points in full in §âŻ3.4.1 Speed and Flexibility for Everyone: Customization Points.
-
The Implementation is now hidden, after doing a magic trick. Contact the author for access.
-
Add §âŻ5 FAQ.
-
Going no-where, targeted at no-one.
1.3. Revision 0 - June 17th, 2019
-
Initial release of exploratory paper.
2. Motivation
Itâs 2020 and Unicode is still barely supported in both the C and C++ standards.
From the POSIX standard requiring a single-byte encoding by default, heavy limitations placed in
facets in C and C++, and the utter lack of UTF8/16/32 multi-unit conversion functions by the standard, the programming languages that have shaped the face of development in operating systems, embedded devices and mobile applications has pushed forward a world that is incredibly unfriendly to a world of text beyond ASCII English. Developers frequently roll their own solutions, and almost every major codebase -- from Chrome to Firefox, Qt to Copperspice, and more -- all have their own variations of hand-crafted text processing. With no standard implementation in C++ and libraries split between various third party implementations plus ICU, it is increasingly difficult and error-prone to handle text. This means the basic method of communication between people on the planet and for historical record is difficult when using C++.
This paper aims to explore the design space for both extremely high performing transcoding (encoding and decoding) as well as a flexible one-by-one interface for more careful and meticulous text processing. This proposal arises from industry experience in large codebases and best-practice open source explorations with [libogonek], [icu], [boost.text] and [text_view] while also building on the concepts and design choices found in both [range-v3] and pre-existing text encoding solutions such as Windowsâs
interfaces, *nix utility [iconv], a Fortune 500 company codebase, and more.
The ultimate goal is to allow an interface that is correct by default but capable of being fast both by Standard Library implementer efforts but also program overridable customization points. It will produce interfaces for encoding, decoding, and transcoding in eager and lazy forms.
2.1. The Basic Ideas
While some of these types arenât contained in this paper, the end goal is to enable the following to be possible:
#include <text_encoding>// this proposal #include <text>// future proposal int main ( int , char * []) { using namespace std :: literals ; // future proposal: container type std :: text :: u8text my_text // this proposal: transcoding = std :: text :: transcode ( âěë íě¸ě đâsv , std :: text :: utf8 ); std :: cout << my_text << std :: endl ; // prints ěë íě¸ě đ to a capable console std :: cout << std :: hex ; for ( const auto & cp : my_text ) { std :: cout << static_cast < uint32_t > ( cp ) << â â; } // 0000c548 0000b155 0000d558 0000c138 0000c694 00000020 0001f44b return 0 ; }
This paper is in support of reaching this goal. The following examples are more concretely tied to this proposal in particular.
2.1.1. Reading "Execution Encoding" Data
The following is an example of opening a file handle on Windows after converting from the execution encoding of the system
to the wide arguments for
.
#define WINDOWS_LEAN_AND_MEAN 1 #include <windows.h>#include <text_encoding>// this proposal #include <iostream>int main ( int argc , char * argv []) { if ( argc < 2 ) { std :: cerr << "Path unspecified: exiting." << std :: endl ; return -1 ; } std :: wstring path_as_wstr = std :: text :: transcode ( std :: string_view ( argv [ 1 ]), std :: text :: wide_execution {}); // Interop with Windows std :: unique_ptr < HANDLE , FileHandleDeleter > target_file = CreateFileW ( path_as_wstr . data (), GENERIC_WRITE , 0 , NULL, CREATE_ALWAYS , FILE_ATTRIBUTE_NORMAL ); if ( ! target_file ) { // GetLastError(), etc... return -2 ; } /* Use File... */ return 0 ; }
This paper directly enables such a use case.
2.1.2. Networking with Boost.Beast
The following is an example using this proposal to do a byte-based read off the network of a UTF-16 Big Endian payload in any machine.
#include <boost/beast.hpp>#include <boost/beast/http.hpp>#include <boost/asio/ip/tcp.hpp>#include <iostream>#include <text_encoding>// this proposal namespace beast = boost :: beast ; namespace http = beast :: http ; using tcp = boost :: asio :: ip :: tcp ; using results_type = tcp :: resolver :: results_type ; class session : public std :: enable_shared_from_this < session > { /* ... */ http :: request < http :: empty_body > req_ ; std :: vector < std :: byte > res_body_ ; http :: response < http :: vector_body < std :: byte :> res_ ; std :: u8string converted_body_ ; /* ... */ void on_connect ( beast :: error_code ec , results_type :: endpoint_type ); void on_resolve ( beast :: error_code ec , results_type results ); /* ... */ void on_read ( beast :: error_code ec , std :: size_t bytes_transferred ) { if ( ec ) { log_fail ( ec , u8"read failed" ); return ; } std :: span < std :: byte > bytes ( res_body_ . data (), bytes_transferred ); std :: ranges :: unbounded_view output ( std :: back_inserter ( converted_body_ )); // utf16, but big endian std :: text :: encoding_scheme < std :: text :: utf16_t , std :: endian :: big > from_encoding {}; // alternatively: std::text::utf16_be // transcode from bytes that are UTF16, Big Endian, // into unbounded output std :: text :: transcode ( bytes , output , from_encoding , std :: text :: utf8 ); std :: clog << converted_body_ << std :: endl ; /* Commit / clean up, etc. */ } };
This paper directly enables such a use case.
2.2. Current Problems
I donât write any software which runs only in English. Iâm tired of writing the same code different ways all the time just to display a handful of strings. Lately, I just skip C++ for anything that displays UI -- itâs so much easier in every other modern language.
This is REQUIRED for using C++ with any software which needs to run in multiple languages, without rolling your own code. Iâm tired of writing this from scratch for every separate project (cannot share code for most of them), using different underlying libraries for each (as licensing and processing requirements vary, I canât just pick one library and use it everywhere). Unfortunately, I have no confidence the ISO committee understands the problem well enough, given how it patted itself on the back so much for adding u8"", u"", and U"" a while back. Real-world software which runs in multiple languages never hard-codes strings...
Norway has its own character set which is a variant of ISO-8859-10 with modifications to a couple of characters. This proposal would ease the transition for existing software when C++ gets (better/more coherent) support for Unicode.
The standard : "Oh yeah hey dudes
is deprecated but we didnât feel like writing an alternative so good luck yolo".
codecvt
â Herb Sutterâs "Top 5 C++ Proposals" Survey, Survey Respondent
Text in the Standard is a desert wasteland.
After pulling
from the language (for a very good reason, yes), users were left with no proper utilities to convert Unicode to Unicode, or convert execution / wide execution text to Unicode and back. People reach out for ICU, but the API -- while extremely fast -- is opaque and not the friendliest to use. [iconv] is not easy to build everywhere and does not have extensibility. Applications ages ago have shipped all manner of ad-hoc solutions (or not) to the text problem without working together or sharing their libraries with the whole ecosystem. As text -- and particularly, the encoding of text -- stands as one of the greatest barriers to Systems Programming languages being more diverse and friendly, there is a strong obligation to provide a standard solution that is capable of lasting the next 40 years unmodified.
The use cases for text encoding are vast. From:
-
basic processing of user-entered data;
-
sanitization of scripts
-
domain name protection in browsers;
-
text conversions when working with legacy systems or differing new/Unicode systems;
-
supplying the components that can be successfully used with industry-standard FreeType/Harfbuzz and DirectWrite
-
talking properly to legacy GDI applications;
-
communicating string data in JSON;
-
receiving market data from the Chinese Exchange in GB18030;
To:
-
converting and preserving government data in digital records;
-
handling data generated by logs in a multitude of languages;
-
handling user names without mangling;
and hundreds of other use cases, the motivating text for text practically writes itself.
2.3. Statement of Objectives
Part of this proposal is identifying exactly how those needs should be served. The primary objectives of this proposal, therefore, is as follows:
-
Users should be able to define their own encodings for their own purposes. Jonathan Wakelyâs time may not be well spent on EBCDIC, but IBM will certainly be very invested in making sure EBCDIC and its code pages is well-implemented and optimized. Put another way: company-specific and user-specific problems should be specific to them and not exported to the whole ecosystem, and they should be able to handle their problems effectively and efficiently without throwing the C++ Standard in the trash.
-
Locale-based
andchar
encodings belong to the C and C++ implementation. If users need/are tempted to guess about the localeâs encoding and pick (probably extremely wrongly) something, then this API has failed them.wchar_t -
The standard library should be able to cannibalize all existing legacy encodings and -- by way of leading design -- encourage and promote the use of Unicode in the userâs code. Embrace. Extend. Extinguish.
-
The standard library (and its implementers) do not have time to implement every new, old, and existing encoding. Put bluntly: CJ Johnsonâs brilliance and Stephan T. Lavavejâs passion is better spent improving their respective libraries and fixing bugs, not implementing EBCDIC or ISO/IEC 2022 CN, extended variant 2.
-
Unicode is the one and only language the standard speaks in its higher level text algorithms and functionality: legacy encodings must convert to Unicode to work with functionality built beyond this proposal. Future proposals will never need to concern themselves with encodings after this proposal is done.
-
Users may choose not to convert to Unicode, but they will need to spend the time and effort working out that trade off with their environment. The standard library will never have to care about text that willingly and deliberately exits the Unicode system.
-
Safety is not optional. Code that performs unsafe operations should require explicit opt-in and easily searchable patterns and names that make it clear the user has made a deliberate choice to open themselves up to vulnerabilities such as Undefined Behavior.
-
Performance is not optional, and correctness isnât a tender suggestion achievable with insane workarounds.
-
Simple function calls should be simple, but if the user wants to pry open the details they should be able to do so incrementally with ease.
-
Nobody has time to reimplement all of [iconv], especially the library developers. The interface should allow implementers to substitute a backend for certain encodings that takes advantage of pre-existing Operating System functions, widely-available libraries, or similar functionality.
-
Users should be able to do everything implementers can without undue clash between user functionality and implementer internal handling and extensions.
-
Octets -- delivered over the network, from IPC, or similar -- are an important input case that must be handled.
-
The design must be viable for low-memory environments, and prioritize zero allocation if a user cares enough to invest the time into the API with that goal.
-
At no point should we be introducing new container types for this functionality. Container wrappers / adaptors and range wrapper / adaptors are enough.
3. Design
The current design has been the culmination of a few years of collaborative and independent research, starting with the earliest papers from Mark Boyallâs [n3574], Tom Honermannâs [p0244r2], study of ICUâs interface, and finally the musings, experience and work of R. Martinho Fernandes in [libogonek]. Current and future optimizations are considered to ensure that fast paths are not blocked in the interface proposed for standardization. With [boost.text] showing an interface with a nailed down internally used UTF-8 encoding, Markus Shererâs participation in SG16 meetings, Henri Sivonenâs feedback on blog posts and mailing lists, and Bob Steagallâs work in writing a fast UTF8 decoder this paper absorbs a wealth of knowledge to get reach a flexible interface that enables high-throughput.
In reading, implementing, working with and consuming all of these designs, the author of this paper, independent implementers, and several SG16 members have come to the following core tenants:
-
strong types for code units allow selecting proper default encodings for these interfaces;
-
iterators and ranges are a huge interface win for working with text but are impossible to provide the fastest possible way to encode/decode/transcode text;
-
and, avoid creating new vocabulary: improve working with original containers and imposing well-formedness constraints upon them rather than designing new containers from the ground up.
Given these tenants, the following interface choices have arisen for this paper. Each section will describe a piece of the interface, its goals, and how it works. A low-level encoding interface and its plumbing and core types will be described first, followed by a high level interface that makes the low level easy to use. Both are imperative to cover the full design space that exists together, and the use cases today.
3.1. Definitions
Here are some handy definitions which will be used liberally to shorten the prose and specification in this paper.
-
Unicode Code Point: the 21-bit value (often represented as a 32-bit number for implementation-related reasons) that represents a code point from the Unicode Standard. Specifically, it is the range of integers 0 to 0x10FFFF inclusive.
-
Unicode Scalar Value: the 21-bit value that represents a code point from the Unicode Standard, but without Surrogate Unicode Code Point values. Specifically, it is the ranges of integers 0 to 0xD7FF and 0xE000 to 0x10FFFF inclusive.
-
: a type in C++ that represent at Unicode Code Point. Alias ofunicode_code_point
.char32_t -
: a type for C++ that represents a Unicode Code Point, barring all surrogate characters (the values used to encode double-code unit UTF-16 characters). Strong typedef that supports all the same operations asunicode_scalar_value
. Itâs constructor may assert / trap on values outside of the allowed 21 Unicode bits and may assert / trap on values that are surrogate characters.char32_t -
: a boolean value that indicates whether or not a given type is a code point value. Program-specializations are allowed for all types. The primary template returnstemplate < typename T > constexpr bool is_code_point_v true
if
isT
,char32_t
, andunicode_code_point
.unicode_scalar_value -
: a boolean value that indicates whether or not a given type is a unicode scalar value. Program-specializations are allowed for all types. The primary template returnstemplate < typename T > constexpr bool is_scalar_value_v true
if
isT
.unicode_scalar_value -
.template < typename T > using encode_state_t = typename std :: remove_cvref_t < T >:: state ; -
.template < typename T > using decode_state_t = typename std :: remove_cvref_t < T >:: state ; -
.template < typename T > using code_unit_t = typename std :: remove_cvref_t < T >:: code_unit ; -
: this is thetemplate < typename T > using code_point_t = typename std :: remove_cvref_t < T >:: code_point ;
type definition for a given typecode_point
, ignoring cv-qualifiers.T -
: a boolean that tells whether or not a giventemplate < typename State , typename Encoding > constexpr bool is_state_independent_v
type requires theState
to be constructed. Equivalent toEncoding
;std :: is_constructible_v < State , const Encoding &> -
: Equivalent totemplate < typename Encoding > constexpr bool is_encode_state_independent_v
;is_state_independent_v < encode_state_t < std :: remove_cvref_t < Encoding >> , Encoding > -
: Equivalent totemplate < typename Encoding > constexpr bool is_decode_state_independent_v
;is_state_independent_v < decode_state_t < std :: remove_cvref_t < Encoding >> , Encoding > -
given the existence of a template parameterusing UEncoding = std :: remove_cvref_t < Encoding >
.Encoding -
given the existence of a template parameterusing UToEncoding = std :: remove_cvref_t < ToEncoding >
.ToEncoding -
given the existence of a template parameterusing UFromEncoding = std :: remove_cvref_t < FromEncoding >
.FromEncoding -
: is a concept defining that there is a range whose iterator produces arange_of < T >
ofvalue_type
. For example,T
andstd :: vector < int >
model a concept-constrained parameter or return type ofint [ 1 ]
.const range_of < int > auto &
template < typename R , typename T > concept range_of = std :: ranges :: range < std :: remove_cvref_t < R >> && std :: is_same_v < std :: ranges :: range_value_t < std :: remove_cvref_t < R >> , T > ;
-
: is a concept defining that there is a range whose iterator produces acontiguous_range_of < T >
ofvalue_type
. For example,T
andstd :: span < double >
model a concept-constrained parameter or return type ofdouble [ 1 ]
.const contiguous_range_of < double > auto &
template < typename R , typename T > concept contiguous_range_of = std :: ranges :: contiguous_range < std :: remove_cvref_t < R >> && std :: is_same_v < std :: ranges :: range_value_t < std :: remove_cvref_t < R >> , T > ;
3.2. Low-Level
The high-level interfaces must be built on something: it cannot be magically willed into existence. There is quite a bit of plumbing that goes into the low-level interfaces, most of which will be boilerplate to users but will serve keen use and importance to several library developers and standard library implementers.
3.2.1. Error Codes
There is some boilerplate that needs to be taken care of before building our encoding, decoding, transcoding and similar functionality begins. First and foremost is the error codes and result types that will go in and out of our encoding functions. The error code enumeration is
. It lists all the reasons an encoding or decoding operation can fail:
namespace std :: text { enum class encoding_error { // just fine ok = 0x00 , // input contains ill-formed sequences invalid_sequence = 0x01 , // input contains incomplete sequences incomplete_sequence = 0x02 , // output cannot receive all the completed // code units insufficient_output_space = 0x03 }; }
3.2.2. Result Types
The result types are the glue that help users who use the low level interface loop through their text properly. It returns updated ranges of both the input and output to indicate how far things have been moved along, on top of an
and whether or not the result came from an error being handled (see below about error handling):
namespace std :: text { template < typename Input , typename Output > class encode_result { public : Output output ; Input input ; encoding_error error_code ; size_t handled_errors ; template < typename InRange , typename OutRange > constexpr encode_result ( InRange && input , OutRange && output , encoding_error error_code = encoding_error :: ok ) noexcept ( /* member-construction-conditional */ ); template < typename InRange , typename OutRange > constexpr encode_result ( InRange && input , OutRange && output , encoding_error error_code , size_t handled_errors ) noexcept ( /* member-construction-conditional */ ); }; template < typename Input , typename Output , typename State > class stateful_encode_result : public encode_result < Input , Output > { State & state ; template < typename InRange , typename OutRange , typename EncodingState > constexpr stateful_encode_result ( InRange && input , OutRange && output , EncodingState && state , encoding_error error_code = encoding_error :: ok ); template < typename InRange , typename OutRange , typename EncodingState > constexpr stateful_encode_result ( InRange && input , OutRange && output , EncodingState && state , encoding_error error_code , size_t handled_errors ) noexcept ( /* member-construction-conditional */ ); }; template < typename Input , typename Output > class decode_result { public : Output output ; Input input ; encoding_error error_code ; size_t handled_errors ; template < typename InRange , typename OutRange > constexpr decode_result ( InRange && input , OutRange && output , encoding_error error_code = encoding_error :: ok ) noexcept ( /* member-construction-conditional */ ); template < typename InRange , typename OutRange > constexpr decode_result ( InRange && input , OutRange && output , encoding_error error_code , size_t handled_errors ) noexcept ( /* member-construction-conditional */ ); }; template < typename Input , typename Output , typename State > class stateful_decode_result : public decode_result < Input , Output > { public : State & state ; template < typename InRange , typename OutRange , typename EncodingState > constexpr stateful_decode_result ( InRange && input , OutRange && output , EncodingState && state , encoding_error error_code = encoding_error :: ok ); template < typename InRange , typename OutRange , typename EncodingState > constexpr stateful_decode_result ( InRange && input , OutRange && output , EncodingState && state , encoding_error error_code , size_t handled_errors ) noexcept ( /* member-construction-conditional */ ); }; template < typename Input , typename Output > class transcode_result { public : Output output ; Input input ; encoding_error error_code ; size_t handled_errors ; template < typename InRange , typename OutRange > constexpr transcode_result ( InRange && input , OutRange && output , encoding_error error_code = encoding_error :: ok ) noexcept ( /* member-construction-conditional */ ); template < typename InRange , typename OutRange > constexpr transcode_result ( InRange && input , OutRange && output , encoding_error error_code , size_t handled_errors ) noexcept ( /* member-construction-conditional */ ); }; template < typename Input , typename Output , typename FromState , typename ToState > class stateful_transcode_result : public transcode_result < Input , Output > { public : FromState & from_state ; ToState & to_state ; template < typename InRange , typename OutRange , typename FromEncodingState , typename ToEncodingState > constexpr stateful_transcode_result ( InRange && input , OutRange && output , FromEncodingState && from_state , ToEncodingState && to_state , encoding_error error_code = encoding_error :: ok ) noexcept ( /* member-construction-conditional */ ); template < typename InRange , typename OutRange , typename FromEncodingState , typename ToEncodingState > constexpr stateful_transcode_result ( InRange && input , OutRange && output , FromEncodingState && from_state , ToEncodingState && to_state , encoding_error error_code , size_t handled_errors ) noexcept ( /* member-construction-conditional */ ); }; template < typename Input , typename State > class validate_result { public : Input input ; bool valid ; template < typename ArgInput > constexpr validate_result ( ArgInput && input , bool is_valid ) noexcept ( /* member-construction-conditional */ ); }; template < typename Input , typename State > class stateful_validate_result : public validate_result < Input > { public : State & state ; template < typename ArgInput , typename ArgState > constexpr validate_result ( ArgInput && input , bool is_valid , ArgState && state ) noexcept ( /* member-construction-conditional */ ); }; template < typename Input > class count_result { public : Input input ; encoding_error error_code ; size_t count ; size_t handled_errors ; template < typename ArgInput > constexpr count_result ( ArgInput && input , size_t count , encoding_error error_code = encoding_error :: ok ) noexcept ( /* member-construction-conditional */ ); template < typename ArgInput > constexpr count_result ( ArgInput && input , size_t count , encoding_error error_code , size_t handled_errors ) noexcept ( /* member-construction-conditional */ ); }; template < typename Input , typename State > class stateful_count_result : public count_result < Input > { public : State & state ; template < typename ArgInput , typename ArgState > constexpr stateful_count_result ( ArgInput && input , size_t count , ArgState && state , encoding_error error_code = encoding_error :: ok ) noexcept ( /* member-construction-conditional */ ); template < typename ArgInput , typename ArgState > constexpr stateful_count_result ( ArgInput && input , size_t count , ArgState && state , encoding_error error_code , size_t handled_errors ) noexcept ( /* member-construction-conditional */ ); }; }
There is a lot to unpack here. There are two essentially identical structures:
and
. These contain the input range, the output range, the error code and whether or not the error handler was invoked. The
member is important because some error handlers may change the
member to
, indicating that things are fine (e.g., a replacement character was successfully inserted into the output stream to replace some bad input).
Each of these also has a
version of them, which contains all of the above plus a reference to the encodingâs current state.
Note: Having 2 differently-named types with much the same interface is paramount to allow an
callable to know how to interpret some errors and whether to try to insert code units into the output stream or code points into the output stream (encoding means code units into output, decoding means code points into the output). If the structures were merged, this information would be lost at compile-time and have to attempt to coerce that information out by examining the
and
types of the output or input range. Unfortunately, even that is not foolproof because neither the input range or output ranges need to exactly dereference to exactly
or
types, just things convertible to / from them.
is a joint type for operations which go from
â
and then
â
, assuming the
types are compatible between the two encodings deployed for the transformation.
is for counting operations that check the number of code units / code points that would result from an encoding or decoding operation.
is for testing whether or not a given operation can be done successfully with no errors. They also have their stateful variants, which transport the state as a reference type.
3.2.2.1. Input and Output Ranges
These are essentially the ranges moved forward as much or as little as the encoding needed to for reading from the input, converting, and writing to the output. It also solves the problem of obtaining maximal speed based on checking if the destination is filled or if the input is exhausted:
works well since its comparison sentinel always returns the literal "false" bool on comparison, meaning that any compiler beyond the typical
/
/ etc. levels of optimization will cull any
comparison branches out of code.
The decoding result and encoding result types both return the input and output range passed to encoding and decoding functions in the structure itself. This represents the changed ranges. In the event where the range cannot be successfully reconstructed from itself using the iterator and sentinel, a
will be returned instead.
3.2.2.2. Error Handling: Allow All The Options
This is a low-level interface. As such, accommodating different error handling strategies is necessary. There are several ways to report errors used in both the C and C++ standard libraries, from throwing errors, to
out parameters, to integral return values and even complex return structures. Choosing a scheme here is difficult given the large breadth and depth of error handling history in C++. While the standard library shows a clear bias towards throwing exceptions, it would not be prudent to throw all the time. Requiring exceptions may exclude hard and soft real-time programming environments wherein these encoding structures will be needed. Exceptions also have an intrinsic problem in this domain, as described a little bit below in this section.
To accommodate the wide breadth of C++ programming environments and ecosystems, error reporting will be done through an error handler, which can be any type of callable that matches the desired interface. The standard will provide 4 of these error handlers:
namespace std :: text { class replacement_handler ; class throw_handler ; class assume_valid_handler ; class default_handler ; template < typename ErrorHandler > class incomplete_handler ; }
The interface for an error handler looks like the below error handler that does no modification and just passes the result through:
namespace std :: text { class pass_through_handler { template < typename Encoding , typename InputRange , typename OutputRange , typename State , contiguous_range_of < code_point_t < Encoding >> Progress > constexpr auto operator ()( const Encoding & encoding , encode_result < InputRange , OutputRange , State > result , const Progress & progress ) const { /* morph result, log, throw error, etc. ... */ return result ; } template < typename Encoding , typename InputRange , typename OutputRange , typename State , contiguous_range_of < code_unit_t < Encoding >> Progress > constexpr auto operator ()( const Encoding & encoding , decode_result < InputRange , OutputRange , State > result , const Progress & progress ) const { /* morph result, log, throw error, etc. ... */ return result ; } }; }
The specification here is a value-based one.
is a reference to the encoding which threw the error.
is passed to the error handler and it represents an
or
functionâs current progress. The
types provide the current input range, the current output range, a reference to the current state, and the type of error encountered according to the
. Finally, the
object is a
passed from the codec with the code points or code units already read from the input range. (This is important for e.g. reading from one-way iterators like
, where it is impossible to go back and recover information consumed by the algorithm.) The error handler is then responsible for performing any modifications it wants to the result type, before returning the modified result to be propagated back by the encoding interface.
There are a few things that can be done in the commented code shown above. First and foremost is that someone could look at
and simply throw a hand-tailored exception. This would bubble out of the function and let the caller decide what to do.
Note: Throwing is explicitly not recommended by default by prominent vendors and implementers (the Unicode Consortium, etc.). Ill-formed text is common. Text from misbehaving programs -- 40 years of them -- is a frequent kind of user and machine input. It is extremely easy to provoke a Denial of Service Attack (DoS Attack) if a library throws an error on malformed input that the application author did not consider.
The default error handler will be the
, as hinted by the name. The
is a "strong typedef" over the
, done for the purposes of safety in the higher-level API.
The
will look inside
to see if the expression
,
,
, or
is well-formed. If so, it will take the range returned from that function and will attempt to insert it into the
range. Specifically:
-
On a failure in
:decode_one -
If the output is at its end, return the result as-is.
-
Let
be one of:replacement_points -
, if this expression is well-formed.encoding . replacement_code_points () -
Otherwise, let
bemaybe_replacement_points
if the expression is well-formed.encoding . maybe_replacement_code_points ()
is checked to see if it produces amaybe_replacement_points true
value when contextually converted to a
. If it is, thenbool
is equal to the expressionreplacement_points
. Otherwise, it is an empty range whose* maybe_replacement_points
isstd :: ranges :: range_value_type < T >
.code_point_t < Encoding > -
Otherwise, if
isis_code_point_v < code_point_t < Encoding >> true
,
is an array of one element ofreplacement_points
type, whose contents are equivalent tocode_point_t < Encoding >
.{ static_cast < code_point_t < Encoding >> ( U '\uFFFD ') } -
Otherwise, it is not defined.
-
-
If
is defined, thenreplacement_points
is iterated over and code points are inserted into the output range in linear ascending order, if there is space. If there is not enough space, return the result as-is. Note that this may write partial data to the range ifreplacement_points
contains more than one code point and the output is an output range.replacement_points -
Otherwise, let
be one of:replacement_units -
if the expression is well-formed.encoding . replacement_code_units () -
Otherwise, let
bemaybe_replacement_units
if the expression is well-formed. Ifencoding . maybe_replacement_code_units ()
contextually converted tomaybe_replacement_units
evaluates tobool true
, then
is equal to the expressionreplacement_units
. Otherwise, it is equal to an empty range of* maybe_replacement_units ; -
Otherwise, it is not defined.
-
-
If
is defined, then replacement code points are inserted into the output stream as-if as follows:replacement_units
. Ifauto intermediate_result = encoding . decode_one ( replacement_units , result . output , pass - through - handler , result . state );
is not equal tointermediate_result . error_code
, then return the original result. Note that this may write partial data to the range if the decode operation needs to write more than one code point to thestd :: text :: encoding_error :: ok
.output -
Otherwise, the program is ill-formed.
-
-
On a failure in
:encode_one -
If the output is at its end, return the result as-is.
-
Let
be one of:replacement_units -
, if this expression is well-formed.encoding . replacement_code_units () -
Otherwise, let
bemaybe_replacement_units
if the expression is well-formed.encoding . maybe_replacement_code_units ()
is checked to see if it produces amaybe_replacement_units true
value when contextually converted to a
. If it is, thenbool
is equal to the expressionreplacement_units
. Otherwise, it is an empty range where* maybe_replacement_units
isstd :: ranges :: range_value_type < decltype () >
.code_unit_t < Encoding > -
Otherwise, if
isis_code_unit_v < code_unit_t < Encoding >> true
,
is an array of one element ofreplacement_units
type, whose contents are equivalent tocode_unit_t < Encoding >
.{ static_cast < code_unit_t < Encoding >> ( U '\uFFFD ') } -
Otherwise, it is not defined.
-
-
If
is defined, thenreplacement_units
is iterated over and code points are inserted into the output range in linear ascending order, if there is space. If there is not enough space, return the result as-is. Note that this may write partial data to the range ifreplacement_units
contains more than one code point and the output is an output range.replacement_units -
Otherwise, let
be one of:replacement_points -
if the expression is well-formed.encoding . replacement_code_points () -
Otherwise, let
bemaybe_replacement_points
if the expression is well-formed. Ifencoding . maybe_replacement_code_points ()
contextually converted tomaybe_replacement_points
evaluates tobool true
, then
is equal to the expressionreplacement_points
. Otherwise, it is an empty range where* maybe_replacement_points ;
isstd :: ranges :: range_value_type < decltype ( replacement_points ) > code_point_t < Encoding > -
Otherwise, it is not defined.
-
-
If
is defined, then replacement code points are inserted into the output stream as-if as follows:replacement_points
. Ifauto intermediate_result = encoding . encode_one ( replacement_units , result . output , pass - through - handler , result . state );
is not equal tointermediate_result . error_code
, then return the original result. Note that this may write partial data to the range if the encode operation needs to write more than one code point to thestd :: text :: encoding_error :: ok
.output -
Otherwise, the program is ill-formed.
-
If successful, the error code on the result will be corrected to say "everything is fine" (
) and then returned from the function. This allows algorithms continue looping over input with the replacement characters inserted. If there is no room in the output, then the error is returned untouched.
For performance reasons and flexibility, the error callable must have a way to ensure that the user and implementation can agree on whether or not Undefined Behavior is invoked by assuming that the text is valid. [libogonek] made an object of type
. This paper provides the same here: an error handler of
means that the implementation will eliminate all of its checks and subsequent calls to the error handling interface. A user must provide the
to achieve this behavior: it will never be the default behavior because it is error-prone and dangerous and only to be performed with explicit user consent.
Note: Lesson Learned! Rust attempted to force that every string constructed ever was valid UTF-8 and rigorously checked this pre- and post-condition. Doing this check was so obscenely expensive that they needed to introduce a new function to
some UTF-8 text so it would not be checked if the user knew the text was in the proper encoding.
3.2.3. The Encoding Object
It is no great surprise that there is not enough library implementers prepared to standardize the entirety of what the WHATWG specifies in its encoding specification, let alone enough to handle every rogue request for a new encoding object type in C++ Standard. A system must be developed that provides flexibility for the end-user that does not require them writing a paper and getting into a 1-4 year long process of herding a proposal through the notoriously slow Committee, just to have support for encoding X or feature Y. There is also less and less (read: almost none) tolerance for adding whacky extensions to libraries like libstdc++ or libc++, and MSVC has only recently open-sourced (with no appetite for shoveling more semi-abandonware legacy library extensions into their codebase at the time of writing).
Encoding objects provide flexibility that enable us to consume the entire encoding space without needing to tax the Standard Library. It enables other people to plug into the system and provides the flexibility they need, and only standardize when interoperability and redundant implementation becomes a burden to the greater C++ ecosystem. This frees up Billy OâNeal, Jonathan Wakely, Louis Dionne, their successors, and the dozens of other standard library contributors and implementers to focus on producing high quality code, rather than scrambling to implementing four or five dozen encodings because one company, somewhere, made an at-the-time-it-seemed-okay choice in 2005 about how to store their text.
Given our result types and error handlers, the interface for the encoding object itself can be defined. Here is the example encoding illustrating the interface:
class example_locale_encoding { class example_state { std :: mbstate_t multibyte_state ; }; // REQUIRED: code point and code unit types using code_point = char32_t ; using code_unit = char ; // REQUIRED: one of either this single type, using state = example_state ; // OR, both of the following using decode_state = example_state ; using encode_state = example_state ; // REQUIRED: maximum input and output sizes static constexpr size_t max_code_units = MB_LEN_MAX ; static constexpr size_t max_code_points = 4 ; // OPTIONAL: injective indicators using is_encoding_injective = std :: false_type ; using is_decoding_injective = std :: true_type ; // REQUIRED: functions template < typename Input , typename Output , typename Handler > auto decode_one ( Input && in_range , Output && out_range , Handler && handler , state & current_state ); template < typename Input , typename Output , typename Handler > auto encode_one ( Input && in_range , Output && out_range , Handler && handler , state & current_state ); // OPTIONAL: functions constexpr std :: span < code_point > replacement_code_points () const noexcept ; constexpr std :: span < code_unit > replacement_code_units () const noexcept ; constexpr std :: optional < std :: span < code_point >> maybe_replacement_code_points () const noexcept ; constexpr std :: optional < std :: span < code_unit >> maybe_replacement_code_units () const noexcept ; };
There are many pieces of this encoding object. Some of them fit the purposes explained above. As an overview, given an
type such as
, the following type definitions, static member variables, and functions can be present:
-
andcode_unit
type definitions let us know what an Encodingâs inputs and outputs will be from its functions. It also helps us tell if 2 encodings can be transcoded from one another by having at least thecode_point
in common.code_point -
allows a user to instantiate the type and control any parameters for manipulating stateful or shift-state encodings. It can be split up into two type definitions to bestate
andencode_one
specific:decode_one
andencode_state
.decode_state -
If
isis_ { encode | decode } _state_independent_v < Encoding > false
(the encodingâs state does not need the an
object to construct itself),encoding
must be default-constructible and default construction results in the "initial (shift) sequence" of processing a string of encoded text. This is essentially the state of a conversion before any text is processed.encoding_state_t < Encoding > -
If
is true, then:is_ { encode | decode } _state_independent_v < Encoding > -
the
may not be default-constructible; and,{ encode | decode } _state_t < Encoding > -
the encoding must be constructible when given a parameter of type
.const UEncoding &
-
-
Encodings that do not need state are
,utf8_t
,utf16_t
,utf32_t
,ascii_t
, and similar and fall under the case whereutf_ebcdic_t
would be false (and theis_ { encode | decode } _state_independent_v < Encoding >
structure itself would be empty).state -
Many IBM/Windows code pages, ISO 2022 encodings, and Extended Unix Code (EUC) encodings, regardless of the self-state, will need
to process its characters properly. This does not requirestate
to beis_ { encode | decode } _state_independent_v < Encoding > true
. -
Encodings where the encoding may point to itself to have additional information like this include run-time, type-erased encodings like
,any_encoding
. For these run-time erased encodings,one_encoding_of < utf8_t , utf16_t , shift_jis , tscii , ... >
would be true because the encoding has special knowledge that is needed to manage the state properly.is_encoding_self_state_v < Encoding >
-
-
andmax_code_units
represent integral values which inform users of the encoding the necessary size of a buffer to handle at least one full, encoded sequence of code units and one full, decoded sequence of code points. In most cases,max_code_points
will bemax_code_points
, but there are cases where this is not the case (e.g., the Tamil Standard Code for Information Interchange (TSCII)).1 -
anddecode_one
are fundamental functions which convert one full unit of complete, indivisible information from one representation to the other. Specifically,encode_one
converts fromdecode_one
s tocode_unit
s, andcode_point
converts fromencode_one
s tocode_point
s.code_unit
is an input range,Input
is an output range, andOutput
is an error handler as defined in §âŻ3.2.2.2 Error Handling: Allow All The Options.Handler
Optionally, some additional type definitions and functions help with safety, error handling (for replacement), and more:
-
andis_encoding_injective
indicate whether or not the encode or decode operations provide a lossless map from the code_point to code_unit or vice-versa, respectively. This is important when using high-level conversion facilities: compile-time diagnostics can be issued for conversions that are lossy. This ensures that users who do lossy conversions must specify anis_decoding_injective
from the standard, or one of their own making, and know what they are getting into data loss territory.error_handler -
is a function that returns a range to be entered into the output if an error occurs during areplacement_code_points
call and the error handler used is thedecode_one
orstd :: text :: default_handler
. This provides encodings a simple way to plug in replacement code points that are not the same as the default replacement character used is, which isstd :: text :: replacement_handler
(ďż˝). This can be defined to be an empty range (not recommended but possible).\uFFFD -
is a function that returns a range to be entered into the output if an error occurs during amaybe_replacement_code_points
call and the error handler used is thedecode_one
orstd :: text :: default_handler
. This provides encodings a simple way to plug in replacement code points that are not the same as the default replacement character used is, which isstd :: text :: replacement_handler
(ďż˝). This allows an encoding to determine if a replacement range should be produced based on runtime effects. It also differentiates between not-present replacement values (the optional-like value returned is empty), or an empty replacement value (wherein the range is literally empty).\uFFFD
is preferred if the implementation knows statically that there is always a returned replacement range.replacement_code_points -
is a function that returns a range to be entered into the output if an error occurs during anreplacement_code_units
call and the error handler used is theencode
orstd :: text :: default_handler
. Note that not all encodings can handle the entirety of the Unicode Code Point space, let alonestd :: text :: replacement_handler
(ďż˝). This can be defined to return an empty range (not recommended, but possible).\uFFFD -
is a function that returns a range to be entered into the output if an error occurs during anmaybe_replacement_code_units
call and the error handler used is theencode
orstd :: text :: default_handler
. Note that not all encodings can handle the entirety of the Unicode Code Point space, let alonestd :: text :: replacement_handler
(ďż˝). It also differentiates between not-present replacement values (the optional-like value returned is empty), or an empty replacement value (wherein the range is literally empty).\uFFFD
is preferred if the implementation knows statically that there is always a returned replacement range.replacement_code_units
3.2.3.1. Encodings Provided by the Standard
The primary reason for the standard to provide an encoding is to ensure that it produces a way for applications to communicate with one another. As a baseline, the standard should support all the encodings it ships with its string literal types. On top of that, there is an important base-level optimization when working with strictly ASCII text that can be implemented with UTF8 which would most library implementers are interested in shipping. This means that the following encodings will be shipped by the standard library:
// header: <text_encoding> namespace std :: text { using unicode_code_point = char32_t ; class unicode_scalar_value ; template < typename CharT , typename Codepoint = unicode_code_point > class basic_utf8 ; template < typename CharT , typename Codepoint = unicode_code_point > class basic_utf16 ; template < typename CharT , typename Codepoint = unicode_code_point > class basic_utf32 ; template < typename Encoding , std :: endian endianness = std :: endian :: native , typename Byte = std :: byte > class encoding_scheme ; template < typename CharT , typename Codepoint > class no_encoding ; class ascii_t ; using utf8_t = basic_utf8 < char8_t , unicode_code_point > ; using utf16_t = basic_utf16 < char16_t , unicode_code_point > ; using utf32_t = basic_utf32 < char32_t , unicode_code_point > ; class literal_t ; class wide_literal_t ; class execution_t ; class wide_execution_t ; inline constexpr auto ascii = ascii_t {}; inline constexpr auto utf8 = utf8_t {}; inline constexpr auto utf16 = utf16_t {}; inline constexpr auto utf32 = utf32_t {}; inline constexpr auto literal = literal_t {}; inline constexpr auto wide_literal = wide_literal_t {}; inline constexpr auto execution = execution_t {}; inline constexpr auto wide_execution = wide_execution_t {}; template < typename Byte , typename CharT = char16_t , typename Codepoint = unicode_code_point > using basic_utf16_le = encoding_scheme < basic_utf16 < CharT , Codepoint > , std :: endian :: little_endian , Byte > ; template < typename Byte , typename CharT = char16_t , typename Codepoint = unicode_code_point > using basic_utf16_be = encoding_scheme < basic_utf16 < CharT , Codepoint > , std :: endian :: little_endian , Byte > ; template < typename Byte , typename CharT = char16_t , typename Codepoint = unicode_code_point > using basic_utf16_ne = encoding_scheme < basic_utf16 < CharT , Codepoint > , std :: endian :: native_endian , Byte > ; template < typename Byte , typename CharT = char32_t , typename Codepoint = unicode_code_point > using basic_utf32_le = encoding_scheme < basic_utf32 < CharT , Codepoint > , std :: endian :: little_endian , Byte > ; template < typename Byte , typename CharT = char32_t , typename Codepoint = unicode_code_point > using basic_utf32_be = encoding_scheme < basic_utf32 < CharT , Codepoint > , std :: endian :: little_endian , Byte > ; template < typename Byte , typename CharT = char32_t , typename Codepoint = unicode_code_point > using basic_utf32_ne = encoding_scheme < basic_utf32 < CharT , Codepoint > , std :: endian :: native_endian , Byte > ; using utf16_le_t = basic_utf16_le < std :: byte , char16_t , unicode_code_point > ; using utf16_be_t = basic_utf16_le < std :: byte , char16_t , unicode_code_point > ; using utf16_ne_t = basic_utf16_le < std :: byte , char16_t , unicode_code_point > ; using utf32_le_t = basic_utf32_le < std :: byte , char32_t , unicode_code_point > ; using utf32_be_t = basic_utf32_le < std :: byte , char32_t , unicode_code_point > ; using utf32_ne_t = basic_utf32_le < std :: byte , char32_t , unicode_code_point > ; inline constexpr auto utf16_le = utf16_le_t {}; inline constexpr auto utf16_be = utf16_be_t {}; inline constexpr auto utf16_ne = utf16_ne_t {}; inline constexpr auto utf32_le = utf32_le_t {}; inline constexpr auto utf32_be = utf32_be_t {}; inline constexpr auto utf32_ne = utf32_ne_t {}; }
All of
,
,
, and
correspond directly and obviously to what they name. These six encodings are also
-capable encodings in that they can be called at compile-time and used inside of contexts with other
functions, such as within
s.
, and
are encodings which represent the encoding of string literals controlled by the implementation. The functionality to achieve this has already been merged into Clang and GCC, while MSVC, at the time of writing (April 17th, 2021), is still lagging behind.
Both
and
represent the dynamic locale-based encoding that is used as the default encoding for C library functions. They are key encodings for interoperating with locale-dependent narrow execution encoding data as well as locale-dependent wide execution encoding data. It is imperative the standard ships these because only the implementation knows the runtime narrow or wide execution encoding.
's supremely helpful utility is described is described below. It takes byte-based (on the
type) input streams and turns them into sequences of code points or code units with a given endianness. This is especially useful for interoperation with not only UTF-16 and UTF-32, but other non-byte encodings which need to be transferred over the wire (including "special versions" of these such as
).
These (and their aliases) represent the core 9 encodings must be shipped with the standard, no matter what.
holds a special place here because it is a direct subset of
. If an individual knows their text is in purely ASCII ahead of time and they work in UTF8, this information can be used to bit-blast (
) the data from UTF8 to ASCII. It is best the standard is given this ability an not require hundreds of users to remake this very basic functionality in customization points.
3.2.3.2. UTF Encodings: variants?
There are many variants of encodings like UTF8 and UTF16. These include [wtf8] or [cesu8] and are useful for internal processing and interoperability with certain systems, like direct interfacing with Java or communication with an Oracle database. However, almost none of these are publicly recommended as interchange formats: both CESU-8 and WTF-8 are documented and used internally for legacy reasons. In some cases, they also represent security vulnerabilities if they are used for interchange on the internet. This makes them less and less desirable to provide VIA the standard. However, it is worth acknowledging that supporting WTF-8 and CESU-8 as encodings will ease individuals who need to roll such encodings for their applications.
More pressingly, there is a wide body of code that operates with
as the code unit for their UTF8 encodings. This is also subtly wrong, because on a handful of systems
is not unsigned, but signed. Math and bit characteristics for these types are wrong for the typical operations performed in UTF8 encoders and decoders (and many people -- including Markus Scherer that spends a lot of time with ICU -- just wish
was unsigned since it would have saved a lot of time from bugs). On one hand, providing variants that allow someone to pick something like the code unit for UTF16 or UTF8 would make it easier to have text types which play nice with the Windows APIs or existing code bases. The interface would look something like this...
namespace std :: text { template < typename CharT , bool encode_null , bool encode_lone_surrogates > class basic_utf8 ; using utf8 = basic_utf8 < char8_t , false, false> ; template < typename CharT , bool allow_lone_surrogates > class basic_utf16 ; using utf16 = basic_utf8 < char16_t , false> ; }
And externally, libraries and applications could add their own using statements and type definitions for the purposes of internal interoperation:
namespace my_app { using compat_utf8 = std :: basic_utf8 < char , false, false> ; using mutf8 = std :: basic_utf8 < char8_t , true, false> ; using filesystem16 = std :: basic_utf16 < wchar_t , true> ; }
There is clear utility that can be had here. But, this is not going to be looked into too deeply for the first iterations of this proposal. If there is a need, users are strongly encouraged to chime in (speak up) quickly so that this feature can be added to the proposal before later progression stages.
Finally, there is a plan that for early C++26, the full gamut of WHATWG encodings will be added to the standard, since this covers the minimal viable set of encodings that is required for communicating across the internet and through messaging mediums such as e-mail successfully.
3.2.3.3. Encoding Schemes: Byte-Based
Unicode specifies what are called Encoding Schemes for the encodings whose code unit size exceeds a single byte. This is essentially UTF16 and UTF32, of which there is UTF16 Little Endian (UTF16-LE), UTF16 Big Endian (UTF16-BE), UTF32 Little Endian (UTF32-LE), and UTF32 Big Endian (UTF32-BE). Encoding schemes can be generically handled without creating extremely specific encodings by creating an
template. It will look much like so:
// header: <text_encoding> namespace std :: text { template < typename Encoding , std :: endian endianness = std :: endian :: native , typename Byte = std :: byte > class encoding_scheme ; }
This is a transformative encoding type that takes the source endianness and translates it to the native endianness. It has an identical interface to the
type passed in, with the caveat that the
member type is the same as
. The
type being configurable important because there are many interfaces which interoperate using
,
, and
in the ecosystem. Furthermore, others have realized they can get better performance from their code by avoiding aliasing types altogether and using
with the necessary definitions to make it usable.
All
does is call the same
or
function with small wrappers around the passed-in ranges that takes bytes and composes them into the internal
type, or when writing out takes an
type and writes it out into its byte-based form.
A few SG16 members have frequently advocated that the base inputs and outputs for all types matching the
concept should be byte-based. This paper disagrees with that supposition and instead goes the route of providing this wrapping encoding scheme. The benefit here is flexibility and independence from byte ordering at the
level: the
becomes the layer at which such a concern is both concentrated and isolated. Now, no encoding needs to duplicate its interface at all, while still retaining strong and separately named types that one can perform additional optimization on.
Writing mostly-duplicate encoding object types for
,
, and other such shenanigans is a thorough and fundamental waste of everyoneâs time.
Note: Lesson Learned! This direction is far less boilerplate, and has also already seen implementation experience in [libogonek]'s [libogonek-encoding_scheme] type. Users have not complained. It has also proved to be implementable by simply decomposing the original input/output ranges into their iterators, and wrapping said iterators with a
. It has worked well in the implementation experience linked below.
3.2.3.4. Default Encodings
For interactions with encodings, there are times when a default encoding may be inferred from input and output types in §âŻ3.3 High Level's functions. Thusly, 2 traits provide defaults that can be overridden by the program:
// header: <text_encoding> namespace std :: text { template < typename T > using default_code_unit_encoding_t = /* ... */ ; template < typename T > using default_consteval_code_unit_encoding_t = /* ... */ ; template < typename T > using default_code_point_encoding_t = /* ... */ ; template < typename T > using default_consteval_code_point_encoding_t = /* ... */ ; }
The implementation for the standard will attempt to select one of the following, or fail, for these types when used in conjunctions where the encoding is "guessed" based on the incoming code unit type of the input range.
For
:
-
ifstd :: text :: execution
is (possibly cv-qualified)T
.char -
ifstd :: text :: wide_execution
is (possibly cv-qualified)T
.wchar_t -
ifstd :: text :: utf8
is (possibly cv-qualified)T
.char8_t -
ifstd :: text :: utf16
is (possibly cv-qualified)T
.char16_t -
ifstd :: text :: utf32
is (possibly cv-qualified)T
,char32_t
, orstd :: text :: unicode_code_point
.std :: text :: unicode_scalar_value -
ifstd :: text :: basic_utf8 < std :: byte >
is (possibly cv-qualified)T
.std :: byte -
Otherwise, the program is ill-formed.
For
:
-
ifstd :: text :: literal
is (possibly cv-qualified)T
.char -
ifstd :: text :: wide_literal
is (possibly cv-qualified)T
.wchar_t -
ifstd :: text :: utf8
is (possibly cv-qualified)T
.char8_t -
ifstd :: text :: utf16
is (possibly cv-qualified)T
.char16_t -
ifstd :: text :: utf32
is (possibly cv-qualified)T
,char32_t
, orstd :: text :: unicode_code_point
.std :: text :: unicode_scalar_value -
ifstd :: text :: basic_utf8 < std :: byte >
is (possibly cv-qualified)T
.std :: byte -
Otherwise, the program is ill-formed.
For both
and
:
-
ifstd :: text :: utf8 < char8_t , std :: remove_cvref_t < T >>
isis_unicode_code_point_v < std :: remove_cvref_t < T >> true
. -
Otherwise, the program is ill-formed.
The use of
versus
is based on whether any of the higher level functions are executed at compile-time (checked with
). The
types are backed by non-
versions that allow a user to specialize the structure. Users can override the associations here for any program-defined type
.
3.2.4. Stateful Objects, or Stateful Parameters?
Stateful objects are good for encapsulation, reuse and transportation. They have been proven in many APIs both C and C++ to provide a good, reentrant API with all relevant details captured on the (sometimes opaque) object itself. After careful evaluation, stateful parameters rather than a wholly stateful object for the function calls in encoding and decoding types are a better choice for this low-level interface. The main and important benefits for having the state be passed to the encoding / decoding function calls as a parameter are that it:
-
maintains that encoding objects can be cheap to construct, copy and move;
-
improves the general reusability of encoding objects by allowing state to be massaged into certain configurations by users; and,
-
allows users to set the state in a public way without having to prescribe a specific API for all encoders to do that.
The reason for keeping encoding types cheap is that they will be constructed, copied, and moved a lot, especially in the face of the ranges that SG16 is going to be putting a lot of work into (
in a future paper,
in a future paper,
in this paper). Ranges require that they can be constructed in (amortized) constant time; this change allows shifting the construction for what may be potentially expensive state to other places by un-bundling them from
object construction.
Consider the case of execution encoding character sets today, which often defer to the current locale. Locale is inherently expensive to construct and use: if the standard has to have an encoding that grabs or creates a
or
member, there will be an immediate loss of a large portion of users over the performance drag during construction of higher-level abstractions that rely on the encoding. It is also notable that this is the same mistake std::wstring_convert shipped with and is one of the largest contributing reasons to its lack of use and subsequent deprecation (on top of its poor implementation in almost every standard library, from the VC++ standard library to libc++).
In contrast, consider having an explicit parameter. At the cost of making a low-level interface take one more argument, the state can be paid for once and reused in many separate places, allowing a user to front-load the stateâs expenses up-front. It also allows the users to set or get the locale ahead of time and reuse it consistently. Encoding or decoding operations may be reused or restart in the cases of interruptible or incomplete streams, such as network reading or I/O buffering. These are potent use cases wherein such a design decision becomes very helpful.
By having an explicitly presented and controllable state parameter, there is room for a quality of implementation that does not rely on global variables or other shenanigans that may get in the way of multithreaded code. It is all done in the
or
/
typeâs constructor, and reusable therein; this makes it easy to prevent (default) construction, copy, or move of the encoding object in low-level contexts to trigger unsafe and dangerous operations. Maintaining that trust with end users who have been repeatedly let down by localeâs behavior is very important for a current generation encoding standard library.
Finally, this paradigm makes it far more obvious to the end user when the state is inseparable from the encoding object itself. This is the case with the theoretical
and
types. The necessary state cannot be separated from the encoding object itself: that information is secret in the encoding (because it is erased behind a run-time barrier). There must be a way to ensure that a user can create an encoding that has state that is erased within the current compile-time framework. This is where "state independence" becomes important.
3.2.5. State Independence
Both the
/
(or singular
) types may be connected to an encoding whose type is not known at compile-time. This means that the
is not only tied to the current encoding objectâs value, but might need to reach into private/internal details to properly create and regenerate its own state type.
To facilitate this design, we provide a simple paradigm for determining whether a state is "independent":
and its helpers,
and
.
// header: <text_encoding> namespace std :: text { template < typename State , typename Encoding > inline constexpr bool is_state_independent_v = ! std :: is_constructible_v < State , const std :: remove_cvref_t < Encoding >&> ; template < typename Encoding > inline constexpr bool is_decode_state_independent_v = is_state_independent_v < decode_state_t < Encoding > , const std :: remove_cvref_t < Encoding >&> ; template < typename Encoding > inline constexpr bool is_encode_state_independent_v = is_state_independent_v < encode_state_t < Encoding > , const std :: remove_cvref_t < Encoding >&> ; }
This is how we afford those encodings the ability to work without imposing undue burden on the entire system. The helper functions to create an encode state and a decode state take an
parameter and create a state based on whether it is an independent state or not:
// header: <text_encoding> namespace std :: text { template < typename Encoding > constexpr auto make_encode_state ( const Encoding & encoding ) { if constexpr ( is_encode_state_independent_v < Encoding > ) { return encode_state_t < Encoding > {}; } else { return encode_state_t < Encoding > ( encoding ); } } template < typename Encoding > constexpr auto make_decode_state ( const Encoding & encoding ) { if constexpr ( is_decode_state_independent_v < Encoding > ) { return decode_state_t < Encoding > {}; } else { return decode_state_t < Encoding > ( encoding ); } } }
During the usage of high level functionality that needs default-created arguments, it will invoke
or
to create these the state objects that will be passed down to the
or
(or customization points) of that particular encoding.
3.3. High Level
Working with the lower level facilities for text processing is not a pretty sight. Consider the usage of the low-level facilities described above:
#include <text_encoding>#include <iterator>#include <span>int main () { std :: text :: unicode_code_point array_output [ 41 ]{}; std :: u8string_view input = u8"đźđ°đ˛ đ˛đťđ´đ đšĚđđ°đ˝, đ˝đš đźđšđ đ đż đ˝đłđ°đ˝ đąđđšđ˛đ˛đšđ¸." ; std :: text :: utf8_t encoding {}; std :: u8string_view working_input = input ; std :: span < std :: text :: unicode_code_point > working_output ( array_output ); std :: text :: default_handler handler {}; std :: text :: utf8_t :: state encoding_state {}; for (;;) { auto result = encoding . decode ( working_input , working_output , handler , encoding_state ); if ( result . error_code != encoding_error :: ok ) { // not what we wanted. return -1 ; } if ( std :: empty ( result . input )) { break ; } working_input = std :: move ( result . input ); working_output = std :: move ( result . output ); } assert ( std :: u32string_view ( array_output ) == U"đźđ°đ˛ đ˛đťđ´đ đšĚđđ°đ˝, đ˝đš đźđšđ đ đż đ˝đłđ°đ˝ đąđđšđ˛đ˛đšđ¸." ); return 0 ; }
These low-level facilities -- while powerful and customizable -- do not represent what the average user will -- or should -- be wrangling with. Therefore, the higher-level facilities become incredibly pressing to make these interfaces palatable and sustainable for developers in both the short and long term. Consider the same encoding functionality, boiled down to something far easier to use:
std :: u32string output = std :: text :: decode ( u8"đźđ°đ˛ đ˛đťđ´đ đšĚđđ°đ˝, đ˝đš đźđšđ đ đż đ˝đłđ°đ˝ đąđđšđ˛đ˛đšđ¸." ); assert ( output == U"đźđ°đ˛ đ˛đťđ´đ đšĚđđ°đ˝, đ˝đš đźđšđ đ đż đ˝đłđ°đ˝ đąđđšđ˛đ˛đšđ¸." );
This is much simpler and does exactly the same as the above, without all the setup and boilerplate. Of course, taking only the input and giving the output is too much of a simplification, so there are a few overloads and variants that will be offered. Particularly, there needs to be 3 sets of free functions:
/
,
/
, and
/
. These are high-level functions that perform essentially what is shown above, but with numerous overloads that default a few parameters in the case where they can be figured out.
Note that, at the core of all these functions, the loop as shown above captures the core of the work. All of these abstractions are built on the 7 basis operations specified in §âŻ3.2.3 The Encoding Object. Actually getting additional optimizations is, of course, left to the readers and implementers through the use of §âŻ3.4 The Need for Speed.
3.3.1. Eager Free Functions
The free functions are written in a way to eagerly consume input and output space, unless given an explicit output container which limits its behavior or an error occurs. This is beneficial because many text processing algorithms receive the bulk of their gains by being able to work on multiple code units / code points. Therefore, this layer of the high level API is provided to satisfy the need where input and output space are of little concern.
3.3.1.1. Free Function decode
The
free function provides a High Level API for decoding text. It allows performance with some degree of flexibility and customization through its parameters, as well as additional improvements with the use of some ADL customization points. The core loops behaves as follows:
-
Performing an
call using the current target input and output views.auto result = encoding . decode_one (...) -
Checking if the return valueâs error code is
, and returning the result early if it is not.std :: text :: encoding_error :: ok -
Checking
, and returning with a result that hasstd :: ranges :: empty ( result . input )
set toerror_code
if it is empty.std :: text :: encoding_error :: ok -
Otherwise, go to 0 and use the
andresult . input
views.result . output
The surface of the
API is as follows:
// header: <text_encoding> namespace std :: text { template < typename Input , typename Output , typename Encoding , typename State , typename ErrorHandler > constexpr auto decode_into ( Input && input , Encoding && encoding , Output && output , ErrorHandler && error_handler , State & state ); template < typename Input , typename Encoding , typename Output , typename ErrorHandler > constexpr auto decode_into ( Input && input , Encoding && encoding , Output && output , ErrorHandler && error_handler ); template < typename Input , typename Encoding , typename Output > constexpr auto decode_into ( Input && input , Encoding && encoding , Output && output ); template < typename Input , typename Output > constexpr auto decode_into ( Input && input , Output && output ); template < typename OutputContainer = void , typename Input , typename Encoding , typename ErrorHandler , typename State > constexpr auto decode_to ( Input && input , Encoding && encoding , ErrorHandler && error_handler , State & state ); template < typename OutputContainer = void , typename Input , typename Encoding , typename ErrorHandler > constexpr auto decode_to ( Input && input , Encoding && encoding , ErrorHandler && error_handler ); template < typename OutputContainer = void , typename Input , typename Encoding > constexpr auto decode_to ( Input && input , Encoding && encoding ); template < typename OutputContainer = void , typename Input > constexpr auto decode_to ( Input && input ); template < typename OutputContainer = void , typename Input , typename Encoding , typename ErrorHandler , typename State > constexpr auto decode ( Input && input , Encoding && encoding , ErrorHandler && error_handler , State & state ); template < typename OutputContainer = void , typename Input , typename Encoding , typename ErrorHandler > constexpr auto decode ( Input && input , Encoding && encoding , ErrorHandler && error_handler ); template < typename OutputContainer = void , typename Input , typename Encoding > constexpr auto decode ( Input && input , Encoding && encoding ); template < typename OutputContainer = void , typename Input > constexpr auto decode ( Input && input ); }
The order of arguments is chosen based on what users are likely to specify first. In many cases, all that is needed is the input: the encoding can be chosen automatically for the user based on such. For certain
overloads, the user must specify the
object to use themselves. The third parameter is the error handler, which is defaulted to a parameter of type
. The fourth parameter is the state that is used to do the conversion. Given a type
, the following is passed to subsequent overloads by default:
-
Any
parameter is created doing one of the following and passing it to the next overload:encoding -
Constructing an object of
type ifdefault_consteval_code_unit_encoding_t < std :: ranges :: range_value_t < Input >>
isstd :: is_constant_evaluated () true
. -
Otherwise, constructing an object of
type ifdefault_code_unit_encoding_t < std :: ranges :: range_value_t < Input >>
isstd :: is_constant_evaluated () false
.
-
-
Any
is generated by callingstate
.std :: text :: make_decode_state ( encoding ) -
Any
is generated by constructing aerror_handler
object.default_handler
The
family of functions returns some
after calling
with a
that fills in the
. The
is:
-
ifOutputContainer
isstd :: is_void_v < OutputContainer > false
. -
Otherwise,
ifstd :: basic_string < code_unit_t < Encoding >>
is one of the character types.code_unit_t < Encoding > -
Otherwise,
ifstd :: vector < code_unit_t < Encoding >>
is any other type.code_unit_t < Encoding >
The
family of functions returns a
. If the overload taking a
is used, then it returns a
.
The
family of functions returns a
.
Note: in the current running implementation, there are also separate overloads for
that take an extra template parameter at the beginning called
, which allows the user to write e.g.
and similar. It is not included in this proposal right now but will be added later, for the purposes of allowing different output types with the simpler calls.
3.3.1.2. Free Function encode
The
free function provides a High Level API for encoding text. It allows performance with some degree of flexibility and customization through its parameters, as well as additional improvements with the use of some ADL customization points. The core loop behaves as follows:
-
Performing an
call using the current target input and output views.auto result = encoding . encode_one (...) -
Checking if the return valueâs error code is
, and returning the result early if it is not.std :: text :: encoding_error :: ok -
Checking
, and returning with a result that hasstd :: ranges :: empty ( result . input )
set toerror_code
if it is empty.std :: text :: encoding_error :: ok -
Otherwise, go to 0 and use the
andresult . input
views.result . output
The surface of the
API is as follows:
// header: <text_encoding> namespace std :: text { template < typename Input , typename Output , typename Encoding , typename State , typename ErrorHandler > constexpr auto encode_into ( Input && input , Encoding && encoding , Output && output , ErrorHandler && error_handler , State & state ); template < typename Input , typename Encoding , typename Output , typename ErrorHandler > constexpr auto encode_into ( Input && input , Encoding && encoding , Output && output , ErrorHandler && error_handler ); template < typename Input , typename Encoding , typename Output > constexpr auto encode_into ( Input && input , Encoding && encoding , Output && output ); template < typename Input , typename Output > constexpr auto encode_into ( Input && input , Output && output ); template < typename Input , typename Encoding , typename ErrorHandler , typename State > constexpr auto encode ( Input && input , Encoding && encoding , ErrorHandler && error_handler , State & state ); template < typename Input , typename Encoding , typename ErrorHandler > constexpr auto encode ( Input && input , Encoding && encoding , ErrorHandler && error_handler ); template < typename Input , typename Encoding > constexpr auto encode ( Input && input , Encoding && encoding ); template < typename Input > constexpr auto encode ( Input && input ); }
For
, a default encoding of
(§âŻ3.2.3.4 Default Encodings) is picked when no
object is provided is provided. For
-- which takes an output range to write code units into -- the following is done:
-
If
is false,std :: is_same_v < typename std :: iterator_traits < std :: ranges :: range_iterator_t < Output >>:: iterator_category , std :: output_iterator_tag >
is used.default_code_unit_encoding_t < std :: ranges :: range_value_t < Output >> {} -
Otherwise, if the iterator category of the iterators of the output range are
s,std :: output_iterator_tag
is used.default_code_point_encoding_t < std :: ranges :: range_value_t < Input >> {}
Otherwise, the user must specify the
object to use themselves. The third parameter is the error handler, which is defaulted to a parameter of type
. The fourth parameter is the state to be used. If it is not provided, then the following is used:
-
If
is true, thenis_encoding_self_state_v < Encoding >
is called andencoding . reset_state ();
is passed as theencoding
parameter to the appropriate overload.State & -
Otherwise,
is used as the parameter to the appropriate overload.encoding_state_t < Encoding > {}
The
family of functions returns a
after calling
with a
that fills in the
.
returns a
.
Note: in the current running implementation, there are also separate overloads for
that take an extra template parameter at the beginning called
, which allows the user to write e.g.
and similar. It is not included in this proposal right now but will be added later, for the purposes of allowing different output types with the simpler calls.
3.3.1.3. Free Function transcode
The
free function provides a High Level API for transforming text from one encoding to another. It allows performance with some degree of flexibility and customization through its parameters, as well as additional improvements with the use of some ADL customization points. The core loop behaves as follows:
-
Performing an
call using the current input view and an intermediate temporary output ofauto d_result = from_encoding . decode_one (...)
.code_point_t < FromEncoding > intermediate [ FromEncoding :: max_code_points ]; -
Checking if the return valueâs error code is
, and returning the result early if it is not.std :: text :: encoding_error :: ok -
Performing an
call using the previous temporaryauto e_result = to_encoding . encode_one (...)
output wrapped in a view as the input and the target output view.intermediate -
Checking if the return valueâs error code is
, and returning the result early if it is not.std :: text :: encoding_error :: ok -
Checking
, and returning with a result that hasstd :: ranges :: empty ( d_result . input )
set toerror_code
if it is empty.std :: text :: encoding_error :: ok -
Otherwise, go to 0 and use the
andd_result . input
views.e_result . output
The surface of the
API is as follows:
// header: <text_encoding> namespace std :: text { template < typename Input , typename FromEncoding , typename Output , typename ToEncoding , typename FromErrorHandler , typename ToErrorHandler , typename FromState , typename ToState > constexpr auto transcode_into ( Input && input , FromEncoding && from_encoding , Output && output , ToEncoding && to_encoding , FromErrorHandler && from_error_handler , ToErrorHandler && to_error_handler , FromState & from_state , ToState & to_state ); template < typename Input , typename FromEncoding , typename Output , typename ToEncoding , typename FromErrorHandler , typename ToErrorHandler , typename FromState > constexpr auto transcode_into ( Input && input , FromEncoding && from_encoding , Output && output , ToEncoding && to_encoding , FromErrorHandler && from_error_handler , ToErrorHandler && to_error_handler , FromState & from_state ); template < typename Input , typename FromEncoding , typename Output , typename ToEncoding , typename FromErrorHandler , typename ToErrorHandler > constexpr auto transcode_into ( Input && input , FromEncoding && from_encoding , Output && output , ToEncoding && to_encoding , FromErrorHandler && from_error_handler , ToErrorHandler && to_error_handler ); template < typename Input , typename FromEncoding , typename Output , typename ToEncoding , typename FromErrorHandler > constexpr auto transcode_into ( Input && input , FromEncoding && from_encoding , Output && output , ToEncoding && to_encoding , FromErrorHandler && from_error_handler ); template < typename Input , typename Output , typename ToEncoding , typename FromEncoding > constexpr auto transcode_into ( Input && input , Output && output , FromEncoding && encoding , ToEncoding && encoding ); template < typename Input , typename Output , typename ToEncoding > constexpr auto transcode_into ( Input && input , Output && output , ToEncoding && encoding ); template < typename Input , typename FromEncoding , typename ToEncoding , typename FromErrorHandler , typename ToErrorHandler , typename FromState , typename ToState > constexpr auto transcode ( Input && input , FromEncoding && from_encoding , ToEncoding && to_encoding , FromErrorHandler && from_error_handler , ToErrorHandler && to_error_handler , FromState & from_state , ToState & to_state ); template < typename Input , typename FromEncoding , typename ToEncoding , typename FromErrorHandler , typename ToErrorHandler , typename FromState > constexpr auto transcode ( Input && input , FromEncoding && from_encoding , ToEncoding && to_encoding , FromErrorHandler && from_error_handler , ToErrorHandler && to_error_handler , FromState & from_state ); template < typename Input , typename FromEncoding , typename ToEncoding , typename FromErrorHandler , typename ToErrorHandler > constexpr auto transcode ( Input && input , FromEncoding && from_encoding , ToEncoding && to_encoding , FromErrorHandler && from_error_handler , ToErrorHandler && to_error_handler ); template < typename Input , typename FromEncoding , typename ToEncoding , typename FromErrorHandler > constexpr auto transcode ( Input && input , FromEncoding && from_encoding , ToEncoding && to_encoding , FromErrorHandler && from_error_handler ); template < typename Input , typename ToEncoding , typename FromEncoding > constexpr auto transcode ( Input && input , FromEncoding && encoding , ToEncoding && encoding ); template < typename Input , typename ToEncoding > constexpr auto transcode ( Input && input , ToEncoding && encoding ); }
For
, a default encoding of
(§âŻ3.2.3.4 Default Encodings) is picked when no
object is provided is provided. For
-- which takes an output range to write code units into -- the following is done:
-
If
is false,std :: is_same_v < typename std :: iterator_traits < std :: ranges :: range_iterator_t < Output >>:: iterator_category , std :: output_iterator_tag >
is used.default_code_point_encoding_t < std :: ranges :: range_value_t < Output >> {} -
Otherwise, if the iterator category of the iterators of the output range are
s,std :: output_iterator_tag
is used.default_code_point_encoding_t < std :: ranges :: range_value_t < Input >> {}
Otherwise, the user must specify the
object to use themselves. The third parameter is the error handler, which is defaulted to a parameter of type
. The fourth parameter is the state to be used. If it is not provided, given a type
then the following is used:
-
If
is true, thenis_encoding_self_state_v < Encoding >
is called andencoding . reset_state ();
is passed as theencoding
parameter to the appropriate overload.State & -
Otherwise,
is used as the parameter to the appropriate overload.encoding_state_t < Encoding > {}
The
family of functions returns a
after calling
with a
that fills in the
.
Note: in the current running implementation, there are also separate overloads for
that take an extra template parameter at the beginning called
, which allows the user to write e.g.
and similar. It is not included in this proposal right now but will be added later, for the purposes of allowing different output types with the simpler calls.
3.3.1.4. Free Function validate_decodable_as
The
free function provides a High Level API for checking that a range of text is properly in the encoding provided by the user. Its default core implementation works by:
-
Performing an
call on the input into an intermediate buffer.auto result = encoding . decode_one (...) -
Checking if an error occurred, and returning failure if so.
-
Performing an
call on a view wrapping the intermediate buffer to the output.auto intermediate_result = encoding . encode_one (...) -
Checking if an error occurred, and returning failure if so.
-
Performing a
call on the final result, comparing it to the original input consumed.std :: equals -
If it is not equal, return failure.
-
If
, return true.std :: ranges :: empty ( result . input ); -
Go to 0.
The function signature for
is a little different than the above functions that actually do the transcoding. Specifically, this function needs 2 states, one for the
call and one for the
call. This is problematic for potential stateful encodings, but for most other encodings this is fine.
// header: <text_encoding> namespace std :: text { template < typename Input , typename Encoding , typename DecodeState , typename EncodeState > constexpr auto validate_decodable_as ( Input && input , Encoding && encoding , DecodeState & decode_state , EncodeState & encode_state ); template < typename Input , typename Encoding , typename DecodeState > constexpr auto validate_decodable_as ( Input && input , Encoding && encoding , DecodeState & decode_state ); template < typename Input , typename Encoding > constexpr bool validate_decodable_as ( Input && input , Encoding && encoding ); template < typename_Input > constexpr bool validate_decodable_as ( Input && input ); }
The order of arguments is chosen based on what users are likely to specify first. In many cases, all that is needed is the input: the encoding can be chosen automatically for the user based on such. For
, the
encoding type is picked (see §âŻ3.2.3.4 Default Encodings). Otherwise, the user must specify the
object to use themselves. The third parameter is the state. Thereâs an optional overload with a 4th parameter: this is due to the nature of the algorithm an the way it checks.
3.3.1.5. Free Function validate_encodable_as
The
free function provides a High Level API for checking that a range of text is properly in the encoding provided by the user. Its default core implementation works by:
-
Performing an
call on the input into an intermediate buffer.auto result = encoding . encode_one (...) -
Checking if an error occurred, and returning failure if so.
-
Performing an
call on a view wrapping the intermediate buffer to the output.auto intermediate_result = encoding . decode_one (...) -
Checking if an error occurred, and returning failure if so.
-
Performing a
call on the final result, comparing it to the original input consumed.std :: equals -
If it is not equal, return failure.
-
If
, return true.std :: ranges :: empty ( result . input ); -
Go to 0.
These are the function signatures for validation:
// header: <text_encoding> namespace std :: text { template < typename Input , typename Encoding , typename EncodeState , typename DecodeState > constexpr auto validate_encodable_as ( Input && input , Encoding && encoding , EncodeState & encode_state , DecodeState & decode_state ); template < typename Input , typename Encoding , typename EncodeState > constexpr auto validate_encodable_as ( Input && input , Encoding && encoding , EncodeState & encode_state ); template < typename Input , typename Encoding > constexpr bool validate_encodable_as ( Input && input , Encoding && encoding ); template < typename Input > constexpr bool validate_encodable_as ( Input && input ); }
The order of arguments is chosen based on what users are likely to specify first. In many cases, all that is needed is the input: the encoding can be chosen automatically for the user based on such. For
, the
encoding type is picked (see §âŻ3.2.3.4 Default Encodings).
3.3.1.6. Free Functions count_as_decoded
, count_as_encoded
, count_as_transcoded
Itâs exactly the same as above, except instead of validating it does counting of the
s (
) /
s (
) that come out of the function.
counts the data that comes out after doing the typical decode-to-re-encode dance, giving the final code unit count.
Each of these functions returns a whole
, which shows the number of characters that were consumed.
3.3.2. Safety with the Free Functions
The second problem is the ability to _lose_ data due to not using lossless encodings. For example, most legacy encodings are lossy when it comes to code points and graphemes outside of their traditional reservoir (e.g., trying to handle Chinese scripts with a latin-1 encoding). Trying to properly encode between these myriad of encodings leaves room for losing information. Even for Wide Character Locale-based (
) data, the only standard transformation to get to UTF32 text requires translating through the normal Character Locale-based (
) functions first, leading to loss of information and mojibake (see A C paper for additional transcoding utilities).
Therefore, an error at compile-time is wanted if a user uses the above high-level free functions, but does not explicitly specify an error handler in the case where a conversion is lossy. Taking an example from this presentation, this puppy emoji cannot fit in ASCII. In general, most Unicode Code Points cannot fit in an ASCII string: this is a dangerous conversion! So, unless you use a non-default error handler, the library will
or perform other shenanigans to loudly complain at compile-time:
int main ( int , char * []) { // Compiler Error: lossy encoding, specify non-default error handler std :: string ascii_emoji0 = std :: text :: encode ( U âđśâ, std :: text :: ascii {}); // Compiler Error: lossy encoding, specify non-default error handler std :: string ascii_emoji1 = std :: text :: encode ( U âđśâ, std :: text :: ascii {}, std :: text :: default_handler {}); // Okay: you asked for it! std :: string ascii_emoji2 = std :: text :: encode ( U âđśâ, std :: text :: ascii {}, std :: text :: replacement_handler {}); // ascii_emoji2 contains '?' // Okay: undefined behavior, but you asked for it. std :: string ascii_emoji3 = std :: text :: encode ( U âđśâ, std :: text :: ascii {}, std :: text :: assume_valid_handler {}); // ascii_emoji3 has no guarantees // at this point: undefined behavior was invoked! }
3.3.3. Improving Usability for Low-Memory Environments: Ranges
One of the biggest problems with
,
, and
is exactly their eager consumption. The defaults for these APIs will create owning containers by default of
/
and fill them up as much as they possibly can. This makes these High Level free functions untenable for users in memory-constrained environments. The C++ standard is meant to serve everyone, both high-performance _and_ memory-constrained environments. Therefore, lazy ranges are required to provide low-footprint encode, decode, and transcode operations to everyone.
Most importantly, wrappers around other ranges are employed here. This is important: nobody has time to rewrite all of this functionality just because the API strongly mixed
concerns with encoding concerns. There are spans, string views, and other things outside of the standard that are perfectly suitable for iterating over code units: excluding them by not having this be a wrapper type is a non-starter for getting these abstractions wide adoption in the ecosystem.
3.3.3.1. decode_view
and decode_iterator
is a templated type that takes the for loop found in §âŻ3.3 High Level and turns it into a one-by-one, iterative process that produces iterators as powerful as the iterator category/concept of the
type it is supplied with. It is also meant to work with
s of
,
,
and
types (to allow views to be instantiated over pre-existing Encodings and Ranges and used to make algorithms work).
is also specified as well:
// header: <text_encoding> namespace std :: text { template < typename _Encoding , typename Range = basic_string_view < code_unit_t < _Encoding >> , typename ErrorHandler = default_handler , typename State = encoding_state_t < _Encoding >> class decode_iterator ; template < typename _Encoding , typename Range = basic_string_view < code_unit_t < Encoding >> , typename ErrorHandler = default_handler , typename State = encoding_state_t < _Encoding >> class decode_view { public : using iterator = decode_iterator < Encoding , Range , ErrorHandler , State > ; using sentinel = decode_sentinel ; using range_type = Range ; using encoding_type = Encoding ; using error_handler_type = ErrorHandler ; using encoding_state_type = encoding_state_t < encoding_type > ; constexpr decode_view ( range_type range ) noexcept ; constexpr decode_view ( range_type range , encoding_type encoding ) noexcept ; constexpr decode_view ( range_type range , encoding_type encoding , error_handler_type error_handler ) noexcept ; constexpr decode_view ( range_type range , encoding_type encoding , error_handler_type error_handler , encoding_state_type state ) noexcept ; constexpr decode_view ( iterator it ) noexcept ; constexpr iterator begin () const & noexcept ; constexpr iterator begin () && noexcept ; constexpr sentinel end () const noexcept ; friend constexpr decode_view reconstruct ( :: std :: in_place_type_t < decode_view > , iterator it , sentinel ) noexcept ; }; }
The
produces a
of
. It keeps track of how many code points are generated by a call to
, and iterates through however many are present, before calling
again to obtain the next values.
In the case of errors, the standard has a number of well-defined behaviors that prevent the need to add a
check to the view type, or to provide a
-like wrapper for the
:
-
/default_handler
: provides replacement characters, which will be inserted into the iteration stream. Errors do not escape and are shown as replacement characters. This works fine.replacement_handler -
: throws on an error, exceptions escape thethrow_handler
and++ it
calls. This works fine.* it -
: user was already invoking UB if errors were hit. This works "fine" (the user asked for it).assume_valid_handler
Therefore, the only error case wherein
and
perform badly is when the error handler is one which passes through the error without doing anything with the error information with the expectation that the user handles it. The user would be unable to handle it in this case with the custom error handler. There are a few ways to deal with this situation: the first would be to restrict the allowed error handlers into the range and iterator types to Standard Sanctioned⢠types. The other would be to just throw hands up when the user passes in an error handler that does not properly throw, massage, or handler errors in an appropriate fashion. This proposal currently advocates the latter: passing an error handler to the 4th template parameter is an extreme amount of buy in. If users have gone this far, they must want a very specific custom behavior. Implementations will be encouraged to add asserts to trap users who have poor behavior, but otherwise leave it undefined behavior if errors are not handled for iterator and range types.
Note: This differs from how Tom Honermannâs
and similar behaved. That library returned Boost.Outcome/
/
-like result types that one had to further dereference to get to the code points. This represented an ergonomics and a composability problem, because a further transformation step to dereference was always required.
A third option is returning a special type which holds the
result and has an implicit conversion to the
type. It could throw on a conversion where there is an error. This is design choice has some serious limitations because it makes
dangerous to use for casual users due to the nature of "magical proxy types". It also forces a throwing of the error on end users, which forces a choice that invalidates the need of environments where exceptions do not exist or are prohibitively expensive.
Note: It is recognized that the Standard does not bless such implementations. This proposal does not care: the needs of C++'s users greatly outweighs the theoretical purity of the C++ abstract machine where the cost of all things is equal and does not matter. The standardâs preferred error handling method has a non-zero cost (particularly in binary size) to simply exist that have not been fully optimized into a "do not pay for what you do not use" state. Furthermore, it is still extremely dubious to throw-by-default on any ill-formed text for reasons mentioned above. Therefore, directions wherein the default is equivalent to throwing are not preferred at this time.
3.3.3.2. encode_view
and encode_iterator
This is identical to §âŻ3.3.3.1 decode_view and decode_iterator, except the name of the view and iterator are
and
, respectively, as well as a few other minor changes:
-
The
template parameter is defaulted toRange
.basic_string_view < code_point_t < _Encoding >> -
The
view itself produces code units (e.g.,encode_view
isvalue_type
rather than code points), one at a time, of thecode_unit_t < Encoding >
by usingEncoding
.encoding . encode_one
Everything else is identical in nature to
.
3.3.3.3. transcode_view
and transcode_iterator
This is mostly identical to §âŻ3.3.3.1 decode_view and decode_iterator, though there are more apparent changes here:
-
The name of the view and iterator types are
andtranscode_view
, respectively.transcode_iterator -
The template parameters are modified to take a
and aToEncoding
, aFromEncoding
and aToErrorHandler
, and finally aFromErrorHandler
andToState
.FromState -
The
template parameter is defaulted toRange
.basic_string_view < code_unit_t < ToEncoding >> -
The
isvalue_type
and produces code units, one at a time, of thecode_unit_t < ToEncoding >
.ToEncoding
Additionally, another important change here is an optimization opportunity. The default implementation of performing a single
operation is to:
-
Take the input range stored in the class, call
with it.from_encoding . decode_one -
Take the intermediate output range for the previous
call, and feed it intodecode_one
.to_encoding . encode_one -
Present the output to the user in a suitable manner.
This is fine, as long as the
types agree when going from the code units of the
to the code units of the
. The problem here is that for many conversions, going from
â shared
â
is an unnecessarily long step. The same way ADL customization points are provided for the free functions, there must be provisions for turning that through-code-points roundtrip into something a little bit faster.
For example,
is a bitwise subset of
. It is extremely foolish to roundtrip that -- for each and every code point/code unit -- through an intermediary
as is done in the generic core implementation. Similarly, GB18030 prioritizes encoding compatibility with GBK encodings. Extensibility for this case is provided as described in §âŻ3.4.1.1 One-by-one Transcoding Shortcuts.
3.4. The Need for Speed
Performance is correctness. If these methods and the resulting interface are not fast enough to meet the needs of the programmers, there will be little to no adoption over current solutions. Thanks to work by Bob Steagall and Zach Laine, it is fact that it is incredibly hard to make a range-based or iterator-based interface which will achieve the text processing speeds that will satisfy users of trivial (
-based, pointer-based) need. There are shortcuts when transcoding between certain encoding pairs that should be taken, even in the
-by-
transcoding works in the general case.
An explicit goal of this library is that there shall be no room for a lower level abstraction or language here, and the first steps to doing that are recognizing the benefits of eager encoding, decoding and transcoding interfaces, as well as pluggable and overridable behavior for the variety of functionality as it relates to higher-level abstractions.
Research and implementation experience with [boost.text], [text_view] and others has made it plainly clear that while iterators and ranges can produce an extremely efficient binary, it is still not the fastest code that can be written to compete with hand-written/vectorized bulk text processing routines made specifically for each encoding. Therefore, it is imperative that lazy ranges cannot be the only solution. The C++ Standard must steadily and nicely supplant the codebase-specific or ad-hoc solutions individuals keep rolling for encoding and decoding operations.
3.4.1. Speed and Flexibility for Everyone: Customization Points
An important part of that is the ability to provide performance for both lazy, range-based iteration as described in §âŻ3.3.3 Improving Usability for Low-Memory Environments: Ranges and fast free functions as described in §âŻ3.3.1 Eager Free Functions. To this end, an ADL free function scheme similar to the Range Access Customization Points (e.g.
and friends) has been developed to facilitate the customization for speed that users will require for their code.
Considering this is going to be one of the most fundamental text layers that sits between typical text and a lot of the new I/O routines, it is imperative that these conversions are not only as fast as possible, but customizable. The user can already customize the encoding by creating their own conforming encoding object, but encodings still do their transformations on a code point-by-code point basis. Therefore, a means of extensibility needs to be chosen for the
,
and
(§âŻ3.3.1 Eager Free Functions) functions. As this paper is targeting C++23, there exists hope that Matt Calabreseâs [p1292] receives favor in the Evolution Design Groups so that the extension mechanisms are simple functions that call simple extension points as laid out below. There is also a chance that
([p1895]) may emerge as the dominant customization point handling mechanism. Failing that, a design similar to
's customization points -- as laid out in [n4381] -- would be preferred.
What is not negotiable is that it must be extensible. Users should be able to write fast transcoding functions that the standard picks up for their own encoding types. From GB18030 to other ISO and WHATWG encodings, there will always be a need to extend the fast bulk processing of the standard. Current standard library implementers do not have the time to support every single legacy encoding on the planet, and companies do not have the time to petition each and every standard library to add support for their internal encoding. Similarly, government records kept in legacy encodings for political or organizational reasons cannot be locked out of this world either.
Thusly, the following extension points are provided.
3.4.1.1. One-by-one Transcoding Shortcuts
Using the example of
and
previously made in this paper, there is room for performing faster one-by-one transcoding. Normally, given a
and
such as
and
the process involves round-tripping is as follows:
-
Convert input
â intermediary sharedcode_unit_t < FromEncoding > code_point_t < FromEncoding > -
Convert shared
âcode_point_t < FromEncoding >
.code_unit_t < ToEncoding >
This is accomplished by first calling
on the incoming
with an intermediary output, typically an array of
wrapped up in a view. This intermediary is then put into an
call and the resulting output used for whatever purpose is necessary.
To speed this process up, the free function
can be defined by by the user to skip the round trip:
// in any related namespace in which ADL can find it template < typename Input , typename FromEncoding , typename Output , typename ToEncoding , typename FromErrorHandler , typename ToErrorHandler , typename FromState , typename ToState > std :: text :: transcode_result < Input , Output , FromState , ToState > text_transcode_one ( Input input , FromEncoding && from , Output output , ToEncoding && to , FromErrorHandler && from_error_handler , ToErrorHandler && to_error_handler , FromState & from_state , ToState & to_state );
The following is a complete example of this customization point.
using ascii_to_utf8_result = std :: text :: transcode_result < std :: span < char > , std :: span < char8_t > , std :: text :: ascii_t :: state , std :: text :: utf8_t :: state > ; template < typename FromErrorHandler , typename ToErrorHandler > ascii_to_utf8_result text_transcode_one ( std :: span < char > input , std :: text :: ascii & from , std :: span < char8_t > output , std :: text :: utf8 & to , FromErrorHandler && from_error_handler , ToErrorHandler && to_error_handler , std :: text :: ascii :: state & from_state , std :: text :: utf8 :: state & to_state ) { if ( input . empty ()) { // no input: thatâs fine return ascii_to_utf8_result ( input , output , from_state , to_state ); } if ( output . empty ()) { // error: no room! return std :: text :: propagate_transcode_error ( from , input , to , output , from_error_handler , to_error_handler , from_state , to_state , std :: text :: encoding_error :: insufficient_output_space , std :: span < char , 0 > {}); } if (( input [ 0 ] & '\x7f' ) != 0 ) { // error: high bit set in ASCII return std :: text :: propagate_transcode_error ( from , input . subspan < 1 > (), to , output , from_error_handler , from_state , to_state , std :: text :: encoding_error :: invalid_sequence , input . subspan < 1 , 1 > ()); } // bitwise compatible output [ 0 ] = static_cast < char8_t > ( input [ 0 ]); // return result return ascii_to_utf8_result ( input . subspan < 1 > (), output . subspan < 1 > (), from_state , to_state , std :: text :: encoding_error :: ok ); }
This is faster than the round trip through
and requires much less checking and work. When
is, internally, doing the conversion from one code point to another, it will check if an unqualified call to
is valid, and if so call it with its input, output, to/from encoding, and current states.
Note: The function
takes care of calling the
and, if appropriate, the
as well. It does this by constructing a temporary
with the current results and a temporary output buffer, milling it through the
, checking if the temporary output buffer was written into by
, and passing that intermediary to
to properly simulate the scheme by which an error would normally be handled in the transcode cycle. This is primarily to facilitate the case when a
or similar would communicate a replacement character to the intermediate storage buffer in the default "
â shared
â
" chain; and, that change needs to be placed in the final output rather than in an intermediate buffer which is going to disappear.
Note: This may be an indication that there should be a third kind of error handler for
, but that threatens to leak the detail that a
is an optimization of
+
and make the user sensitive to such an internal optimization.
It is important to note that the above example customization point only works for
s; or, anything that can be consumed by the respective
arguments. This means that a
templated on a
would not qualify here, as it is not a contiguous range. This is intentional: there are cases where the kind of range being captured matters for the purposes of optimization. For example, a contiguous range might have its functionality replaced by a function calls to the C standard. Only a contiguous range works in that case, because the C standard deals exclusively in pointers.
3.4.1.2. Customizability: Transcoding Free Functions
The free functions are the chance for the user to optimize bulk encoding. This is an area that becomes very important to users all over the world. Many people have already written optimized routines to convert from one encoding to another: it would be a shame if all of this work could not interoperate with the standard as it is. That is why there are 3 ADL-found free functions that are checked for well-formedness, and if so are called by the implementation in
,
, and
. They are as follows:
// in any related namespace in which ADL can find it template < typename Input , typename Encoding , typename Output , typename State , typename ErrorHandler > decode_result < Input , Output , State > text_decode ( Input input , const Encoding & encoding , Output output , State & state , ErrorHandler && error_handler ); template < typename Input , typename Encoding , typename Output , typename State , typename ErrorHandler > encode_result < Input , Output , State > text_encode ( Input input , const Encoding & encoding , Output output , State & state , ErrorHandler && error_handler ); template < typename Input , typename FromEncoding , typename Output , typename ToEncoding , typename FromState , typename ToState , typename FromErrorHandler , typename ToErrorHandler > transcode_result < Input , Output , FromState , ToState > text_transcode ( Input input , const FromEncoding & from_encoding , Output output , const ToEncoding & to_encoding , FromState & from_state , ToState & to_state , FromErrorHandler && from_error_handler , ToErrorHandler && to_error_handler );
Each of these is the customization hook that a user can write in a namespace to enable a proper conversion from one encoding to another. Nominally, users would use concrete types in place of templated types like
,
, and
. Because each encoding object is a essentially itâs own "strong object", tags are not required here as the encoding itself acts as an overload-separating, anchoring, strongly-identifying tag that can keep overloads separate and non-clashing. This is different from Boost.Text, where the library must employ encoding tags on its ranges to gain additional framework-internal optimizations based on smart tag and type-based dispatching. With strong encoding objects, it is not necessary to craft such things internally and, externally, users can rely on it for their ADL extension points:
template < typename FromErrorHandler , typename ToErrorHandler > std :: text :: transcode_result < std :: span < char > , std :: span < char16_t > , decode_state_t < win_wrap :: windows_1252 > , encode_state_t < std :: text :: utf8_t >> text_transcode ( std :: span < char > input , const win_wrap :: windows_1252 & encoding , std :: span < char16_t > output , const std :: text :: utf16 & to_encoding , decode_state_t < win_wrap :: windows_1252 >& from_state , encode_state_t < std :: text :: utf8_t >& to_state , FromErrorHandler && from_error_handler , ToErrorHandler && to_error_handler ) { using result_t = std :: text :: transcode_result < std :: span < char > , std :: span < char16_t > , decode_state_t < win_wrap :: windows_1252 > , encode_state_t < std :: text :: utf8_t >> ; if ( input . empty ()) { // do nothing return result_t ( input , output , from_state , to_state , std :: text :: encoding_error :: ok ); } int Needed = MultiByteToWideChar ( 1252 , 0 , input . data (), static_cast < int > ( input . size ()), nullptr , 0 ); if ( Needed == 0 || ( Needed > static_cast < int > ( output . size ()))) { // handle error ... return std :: text :: propagate_transcode_error ( input , from_encoding , output , to_encoding , from_state , to_state , from_handler , std :: text :: encoding_error :: insufficient_output_space , std :: span < char , 0 > {}); } int Succ = MultiByteToWideChar ( 1252 , 0 , input . data (), static_cast < int > ( input . size ()), reinterpret_cast < wchar_t *> ( output . data ()), static_cast < int > ( output . size ())); if ( Succ == 0 ) { // handle error ... return std :: text :: propagate_transcode_error ( input , from_encoding , output , to_encoding , from_state , to_state , from_handler , std :: text :: encoding_error :: invalid_sequence , std :: span < char , 0 > {}); } // update output size size_t output_consumed = static_cast < size_t > ( Succ ); output = std :: span < char16_t > ( output . data () + output_consumed , output . size () - output_consumed ); // update input size input = std :: span < char > ( input . data () + input . size (), 0 ); return result_t ( input , output , from_state , to_state , std :: text :: encoding_error :: ok ); }
This does not show all the error handling, but it is a full explanation/demonstration of a custom
encoding defined by a user going through the customization point to get to
encoded text. Note that this is a slight simplification, since there are additional checks for what kind of error handler is present and whether or not valid substitution can be performed (e.g., since
does not accept "unique replacement" characters, but
does).
Note: Like in §âŻ3.4.1.1 One-by-one Transcoding Shortcuts, the function
takes care of calling the
. The
is not used because the
would require intermediate code points to be passed as the "unread characters" range for the last parameter to the error handler. The whole point of customization points would be skipping the intermediaries and performing direct conversions in the first place. Still, if one needs to call the
for some reason, they can simply call it.
There does exist some concern for individuals who may want to do specializations for the standardâs encodings. The specification will permit someone to write their own
â
optimization, which will take precedent. This does not let the implementation off the hook for performance: this is only expected to be done for cases where the end-user knows their target architecture better than the standard could (small embedded devices with obscure chipsets and ISAs, and platforms with custom compilers, and similar). Common environments can and absolutely should be optimized by the implementation because there is a bounded set of only 9 possible encodings that the C++ Standard will include at first if this proposal progresses all the way.
Even if this is possible, it is absolutely expected for implementations to optimize common Unicode encoding pairs with OS or library-internal specific algorithms. If a vendor fails to do this, please file a bug against their implementation.
Loudly.
3.4.1.3. Customizability: Validating and Counting Free Functions
The
function also needs a customization point, as well as
and
. To start, there are efficient ways to count code units (e.g., in UTF-8) that do not require synthesizing the full code point value. This can be used to gain speed when counting the size of a very large buffer of text. Similarly,
can be done cheaply and efficiently when compared to the common loop outlined in §âŻ3.3.1.4 Free Function validate_decodable_as. Therefore, there are ADL customization points that are as follows:
// in any related namespace in which ADL can find it template < typename Input , typename Encoding , typename Output , typename State , typename ErrorHandler > count_result < Input , State > text_count_code_points ( Input input , const Encoding & encoding , State & state , ErrorHandler && error_handler ); template < typename Input , typename Encoding , typename Output , typename State , typename ErrorHandler > count_result < Input , State > text_count_code_units ( Input input , const Encoding & encoding , State & state , ErrorHandler && error_handler ); template < typename Input , typename Encoding , typename DecodeState , typename EncodeState > validate_result < Input , DecodeState > text_validate_code_units ( Input input , const Encoding & encoding , DecodeState & state , EncodeState & state ); template < typename Input , typename Encoding , typename DecodeState > validate_result < Input , DecodeState > text_validate_code_units ( Input input , const Encoding & encoding , DecodeState & state ); template < typename Input , typename Encoding , typename EncodeState > validate_result < Input , EncodeState > text_validate_code_points ( Input input , const Encoding & encoding , EncodeState & state ); template < typename Input , typename Encoding , typename EncodeState , typename DecodeState > validate_result < Input , EncodeState > text_validate_code_points ( Input input , const Encoding & encoding , EncodeState & state , DecodeState & state );
Notably, there are two
functions that can be opted into that take 3 or 4 arguments, respectively. This is for the rare case of an encoding that both cannot create a default state, like ones where
is true (e.g. the
/
described in this proposal).
In this case, we need a customization point wherein such an encoding, using internal/secret knowledge, can do its validation without needing to rely on the 4-argument
overload and the core default loopâs specification. This satisfies the ability of self-state encodings to escape the need to pass itself twice to the
function.
4. Implementation Experience
There are implementations of this work and its predecessors. This paper is a culmination of reading and understanding all of those and building/improving on them. Links to the publicly available implementations are in the below section.
4.1. Previous Work
While the ideas presented in this paper have been explored in various different forms, the ideas have never been succinctly composed into a single distributable library. Therefore, the author of this paper is working on an implementation that synthesizes all of the learning from [icu], [boost.text], [text_view] and [libogonek]. Reportedly, an implementation using a similar system exists in a few Fortune 500 company codebases. [copperspice] also has a somewhat similar implementation, but differs in a few places.
The thought of connecting encodings together through shared code points is a prominent feature of [iconv] and [libogonek], and is how both libraries enjoy strong support for transcoding between so many different encodings. In the context of [iconv], it is integral to how they have achieved full support for transcoding between every single one of their 48+ encodings, which normally would take a quadratic (over 2300ââ) handwritten bidirectionally capable functions.
4.2. Current Work
A publicly-available form of this implementation can be found here, in the Shepherdâs Oasis repository. The deep documentation can be found here, available online. The docs can also be built from the source repository: see the instructions and build for details.
5. FAQ
Some commonly asked questions.
5.1. Question: Why is there a max_code_points
value? Wonât you only ever output a single unicode code point?
This is incorrect. There are cases for encodings such as TSCII that output multiple unicode code points at once. The minimum required space must be dictated by the encoding: C++ made the mistake for
with the infamous "N:1" rule, and that rule is one of the primary reasons file-based streams (which can be any
in an inheritance-based design, as well as nearly anything with the wide use of what file descriptors represent in many operating systems) cannot handle Unicode properly in many implementations (chief among them, Microsft Windows).
5.2. Question: What about Old Unicode Encodings / Private Use Area Encodings?
These are treated like legacy encodings. Someone must convert to "normal" (Unicode vRight-Now) Unicode in order to have higher level algorithms work. If this includes Private Use Area characters, than a person will need the ability to customize the normalization algorithms for use in getting e.g. Medieval Text and Biblical Text to normalize properly. This will be covered in a future paper on a
free function, a
type, and
/
normalization objects provided by the standard. SG16 at the moment is against trying to create customization points and changes for the Unicode Character Database and give PUA code points different properties. Individuals who use e.g. Unicode v6 w/ Softbank Private Use Area or TACE 16 Encodings will need to convert any Private Use Area characters to Unicode and normalize, or provide their own normalization form for upcoming papers.
5.3. Question: It can be faster to bulk-decode, then bulk-encode instead of one-by-one transcoding. Why not that design?
While this is true, as asserted in the §âŻ3.3.1.3 Free Function transcode section, bulk decoding requires that there is a intermediary storage in to bulk-decode into. This imposes an invisible intermediate in the API, or requires explicitly allowing the user to pass one in. Furthermore, a user may only want to partially decode, partially encode, and then repeat because there is some internal memory limit rather than do a single "complete" bulk conversion.
A significant amount of thought and experimental implementation went into potentially providing both a
function that behaves as is currently specified, PLUS a
function that does a bulk decode and then a bulk encode. The design space was deemed a little too fraught with knobs and potential for violating user expectations in unexpected ways. This does not mean a regular user cannot enjoy the benefits of building a similar abstraction. Both the
and
functions are available for a user to apply the right amount of each to achieve a goal similar to the one behind the
abstraction previously envisioned.
Still, a future revision of this paper may include it, because basic testing and user feedback has indicated this can be much faster than a transcode method that is not specialized.
5.4. Question: Where is the specification for normalization_view < nfkc >
and normalize (...)
?
Normalization is separable from the low-level transcoding, and even though APIs like
and similar have additional parameters for doing automatic decomposition or composition upon transcoding, more recently the API has switched to doing these things in 2 separate phases. It is unclear whether there is a performance gain for the two being combined as it is in Windowsâs APIs, but without such performance data we prefer correctness and existing practice. Furthermore, normalization overloads can always be added to the transcoding interfaces later, if a combined interface proves to have benefits. There is also an open question about the existence of normalization within the highest level abstraction types like
and whether or not those invariants be enforced. Currently, Zach Laineâs Boost.Text enforces normalization on creation and insertion of data into its
type. This has proven to be unsuitable to many people because they chose different normalization forms, as seen during the Boost Review.
5.5. Question: Where is the specification for std :: text :: basic_text
and std :: text :: basic_text_view
?
Those types as currently imagined require additional functionality, like normalization and potentially segmentation algorithms (e.g., for making Grapheme Cluster Iterators the default iterators). It will be split off into a separate paper, even if we allude to its existence and use in this proposal. This is primarily because the ultimate goal is to make
a reality.
6. Proposed Changes
The following wording is relative to the latest C++ Draft paper.
6.1. Feature Test Macro
The desired feature test macro is
.
6.2. Intent
The intent of this wording is to provide greater generic coding guarantees and optimizations by allowing for a class of ranges and views that model the new exposition-only definitions of a reconstructible range:
-
add a new feature test macro for reconstructible ranges to cover constructor changes;
-
add a new customization point object for
,ranges :: reconstruct -
and, add two new concepts to [range.req].
If borrowed_range is changed to the
concept name, then this entire proposal will rename all its uses of borrowed_range.
For ease of reading, the necessary portions of other proposalâs wording is duplicated here, with the changes necessary for the application of reconstructible range concepts. Such sections are clearly marked.
6.3. Proposed Library Wording
Add a feature test macro
.
Insert into §24.2 Header
Synopsis [ranges.syn] a new customization point object in the inline namespace:
namespace std :: ranges {
inline namespace unspecified {
âŚ
nline constexpr nspecified reconstruct = unspecified ;
âŚ
}
}
7. Acknowledgements
Thanks to R. Martinho Fernandes, whose insightful Unicode quips got me hooked on the problem space many, many years ago and helped me develop my first in-house solution for an encoding container adaptor several years ago. Thanks to Mark Boyall, Xeo, and Eric Tremblay for bouncing off ideas, fixes, and other thoughts many years ago when struggling to compile libogonek on a disastrous Microsoft Visual Studio November 2012 CTP compiler.
Thanks to Tom Honermann, who had me present my second SG16 meeting before it was SG16 and help represent and carry his papers which gave me the drive to help fix the C++ standard for text. Many thanks to Zach Laine, whose tireless implementation efforts have given me much insight and understanding into the complexities of Unicode and whose implementation in Boost.Text made clear the tradeoffs and performance issues. Thanks to Mark Zeren who helped keep me in SG16 and working on these problems.
And thank you to those of you who grew tired of an ASCII-only world and supported this effort.