There does not exist a good, flexible, backwards-compatible solution for Text in C++. Choosing a library either requires taking the entire thing (Qt), getting involved in a very complex interface (ICU), dealing with sometimes limiting API choices (CopperSpice), or having a well-done but very opinionated design set (the proposed-for-Boost Boost.Text library).
Where is the standard library-friendly, maximum performance solution for handling text encoding and decoding in C++ and C?
This project is the push to reach that goal.
Publicly Available Implementation
The Publicly-Available Implementation is here: https://ztdtext.rtfd.io. You can track progress on this page, through the documentation’s “Progress & Future Work” section, or at the the GitHub Repository.
The C Library implementation — Cuneicode — will be made publicly available as funding, scholarship, and sponsorship goals are reached.
Current Funding
Funding goes toward:
- Funding development;
- Targeting specific features;
- Covering general library support;
- Covering specific company or vendor support;
- and, Attending WG14 (C Committee) and WG21 (C++ Committee) meetings.
Specialized solutions for C++11 (or C++03) can be made. If you, your company or organization is interested in helping or need special features/early access to features listed below, please get in touch with these folk through their website or by e-mail.
Funding Goals and Progress
Below are the published funding goals. Sponsors may pay into specific goals or, if given a large enough donation, create a new goal entirely; otherwise, funding falls into the categories in a top-to-bottom, linear fashion. Goals marked (Stretch) are not quite bare-minimum necessary, but would be absolutely wonderful to accomplish!
- [🎊 Accomplished!] Bootstrap Initial Development, to get library tested and released;
- Normalization Forms and C-based Span Implementation (Cuneicode C Library)
- WHATWG and CJK Encoding Tests
- Cover C Standard Library development to reach maximum amount of users with basic functionality;
- Reach Full-Time Text Development to reach 2022 Goal;
Current Goal: Normalization Forms and C-based Span Implementation
Current Goal Total: $1,275.55 USD / $20,000.00 USD
[ ⣿⣿⣤⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀ ]
Technical Details
The work is ongoing. The latest public documentation for the released library can be found here: https://ztdtext.rtfd.io.
The C++ library submodules and builds on top of the C one for fast-path functions. Internally, the C library is implemented with C++ and – hopefully soon in the future – vectorized by hand or with SIMD/std::experimental::simd
. Document trails:
The principles and inner workings of the implementation are detailed in a series of talks, slides and posts:
- !!Con 2021
😱 Oh, No! 😱 The Lowest-level‡ Programming Language is Unicode-aware and I have no excuses?!
May 20th, 2021
Virtual Conference - C++ on Sea 2020
🤿 Deep C Diving - Fast and Scalable Text Interfaces at the Bottom 🤿
July 16th, 2019
Virtual Conference - C++ Russia Moscow 2020
🏎 Burning Silicon - Speed for Transcoding in C++23
June 30th, 2020
Virtual Conference - Pure Virtual C++ 2020
Lucky 7 - Designing Text Encodings for C++
April 30th, 2020
Virtual Conference- Abstract: Text handling in the C and C++ Standards is a tale of legacy encodings and a demonstration of decisions made that work at the moment don’t scale up to the needs of tomorrow. With Unicode on the horizon, C++20 prepared fundamental changes such as char8_t and polishing a things to make it easier to catch bad conversions and logical program errors when working with encoded text. Still, the landscape has poor support for transcoding from one encoding to the other, let alone talking about higher level algorithms such as how to compare two text forms which render identical to the user but have different bit patterns. This talk explores the fundamental design space behind Encoding, Decoding and Transcoding text. It describes the benefits of the API under active consideration of text, potential speed gains from such an API, and how it enables better handling of complex tasks such as normalization.
- Video
- Slides
- Meeting C++ 2019
Catching ⬆️: Unicode for C++ in Greater Detail - 2 of 5
Saturday, November 16th, 2019
Berlin, Germany - CppCon 2019
Catching ⬆️: The (Baseline) Unicode Plan for C++23
Friday, September 20th, 2019
Aurora, Colorado - Study Group 16 - Text and Unicode
A Rudimentary Unicode Abstraction
Wednesday, March 7th, 2018
Boston, Massachusetts
The current spread of goals is as follows.
Ⅰ: Core Text Utilities [ 🎉 COMPLETE 🎉 ]
Finished and documented here:
- View Types - https://ztdtext.readthedocs.io/en/latest/api.html#views
- Encoding Objects - https://ztdtext.readthedocs.io/en/latest/encodings.html
[ ⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿ ]
Ⅱ: User Extensibility Hooks for (User) Encodings [ 🎉 COMPLETE 🎉 ]
[ ⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿ ]
Ⅲ: Byte Buffers and Streaming [ 🎉 COMPLETE 🎉 ]
[ ⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿ ]
Ⅳ: Normalization Forms [ 5% ]
- All four Unicode Normalization Forms, as specified in UAX #15.
- Canonical Form
nfc
- Canonical Form
nfd
- Compatibility Form
nfkc
- Compatibility Form
nfkd
- Canonical Form
text_view<Encoding, NormalizationForm, Container>
text<Encoding, NormalizationForm, Container>
[ ⣿⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀ ]
Ⅴ: CJK Encoding Tests [ 0% ]
- Implement
gb18030
, the official government Unicode Transformation Format encoding of PRC. - Implement legacy
shift_jis
/euc_jp
/iso2022_jp
legacy encodings.- priority goes to
shift_jis
/euc_jp
as it encodes more traffic.
- priority goes to
[ ⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀ ]
Ⅵ: (Stretch) Enhanced Execution Encoding [ 0% ]
- Reach into platform-specific functions to rip out guts of platform’s current encoding to ensure preservation of Unicode in:
narrow_execution
wide_execution
[ ⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀ ]
Ⅶ: (Stretch) Hyper-Scrutinized Vectorization Implementation [ 0% ]
- Apply vectorization techniques for conversions to pairs of encodings in
ascii
,utf8
,utf16
, andutf32
.
[ ⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀ ]
Ⅷ: (Stretch) C Library for Span-Based Conversions [ 0% ]
- As detailed in proposal N2440: C functions for fast conversions.
- Cover INCITS/ANSI fees.
- Take functionality through all of WG14, put into C Libraries such as:
musl
.glibc
.- Potentially: new LLVM
libc
.
[ ⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀ ]
Ⅸ: (Stretch) WHATWG Encoding Functionality [ 0% ]
- The WHATWG (https://whatwg.org/) specifies many encodings which are required to support the web.
- Covers developing encodings to handle encodings covered by the WHATWG Living Standard.
[ ⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀ ]
Ⅹ: (Stretch) Strong Exception Guarantee [ 0% ]
std::text<Encoding, NormalizationForm, Container>
: strong exception guarantee on all applicable operations.noexcept
container support forstd::text
andstd::text_view
noexcept
allocator support.- Containers operations are made conditionally
noexcept
if possible based on the allocator and movability of the inserted types.
[ ⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀ ]