It is finally time.
After 3 months and quite a bit of hard work, I have finally finished the Google Summer of Code program. In it, I have created an enhanced and improved implementation of P0237 and N2050 from scratch. In my original proposal to get the Summer of Code work-study, I stated that I would implement a range-based bit_view
, a container adaptor version of dynamic_bitset
, all of the bit_iterator
/bit_reference
/bit_value
types, and more.
Let’s go through the checklist of each phase and what we managed to implement and work on:
Phase 1 - bit_iterator, bit_reference, bit_value and bit_view
✔️ 🏆 🎉 - Absolutely covered, with all stretch goals and even MORE met!
This was spoken about much, much earlier. We had both dynamic_bitset<C>
and bit_view<R>
covered, which pulled in an implementation of bit_iterator
, bit_reference
, and bit_value
. Unfortunately, during this time we also had bit_span
, which was a bad name for a type. The original idea was having a mutable versus immutable split for each container in separate types. That was a mistake and it was far better to just let the const
-ness of the range or container flow naturally into the class’s interface and compile-time prevent things. There’s still a bit_span
in the final product, but it’s just an alias for template <typename T> using bit_span = bit_view<span<T>>;
.
The only other change here is that our bit_reference
type is a bit more involved, taking a ReferenceType
and a MaskType
rather than just the default bit_reference<T>
of the original proposal. This allows for a greater degree of control over how the reference shakes out, including potential uses for std::reference<>
types being used in bit_reference
successfully. (Currently, this scenario is not tested, but we do allow for a top-level bit_view<std::reference<std::vector>>
behavior!)
Phase 2 - Iterator Improvements and Algorithms
✔️ 🏆 - Everything, plus most of the stretch goals.
This was covered by 2 previous articles, here and here. I won’t recap absolutely everything, but it was a bit of a struggle getting most of the iterator constraints perfectly okay.
We also added some serious utilities here that cover some of the use cases first brought up after completion of the first phase. Jonathan Müller mentioned that he had some use cases for directly working with a potentially disjoint set of bits. The tiny/bit_view
class in his library does that. I added support for creating a view of a set of bits that could target precisely bits [x, y)
by using the Bounds
class portion of the new bit_view<Range, Bounds>
:
// 0xFBFF = (MSB) 0b1111101111111111 (LSB)
std::vector<std::uint32_t> storage{ 0x0000FBFF, 0xFFFFFFFF };
using Rng = std::span<std::uint32_t>;
bitsy::bit_view<Rng, bitsy::bit_bounds<10, 22>> truncated_view_bits(
storage.data(), storage.size()
);
assert(truncated_view_bits.size() == 12);
// 0th bit of biew is 10th bit,
// 10th bit of 0xFBFF is false
assert(truncated_view_bits[0] == bitsy::bit0);
Most people would wonder why I didn’t take the stance of using a number directly like when they made std::span<T, N>
. I believe taking a class template parameter is far more flexible than span
’s single numeric bound
value. It also allows for taking a bitsy::dynamic_bit_bounds
type as well, whose size can be calculated entirely at runtime. The default is a bitsy::word_bit_bounds
, which simply spans from 0 to the size of the container (which is, e.g., the bounds encompassing the words of the range). This was a severe improvement to the API that resulted in covering a much broader selection of use cases while still retaining the same abstraction powers.
It is also important to pass in 2 numbers: span
does not need 2 numbers to describe its bounds/extents, because you can just increment the contiguous iterator being put into the span
. You cannot increment “just 10 bits” from the iterator that goes into a bit_view
: either the bit_iterator
itself needs to be incremented, or an external abstraction applied to bit_view
(like, e.g. std::ranges::view::drop(10)
).
Writing the algorithms was simple, really. It turns out most of the algorithms in C++ aren’t too hard when you sit down with them and read the requirements and listen to conference talks and read text books about them.
But the next part? Getting things fast is where things got…. interesting.
Phase 3 - Optimizations
✔️ 🏆 - Everything, plus most of the stretch goals.
Optimizing everything was the hardest part of this Summer of Code, but also the most satisfying. I plan to write an article later with much more detailed information, but the success here is clear as day. There were severe improvements made over top of the bit intrinsics GCC ships in libstdc++ (the ones detailed by Jens Maurer’s p0553), since they only work with unsigned numeric types. This excluded enumerations (including std::byte
) and most character types as well. We also identified several places where GCC trips over itself as far as code generation, something that actually made me publish this article today rather than a full week ago (feel free to poke the bear here and see all the zany codegen and constexpr mishaps between MSVC, GCC, and Clang: Godbolt). I would like to actually congratulate Clang, which gets the code generation and work done 100% right when you affix the platform. It even recognizes that the use of the built-ins, one with the addition, are actually equivalent code, and it dumps out the optimal assembly. It is pretty great!
In all seriousness, I could spend all day writing about this kind of stuff. But, why should I prattle on about algorithmic improvements, intrinsic usage, codegen and the like, when I can simply show some numbers. I re-implemented Howard Hinnant’s performance benchmarks for almost all of the algorithms he talked about (sans std::rotate
). The results are in:
----------------------------------------------------------------------
Benchmark Time CPU Iterations
----------------------------------------------------------------------
noop 0.000 ns 0.000 ns 1000000000
is_sorted_by_hand 2842 ns 2888 ns 248889
is_sorted_base 56290 ns 57199 ns 11200
is_sorted_vector_bool 273079 ns 278308 ns 2358
is_sorted_bitset 192486 ns 194972 ns 3446
is_sorted_itsy_bitsy 817 ns 820 ns 896000
is_sorted_until_by_hand 2812 ns 2825 ns 248889
is_sorted_until_base 48404 ns 48652 ns 14452
is_sorted_until_vector_bool 251334 ns 251116 ns 2800
is_sorted_until_bitset 187284 ns 185904 ns 3446
is_sorted_until_itsy_bitsy 679 ns 680 ns 896000
find_by_hand 2443 ns 2455 ns 280000
find_base 22696 ns 22949 ns 32000
find_vector_bool 84364 ns 85449 ns 8960
find_bitset 90512 ns 89979 ns 7467
find_itsy_bitsy 2422 ns 2400 ns 280000
fill_by_hand 256 ns 255 ns 2635294
fill_base 2533 ns 2511 ns 280000
fill_vector_bool 254 ns 257 ns 2800000
fill_bitset 126316 ns 125552 ns 4978
fill_bitset_smart 255 ns 257 ns 2800000
fill_itsy_bitsy 259 ns 262 ns 2800000
fill_itsy_bitsy_smart 254 ns 255 ns 2635294
sized_fill_by_hand 252 ns 251 ns 2800000
sized_fill_base 2567 ns 2567 ns 280000
sized_fill_vector_bool 113136 ns 112305 ns 6400
sized_fill_bitset 124528 ns 125558 ns 5600
sized_fill_bitset_smart 254 ns 251 ns 2800000
sized_fill_itsy_bitsy 257 ns 257 ns 2800000
sized_fill_itsy_bitsy_smart 256 ns 257 ns 2800000
equal_by_hand 1020 ns 1025 ns 640000
equal_memcmp 0.321 ns 0.312 ns 1000000000
equal_base 0.323 ns 0.328 ns 1000000000
equal_vector_bool 242796 ns 239955 ns 2800
equal_vector_bool_operator 256364 ns 254981 ns 2635
equal_bitset 0.000 ns 0.000 ns 1000000000
equal_bitset_operator 0.000 ns 0.000 ns 1000000000
equal_itsy_bitsy 518 ns 516 ns 1000000
equal_itsy_bitsy_operator 513 ns 516 ns 1120000
count_by_hand 5606 ns 5625 ns 100000
count_base 46712 ns 46490 ns 14452
count_vector_bool 98263 ns 97656 ns 6400
count_bitset 111041 ns 109863 ns 6400
count_bitset_smart 11085 ns 10986 ns 64000
count_itsy_bitsy 5509 ns 5580 ns 112000
count_itsy_bitsy_smart 5500 ns 5625 ns 100000
copy_by_hand 259 ns 255 ns 2635294
copy_base 3202 ns 3209 ns 224000
copy_vector_bool 192254 ns 192540 ns 3733
copy_bitset 146175 ns 146484 ns 4480
copy_bitset_operator 262 ns 261 ns 2635294
copy_itsy_bitsy 258 ns 262 ns 2800000
copy_itsy_bitsy_operator 259 ns 255 ns 2635294
sized_copy_by_hand 259 ns 255 ns 2635294
sized_copy_base 3258 ns 3278 ns 224000
sized_copy_vector_bool 220069 ns 219727 ns 3200
sized_copy_bitset 151193 ns 149972 ns 4480
sized_copy_itsy_bitsy 269 ns 267 ns 2635294
There is still performance left on the table here in some cases, but the results are clear: our bitsy::bit_iterator
s and data structures severely outperform std::vector<bool>
from current-generation libstdc++ (GCC 9) and either meet or exceed the current algorithmic benchmarks in all categories.
Here are the results for VC++ from Visual Studio 16.2.3:
----------------------------------------------------------------------
Benchmark Time CPU Iterations
----------------------------------------------------------------------
sized_copy_by_hand 554 ns 558 ns 1120000
sized_copy_base 2670 ns 2668 ns 263529
sized_copy_vector_bool 201823 ns 204041 ns 3446
sized_copy_bitset 189604 ns 188354 ns 3733
sized_copy_itsy_bitsy 190 ns 188 ns 3733333
copy_by_hand 606 ns 614 ns 1120000
copy_base 2431 ns 2431 ns 263529
copy_vector_bool 244844 ns 245857 ns 2987
copy_bitset 171791 ns 172631 ns 4073
copy_bitset_operator 147 ns 146 ns 4480000
copy_itsy_bitsy 170 ns 167 ns 3733333
copy_itsy_bitsy_operator 145 ns 148 ns 4977778
count_by_hand 2728 ns 2727 ns 263529
count_base 67899 ns 65569 ns 11200
count_vector_bool 132797 ns 131830 ns 4978
count_bitset 98924 ns 97656 ns 6400
count_bitset_smart 5299 ns 5301 ns 112000
count_itsy_bitsy 2563 ns 2550 ns 263529
count_itsy_bitsy_smart 2601 ns 2609 ns 263529
equal_by_hand 1023 ns 1001 ns 640000
equal_memcmp 518 ns 516 ns 1120000
equal_base 65633 ns 65569 ns 11200
equal_vector_bool 230969 ns 230164 ns 2987
equal_vector_bool_operator 525 ns 531 ns 1000000
equal_bitset 146907 ns 146484 ns 4480
equal_bitset_operator 528 ns 531 ns 1000000
equal_itsy_bitsy 532 ns 531 ns 1000000
equal_itsy_bitsy_operator 527 ns 516 ns 1120000
sized_fill_by_hand 515 ns 516 ns 1120000
sized_fill_base 1805 ns 1800 ns 373333
sized_fill_vector_bool 113288 ns 111607 ns 5600
sized_fill_bitset 145251 ns 144385 ns 4978
sized_fill_bitset_smart 138 ns 138 ns 4977778
sized_fill_itsy_bitsy 162 ns 164 ns 4480000
sized_fill_itsy_bitsy_smart 167 ns 167 ns 3733333
fill_by_hand 504 ns 502 ns 1120000
fill_base 1824 ns 1842 ns 407273
fill_vector_bool 149444 ns 149613 ns 4073
fill_bitset 145098 ns 147524 ns 4978
fill_bitset_smart 147 ns 148 ns 4977778
fill_itsy_bitsy 566 ns 572 ns 1120000
fill_itsy_bitsy_smart 163 ns 161 ns 4072727
find_by_hand 64005 ns 64174 ns 11200
find_base 45115 ns 45516 ns 15448
find_vector_bool 114227 ns 114397 ns 5600
find_bitset 78854 ns 77424 ns 7467
find_itsy_bitsy 64992 ns 64523 ns 8960
is_sorted_until_by_hand 49151 ns 48828 ns 11200
is_sorted_until_base 65651 ns 65569 ns 11200
is_sorted_until_vector_bool 231133 ns 234375 ns 3200
is_sorted_until_bitset 161046 ns 163923 ns 4480
is_sorted_until_itsy_bitsy 1459 ns 1475 ns 497778
is_sorted_by_hand 52399 ns 51618 ns 11200
is_sorted_base 68395 ns 68011 ns 8960
is_sorted_vector_bool 236765 ns 235395 ns 2987
is_sorted_bitset 162437 ns 163923 ns 4480
is_sorted_itsy_bitsy 1420 ns 1444 ns 497778
noop 0.325 ns 0.328 ns 1000000000
Note that the performance here might not be as great. In order to preserve constexpr
, we could not use intrinsics in VC++ because none of their intrinsics are constexpr. Their compiler does not have std::is_constant_evaluated()
in any form, which makes it impossible to use low-level, fast functions in MSVC for the algorithms and other places without giving constexpr
the guillotine. Hopefully, the situation on that front will improve and we will get better constexpr-related goodies in the releases to come.
I’m almost certain Alisdair Meredith and Vincent Reverdy would be proud to know that the papers they thought up of and 13 and 3 years ago, respectively, can enable such great performance. And the best part about this is…
You Can Have It Too
While I implemented this for libstdc++ and for itsy.bitsy, the performance is not limited to the standard, or to itsy bitsy. The point is that the iterators are not an implementation-defined, private mess. Nor is it like std::bitset
, where it has no iterators or begin()
/end()
at all. These are well-defined, public interfaces that are going to go into the standard. That means if you write a data structure that works with bits and opt to use bitsy::bit_iterator
or __gnu_cxx::bit_iterator
, you get the performance benefits when using std::equal
and std::find
and std::copy
too. So rather than sinking the tiny handful of standard library maintainer time into optimizing a single container, they optimize the algorithm and everyone gets to benefit.
Note that there are even more optimizations to perform here: there are plenty of optimization opportunities when someone copies a bit_iterator<std::byte*>
to, say a bit_iterator<std::uint64_t*>
. The current overloads don’t handle such a case (because it is a difficult case to handle generically and with all due speed). Being able to tackle that in a far more generic fashion will be important to the continued longevity of making the basic abstractions we use faster and better than before.
This also has further downstream consequences: as pretty as Conor Hoekstra’s code is, nobody who cares in performance critical areas is going to use that beautiful solution if it exhibits the std::vector<bool>
levels of performance. Performance is correctness, and if I can write a disgusting hand-optimized loop and get better performance, you had better believe I’m going to write that disgusting code that would probably make Conor very sad.
But!
The liberating point about this exploration here is that we can have our beautiful algorithmic cake and develop a powerful algorithmic intuition, and – as long as we commit to improving the algorithm – we can have our speedy cake too and share it with everyone. That means that you, the application developer, can focus on algorithm intuition and reaching Sean Parent levels of “that’s a rotate”. And the library developers and interested parties can write the optimization in their algorithms and other places. Once. For everyone. For all time. And finally move on to doing more interesting things that twiddling bits, something that should’ve been a solved problem for 20 years! Algorithms no longer need to be for the Computer Science Gods and the Geniuses: we, the proletariat, can seize the means of performance from the bourgeoisie and the regular plebeian developer can write std::copy
and std::equal
and have it be exactly as fast as intended…!
But I digress. There is a lot more to do, but this is it for now. The GCC patch can be reviewed but albeit it’s likely going to be rejected because there’s some work I need to do before hand to make sure the tests can work in an isolated GCC. Most notably, I need to implement a std::span
for libstdc++ and I need a std::ranges::subrange
type. Then all the tests will work directly.
Remember to optimize your algorithms, not your data structures, and hug your local library developer (with permission).
Catch you later. 💚
P.S.: double underscores are ugly and I hate them with every fiber of my being.