NXX39
printf string size specifiers

Draft Proposal,

Previous Revisions:
None
Authors:
Paper Source:
GitHub
Issue Tracking:
GitHub
Project:
ISO/IEC 9899 Programming Languages — C, ISO/IEC JTC1/SC22/WG14
Proposal Category:
Change Request, Feature Request
Target:
C2y

Abstract

1. Revision History

1.1. Revision 1 - May 27th, 2025

1.2. Revision 0 - May 27th, 2025

2. Introduction & Motivation

It is impossible to use anything other than an int for the precision (size) of a string specifier, whether it’s used with %*.s or %*.ls. Normally, this should not be a problem because fprintf and many other <stdio.h> and other I/O functions in C only ever return int. The problem is, most:

and so much more are not int-typed. This results in a lot of excessive (and, in some ways, dangerous) casting for working with the I/O output functions. The simple, easy-integration fix is to simply allow precision with .* to include a size modifier, such that while %.*s is a string sized by an int, %.z*s represents a string sized by (the signed version of) a size_t.

It is also important for strings that are not null terminated, such as substring functionality and parsing/searching. Needing to make sure things are null terminated is a huge burden, and while the int precision modifier helps, the constant casting hides potential overflow errors from high quality of implementation libraries and makes its use dubious.

This proposal is to allow the typical integer length modifiers (hh, h, l, j, z, t, wN, and wfN) to be applied to the precision modifier when the precision modifier uses an asterisk (i.e., .*). This proposal also adds a new precision argument modifier to replace ^, for indicating an unsigned type for a precision modifier. We choose this character due to not being burdened by existing implementation extensions and being able to scale to other locations (e.g., as the field width).

3. Design

Given the following grammar (using the notation from POSIX, where things enclosed in [ ] are optional):

% [argument$] [flags] [width] [ . precision] [length modifier] conversion-specifier

([argument$] is a POSIX extension), then the logical place in the grammar to place the length modifier that applies specifically to the precision argument is:

% [argument$] [flags] [width] [ . [length modifier] precision] [length modifier] conversion-specifier

This is the easiest place for this to be where it won’t be ambiguous. In particular, placing it in other locations could have it confused for a conversion-specifier, and putting it up ahead of the [flags]/[argument$] but having it apply to the .precision itself means that we would preclude having such a modifier on [width] itself. (This paper does not propose this for [width], just for asterisk-based .* and the newly-proposed .^ precisions).

Therefore, this design slots it into the one place it can have no negative impact and would be unambiguous: after the ., but before the * of precision:

extern size_t big_honkin_number;

int main () {
  char* str = malloc(big_honkin_number);
  // ...
  int result = printf("%.z^s", big_honkin_number, str); // no cast needed
  // ...
  free(str);
  return 0;
}

3.1. "But fprintf and friends only return int, isn’t this a problem?"

Thankfully, this is actually less of a problem than was previously surmised. In fact, this proposal actively makes it less of a problem than the cast-based solution. Consider the existence of a "/dev/null" file that can be written to and this program:

#include <stdio.h>
#include <stdlib.h>
#include <limits.h>
#include <assert.h>

int main() {
  enum { COUNT = 10, BYTESIZE = INT_MAX / COUNT };
  char* str = (char*)malloc(BYTESIZE + 1);
  for (size_t i = 0; i < BYTESIZE; ++i) {
    str[i] = 'a';
  }
  str[BYTESIZE]                          = '\0';
  FILE* f                                = fopen("/dev/null", "w+");
  [[maybe_unused]] int write_value       = fprintf(f,
    "%.*s", BYTESIZE, str);
  [[maybe_unused]] int large_write_value = fprintf(f,
    "%.s %.s %.s %.s %.s %.s %.s %.s %.s %.s %*.s",
    BYTESIZE, str, BYTESIZE, str, BYTESIZE, str, BYTESIZE, str,
    BYTESIZE, str, BYTESIZE, str, BYTESIZE, str, BYTESIZE, str,
    BYTESIZE, str, BYTESIZE, str, BYTESIZE, str);
  free(str);
  assert(write_value == BYTESIZE); // Well.
  assert(large_write_value < 0); // ... Okay.
  return 0;
}

For both write_value and large_write_value, the individual sizes of the strings are not what is ultimately the problem here. In fact, each of these is an int-typed value (as per the rules for enum constants and their values in both old and new C) are fully within the bounds. But, large_write_value effectively creates a situation where, over the course of the 11 strings written, the last write is large enough that it triggers overflow.

While there is no hard requirement in any standard that mandates rigorous checking, most implementations do check if the write will eventually overflow the int and either return -1 with an appropriate errno value or some other negative value. There is no constraint or recommended practice to check for overflow, but glibc, musl-libc, and many more can and do check for this case and report it. We see here that even with purely int-typed writes, we get the same error to happen on these platforms: all of them return a negative integer value.

What this means, ultimately, is that it is not the type of the length that matters more, but the actual value!

This proposal cannot change the return value’s type for printf or fprintf or any of its family of functions (as that is an ABI break), but allowing a size_t type for the length modifier is actually an improvement to security. Since most implementations are doing value/overflow checking here, being able to pass in a (too-large) size_t directly and letting the overflow checks inherit in most implementations catch it and return a negative number. For example, observe the following (too large) string being written, but written in the "typical" way that string sizes get passed to formatted I/O functions like printf:

#include <stdio.h>
#include <stdlib.h>
#include <limits.h>
#include <assert.h>

int main() {
  const size_t BYTESIZE = ((size_t)INTMAX) + 1; 
  char* str = (char*)malloc(BYTESIZE + 1);
  for (size_t i = 0; i < BYTESIZE; ++i) {
    str[i] = 'a';
  }
  str[BYTESIZE]                    = '\0';
  FILE* f                          = fopen("/dev/null", "w+");
  [[maybe_unused]] int write_value = fprintf(f, "%.*s", (int)BYTESIZE, str);
  free(str);
  assert(write_value < 0); // might not trigger, actually!
  return 0;
}

This is an error. But, we will never see it as an error anymore: the explicit cast inserted into the code for the express purpose of matching the type means that the error is now hidden from us. Compilers cannot warn on it (except using tracing analysis which flags BYTESIZE as being truncated by the cast) without generating excessive false positives, as casting is seen as the way to get around this problem and intentional on the part of the user. Thus, by allowing the size_t value directly. We can avoid hard-to-detect truncation errors that happen from potential (int)BYTESIZE code. Rather than (erroneously) casting and truncating the value of a size_t into an int type or similar, it will instead be actually checked by fprintf, wfprintf, and similar.

This is a notably improvement because (int)some_too_big_size is seen as an explicit choice on the part of the developer, made to silence warnings or other diagnostics. Casting is too big of a hammer and too large of a club for this feature set; supplying the size without truncation directly to the function allows for existing quality of implementation to catch this error:

#include <stdio.h>
#include <stdlib.h>
#include <limits.h>
#include <assert.h>

int main() {
  const size_t BYTESIZE = ((size_t)INTMAX) + 1; 
  char* str = (char*)malloc(BYTESIZE + 1);
  for (size_t i = 0; i < BYTESIZE; ++i) {
    str[i] = 'a';
  }
  str[BYTESIZE]                    = '\0';
  FILE* f                          = fopen("/dev/null", "w+");
  [[maybe_unused]] int write_value = fprintf(f, "%.z^s", BYTESIZE, str);
  free(str);
  assert(write_value < 0); // triggers on high quality-of-implementation again!!
  return 0;
}

The paper aims to allow high quality library implementations to catch this class of errors and let them error on it, while making things less cumbersome for the user. This forms the basis of this proposal.

3.2. Other Positions?

There were a couple of other choices for this insofar as where to put the "length modifier" type. Unfortunately, for all of these:

  1. "%z.^s"

  2. "%.^zs"

There can be minor conflicts in the grammar or ambiguity of application. For (1), it’s unclear whether that is meant to apply to a potential [width] argument or the desired [.precision] argument (which determines whether it should be a formatting error or not). This could block future improvements or modifications to the printf syntax that would allow for different types for the [width] argument. It is not being proposed in this paper, however; this paper is concerned mostly with enabling the use case of typical string and substring data.

For (2), the problem is that it’s unclear when parsing certain things, such as "%.*zu", whether it’s a modifier on the size for the .* or it’s the traditional, current meaning as a precision modifier of int type for a zu type (e.g., int-specified padding on a size_t argument.) Given the grammar, having it appear before the * is both the most grammatically safe and implementable choice (without disambiguation and backwards-compatibility break rules). It also appears before what it modifies -- the * -- which allows a future where some other position can be chosen to modify a potential [width] modifier or other printf extensions.

3.3. Why .u*s and .zu*s Fails

This proposal previously used "%.zu*s" to specify a const char* string with a size_t length, or "%.u*s" for a const char* string with an unsigned int length. The problem is that, while mnemonically .u* was really good for "unsigned numeric precision argument", has an immediately problem: the u. "%.u*s can be parsed as BOTH:

This is a problem due to u not being a modifier but a conversion specifier in C. Design-wise, if u simply modified x or d to make it go from being a "signed" value to an "unsigned" value, this could never happen. But C did not go down that path: instead, unsigned anything is treated as a wholly separate conversion specifier and thus as a terminating sequence for a simple grammar. So, this proposal chooses ^ because it is less characters than ** and more aesthetically pleasing than **. No known extensions seem to be using ^ as a character either, so this clears any ambiguity while leaving the character ^ in a deliberate location that, for implementations which are thorough in their checking, can produce formatting violations for it.

3.4. Syntax Alternatives

The alternatives here are (potentially) viable choices that can be chosen instead. Briefly, the pros and cons of such characters are discussed. If they are popular, we plan to poll WG14 about them.

3.4.1. ^

The caret is what this proposal uses. It’s a single character so it doesn’t require any changes to a potential lookahead buffer for handling this sort of thing like ** might do by itself. It’s also aesthetically pleasing, at least in the author’s opinion (or, rather, it’s the last ugly). It can be used in the field width position as well without creating ambiguities, which leaves this as something that can be used in the future if more creative uses of the field width modifiers is considered.

The negative is that its the new character. While there’s currently no reported collision with existing practice and extensions, new characters means there’s less room to maneuver in the future.

3.4.2. **

The double asterisk is a potential choice. It does not use a new character but instead doubles up the current one in the current position. It can be used as a field width as well. It’s novel and does not provide for ambiguities. As an already-used character, it does not provide any chance to clash with existing practice or extensions.

The negative is that its a double-character and (in the author’s opinion) a bit ugly to look at.

3.4.3. +

The plus is a potential choice. It’s a single character so it doesn’t require any changes to a potential lookahead buffer for handling this sort of thing like ** might do by itself. It’s aesthetically pleasing and somewhat resembles the idea of an "unsigned" (positive) value. As an already-used character, it does not provide any chance to clash with existing practice or extensions.

The negative is that it is very questionable whether or not this can be a field width, as %+ is already a valid introductory sequence and is ambiguous on wether that is the old + for adding a sign to an integer or a new + for allowing an integer of a different (unsigned) type to be the field width. This means that the concept of a "length modifier" for the precision cannot be extended to the field width in the future without contorting the concept or once again creating new rules for that specific situation.

4. Wording

The following wording is against the latest draft of the C standard.

4.1. Modify §7.23.6.2 "The fprintf function"

7.24.6.2The fprintf function
Synopsis
#include <stdio.h>
int fprintf(FILE * restrict stream, const char * restrict format, ...);
Description
...

...

Each conversion specification is introduced by the character %. After the %, the following appear in sequence:

  • ...

  • An optional precision that gives the minimum number of digits to appear for the b, B, d, i, o, u, x, and X conversions, the number of digits to appear after the decimal-point character for a, A, e, E, f, and F conversions, the maximum number of significant digits for the g and G conversions, or the maximum number of bytes to be written for s conversions. The precision takes the form of a period (.) optionally followed either by an asterisk * (described later) or by an optional nonnegative decimal integer; by one of:

    • an optional length modifier followed by an asterisk * (described later);
    • an optional length modifier followed by a caret^ (described later);
    • or, a nonnegative decimal integer.

If only the period is specified, the precision is taken as zero. If a precision appears with any other conversion specifier, the behavior is undefined.

  • ...

As noted previously, a field width , or precision, or both, may be indicated with an asterisk. A precision may be indicated with an asterisk or a caret. In this case An asterisks means an int argument supplies the field width or precision . If the precision is an asterisk, an int argument or an argument of signed integer type (indicated by an optional length modifier) supplies the precision. If the precision is a caret '^', an unsigned int argument or an argument of unsigned integer type (indicated by an optional length modifier) supplies the precision. The arguments specifying field width, or precision, or both, shall appear (in that order) before the argument (if any) to be converted. A negative field width argument is taken as a - flag followed by a positive field width. A negative precision argument is taken as if the precision were omitted.

...

The length modifiers and their meanings are:

hh Specifies that a following b, B, d, i, o, u, x, or X conversion specifier applies to a signed char or unsigned char argument (the argument will have been promoted according to the integer promotions, but its value shall be converted to signed char or unsigned char before printing); or that a following n conversion specifier applies to a pointer to a signed char argument. If it is followed by an asterisk, then it specifies that the corresponding argument is of type signed char. If it is followed by a caret, it specifies that the corresponding argument is of type unsigned char.
h Specifies that a following b, B, d, i, o, u, x, or X conversion specifier applies to a short int or unsigned short int argument (the argument will have been promoted according to the integer promotions, but its value shall be converted to short int or unsigned short int before printing); or that a following n conversion specifier applies to a pointer to a short int argument. If it is followed by an asterisk then it specifies that the corresponding argument is of type short int. If it is followed by a caret '^', it specifies that the corresponding argument is of type unsigned short int.
l (ell) Specifies that a following b, B, d, i, o, u, x, or X conversion specifier applies to a long int or unsigned long int argument; that a following n conversion specifier applies to a pointer to a long int argument; that a following c conversion specifier applies to a wint_t argument; that a following s conversion specifier applies to a pointer to a wchar_t argument; or has no effect on a following a, A, e, E, f, F, g, or G conversion specifier. If it is followed by an asterisk then it specifies that the corresponding argument is of type long int. If it is followed by a caret '^', it specifies that the corresponding argument is of type unsigned long int.
ll (ell-ell) Specifies that a following b, B, d, i, o, u, x, or X conversion specifier applies to a long long int or unsigned long long int argument; or that a following n conversion specifier applies to a pointer to a long long int argument. If it is followed by an asterisk then it specifies that the corresponding argument is of type long long int. If it is followed by a caret '^', it specifies that the corresponding argument is of type unsigned long long int.
j Specifies that a following b, B, d, i, o, u, x, or X conversion specifier applies to an intmax_t or uintmax_t argument; or that a following n conversion specifier applies to a pointer to an intmax_t argument. If it is followed by an asterisk then it specifies that the corresponding argument is of type intmax_t. If it is followed by a caret '^', it specifies that the corresponding argument is of type uintmax_t.
z Specifies that a following b, B, d, i, o, u, x, or X conversion specifier applies to a size_t or the corresponding signed integer type argument; or that a following n conversion specifier applies to a pointer to a signed integer type corresponding to size_t argument. If it is followed by an asterisk then it specifies that the corresponding argument is of the corresponding signed type of size_t. If it is followed by a caret '^', then it specifies that the corresponding argument is of type size_t.
t Specifies that a following b, B, d, i, o, u, x, or X conversion specifier applies to a ptrdiff_t or the corresponding unsigned integer type argument; or that a following n conversion specifier applies to a pointer to a ptrdiff_t argument. If it is followed by an asterisk then it specifies that the corresponding argument is of type ptrdiff_t. If it is followed by a caret '^', then it specifies that the corresponding argument is of the corresponding unsigned type of ptrdiff_t.
wN Specifies that a following b, B, d, i, o, u, x, or X conversion specifier applies to an integer argument with a specific width where N is a positive decimal integer with no leading zeros (the argument will have been promoted according to the integer promotions, but its value shall be converted to the unpromoted type); or that a following n conversion specifier applies to a pointer to an integer type argument with a width of N bits. If it is followed by an asterisk then it specifies that the corresponding argument is of N-bit integer type. If it is followed by a caret '^', it specifies that the corresponding argument is of N-bit unsigned integer type. All minimum-width integer types (7.22.2.3) and exact-width integer types (7.22.2.2) defined in the header <stdint.h> shall be supported. Other supported values of N are implementation-defined.
wfN Specifies that a following b, B, d, i, o, u, x, or X conversion specifier applies to a fastest minimum-width integer argument with a specific width where N is a positive decimal integer with no leading zeros (the argument will have been promoted according to the integer promotions, but its value shall be converted to the unpromoted type); or that a following n conversion specifier applies to a pointer to a fastest minimum-width integer type argument with a width of N bits. If it is followed by an asterisk then it specifies that the corresponding argument is of N-bit fastest minimum-width integer type. If it is followed by a caret '^', it specifies that the corresponding argument is of N-bit fastest minimum-width unsigned integer type. All fastest minimum-width integer types (7.22.2.4) defined in the header <stdint.h> shall be supported. Other supported values of N are implementation-defined.

If a length modifier appears with any conversion specifier other than as specified previously, the behavior is undefined.

4.2. NOTE: IDENTICAL CHANGES TO fwprintf!