“We demand rigidly defined areas of doubt and uncertainty!” 
 ― Douglas Adams 
1. Introduction
A new text formatting facility ([P0645]) was adopted into the draft standard for C++20 in Cologne. Unfortunately it left unspecified units of width and precision which created an ambiguity for string arguments in variable-width encodings ([LWG3290]). This paper proposes fixing this shortcoming and specifying width and precision in a way that satisfies the following goals:
- 
     addressing the main use case 
- 
     locale-independence by default 
- 
     Unicode support 
- 
     ordinary and wide execution encoding support 
- 
     consistency with the SG16’s long-term direction 
- 
     following existing practice 
- 
     ease of implementation 
2. Motivating example
To the best of our knowledge, the main use case for the string width and precision format specifiers is to align text when displayed in a terminal with a monospaced font. The motivating example is a columnar view in a typical command-line interface:
We would like to be able to produce similar or better output with the C++20 formatting facility using the most natural API, namely dynamic width:
// Prints names in num_cols columns of col_width width each. void print_columns ( const std :: vector < std :: string >& names , int num_cols , int col_width ) { for ( size_t i = 0 , size = names . size (); i < size ; ++ i ) { std :: cout << std :: format ( "{0:{1}}" , names [ i ], col_width ); if ( i % num_cols == num_cols - 1 || i == size - 1 ) std :: cout << '\n' ; } } std :: vector < std :: string > names = { "Die Allgemeine Erklärung der Menschenrechte" , "『世界人権宣言』" , "Universal Declaration of Human Rights" , "Всеобщая декларация прав человека" , "世界人权宣言" , "ΟΙΚΟΥΜΕΝΙΚΗ ΔΙΑΚΗΡΥΞΗ ΓΙΑ ΤΑ ΑΝΘΡΩΠΙΝΑ ΔΙΚΑΙΩΜΑΤΑ" }; print_columns ( names , 2 , 60 ); 
Desired output:
(Note that spacing in front of 
3. Prior art
Display width is a well-established concept. In particular, POSIX defines the 
The
function shall determine the number of column positions required forwcswidth () wide-character codes (or fewer thann wide-character codes if a null wide-character code is encountered beforen wide-character codes are exhausted) in the string pointed to byn .pwcs 
Many languages have implementations of 
- 
     C wcwidth 
- 
     Go go - runewidth 
- 
     JavaScript wcwidth . js 
- 
     Julia Base . UTF8proc . charwidth Base . strwidth 
- 
     Perl Text :: CharWidth 
- 
     Python wcwidth 
- 
     Ruby unicode - display_width 
GitHub code search returns over 500,000 results for "wcwidth" and 180,000 results for "wcswidth".
The number of implementations of this facility together with large usage numbers indicate that it is an important use case. All of the above implementations work exclusively with Unicode.
4. Locale and execution encodings
One of the major design features of the C++20 formatting facility ([P0645]) is
locale independence by default with locale-aware formatting available as an
opt-in via separate format specifiers. This has an important safety property
that the result of 
Another observation is that the terminal’s encoding is independent from the
execution encoding. For example, on Windows it’s possible to change the
console’s code page with 
$ ls Die Allgemeine Erklц╓rung der Menschenrechte Universal Declaration of Human Rights Д╦√Г•▄Д╨╨Ф²┐Е╝ёХ╗─ н÷н≥н н÷н╔н°н•н²н≥н н≈ н■н≥н▒н н≈н║н╔н·н≈ н⌠н≥н▒ н╓н▒ н▒н²н≤н║н╘н═н≥н²н▒ н■н≥н н▒н≥н╘н°н▒н╓н▒ п▓я│п╣п╬п╠я┴п╟я▐ п╢п╣п╨п╩п╟я─п╟я├п╦я▐ п©я─п╟п╡ я┤п╣п╩п╬п╡п╣п╨п╟ ь╖ы└ь╔ь╧ы└ь╖ы├ ь╖ы└ь╧ь╖ы└ы┘ы┼ ы└ь╜ы┌ы┬ы┌ ь╖ы└ь╔ы├ьЁь╖ы├ $ LC_ALL=ru.RU.KOI8-R ls Die Allgemeine Erkl??rung der Menschenrechte Universal Declaration of Human Rights ?????????????????????? ?????????????????? ?????? ???? ?????????????????? ???????????????????? ???????????????? ???????????????????? ???????? ???????????????? ?????????????? ?????????????? ?????????? ?????????????? ??????????????????
Therefore, for the purposes of specifying width, the output of 
5. Windows
According to the Windows documentation ([WINI18N]):
Most applications written today handle character data primarily as Unicode, using the UTF-16 encoding.
and
New Windows applications should use Unicode to avoid the inconsistencies of varied code pages and for ease of localization.
Code pages are used primarily by legacy applications or those communicating with legacy applications such as older mail servers.
Since 
6. Precision
Precision, when applied to a string argument, specifies how many characters will
be used from the string. It can be used to truncate long strings in the columnar
output as in the motivating example shown earlier. Because it works with a
single argument and only for some argument types it is not particularly useful
for truncating output to satisfy storage requirements. 
Since precision and width address the same use case, we think that they should be measured in the same units.
7. Proposal
To address the main use case, we propose using the display width of a string, i.e. the number of column positions needed to display the string in a terminal, for both width and precision.
There is a spectrum of solutions to the problem of estimating display width,
from always wrong (return 42 times the number of code units) and almost always
wrong (code units and 
To satisfy the locale-independence property we propose that for the purposes
of display width computation the default should be Unicode on systems that
support display of Unicode text in a terminal or fixed implementation-defined
encodings otherwise. In particular this allows using EBCDIC on z/OS and ASCII on
resource-constrained embedded systems that may not want to provide even minimal
Unicode handling capabilities.
On Unicode-capable systems both 
Using a fixed system encoding is completely safe because formatting functions
don’t do any transcoding. So the worst thing that can happen is that the display
width will be estimated incorrectly leading to misaligned text which is what
already happens when you pass a variable-width string to 
The native encoding of an ordinary character string is the operating system dependent current encoding for pathnames.
For Unicode, the first step in computing width is to break the string into
grapheme clusters because the latter correspond to user-perceived characters
([UAX29]). Then the width should be adjusted to account for graphemes that
take two column positions as it is done, for example, in the Unicode
implementation of 
Width estimation can be done efficiently with a single pass over the input and optimized for the case of no variable-width characters. It has zero overhead when no width is specified or when formatting non-string arguments.
We also propose adding a new format specifier in C++23 for computing display width of a string argument based on the locale’s encoding, for example:
std :: locale :: global ( std :: locale ( "ru_RU.KOI8-R" )); std :: string message = std :: format ( "{:6ls}" , " \xd4\xc5\xd3\xd4 " ); // "тест" in KOI8-R // message == "\xd4\xc5\xd3\xd4 " ("тест " in KOI8-R) 
This will support display width estimation for ordinary and wide execution
encodings.
We think that the current proposal is in line with SG16: Unicode Direction
([P1238]) goal of "Designing for where we want to be and how to get there"
because it creates a clear path for the future 
8. Why not code units?
It might seem tempting at first to measure width in code units because
it is simple and avoids the encoding question. However, it is not very useful in
addressing practical use cases. Also it is an evolutionary deadend because
standardizing code units for 
Code units are even less adequate for precision, because they can result in invalid output. For example
std :: string s = std :: format ( "{:.2}" , " \x41\xCC\x81 " ); 
would result in 
printf ( "%10s - %s \n " , "bistro" , "a small or unpretentious restaurant" ); printf ( "%10s - %s \n " , "café" , "a usually small and informal establishment serving various refreshments" ); 
prints
    bistro - a small or unpretentious restaurant
     café - a usually small and informal establishment serving various refreshments
   or
    bistro - a small or unpretentious restaurant
    café - a usually small and informal establishment serving various refreshments
   depending on how é is represented.
If we want to truncate the output
printf ( "%.4s... \n " , "bistro" ); printf ( "%.4s... \n " , "café" ); 
the result is even worse:
bist... caf<C3>...
9. Limitations
Unlike terminals, GUI editors often use proportional fonts or fonts that claim to be monospaced but treat some characters such that their width is not an integer multiple of the other. Therefore width, regardless of how it is defined, is inherently limited there. However, it can still be useful if the input domain is restricted. Possible use cases are aligning numbers, text in ASCII or other subset of Unicode, or adding code indentation:
// Prints text prefixed with indent spaces. void print_indented ( int indent , std :: string_view text ) { std :: cout << fmt :: format ( "{0:>{1}}{2} \n " , "" , indent , text ); } 
Our definition of width fully support these use cases and gives better results
than 
10. Examples
#include <format>#include <iostream>#include <stdio.h>struct input { const char * text ; const char * info ; }; int main () { input inputs [] = { { "Text" , "Description" }, { "-----" , "------------------------------------------------------------------------" "--------------" }, { " \x41 " , "U+0041 { LATIN CAPITAL LETTER A }" }, { " \xC3\x81 " , "U+00C1 { LATIN CAPITAL LETTER A WITH ACUTE }" }, { " \x41\xCC\x81 " , "U+0041 U+0301 { LATIN CAPITAL LETTER A } { COMBINING ACUTE ACCENT }" }, { " \xc4\xb2 " , "U+0132 { LATIN CAPITAL LIGATURE IJ }" }, // IJ { " \xce\x94 " , "U+0394 { GREEK CAPITAL LETTER DELTA }" }, // Δ { " \xd0\xa9 " , "U+0429 { CYRILLIC CAPITAL LETTER SHCHA }" }, // Щ { " \xd7\x90 " , "U+05D0 { HEBREW LETTER ALEF }" }, // א { " \xd8\xb4 " , "U+0634 { ARABIC LETTER SHEEN }" }, // ش { " \xe3\x80\x89 " , "U+3009 { RIGHT-POINTING ANGLE BRACKET }" }, // 〉 { " \xe7\x95\x8c " , "U+754C { CJK Unified Ideograph-754C }" }, // 界 { " \xf0\x9f\xa6\x84 " , "U+1F921 { UNICORN FACE }" }, // 🦄 { " \xf0\x9f\x91\xa8\xe2\x80\x8d\xf0\x9f\x91\xa9\xe2\x80\x8d " " \xf0\x9f\x91\xa7\xe2\x80\x8d\xf0\x9f\x91\xa6 " , "U+1F468 U+200D U+1F469 U+200D U+1F467 U+200D U+1F466 " "{ Family: Man, Woman, Girl, Boy } " } // 👨👩👧👦 }; std :: cout << " \n std::format with the current proposal: \n " ; for ( auto input : inputs ) { std :: cout << std :: format ( "{:>5} | {} \n " , input . text , input . info ); } std :: cout << " \n printf: \n " ; for ( auto input : inputs ) { printf ( "%5s | %s \n " , input . text , input . info ); } } 
Output on macOS Terminal:
Notice that the 
Output on Windows with console codepage set to 65001 (UTF-8) and the active code page unchanged:
The Windows console doesn’t handle combining accents and emoji correctly
which is unrelated to the question of width. Although it is possible to
implement a workaround for this platform we advise against it. If the
output is incorrect it is reasonable to expect the alignment to be incorrect as
well. The new Windows Terminal reportedly handles emoji correctly. Console bugs
aside, 
Notice that although the Windows console is unable to display CJK Unified Ideograph-754C, the width is still computed correctly and a placeholder character is displayed instead. This is a very nice example of a fallback behavior, in this case done by the terminal itself.
Output on GNOME Terminal 3.32.1 in Linux:
Output on Konsole 18.12.3 in Linux:
GNOME Terminal and Konsole which are two major terminal emulators on Linux also cannot handle complex emoji yet but otherwise produce similar results.
11. Implementation
The proposal is implemented in the fmt library, successfully tested on a
variety of platforms, and will become the default for both 
We tested our implementation on macOS Terminal version 2.9.5 (421.2), Windows console on Windows version 10.0.17763.737 and GNOME Terminal 3.32.1 on Linux verifying that the display width is consistent for at least the following Unicode blocks according to our definition of width and produces visually aligned results:
Block range Block name ============== =========================== U+0000..U+007F Basic Latin U+0080..U+00FF Latin-1 Supplement U+0100..U+017F Latin Extended-A U+0180..U+024F Latin Extended-B U+0250..U+02AF IPA Extensions U+02B0..U+02FF Spacing Modifier Letters U+0300..U+036F Combining Diacritical Marks U+0370..U+03FF Greek and Coptic U+0400..U+04FF Cyrillic U+0500..U+052F Cyrillic Supplement U+0530..U+058F Armenian U+0590..U+05FF Hebrew U+0600..U+06FF Arabic U+0700..U+074F Syriac U+0750..U+077F Arabic Supplement U+0780..U+07BF Thaana U+07C0..U+07FF NKo U+0800..U+083F Samaritan U+0840..U+085F Mandaic U+0860..U+086F Syriac Supplement U+08A0..U+08FF Arabic Extended-A U+0900..U+097F Devanagari U+0980..U+09FF Bengali U+0A00..U+0A7F Gurmukhi
Even this small subset of Unicode includes 5 out of top 10 writing scripts ranked
by active usage ([SCRIPTS]) with billions of active users. For comparison,
width in 
Additionally we looked at the Unicode block U+1F300..U+1F5FF Miscellaneous Symbols and Pictographs. Support for code points in this block varies, for example, in the Windows console none of them were displayed correctly, with or without width. In macOS Terminal most of the symbols had width 2 with a few exceptions. Therefore we recommend not declaring any support for this block and using the default of 2. This will have reasonable behavior on systems that support emoji and make it clear that these symbols may need escaping or other fallback mechanism.
12. Wording
Modify [format.string.std]/p7 as follows:
The positive-integer in width is a decimal integer defining the minimum field width. If width is not specified, there is no minimum field width, and the field width is determined based on the content of the field.
     Width of a string is defined as the estimated number of column position required
to display it in a terminal.
[ Note: This is similar to the semantics of the POSIX wcswidth 
   For the purposes of width computation the string is assumed to be in a fixed operating system dependent encoding. If the operating system is capable of displaying Unicode text in a terminal both ordinary and wide encodings are Unicode encodings such as UTF-8 and UTF-16, respectively. [ Note: this is the case for Windows-based and many POSIX-based operating systems. — end note ] Otherwise, the encodings are implementation-defined.
Width of a string in a Unicode encoding is the sum of estimated widths of the first code points in its extended grapheme clusters as defined by Unicode® Standard Annex #29 Unicode Text Segmentation. Estimated width of the following code points is 2:U+1100 .. U+115F U+2329 U+232A U+2E80 .. U+303E U+3040 .. U+A4CF U+AC00 .. U+D7A3 U+F900 .. U+FAFF U+FE10 .. U+FE19 U+FE30 .. U+FE6F U+FF00 .. U+FF60 U+FFE0 .. U+FFE6 U+20000 .. U+2FFFD U+30000 .. U+3FFFD U+1F300 .. U+1F64F U+1F900 .. U+1F9FF
Estimated width of other code points is 1. [ Note: The method of estimated width computation is subject to change. — end note ]
Width of a string in a non-Unicode encoding is implementation-defined.
Modify [format.string.std]/p9 as follows:
The nonnegative-integer in precision is a decimal integer defining the precision or maximum field size. It can only be used with floating-point and string types. For floating-point types this field specifies the formatting precision.
For string types it specifies how many characters will be used from the string.For string types it specifies the maximum width. Trailing grapheme clusters or implementation-defined units of width that exceed the given precision are ignored.
Optional (possibly in C++23):
    The l 
   13. Acknowledgements
We would like to thank Tom Honermann for bringing the issue of ambiguous width to our attention and Henri Sivonen for a very detailed and insightful post on various definitions of Unicode string length ([LENGTH]) which helped us in researching the topic.