SG16: Unicode meeting summaries 2021-04-14 through 2021-05-26
Summaries of SG16 meetings are maintained at
https://github.com/sg16-unicode/sg16-meetings.  This paper contains a
snapshot of select meeting summaries from that repository.
Previously published SG16 meeting summary papers:
April 14th, 2021
Draft agenda:
Attendees:
  - Corentin Jabot
- Hubert Tong
- JeanHeyd Meneide
- Jens Maurer
- Mark Zeren
- Peter Bindels
- Peter Brett
- Steve Downey
- Tom Honermann
- Zach Laine
Meeting summary:
  - PBrett introduced the agenda.
- P2295R2: Correct UTF-8 handling during phase 1 of translation
    
      - Corentin introduced:
        
          - This is a proposal to require that UTF-8 be one of the set of
              otherwise implementation-defined source file encodings.
- With regard to ill-formed code unit sequences, there is no such
              thing; the source code is either valid UTF-8 or it is not
              UTF-8.
- Gcc does not validate its presumed UTF-8 input.
- With regard to BOMs, the proposal does not impose any
              requirements other than that a BOM present in a UTF-8 source
              file be ignored for the purposes of lexing.
- An implementation may use the presence or non-presence of a BOM
              as part of its source file encoding determination.
- The proposed wording will require updates for changes that will
              presumably be adopted from Jens'
              P2314: Character sets and encodings.
- This proposal follows Beman Dawes' earlier proposal,
              N3463: Portable Program Source Files.
- At present, the C++ standard has no requirement for a portable
              source file.
 
- Tom stated that gcc will perform UTF-8 validation if both
          -finput-charset=utf-8 and -fexec-charset=utf-8
          are specified.
- [ Editor's note: Tom was wrong (and since Tom is also the editor,
          he can be blunt like that); gcc only validates UTF-8 for string
          literals, and then only if -fexec-charset=<encoding>
          is specified. ]
- Jens noted a capitalization issue in the wording; the sentence
          following the added note in [lex.phases]p1 has a capitalized "The"
          following a ";".
- Jens asked why the note added to [lex.phases]p1 is just a note; the
          preceding prose provides a definition, but does not impose any
          requirements.
- PBrett responded that, if an invalid sequence is present, then there
          is no sequence of Unicode scalar values.
- PBrett asked if moving the note after the following sentence would
          resolve the concern.
- Jens replied that it would not; that would define a UTF-8 source file
          and state that a well-formed UTF-8 source file must be accepted, but
          would impose no requirements on an ill-formed UTF-8 source file.
- PBrett acknowledged that further wording work is needed.
- Jens observed, and noted that the paper discusses, that
          implementations can accept source files that approximate UTF-8.
- Hubert noted that a normative statement is needed to state that it is
          implementation-defined how a requirement for UTF-8 source files is
          specified.
- PBindels suggested placing a requirement for well-formed input with
          the character set definitions.
- Jens indicated no objection to clarification, but that he would like
          to see the ISO 10646 definition of "well-formed".
- Steve observed that the note is stating that invalid UTF-8 sequences
          cannot happen in a well-formed UTF-8 source file.
- Jens responded that there is a normative difference between
          something that cannot happen and something that is ill-formed; the
          latter requires a diagnostic.
- Hubert asserted that the wording needs to establish intent; a
          sequence of bytes may happen to be well-formed UTF-8, but the
          wording needs to ensure that the bytes were intended to be
          interpreted as UTF-8.
- PBindels summarized; we need to state there is an
          implementation-defined way to specify that a source file is to be
          interpreted as UTF-8.
- Jens agreed.
- JenaHeyd agreed from chat, "Yes, Hubert's definition is correct. You
          have to make it so the implementation has a way to mark/identify a
          source file as UTF-8, and then you can impose these requirements."
- Corentin stated the intent; that the compiler determine the source
          encoding in an implementation-defined way, but that a source file
          that does not decode successfully is diagnosed as ill-formed.
- Tom suggested specifying that the file must decode successfully as
          opposed to being well-formed.
- PBrett stated that a branch is needed in translation phase 1 to
          distinguish the cases where the source file is encoded as UTF-8 vs
          some other encoding.
- Zach suggested that a definition for a UTF-8 source file is
          unnecessary.
- PBindels expressed concern that there may be a conflict between use
          of a BOM and a truly portable source file.
- PBrett responded that the goal is that, if a source file is UTF-8
          encoded, that there is a way to direct an implementation to process
          it as such.
- Jens acknowledged and added that an implementation could require use
          of a command line option to opt-in to UTF-8 encoded source files;
          that implies that the source file is not automatically portable,
          but is the best we can do.
- Tom agreed and stated that the only way we could do better is to
          require a BOM everywhere and nobody wants that.
- Zach noted that the only statement made regarding a BOM is that it
          can be ignored; presumably after encoding determination is complete
          so that the BOM doesn't interfere with translation phase 2.
- Hubert noted that, once the encoding is determined to be UTF-8, a
          BOM is portably ignored.
- PBrett encouraged assumption of non-hostile implementations; no
          implementation is going to require a BOM in order for a UTF-8
          encoded source file to be processed as such.
- Several relevant comments were made from chat:
        
          - Steve: "We want portable source code. If anyone requires a BOM,
              then portable source code needs one."
- JeanHeyd: "If you put in a BOM and use -fexec-charset=SHIFT-JIS,
              the implementation can ignore the BOM and still read everything
              as SHIFT-JIS."
- Hubert: "If you did that, the BOM is not a BOM..."
 
- Jens suggested that the wording needs to establish when encoding
          determination happens; that should be the first step of translation
          phase 1.
- Jens added that the wording should be consistent with regard to
          encoding vs encoding form vs encoding scheme.
- Tom stated that, for UTF-8, encoding form vs encoding scheme doesn't
          matter, but that encoding scheme should be used if the intent is for
          the wording to be compatible with UTF-16 or UTF-32.
- Hubert asserted that, since the context is byte oriented files,
          encoding scheme should be used.
- Jens reiterated the necessary wording updates; the encoding scheme to
          use must first be established, then the source file can be validated
          and diagnostics issued if it fails to conform to the encoding
          scheme.
- Jens added that the wording needs to prevent the current
          implementation-defined mapping to the internal encoding from being
          applied to UTF-8 source files.
- PBindels asked if the added sentence in translation phase 2 regarding
          the "first codepoint" applies to each source file or just to the
          primary source file.
- Tom and Corentin replied that translation phases 1 through 3 are
          performed separately for each source file.
- Hubert suggested that translation phase 2 should discard a lead
          U+FEFF character regardless of the source file encoding.
- Jens noted that the added translation phase 2 sentence doesn't make
          sense without the wording changes proposed in
          P2314: Character sets and encodings
          due to character translation to universal-character-name in
          translation phase 1.
- Tom noted that the wording changes in P2314 allow distinguishing a
          source file with a BOM and a source file that starts with a
          \uFEFF universal-character-name.
- Jens clarified that, after P2314, a universal-character-name
          isn't translated to a UCS scalar value until translation phase 3.
- Hubert stated that it is a design question whether we want to treat a
          leading \uFEFF universal-character-name as a BOM.
- PBrett asked PBindels if he is satisfied with the BOM design
          following prior discussion.
- PBindels responded that he is, so long as we don't intentionally or
          unintentionally create the situation where UTF-8 source files end up
          requiring a BOM in practice.
- PBrett asked if we should add normative encouragement not to require
          a BOM.
- Hubert noted that, as wording updates are done, care must be taken to
          ensure we don't lose the wording that requires an implementation to
          accept a UTF-8 encoded source file whether it does, or does not,
          contain a BOM.
- Tom asked about handling of differently encoded source files.
- JeanHeyd replied in chat, "I think it's better to leave Encoding
          Identication to Tom's Paper on the subject."
- Tom replied in chat, "Assuming I actually deliver on that
          threat..."
- Hubert responded that the implementation must provide some means for
          standard headers (as opposed to header files), to remain usable when
          the implementation is running in UTF-8 mode.
- Steve added in chat, "Which might be 7 bit ascii for those headers.
          Which is largely the case today."
- We wish to require implementations to support UTF-8 source files.
        
          - Attendance: 10
- No objections to unanimous consent.
 
- We wish to require implementations to be capable of accepting UTF-8
          source files whether or not they begin with a U+FEFF byte order mark.
        
          - Attendance: 10
- No objections to unanimous consent.
 
- Hubert reported that Clang allows non-UTF-8 encoded header names in
          #include directives in otherwise UTF-8 encoded source
          files.
- Steve stated that, since file names are not required to be
          representable in UTF-8, requiring strictly well-formed UTF-8 could
          have unanticipated consequences.
- JeanHeyd asked in chat, "Does `\xFF` work in header-names as an
          escape?"
- Corentin replied in chat, "unspecified".
- Corentin explained his intent in requiring diagnosis of ill-formed
          UTF-8 input.
- PBindels asked why it is useful to allow invalid UTF-8 in
          comments.
- Corentin replied that Clang source code has comments explaining why
          invalid UTF-8 in comments is explicitly allowed and provided a link
          to the source code.
        
      
- PBrett shared cases of copyright symbols appearing in otherwise ASCII
          files.
- Tom noted that non-ASCII characters tend to appear in author, product,
          and company names in comments.
- Hubert stated that source files that iconv will reject are
          undesirable.
- We wish to require implementations to have a mode in which they diagnose ill-formed UTF-8 source files (regardless of whether the ill-formedness is located in comments, header names or string literals).
        
      
- Consensus is strongly in favor.
- SF: As it stands right now, people are already basically rolling
          the dice with their source files. This is strictly an improvement
          over the status quo, because now there is, at least, one entirely
          portable way to write source code.
- Corentin asked about necessary wording to support both source files
          and non-files.
- Hubert responded that (standard library) headers are not source
          files; source files are those things that are included by
          #include directives that do not name standard headers.
- PBrett asked if the wording should be modified do discuss "input"
          as opposed to "files".
- Hubert responded that such a change is not necessary.
- Corentin pledged to bring back a revised paper.
 
- Tom stated the next telecon will be April 28th.
April 28th, 2021
Draft agenda:
Attendees:
  - Charlie Barto
- Corentin Jabot
- Hubert Tong
- Jens Maurer
- Mark Zeren
- Peter Bindels
- Peter Brett
- Steve Downey
- Tom Honermann
- Victor Zverovich
- Zach Laine
Meeting summary:
  - Charlie Barto was welcomed with a round of introductions.
- PBrett introduced the agenda.
- LWG3547: Time formatters should not be locale sensitive by default
    
      - PBrett presented:
        
          - Peter's presentation slides are available
              here.
- As currently specified, whether a format specifier is locale
              dependent is not obvious.
- Floating point values are locale independent by default, but
              chrono values are not.
- There is no systematic way to format locale-independent and
              locale-dependent chrono values.
 
- Victor expressed a preference for chrono values being locale
          independent by default.
- Victor explained that the current specification derived from existing
          specifiers used elsewhere.
- Victor noted that, in some cases, specifiers are not available for
          locale independent formatting.
- Victor reported success with a prototype implementation of the
          proposed resolution that performs locale independent formatting of
          chrono values unless a L specifier is present.
- Charlie stated that changes to the format specifier syntax may have
          more implementation impact than just requiring changes to the
          implementation behavior.
- [ Editor's note: Discussion regarding the amount of time
          available to make changes before implementations of
          std::format() are shipped to users ensued.  That discussion
          is not recorded as it involved discussion of internal company time
          lines that have not yet been stated in public. ]
- PBrett noted that there are two related issues:
        
          - 1: The format specification syntax.
- 2: The behavior of the format specifiers.
 
- PBrett explained that the proposed resolution addresses both concerns
          by making the format syntax consistent in requiring a L
          specifier to opt-in to locale dependent behavior.
- Charlie noted that std::format() does not currently perform
          any transcoding operations today; not for format arguments, and not
          for text provided by a locale that uses a different character encoding
          than the literal encoding.
- Charlie added that std::format() does need to be encoding
          aware for the purposes of field width estimation.
- Corentin stated that the intent of the proposed resolution is to
          ensure that std::format() use consistent syntax to opt-in to
          locale dependent formatting and encouraged trying to address at least
          this concern.
- Corentin added that LWG might agree on a resolution in a short time
          frame, but that there will not be a plenary poll until June.
- PBrett stated that the resolution may be considered evolutionary.
- Victor agreed and noded that the L specifier could be added
          for a future standard.
- Victor asserted that we do need to decide what the default behavior
          is now.
- Victor added that we could consider transcoding locale provided text
          and potentially detecting mojibake if it would be produced.
- Victor noted that the format string is always a literal.
- [ Editor's note: In C++20, the format string may not be a literal,
          but
          P2216,
          if adopted, will require a literal or other compile-time evaluated
          expression. ]
- Zach asked for clarification regarding what is meant by
          "default behavior" and noted that the %Ou specifier is
          locale dependent, but that %u is not.
- Victor responded that there are cases like %T that do not
          have locale independent forms.
- [ Editor's note: %T is locale dependent because the
          decimal point character potentially used for sub-second precision is
          provided by the locale. ]
- Hubert stated that these concerns will be difficult to resolve
          quickly, are clearly evolutionary, and may require balloting.
- Hubert added that there may also be issues with requiring the locale
          independent behavior to use English translations.
- Tom noted that the basic source character set already has a bias in
          English.
- Hubert responded that this goes further; we may potentially have to
          specify behavior in terms of asctime().
- Charlie commented that the text provided by the locale facet is
          currently produced by the operating system; changing that behavior
          may not be problematic.
- Charlie added that adding new format specifiers will result in
          incompatibilities if code that uses those specifiers is run with an
          older library implementation that doesn't support them.
- Charlie noted that, if support for compile-time format string
          checking is adopted via
          P2216,
          then the format string will become part of the function template
          specialization; this may help to avoid library compatibility
          issues.
- Charlie stated that there are multiple sources of locale information
          and that formatting of the chrono types is goverend by the Windows
          region settings.
- Charlie noted that changes to the Windows region settings require a
          reboot.
- Tom asked for confirmation that calls to std::setlocale()
          don't affect how chrono values are formatted.
- Charlie confirmed that is correct.
- PBrett asked if std::format() behavior is affected by
          changes to the global locale via std::locale::global().
- Charlie responded that the global locale does affect the behavior of
          format specifiers that include the L specifier.
- Charlie clarified that the global locale will not affect parsing of
          the format string itself.
- Corentin requested review of the proposed resolution.
- Hubert noted that the wording requires that the "C" locale be used
          for field formats that do not include the L specifier
          regardless of whether a std::locale argument is passed.
- Hubert noted that under the C++20 wording, implementations trying to
          accomodate this tentative future direction may be more able to ignore
          the global locale than an explicit locale argument.  So, a change
          that maintains respecting the locale parameter is more compatible
          with C++20.
- Tom responded that doing so would not be consistent with the other
          standard format specifiers.
- Victor agreed and added that he would be strongly opposed to implicit
          use of a std::locale parameter.
- Jens stated that a migration path to better behavior needs to be
          estalished and noted that the current situation is an interesting
          mess.
- Jens suggested investigating how to increase consistency with the
          existing locale dependent format specifiers; e.g., for decimal
          point and digit group separator characters.
- Jens added that there may be cases where it would be useful to be
          able to specify use of the "C" locale even when a locale is provided
          as an argument.
- Jens observed that use of the "C" locale for the chrono %p
          specifier would be consistent with use of the "C" locale for floating
          point values.
- Jens noted that the example in the proposed resolution does not match
          the proposed grammar; the L specifier should precede the
          chrono-specs specifier, not follow it.
- Jens stated that adding support for the L specifier is
          backward compatible from a standard evolution perspective.
- Tom stated that a change to use the "C" locale in place of the global
          locale or a locale passed as an argument can be done as a non-abi
          breaking change.
- Charlie agreed, but noted that some implementation tricks may be
          required to avoid potential conflicts with older libraries.
- Zach stated that mixing different library versions is non-conforming
          anyway.
- Corentin stated that the "C" locale is used as a proxy for the
          absence of a locale and suggested that a constexpr locale might be
          desired in the future.
- Corentin asked Charlie if formatters can be modified without
          breaking ABI.
- Charlie replied that they are templates, so modifications can result
          in ODR violations.  Charled added that inline namespaces can be
          helpful in some cases.
- PBrett asked for confirmation that use of a L specifier
          where one is not expected will result in a format exception being
          thrown.
- Victor confirmed that is the case.
- PBrett asked if the L specifier could be reserved now such
          that a format exception will be thrown if used, and then different
          behavior specified later.
- Charlie responded that changing behavior to not throw in cases where
          an exception was previously thrown is fine so long as mixed library
          version problems are avoided.
- Victor expressed agreement with Jens' prior comments.
- Victor stated that behavior must remain consistent between
          std::format() overloads that do and do not accept
          std::locale arguments; the presence of the
          std::locale argument must not, by itself, affect
          behavior.
- PBrett suggested that a paper that explores the alternatives may be
          required.
- Corentin asserted that it must be possible to evolve the
          std::format format string so as to add new behaviors.
- Corentin expressed distaste for the idea of a "no locale" specifier;
          that approach would still result in inconsistencies with number
          formatting.
- Charlie agreed.
- Jens conceded that challenging standardization work will be required
          if behavior changes from C++20 to C++23.
- Jens asserted that the right to add format specifiers when a new
          standard is issued must be reserved, even if doing so causes
          implementation challenges.
- Poll 1: LWG3547 raises a valid design defect in [time.format] in C++20.
        
          - Attendance: 11
- 
            
          
- Consensus: Strong consensus that this issue represents a
              design defect.
 
- Hubert noted that, with regard to issues of consistency, the proposed
          resolution is a departure from existing interfaces such as
          strftime().
- Poll 2: The proposed LWG3547 resolution as written should be applied to C++23.
        
          - Attendance: 11
- 
            
          
- No consensus.
- SA: Mitigation of behavior changes sensitive to string literal
              contents is very difficult and there are options available to
              deal with this problem in an additive way; this direction
              represents an unnecessary backward compatibility break.
 
- Mark stated that the proposed resolution would have been great 18
          months ago.
- PBrett responded that we need to recognize when we make mistakes
          and own correcting them.
- Corentin lamented the current state being another case of a bad
          default.
- Tom suggested that the current behavior can be presented as
          intentional with the goal to maintain consistency with existing
          interfaces; new format specifiers can then be added in C++23.
- PBrett suggested that an SG16 issue be filed and a volunteer found
          to work on it.
- Victor responded that the behavior isn't sufficiently broken to
          make him want to spend time on it.
- [ Editor's note: Despite that lack of desire, Victor and
          Corentin quickly authored an initial draft paper that will become
          P2372R0
          once published. ]
- PBrett volunteered to work on a paper.
 
- Tom and PBrett thanked Charlie for joining the telecon and encouraged
      him to continue attending.
- Tom stated that Victor had expressed interest in working on a potential
      std::locale replacement and asked if there were other
      volunteers interested in such work.
    
  
- Tom stated that the next SG16 telecon will be held May 12th.
    
May 12th, 2021
Draft agenda:
Attendees:
  - Charlie Barto
- Hubert Tong
- Jens Maurer
- Mark Zeren
- Peter Brett
- Steve Downey
- Tom Honermann
- Victor Zverovich
- Zach Laine
Meeting summary:
  - P2295R3: Support for UTF-8 as a portable source file encoding
    
      - No discussion as the author was not present.
 
- P2372R1: Fixing locale handling in chrono formatters
    
      - [ Editor's note: D2372R1 was the active paper under discussion at
          the telecon.  That paper was later published as P2372R1 without
          further modification.  The agenda and links used here reference
          P2372R1 since the links to the draft paper were ephemeral.
          ]
- PBrett introduced the topic:
        
          - LEWG reached consensus for the direction proposed by
              P2372R0
              at its
              2021-05-03 telecon
              with additional refinement to preserve locale dependent
              formatting for iostreams.
- Since SG16 polls conduced at its
              2021-04-28 telecon
              did not agree with this direction, LEWG requested that SG16
              review and conform or rebut the LEWG consensus.
 
- Victor presented slides lightly updated from his prior LEWG
          presentation.
        
          - Victor's presentation slides are available
              here.
 
- Poll 1: Forward D2372R1 to LEWG for inclusion in C++23 and with
          the intent that it be applied retroactively to C++20.
        
          - Attendance: 8
- 
            
          
- Consensus: Strong consensus in favor.
 
- [Editor's note: D2372R1 contains the LEWG requested update to
          preserve locale dependent formatting for ostreams. ]
- [Editor's note: The chair's perception is that SG16's change in
          consensus is attributable to two factors:
        
          - New information that arrived after the initial poll.
- SG16's original poll targeted C++23 while LEWG's poll targets
              C++23 and C++20 as a DR; some concerns had been expressed
              regarding backward compatibility and migration.
 ]
 
- P2093R6: Formatted output
    
      - Victor presented:
        
          - std::print() integrates std::format() with
              I/O.
- R6 addresses recent LEWG feedback:
            
              - The proposed std::print() header was changed from
                  <io> to <print>.
- Additional rationale and clarifications were added regarding:
                
                  - Substitution of replacement characters.
- The choice to base behavior on the compile-time literal
                      encoding.
- ANSI escape sequences do not constitute a native device
                      API.
- Existing practice in Rust.
 
 
 
- PBrett asked how substitutions would be performed for different kinds
          of ill-formed scenarios.
- Zach stated that the Unicode standard documents recommended practice
          for substitution of replacement characters.
- [ Editor's note:
          Unicode 13
          discusses substitution of replacement characters in section
          "U+FFFD Substitution of Maximal Subparts" of
          chapter 3.9, "Unicode Encoding Forms" and in
          chapter 5.22, "U+FFFD Substitution in Conversion". ]
- Zach expressed a preference for implementations to be consistent in
          how replacement characters are substituted.
- Hubert stated that an example should be added to the paper.
- Hubert expressed a preference for vprint_unicode() to
          substitute replacement characters even when the output device is not
          Unicode.
- Victor asked if that could be done as implementation-defined
          behavior.
- Hubert responded, no; the goal is for the substitution behavior to be
          determinstic for vprint_unicode() regardless of the output
          device.
- Victor replied that he would prefer that behavior to be optional.
- Hubert replied that he would like to ensure that ill-formed inputs are
          not presented with no indication that something went wrong.
- PBrett stated that, when writing to a Unicode device, a
          U+FFFD replacement character should be substituted and the
          device should then handle it as its designers intended.
- Victor agreed with the substitution rationale for the device case
          since transcoding may be necessary, but disagreed for files due to a
          desire to avoid the validation overhead.
- Hubert expressed a preference for the behavior of
          vprint_unicode() to be consistent across files and
          devices.
- PBrett suggested that what Hubert desires is some kind of noisy
          failure, like a trap.
- Hubert agreed and restated the goal as some kind of signal that
          encoding issues were encountered.
- Steve stated that C++ programs do not typically interact directly
          with a device and that it is difficult to diagnose problems where the
          data can't be inspected en route.
- PBrett asked if Steve had a suggestion.
- Steve responded with a preference for a programatic error handling
          facility.
- Zach stated that, in the case where UTF-8 source is copied to a UTF-8
          sink, introduction of replacement characters could be surprising, but
          when transcoding is required, e.g., when the sink is UTF-16, then
          replacement characters are expected.
- Zach suggested decomposing the problem; validate and handle errors
          first, then convert.
- Charlie explained that, on Windows, the only ways to write Unicode to
          the console are to change the console encoding and write using the
          ANSI APIs, or to convert to UTF-16 and write using the wide APIs.
- Charlie noted that, since the console encoding is a global property
          of the process, changing it within std::print() would require
          synchronization.
- Zach suggested that it is reasonable to get mojibake in the ANSI case
          if the console encoding hasn't been correctly set.
- Hubert responded that the global console encoding condition seems to
          be particular to Windows and worth addressing.
- Charlie pondered the ramifications of writing to a stream opened in
          text mode.
- Victor reiterated his stance on not wanting to pay validation costs
          except in cases where transcoding is necessitated.
- Poll 2: When <print> facilities must transcode formatting
          results for display on a device and, during that process,
          invalidly-encoded text is encountered, std::print() should
          replace the erroneously-encoded code units with
          U+FFFD REPLACEMENT CHARACTER.
        
          - Attendance: 9
- 
            
          
- Consensus is in favor.
- A: Not convinced that silently substituting replacement
              characters is always the right policy; an exception could be
              appropriate.  There are parallels with integer overflow.
- A: Testing is difficult if substitution is device
              sensitive.
 
- Charlie expressed support for a direction that would allow explicitly
          inhibiting use of the native device API but noted that, on Windows,
          that would mean the console encoding would have to be correctly set
          and the application would have to take care of buffering
          concerns.
- Poll 3: When <print> facilities need not transcode their
          formatting results for display on a device and invalidly-encoded text
          is encountered, std::print() should nevertheless replace the
          erroneously-encoded code units with U+FFFD REPLACEMENT CHARACTER.
        
          - Attendance: 9
- 
            
          
- N: Undecided due to uncertainty; more consideration is
              needed.
- A: Would prefer a UB approach that would enable sanitizers to
              diagnose these cases and remain conforming.
- SA: There is lack of implementation experience for this
              direction, it imposes overhead, and there are terminals that
              accept bytes.
- SA: A wide contract with validation does not make sense for
              high-performance I/O.
 
- PBrett stated that there appear to be different audiences for
          std::print() and these audiences have different ideas of
          what is "obviously" correct:
        
          - For some, std::print() is a simple tool that enables a
              better Hello World.
- For others, it is a high-performance I/O facility.
- For yet others, it is a way to format bytes.
 
- Tom suggested that an error handling facility might move us
          towards more consensus.
- PBrett noted that something like JeanHeyd's transcoding facilities
          could provide that.
- Charlie agreed that integration of a familiar transcoding facility
          could work.
 
- Tom stated that the next telecon will be May 26th and that the agenda
      will again include
      P2295R3
      and
      P2093R6.
May 26th, 2021
Draft agenda:
Attendees:
  - Corentin Jabot
- Hubert Tong
- Jens Maurer
- Mark Zeren
- Peter Brett
- Steve Downey
- Tom Honermann
- Victor Zverovich
- Zach Laine
Meeting summary:
  - P2295R4: Support for UTF-8 as a portable source file encoding
    
      - [ Editor's note: D2295R4 was the active paper under discussion at
          the telecon.  The agenda and links used here reference
          P2295R4 since the links to the draft paper were ephemeral.  The
          published document may differ from the reviewed draft revision.
          ]
- PBrett provided an introduction.
- Corentin presented and described the changes from R3 to the
          draft R4.
- PBrett observed that the wording updates removed the prior
          definition for a UTF-8 file and added a new definition for
          a UTF-8 source file.
- Tom recalled prior discussion that suggested there was no need to
          provide such a definition at all.
- Jens confirmed and explained that the prior suggestion was to
          instead specify translation phase 1 in terms of a sequnce of
          characters instead.
- Jens noted that there will be merge conflicts with
          P2314.
- Corentin asked if the merge conflicts can be dealt with after CWG
          reviews P2314.
- Jens confirmed that they can be.
- PBrett asked if progress can be made before P2314 is adopted into
          the working paper.
- Jens confirmed that progress can be made.
- PBrett asked Jens if he would like to see additional wording changes
          reviewed in SG16.
- Jens replied that he would and noted that he had not received a
          response to all of the suggestions previously provided in his message
          to the mailing list available at
          https://lists.isocpp.org/sg16/2021/04/2353.php.
- Jens observed that the proposed wording results in existing wording
          no longer applying to all source files.  For example, "Any source
          file character not in the basic source character set is replaced by
          the universal-character-name that designates that character"
          now appears in a paragraph that doesn't apply to UTF-8 source
          files.
- Corentin responded that this paper doesn't make sense without the
          changes from P2314.
- Tom asked if the wording could be rebased on P2314 with a noted
          dependency on P2314.
- Jens replied that it could be.
- Hubert noted that the definition of a UTF-8 source file is problematic
          since the definition could apply to a file that just so happens to
          decode as UTF-8, but is not intended as a UTF-8 file.
- PBrett responded that the following sentence specifies that encoding
          determination is implementation-defined.
- Hubert acknowledged and suggested it might be helpful to reorder the
          sentences.
- Hubert added that wording is still required to reflect intent that a
          file be interpreted as UTF-8.
- PBrett agreed by way of an example; an implementation invoked without
          such intent may analyze a file, determine that it does not decode
          successfully as UTF-8, and then interpret it as, for example,
          Windows-1252, and do so without issuing a diagnostic.
- Jens observed that the wording states that, "An implementation shall
          support UTF-8 source files", but there is no wording to require
          diagnosis of ill-formed UTF-8 source files.
- Corentin responded that there is no such thing as an invalid UTF-8
          file; either a file is valid UTF-8 or it is not UTF-8.
- Mark responded that there is a desire to have implementations produce
          a diagnostic if source files that are purported to be encoded as
          UTF-8 are not, in fact, valid UTF-8.
- PBrett stated that there are three distinct requirements:
        
          - A requirement to support UTF-8 encoded source files.
- A requirement for means to inform the implementation that all
              source files are to be assumed to be UTF-8 encoded.
- A requirement that the implementation diagnose files that were
              assumed to be UTF-8 encoded but that contain (some) non-UTF-8
              content.
 
- Hubert offered some suggested wording in chat:
        
          - "An implementation shall provide for processing physical source
              files as having a UTF-8 encoding scheme without restriction,
              other than resource limits ([implimits]), upon the content of
              the physical source file."
 
- Jens pasted previously suggested wording from the mailing list in
          chat:
        
          - "The encoding scheme of a physical source file is determined in
              an implementation-defined manner.  An implementation shall
              support (possibly among others) the UTF-8 encoding scheme."
- "If the encoding scheme of a physical source file is determined
              to be UTF-8, the physical source file shall consist of a
              well-formed sequence of UTF-8 code units as specified by ISO/IEC
              10646."
 
- Hubert expressed support for that wording but thought some additional
          updates would still be required to ensure diagnostics.
- Corentin disagreed with removal of wording that requires that the
          scalar value of source file characters be preserved.
- Jens responded that the scalar value preservation wording isn't
          required because the mapping to the translation character set already
          preserves characters.
- Steve noted the existence of wording that uses the phrase "known to
          the implementation" and asked if that could be used to specify how
          source file encoding is determined.
- Tom suggested that implementation-defined is preferred since that
          reflects a documentation requirement.
- Hubert added that the "known to the implementation" wording is not
          intended to reflect that implementations can be wrong.
- PBrett observed that Jens and Hubert would presumably like to see
          updated wording.
- Hubert expressed a belief that the required wording has been
          identified and that he is onboard with the goal of preserving scalar
          value sequences from UTF-8 source files.
- Corentin responded that he will bring back a revised paper with the
          suggested wording.
- Steve informed the group that the EWG chair is considering dedicating
          a telecon to SG16 papers in the next month or so.
 
- P2093R6: Formatted output
    
      - PBrett reported a previous conversation with Victor in which Victor
          expressed that he felt he has the guidance he needs regarding
          handling of substitution characters and locale.
- Victor presented slides:
        
          - The next question to be answered is whether it is ok to base
              behavior on the literal encoding.
- Use of the literal encoding avoids race conditions with locale
              settings.
 
- Discussion ensued regarding current dependencies on the choice of
          literal encoding and it was observed that, though the wording
          provided by
          P1868
          to specify estimated format field widths is not based on the literal
          encoding, at least one implementation is planning to only use the
          specified estimated widths when the literal encoding is UTF-8.
- Hubert observed that field width estimation can apply to content
          from other than string literals.
- PBrett provided an example; when gettext() is used, a
          literal is used for the message catalog lookup, but the result is
          not a string literal.
- Hubert acknowledged the provided rationale, but noted that it does
          not address concerns raised and that he has seen many cases where
          use of locales works fine on UNIX systems.
- Hubert added that this has the potential to bite existing users since
          code may appear to work correctly until it suddenly doesn't.
- Victor replied that his goal is to make UTF-8 cases work as expected
          and that he is willing to accept some surprises in other
          scenarios.
- Victor stressed that the intention is that, on UNIX systems, bytes
          are simply passed through.
- Tom directed discussion towards the example code from the
          telecon announcement.
- Victor stated that he will request a LWG issue or author a paper to
          address handling of locale provided text.
- [ Editor's note: Victor requested an LWG issue that is now
          tracked as
          LWG issue 3565. ]
- Corentin stated that he is content with undefined behavior for cases
          where UTF-8 input is expected, but the input is not actually
          UTF-8 encoded.
- Hubert responded that the format locale situation is rather urgent
          for EBCDIC environments.
- PBrett stated that he is ok with the proposal because it won't break
          anything worse than it already is.
 
- Tom stated that the next telecon will be held on June 9th.