SG16: Unicode meeting summaries 2018/05/16 - 2018/06/20
Summaries of SG16 meetings are maintained at
https://github.com/sg16-unicode/sg16-meetings.  This paper contains a
snapshot of select meeting summaries from that repository.
May 16th, 2018
Draft agenda:
  - Review and discuss papers in the Rapperswil pre-meeting mailing.
- Discuss plans and goals for those attending Rapperswil.
Attendees:
  - Bob Steagal
- Corentin Jabot
- Dalton Woodward
- Florin Trofin
- JeanHeyd Meneide
- Mark Zeren
- Martinho Fernandes
- Steve Downey
- Tom Honermann
- Zach Laine
Meeting summary:
  - It was reported on Slack that Martinho's properly formatted UTF-8
      P1041R0 paper was served up by
      open-std.org either without a CharSet header or with a Latin1
      setting.  Tom contacted Hal and Keld.  Further discussion yielded a plan
      to update
      
      SD-7 to require UTF-8 for .md files and to configure the
      open-std.org web server to serve them with a CharSet=UTF-8 header.
- Zach, Bob, and JeanHeyd shared some of their experience at C++Now.
- We then went on to review papers from the pre-Rapperswil mailing.
- P1041R0 - Make char16_t/char32_t string literals be UTF-16/32
    
      - Tom noted a typo in the proposed wording changes for
          lex.ccon/4; a use of UTF-8 where UTF-16 was
          intended.
- Given the encoding issues and lack of Markdown rendering support built
          into browsers, it was suggested that future papers, at least for now,
          be submitted in a pre-rendered format.
- Martinho asked about getting the paper scheduled for discussion in
          Rapperswil.  Tom said he would forward SG16 polls on papers we
          discussed to WG chairs to communicate our position and request time
          in Rapperswil.  Tom will copy paper authors and expected presenters on
          this communication.
- It was asked if there was any library impact.  Martinho responded no.
          Tom noted having previously audited occurrences of char16_t,
          char32_t, UTF-16, and UTF-32 and could not
          think of a case.
- Zach suggested that, when presenting, it be emphasized that no
          implementation will need to make changes; that this is just standarizing
          existing practice.  Emphasize that there are no known implementations
          where the encoding used is not already UTF-16/UTF-32 and that a member
          of the C committee was consulted.
- Poll: Those in favor of P1041R0?
        
      
 
- P1072R0 - Default Initialization for basic_string
    
      - Mark noted that P1072 is dependent
          on P1010 which is dependent on
          Richard Smith's P0593.  This
          raised the question of prioritization and a request for SG16 to request
          that P0593 and
          P1010 get time in Rapperswil so
          that progress can be made on P1072
          in San Diego.  Tom agreed to make such a request; specifically to
          request that EWG entertain P0593
          and that LEWG look at P1010 (and
          P1072 time permitting).
- Mark went on to discuss applicability of
          P1072 to SG16.  Of particular
          concern are the issues caused by requiring null termination.  This is
          not a problem for vector, and hence not a concern expressed in
          P1010.
- Mark pointed out that the design is used in real world code today.
- Zach asked why reserve() doesn't suffice.  Mark explained the
          examples in the paper; that we currently either have to repeatedly
          update the size of the container with each addition, or eagerly resize
          the container and pay for an unused initialization.  The goal of the
          paper is to avoid both costs by enabling writing to excess capacity
          independently of updates to the container size.
- Tom asked if option A is viable.  The concern is that const member
          functions must be thread safe.  A call to resize_uninitialized()
          makes uninitialized data available to const member functions.  Further,
          there is no event to indicate when the uninitialized data has been
          read and therefore no memory barier to function as a synchronization
          point.
- Mark acknowledged that a two-phase commit approach is necessary to
          avoid UB.
- Martinho observed that two-phase commit is not sufficient by itself
          because basic_string uses excess capacity to store a null
          terminator for the string; this is what allows the data() and
          c_str() member functions to be const qualified.
          Overwriting the null terminator will cause UB for concurrently
          executing threads.
- Mark advised SG16 to consider the consequences of providing implicit
          null-termination for string-like containers in the future.  An
          alternative approach would use string builders that append a
          null-terminator when they are collapsed.
- Mark noted that the two-phase commit approach does at least allow the
          implementation to re-establish invariants (such as ensuring a
          null-terminator is present at the start of excess capacity following
          insert_from_capacity().
- Tom suggested an emplace-like solution might be preferred to enable
          preserving invariants.
- Mark acknowledged a call-back/functor based solution would work (though
          it still doesn't address the over-written null-terminator issue).
- Dalton asked whether making vector/string node-based containers such
          that data could be written to a new buffer and then swapped in.  This
          has the disadvantage of requiring that the current buffer be copied
          prior to performing the append.
- Tom asked if any performance numbers were available.  What is the
          expected gain?
- Mark responded that numbers are not available, but that Google has
          measured and claims the improvements make this feature worthwhile.
          Estimate is a few percent improvement.
- Corentin asked why not to use vector instead of string.
- Mark responded that string is a vocabulary type.
- Poll: Do we agree P1072R0 addresses a problem worth solving?
        
      
- Poll: Do we prefer option A, option B, or some option C?
        
      
- Mark clarified that option C, as discussed today, would be one of:
        
          - An emplace-like call with a call-back/functor.
- A node-based swap.
 
- Discussion moved into allocator interaction with node types.
- Zach stated that swap is broken for PMR allocators.
- Steve agreed and provided an elaboration; that the swap of the
          allocators doesn't swap the actual buffers.
- Mark noted that moving a buffer between vector and
          string encounters complexities due to null-termination
          requirements.
- Martinho asked how a small buffer optimized string is moved into a
          node type.
- Dalton responded that you allocate.
- Tom added, or the node type implements the SBO itself.
- Mark expressed concern that an emplace-like call-back/functor approach
          may not work for the network use case of wanting to read data off the
          network directly into the buffer.
- Zach suggested that, in a string builder approach, vector is
          the string builder.
- Corentin expressed a preference for a specific string builder type
          rather than vector.  Essentially a vocabulary type suited to
          the purpose.
 
- P1025R0 - Update The Reference To The Unicode Standard
    
      - Steve briefly introduced the change as similar to what had been
          proposed, but not completed, for C++17.
- Tom asked, why update the normative reference to specify each of
          Unicode 10, Unicode without a version indicator, and ISO 10646?
- Steve answered, we need ISO 10646 for existing references; for
          example, the __STDC_ISO_10646__ macro.  We want to reference
          the Unicode standard (in addition to ISO 10646) for stability
          guarantees and additional features.  We want to reference Unicode 10
          to establish a minimum requirement, and the unversioned Unicode
          standard to enable implementors to adopt a newer version.
- Tom suggested adding a non-normative note that implementors are
          allowed to use Unicode 10 or newer; though they must use a
          corresponding version of ISO 10646.
- Martinho stated that we need to make it clear that implementors must
          choose a specific Unicode release.
- Tom asked if we should require a predefined macro that indicates the
          Unicode version.
- Steve and Martinho both answered, maybe, but not yet as we don't
          actually depend on anything Unicode version dependent yet.
- Poll: Those in favor of P1025R0:
        
      
 
- Our next meeting will be May 30th; the week before Rapperswil.
- There is a WG21 administrative teleconference May 25th.
    
      - Tom will dial-in to give an update on SG16.  Martinho and JeanHeyd
          are encouraged to attend as well since they have papers to present.
 
- Those planning to attend Rapperswil: Martinho, Corentin, Peter, JeanHeyd.
- Following the meeting, Martinho volunteered to present
      P1025R0 at Rapperswil since Steve
      will not be present.  Steve agreed.
May 30th, 2018
Draft agenda:
  - Discuss plans and goals for those attending Rapperswil.
- Review and discuss the following papers from the Rapperswil pre-meeting mailing:
    
      - P1030R0: std::filesystem::path_view
- P0540R1: A Proposal to Add split/join of string/string_view to the Standard Library
- P0645R2: Text Formatting
 
Attendees:
  - JeanHeyd Meneide
- Mark Zeren
- Martinho Fernandes
- Peter Bindels
- Sergey Zubkov
- Steve Downey
- Tom Honermann
- Zach Laine
Meeting summary:
  - Administrative updates:
    
      - Tom reported that WG chairs were contacted regarding SG16 requests
          for paper reviews in Rapperswil.  WG chairs are predictably swamped
          and prioritizing as best they know how, but we may not get to present
          any of our papers.
- Zach observed that Titus is concerned about the amount of time that
          LEWG will need for ranges, but that LWG should be more concerned.
- Tom relayed that JF Bastien volunteered to arrange introductions with
          Swift and WebKit developers working on Unicode.  Tom reached out to
          arrange meetings, but hasn't heard back.  Apple developers are busy
          preparing for WWDC; Tom will reach out again soon.
- Tom brought up the recent news that Microsoft has added beta support
          for UTF-8 as a system code page as of the Windows 10 April update.
          Tom made some new contacts within Microsoft, but has not yet gotten
          any further information about Microsoft's goals or plans with this
          change.
 
- Rapperswil planning:
    
      - Tom asked for volunteers to standup for SG16 at the Saturday plenary
          in Rapperswil and give a brief update.  Martinho and JeanHeyd agreed
          to do so.
- Tom asked for those who have attended meetings before to offer any
          advice they have for first time attendees.
- Zach recommended spending some time in each of the WGs.  Each WG has
          its own personality.
- It was noted that hanging around in WGs where one has a short paper
          in the queue creates opportunities to present earlier than the paper
          might otherwise be scheduled.  The
          P1025 (normative Unicode reference)
          and P1041 (char16_t/char32_t are
          UTF-16/UTF-32) papers are good candidates.
- Zach also mentioned not to be afraid to ask questions and to try to
          read papers ahead of time.
- Tom noted that anyone present in the room is allowed to vote in straw
          polls, but that polls in plenary are generally restricted to ISO
          members.  It was noted that Herb will make it clear when ISO membership
          is required to vote.
 
- P1030R0 - std::filesystem::path_view
    
      - Martinho liked it, especially section 4.1 (Assume UTF-8 for char based
          interfaces).
- Tom liked it with the exception of section 4.1.
- Tom expressed a belief that the discussion in section 4.1 of how
          existing char based interfaces on Windows handle conversion to
          wchar_t for invocation of native filesystem interfaces is
          incorrect.  Tom's understanding is that char based strings are transcoded
          to wchar_t strings using the system code page.
- Zach asked what is meant by ANSI encoding.
- Tom explained that Microsoft has long referred to char based encodings
          collectively as ANSI encodings despite these encodings not reflecting
          an ANSI standard.
- [Editor's note: Microsoft's glossary of terms on MSDN describes
          the origin of the ANSI reference here.  It comes from a draft ANSI
          specification that was eventually standardized as the ISO-8859 family
          of encodings.  See the definition of "ANSI" at
          
          https://msdn.microsoft.com/en-us/goglobal/bb964658.aspx#a.
          Microsoft now officially refers to these encodings as "Windows code
          pages".]
- Zach initiated a discussion on compile-time vs run-time encodings.
          Section 4.1 describes a scenario in which file paths are pasted into
          source code as string literals, but the existing interpretation of
          such strings, when used as paths at run-time, depends on run-time
          locale settings.
- Peter mentioned that the Microsoft compiler now supports a
          /utf-8 option that purports to define the source and execution
          character encodings.  However, that option really only affects how
          literals from the source code are translated to the execution character
          encoding (UTF-8 at compile time, but never UTF-8 at run-time (at least,
          not until the newly introduced beta support in Windows 10 that requires
          the user to opt in)).
- Tom stated that we can't fix the compile-time vs run-time aspects of
          the execution character encoding.
- Martinho countered that char8_t offers a solution for this -
          we know the compile-time and run-time encoding of char8_t
          characters and strings.
- Tom suggested a response to the author: maintain consistency with
          existing code; char means "ANSI" encoding.  Use
          char8_t for UTF-8 (follow the changes to path
          proposed in the char8_t proposal.
- Tom, Zach, and JeanHeyd all noted the presence of #ifdefs
          surrounding the wchar_t based interfaces in the proposed
          design.  We don't use #ifdef as specification for implementation
          defined features.
- JeanHeyd noted that that path_view should not fight with the
          platform; don't propagate implementation defined behavior through
          interfaces to the programmer.
- Martinho observed that there is no rationale for providing
          wchar_t based interfaces only for Windows; they are perfectly
          applicable to other platforms as well.
- Zach stated that path_view should work the same as
          path; just as string_view does for string.
          path_view should support the same set of constructors that
          path has and they should behave the same.  If there is a need
          for new constructors, they should be added to both path and
          path_view.
- Zack noted that path_view should be explicitly constructible
          from path, not the other way around.  [Editor's note: as
          currently specified, path_view is constructible from
          path, though the constructor isn't explicit.  Note that
          string_view's corresponding constructor is also not
          explicit.]
- Further discussion regarding memory allocation and the behavior of the
          proposed c_str class ensued.  [Editor's note: few details
          of this discussion were recorded.  From what I recall, consensus was
          that the memory allocation behavior should be implementation
          defined.]
- JeanHeyd asked how we should communicate our feedback to the author.
- Zach replied with a preference for a direct person-to-person
          response.
- JeanHeyd volunteered to deliver feedback.
- Poll: Use execution character encoding for char interfaces,
          char8_t for UTF-8?
        
      
 
- P0882R0 - User-defined Literals for std::filesystem::path
    
      - Tom stated that SG16 concerns are limited to encoding issues; LEWG
          should address any other concerns; e.g., naming.
- Peter noted that the paper punts on UTF-8 support pending a solution
          from the comittee for differentiating ordinary and UTF-8 string
          literals.  Fortunately, we have a solution for that in the works!
- It was asked why the UDLs are not constexpr; the answer is
          because they produce path objects and the path
          constructor allocates.
- Mark asked if the UDLs should produce path_view objects ala
          P1030 above and was rewarded
          with a round of yeses.
- Peter observed that the UDL names are very generic (ha ha) and that
          the literal namespace proposed for them differs unnecessarily from
          existing precedent (e.g., std::filesystem::literals vs
          std::literals::filesystem.  [Editor's note: This design
          also results in the UDL declarations being visible following
          using namespace std::filesystem; this may be
          intentional.]
- Poll: Contingent upon adoption of `char8_t`, add `char8_t` based overloads?
         
       
 
- P0540R1 - A Proposal to Add split/join of string/string_view to the Standard Library 
    
  
- P0645R2 - Text Formatting
    
      - Zach requested char8_t overloads.  [Editor's note:
          Peter has been planning to work on adding char16_t and
          char32_t support.  There is an existing issue tracking
          support for char16_t at
          
          https://github.com/fmtlib/fmt/issues/698.  That issue notes
          that support for std::numpunct<char16_t> is
          missing; that would presumably be an issue for char8_t
          support as well.]
- Zach observed that formatting only works for trivial encodings
          in which one code unit equals one code point; otherwise, field
          alignments won't match up in displayed text.
- Martinho responded that, if a font is missing a glyph for a
          combining character, then the combining character will likely be
          displayed as a separate glyph.  Text layout is required to display
          aligned text (e.g., depends on console, curses, etc...).
- Tom asked how such display concerns can be addressed; format
          is not a text display tool.
- Zach asked how field size is specified.  Code units?  Code points?
          "Characters"?
- Peter provided a link to an existing github issue concerning field
          size and UTF-8: 
          https://github.com/fmtlib/fmt/issues/628.
- Tom noted that we were out of time; we'll continue discussion next
          time and will invite Victor to join us.
 
- Tom stated our next meeting will be scheduled for three weeks from now
      on June 20th.  The extra week is to give everyone a break following
      Rapperswil.
June 20th, 2018
Draft agenda:
  - Rapperswil recap.  Progress!
- Continue review of P0645R2 (Text Formatting), hopefully with Victor
      present if he can attend.
- Review the draft D1097R0 proposal:
    
      - https://github.com/rmartinho/sg16/blob/master/papers/d1097r0.md
 
- Discuss what we want to learn from the Swift and WebKit developers.
Attendees:
  - Corentin Jabot
- JeanHeyd Meneide
- Keld Simonsen
- Mark Zeren
- Martinho Fernandes
- Peter Bindels
- Steve Downey
- Tom Honermann
- Victor Zverovich
- Zach Laine
Meeting summary:
  - First order of business was to ensure that papers requiring updates
      following the Rapperswil meeting are submitted in time for the
      post-Rapperswil mailing.  Tom confirmed that P0482R4 had been submitted
      and correspondence with Hal confirmed that P1025R1 (adopted at Rapperswil)
      will be included in the mailing.  Though not discussed in Rapperswil,
      Martinho plans to submit a revision of P1041 for the mailing.
- P0645R2 - Text Formatting
    
      - Victor started us off with a brief introduction of recent changes and
          review in Rapperswil.
- Victor reported having read the summary of our previous meeting and
          discussion of P0645.
- Discussion resumed regarding what field widths mean for multibyte
          encodings and combining characters.
- Victor asked if basing field widths on grapheme clusters would be
          appropriate.
- Zach provided an example of family emojis.  Consider 4 person code
          points separated by zero width joiners.  Each person code point
          combined with a ZWJ is a distinct grapheme cluster, but a single
          glyph may be used to display all four clusters.  So, grapheme clusters
          are not the right abstraction for field width.
- Tom claimed that format should be used to format code
          units.
- Peter suggested assuming one column per code point.
- Keld asked about other libraries; are there any that use abstractions
          above code points for field formatting?
- Tom stated that the competition is printf and iostreams.
- Keld asked what ICU does.
- Zach responded that he wasn't sure, but that Python uses code points
          for field formatting.
- Discussion then moved on to other topics briefly.
- Zach expressed enthusiasm for format_to_n.
- Tom asked if mixed character encodings are supported.  For example:
        - 
          format("{}", u"text"); // execution character encoding for format string with UTF-16 argument.
 
- Victor stated that mixed encodings are not supported and result in
          compilation failure.
- Zach observed that, if char8_t overloads were added, that,
          internally, format must consume code points.
- Tom responded that this is true for any multibyte encoding, and
          therefore true in general for the execution and wide character
          encodings.
- Victor agreed, but noted that operations other than fill and field
          formatting could be optimized to avoid looking at code points.
- Peter asked if any multibyte encodings allow a NUL byte in trailing
          code unit sequences.  No such encodings were named.
- Peter observed that, if an encoding library is used, format
          can always just read code points.
- Zach offered to provide Victor code using code point iterators from
          Boost.Text that could be used to prototype code point based
          approaches.
- Discussion briefly turned to portability of wchar_t and
          Keld's work to increase the number of C interfaces that do not rely
          on global program state; e.g., locale data.  Keld wants to improve
          support for working with multiple encodings in a single process.
- Tom noted that such improvements are useful for our ideas around use
          of compile-time known internal encodings with transcoding to run-time
          determined encodings at program borders.
- Tom asked how format handles signed and unsigned char; are
          they treated as integral/arithmetic or character types?
- Victor replied that he didn't recall and would have to check.
- Keld asked about reentrancy.
- Victor responded that the only global state references are for locale
          data.
- Keld recommended allowing strings to be tagged with encoding data.
- Tom tried to bring discussion back to fill operations and field widths;
          are we agreed on use of code points for field fill/alignment?
- Martinho asked how a code point approach works when writing to a fixed
          width buffer (of code units).
- Victor mentioned that format_to_n takes a code unit count
          constraint.
- Peter observed that a code unit count constraint can result in
          truncated code unit sequences.
- Victor suggested that format_to could produce code points
          instead.
- Steve asked how to avoid writing broken code; code points produced are
          likely going to be written to a code unit buffer anyway.
- Keld stated that programmers like to write both code unit and code point
          code; perhaps both should be supported.
- Martinho claimed that truncated code unit sequences are probably not a
          large concern; buffers are generally larger than required anyway.
- Discussion again drifted towards encodings that are known at
          compile-time vs run-time.
- Keld asked what types are generally used for double byte character
          sets; Japanese, Chinese, ...
- Martinho responded that those tend to be variable length encodings
          that switch between single byte and multibyte.
- Tom agreed and mentioned ISO-2022 and escape sequences.
- Discussion drifted back to code units vs code points.
- Zach suggested that programmers will expect the output encoding to
          match the format string, but that code points are more consistent and
          natural.  If the n in format_to_n means something
          different than for field widths, that will be a problem.
- Victor agreed that programmers will expect to be filling a code unit
          based buffer.
- Tom observed that more discussion would be useful, but that we need
          to move on.
- Zach recommended trying to support both code unit and code point
          based approaches and observe feedback and usage.
 
- D1097R0 - Named character escapes
    
      - Martinho started by requesting feeback on:
        
          - name matching (currently more limited than described by
              UAX44-LM2)
- lack of support for named character sequences.
 
- Tom recommended adding a small section that summarizes what is
          actually proposed.  At present, the paper presents a number of options,
          but one must read the proposed wording to determine which options are
          actually proposed.
- Tom expressed a preference for following the UAX44-LM2 rule
          for name matching.
- Martinho responded with a dislike for the
          U+1180 HANGUL JUNGSEONG O-E exception and noted that none of
          the other languages he surveyed use UAX44-LM2 for matching.
- Keld noted existing APIs that allow specifying precision for matching.
- Martinho clarified that general collation APIs don't apply here
          (because of the U+1180 HANGUL JUNGSEONG O-E exception).
- Tom asked if we should propose this for C and everyone responded yes.
- Tom mentioned the paper should address the potential for code breakage.
          "\N" has a meaning now (it means "N").
- Tom asked if it is permissible to construct these escapes using macro
          concatentation.
- Tom observed that '_' seemed to be missing in the definition of
          c-char.
- Martinho stated that is intentional; '_' would be needed for
          UAX44-LM2 matching, but that actual character names never use
          '_'.
- Zach suggested adding a Tony Table to compare use of \U and
          \N{} escapes.
- Tom suggested clarifying that \N{} escapes would not be
          permitted in identifiers.
- Tom asked about interaction with raw string literals;
          r-char-sequence doesn't seem to include
          universal-character-name.
- Martinho responded that universal-character-name escapes are
          not recognized in raw string literals; following existing precedent.
 
- Rapperswil recap:
    
      - Tom asked if Rapperswil attendees were able to connect with authors
          of previously discussed papers in order to deliver our feedback.
- JeanHeyd reported that connections did not happen.  However:
        
          - P1030 was not discussed in Rapperswil.
- P0882 was discussed in LEWG but not well received.  No need for
              follow up.
- P0540 was discussed but LEWG feedback matched ours.  No need to
              follow up.
 
 
- We ran out of time to discuss what we want to learn from the Swift and
      WebKit developers.
- Tom asked about renaming the SG16 mailing list from unicode to
      sg16-unicode.  Both Tom and Martinho had been annoyed by the
      similarity to the unicode.org mailing list by the same name.  No
      objections were raised; Tom will follow up with Keld.
- Tom noted that our next regularly scheduled meeting would fall on July
      4th, a US holiday.  The next meeting will be scheduled for July 11th.