SC22/WG20 N975

From: Kenneth Whistler [kenw@sybase.com]
Sent: Tuesday, August 27, 2002 7:31 PM

Subject: 	Characters for identifiers in IS 14882 and TR 10176
		RE: Agenda for Character set ad-hoc - 26th August

Tom Plum wrote:

> I took an action item to compare the extended-id character list
> of C++ (ISO/IEC 14882:1998) versus the latest PDTR 10176.
> 
> Here is that comparison:
> 
> 10176 added 42011 additional codes
> 
> 10176 deleted these codes:
> 
>      1 01BB
>      2 0384..0385
>      2 05F3..05F4
>      1 0640
>      8 064B..0652
>      1 0670
>      1 0E46
>     12 0E4F..0E5B
>      4 309B..309E
>      4 30FB..30FE
>    250 1100..11F9
> ~1750 F900..FFDC
> 
> I.e. 10176 deleted 12 ranges which contain about 2000 code points,
> compared to the old C++ list (from the original 10176 list)
> 
> If anybody finds errors, please post them to this list.

Unfortunately, there are *many* errors in this accounting.

I don't have a copy of ISO/IEC 14882:1998 to hand, so I cannot
compare the listing of extended-id characters there to the
10176 listing, but I *do* have copies of all the relevant
10176 documents.

Judging from what Tom observes as major deletions in 10176,
what he calls "the original 10176 list" can only be the
*D*TR 10176, WG20 N477, dated 1996-12-31. That document
underwent a major overhaul as a result of its ballotting,
based on the proposed disposition of comments (WG20 N531,
dated 1997-09-26), revised as the final disposition of
comments (WG20 N532R, dated 1997-12-14). The final result,
which was published as the *TR* 10176, 2nd edition, can
be seen in WG20 N533, dated 1998-02-15, which was the
last WG20 document before the 2nd edition was published.

The deletions between the DTR (an unpublished document)
and the published TR 10176 2nd edition were as follows:

    1 0384
    2 05F3..05F4
    2 309D..309E
    2 30FD..30FE
  240 1100..1159, 1161..11A2, 11A8..11F9 (note: 240, not 250)
 1240 F900..FFDC (lots of gaps; note: 1240, not ~1750)

(There were numerous additions as well as the deletions.)

Incidentally, the U.S. national body (and the UTC) requested
only the deletion of 0384 (which was an error for the intended
0386, which was added), and 05F3..05F4 (which are Hebrew
punctuation, not letters). The other deletions were blanket
removals at the behest of the then-Japanese editor of 10176
and because of the decision by WG20 to omit all combining marks
in 10646 Annex B.2 ("List of combining and other characters
not allowed in implementation level 2") -- which accounts for
the omission of the conjoining jamos for Korean.

The *Unicode* recommendations for extended identifiers
contain all 1484 of those omissions, as documented at
the end of Annex A in TR 10176, 4th edition.

Next in the accounting trail, consider the differences
between the published TR 10176, 2nd edition and the published
TR 10176, 3rd edition. The differences can be found in
the Amendment 1 text, WG20 N699, dated 1999-10-22, which
was the last committee document before the publication
of the 3rd edition. The deletions between the TR 10176 2nd
edition and the TR 10176 3rd edition were as follows:

    1 00B7
    1 06D4
    1 0E4F
    2 0E5A..0E5B
   10 0F2A..0F33
    2 0F3E..0F3F
    2 309B..309C

The rationale for these deletions are:

   00B7, 06D4, 0E4F, 0E5A..0E5B are all punctuation.
       (00B7 is the notorious MIDDLE DOT, and has to
        be special-cased, like LOW LINE)
   0F2A..0F33 are Tibetan half-digit symbols -- not the regular
       Tibetan digits
   309B..309C are *spacing* diacritics, comparable to
       the other spacing accents which have always been
       omitted from the recommended list for 10176.

   0F3E..0F3F are combining marks -- their omission was
       the result of a clerical error in carrying out the committee
       mandate to separate the list of non-combining marks
       and combining marks in the listing for Annex A.
       The *Unicode* recommendation is to include them,
       as also documented at the end of Annex A in TR 10176,
       4th edition.

TR 10176, 4th edition has not recommended the deletion of
any more characters from the list. It *did* make large
extensions to account for all the additions to 10646-1:2000
2nd edition (= Unicode 3.0). However, the additions are
nowhere near "42011 additional codes", since the entire
repertoire consists of 49,194 graphic characters, most of which were
already accounted for in the recommendations for identifiers
in TR 10176, 2nd edition. The major additions were 6582
new CJK characters, 1165 Yi characters, and somewhere around
2000 for other new scripts (Ethiopic, Canadian syllabics,
Khmer, Myanmar, Mongolian, etc.).

The other characters on Tom's list of deletions are errors
in his accounting.

  0385 was *never* in the 10176 list.

  01BB, 0640, 064B..0652, 0670, 0E46, 0E50..0E59, and 30FB..30FC were
    *all* in the original DTR 10176 list, and have *stayed*
    in the published 2nd, 3rd, and 4th editions.

O.k., so with all of that out of the way, let me summarize the
picture. Unlike the somewhat scary picture of instability suggested 
by Tom's conclusion, to wit:

"10176 deleted 12 ranges which contain about 2000 code points"

the actual state of affairs is that since the *publication* of
the 2nd edition of 10176, with its Annex A, 10176 has deleted
a total of 19 code points -- and two of those were the result
of a clerical error. The others all have good reasons for being
omitted, and only one of them, U+00B7 MIDDLE DOT, can be
considered a common-use character. Since the publication of
the 3rd edition, *none* have been deleted.

Now, if the extended-id character list in ISO/IEC 14882:1998 was
based on the DTR for 10176 2nd edition, rather than the published
TR, as appears to have been the case, then the situation does,
indeed, involve a rather more serious mismatch. There must have
been a serious breakdown in committee liaison involved, since it seems
rather questionable to base a language standard on a DTR list
still undergoing substantial revision in another committee.

However, it seems to me, the road forward would not consist of attempting
to *remove* large numbers of characters from C++ identifiers, based
on what happened in the DTR ballotting of TR 10176 2nd edition,
but rather to consider the more benign consequences of trying
to harmonize with the current Unicode recommendations, instead.

Assuming that ISO/IEC 14882:1998 was, indeed, based on the
DTR 10176 2nd edition text, here are the consequences of the
two approaches:

Deletions required to synch with TR 10176, 4th edition
Annex A recommendations:

    1 0384
    2 05F3..05F4
    2 309D..309E
    2 30FD..30FE
  240 1100..1159, 1161..11A2, 11A8..11F9
 1240 F900..FFDC (lots of gaps; note: 1240, not ~1750)
    1 00B7
    1 06D4
    1 0E4F
    2 0E5A..0E5B
   10 0F2A..0F33
    2 0F3E..0F3F
    2 309B..309C

 Total: 1506 deletions

Deletions required to synch with the Unicode recommendations
(see the recommendations appended at the end of Annex A
in the 4th edition of TR 10176):

    1 0384           [spacing diacritic]
    2 05F3..05F4     [punctuation]
    1 00B7           [punctuation: special-case for Catalan]
    1 06D4           [punctuation]
    1 0E4F           [punctuation]
    2 0E5A..0E5B     [punctuation]
   10 0F2A..0F33     [Tibetan half-digit symbols]
    2 309B..309C     [spacing diacritics]

 Total: 20 deletions (19 - 2 + 3)

I think the second list would be a whole lot easier to document
and justify to your standard's constituents.

Regards,

--Ken Whistler

P.S. If anyone else wants to churn through the statistics here,
I have soft copies of the lists from DTR 10176 2nd edition, the
list resulting from the application of the proposed disposition
of comments to DTR 10176, the list resulting from the application
of the *final* disposition of comments to DTR 10176, the actual
list from the published TR 10176 2nd edition, the list from
Amd 1 (the basis for TR 10176 3rd edition), and the list from
TR 10176 4th edition. So you are welcome to check my work, if
you'd like.

A nagging notion made me go back and double-check, and I was
wrong. There were two more deletions for the 4th edition:

    1 2118 SCRIPT CAPITAL P [a misidentified character]
    1 212E ESTIMATED SYMBOL

So the revised summary would be:

Deletions required to synch with TR 10176, 4th edition
Annex A recommendations:

    1 0384
    2 05F3..05F4
    2 309D..309E
    2 30FD..30FE
  240 1100..1159, 1161..11A2, 11A8..11F9
 1240 F900..FFDC (lots of gaps; note: 1240, not ~1750)
    1 00B7
    1 06D4
    1 0E4F
    2 0E5A..0E5B
   10 0F2A..0F33
    2 0F3E..0F3F
    2 309B..309C
    1 2118
    1 212E

 Total: 1508 deletions

Deletions required to synch with the Unicode recommendations
(see the recommendations appended at the end of Annex A
in the 4th edition of TR 10176):

    1 0384           [spacing diacritic]
    2 05F3..05F4     [punctuation]
    1 00B7           [punctuation: special-case for Catalan]
    1 06D4           [punctuation]
    1 0E4F           [punctuation]
    2 0E5A..0E5B     [punctuation]
   10 0F2A..0F33     [Tibetan half-digit symbols]
    2 309B..309C     [spacing diacritics]
    1 2118           [misidentified as letterlike symbol]
    1 212E           [symbol, not letterlike]

 Total: 22 deletions (21 - 2 + 3)

And note that there is one other fly in the ointment. Java
*allows* 2118, 212E, and 309B..309C in identifiers. So for
identifier stability and interoperability, those four characters
might also need to be special-cased.

Regards,

--Ken Whistler

	1