JTC1/SC22/WG14
N628
DOC NUMBER: WG14/N628
X3J11/96-092
Date: 07-Nov-96
The attached document WG21/N0886 was adopted by the WG21/X3J16
C++ committee with the following two amendments. These amendments
were taken from the WG21 C++ Stockholm minutes.
==== C Compatibility (Nelson) ====
17) Motion (to change the syntax for universal-character-name) by
Nelson/Plum:
Move we amend the WP by changing all occurrences of "??u" to "\u"
and all occurrences of "??U" to "\U". (The affected clauses are 2.2
[lex.charset], footnote 18 in 2.10 [lex.name], and A.2 [gram.lex].)
Motion passed X3J16: 31 yes, 0 no.
Motion passed WG21: 7 yes, 0 no, 0 abstain.
18) Motion (to clarify phases of translation) by Nelson/Benito:
Move we amend clause 2.1 [lex.phases] of the WP as follows:
-- under phase 1, add to the end of sentence 1: "in an
implementation-defined manner."
-- under phase 2, after sentence 1, add the following new sentence:
"If a character sequence results which matches the syntax of a
universal-character-name, the behavior is undefined."
-- under phase 4, after sentence 1, add the following new sentence:
"If a character sequence is produced by token concatenation
(_cpp.concat_) which matches the syntax of a universal-char-
acter-name, the behavior is undefined."
Motion passed X3J16: 31 yes, 0 no.
Motion passed WG21: 7 yes, 0 no, 0 abstain.
WG21/N0886
X3J16/96-0068
1996-03-13
Extended Identifiers and Extended Literals
Thomas Plum, John Benito, Clark Nelson
Move that we revise the Working Paper as follows:
Item 1) In 2.1, Phases of Translation, add a new paragraph 1 to precede
the existing first paragraph. Then add onto "phase 1" a new sentence,
"Any source file character not in the basic source character set is replaced
by the _universal-character-name_ that designates that character." so that
the first two paragraphs would read as follows:
2.1 Phases of translation [lex.phases]
1 The _basic source character set_ consists of 96 characters: the space
character, the control characters representing horizontal tab,
vertical tab, form feed, and new-line, plus the following 91
graphical characters:
a b c d e f g h i j k l m n o p q r s t u v w x y z (26)
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z (26)
0 1 2 3 4 5 6 7 8 9 (10)
_ { } [ ] # ( ) < > % : ; . ? * + - / ^ & | ~ ! = , \ " ' (29)
The _universal-character-name_ construct provides a way to name other
characters. The character designated by the _universal-character-name_
??UNNNNNNNN is that character whose encoding in ISO/IEC 10646 is the
hexadecimal value NNNNNNNN; the character designated by the
_universal-character-name_ ??uNNNN is that character whose encoding in
ISO/IEC 10646 is the hexadecimal value 0000NNNN.
hex-quad:
hexadecimal-digit hexadecimal-digit hexadecimal-digit hexadecimal-digit
universal-character-name:
??u hex-quad
??U hex-quad hex-quad
2 The precedence among the syntax rules of translation is specified by the
following phases.
1 Physical source file characters are mapped to the source character set
(introducing new-line characters for end-of-line indicators) if necessary.
Trigraph sequences (2.2) are replaced by corresponding single-character
internal representations. Any source file character not in the basic
source character set is replaced by the _universal-character-name_ that
designates that character. [Footnote -- The process of handling extended
characters is specified in terms of mapping to an encoding that uses only the
basic source character set, and, in the case of character literals and
strings, further mapping to the execution character set. In practical terms,
however, any internal encoding may be used, so long as an actual extended
character encountered in the input, and the same extended character
expressed in the input as an _universal-character-name_ (i.e. using the
??uXXXX notation), are handled equivalently.]
[end of quote from revised WP]
Item 2) In 2.1, Phases of Translation, revise "phase 5" as follows:
5 Each source character set member, escape sequence, or
_universal-character-name_ in character literals and string literals is
converted to a member of the execution character set.
Item 3) In 2.8, Identifiers, add a new line into the definition of _nondigit_,
and modify paragraph 1, so that the revised text of 2.8 reads as follows:
2.8 Identifiers [lex.name]
identifier:
nondigit
identifier nondigit
identifier digit
nondigit: one of
_universal-character-name_
_ a b c d e f g h i j k l m n o p q r s t u v w x y z
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
digit: one of
0 1 2 3 4 5 6 7 8 9
1 An identifier is an arbitrarily long sequence of nondigits and digits.
Each _universal-character-name_ in an identifier shall designate a character
whose encoding in ISO 10646 falls into one of the ranges specified in Annex E.
Upper- and lower-case letters are different. All characters are significant. *
[*Footnote: On systems in which linkers cannot accept extended characters, an
encoding of the universal-character-name may be used in forming valid external
identifiers. For example, some otherwise unused character or sequence of characters
may be used to encode the "??u" in a universal-character-name. Extended characters
may produce a long external identifier, but C++ does not place a
translation limit on significant characters for external identifiers. In C++,
upper and lower case letters are considered different for all identifiers,
including external identifiers.]
Item 4) Augment the definition of _c-char_ in 2.10.2, Character Literals, as
follows:
c-char:
any member of the source character set except
the single-quote ', backslash \, or new-line character
escape-sequence
universal-character-name
Then add a new paragraph 5, as follows:
5 A _universal-character-name_ is translated to the encoding, in the
execution character set, of the character named. If there is no such
encoding, the _universal-character-name_ is translated to an
implementation-defined encoding. [Note: In translation phase 1 a
_universal-character-name_ is introduced whenever an actual extended
character is encountered in the source text. Therefore, all extended
characters are described in terms of _universal-character-names_.
However, the actual compiler implementation may
use its own native character set, so long as the same results are obtained.]
Item 5) Augment the definition of _s-char_ in 2.10.4, String Literals, as
follows:
s-char:
any member of the source character set except
the double-quote ", backslash \, or new-line character
escape-sequence
universal-character-name
Then, in paragraph 5, change "Escape sequences" to "Escape sequences and
_universal-character-names". Change the last sentence to read as follows:
In a non-wide string literal, a _universal-character-name_ may map to more
than one char element. The size of a wide string literal is the total number
of escape sequence, _universal-character-names_, and other characters, plus
one for the terminating L'\0'. The size of a non-wide string literal is
the total number of escape sequences and other characters, plus at least one
for the multibyte encoding of each _universal-character-name_, plus one for
the terminating '\0'.
Item 6) Add an annex to list the universal-character-names for identifiers.
________________________________________________________________________________
Annex E (normative) Universal-character-names for Identifiers [extended-id]
________________________________________________________________________________
1 This Clause lists the hexadecimal code values that are valid in
_universal-character-names_ in C++ identifiers.
2 This table is reproduced unchanged from ISO/IEC PDTR 10176, produced by
ISO/IEC JTC1/SC22/WG20, except that the ranges 0041-005a and 0061-007a designate
the upper and lower case English alphabets, which are part of the basic source
character set, and are not repeated in the table below.
[Editorial Note: If PDTR 10176 is changed during its balloting and adoption as
a TR, then this table should be changed to match its changes.]
Latin: 00c0-00d6,00d8-00f6,00f8-01f5,01fa-0217,
0250-02a8,1e00-1e9a,1ea0-1ef9
Greek: 0384,0388-038a,038c,038e-03a1,03a3-03ce,03d0-03d6,03da,03dc,03de,
03e0,03e2-03f3,
1f00-1f15,1f18-1f1d,1f20-1f45,1f48-1f4d,1f50-1f57,1f59,1f5b,1f5d,
1f5f-1f7d,1f80-1fb4,1fb6-1fbc,1fc2-1fc4,1fc6-1fcc,1fd0-1fd3,
1fd6-1fdb,1fe0-1fec,1ff2-1ff4,1ff6-1ffc,
Cyrilic: 0401-040d,040f-044f,0451-045c,045e-0481,0490-04c4,04c7-04c8,
04cb-04cc,04d0-04eb,04ee-04f5,04f8-04f9
Armenian: 0531-0556,0561-0587
Hebrew: 05d0-05ea,05f0-05f4
Arabic: 0621-063a,0640-0652,0670-06b7,06ba-06be,06c0-06ce,06e5-06e7,
Devanagari: 0905-0939,0958-0962
Bengali: 0985-098c,098f-0990,0993-09a8,09aa-09b0,09b2,09b6-09b9,
09dc-09dd,09df-09e1,09f0-09f1
Gurmukhi: 0a05-0a0a,0a0f-0a10,0a13-0a28,0a2a-0a30,0a32-0a33,
0a35-0a36,0a38-0a39,0a59-0a5c,0a5e
Gujarati: 0a85-0a8b,0a8d,0a8f-0a91,0a93-0aa8,0aaa-0ab0,0ab2-0ab3,
0ab5-0ab9,0ae0,
Oriya: 0b05-0b0c,0b0f-0b10,0b13-0b28,0b2a-0b30,0b32-0b33,0b36-0b39,
0b5c-0b5d,0b5f-0b61,
Tamil: 0b85-0b8a,0b8e-0b90,0b92-0b95,0b99-0b9a,0b9c,0b9e-0b9f,0ba3-0ba4,
0ba8-0baa,0bae-0bb5,0bb7-0bb9,
Telugu: 0c05-0c0c,0c0e-0c10,0c12-0c28,0c2a-0c33,0c35-0c39,0c60-0c61,
Kannada: 0c85-0c8c,0c8e-0c90,0c92-0ca8,0caa-0cb3,0cb5-0cb9,0ce0-0ce1,
Malayalam: 0d05-0d0c,0d0e-0d10,0d12-0d28,0d2a-0d39,0d60-0d61,
Thai: 0e01-0e30,0e32-0e33,0e40-0e46,0e4f-0e5b,
Lao: 0e81-0e82,0e84,0e87,0e88,0e8a,0e0d,0e94-0e97,0e99-0e9f,0ea1-0ea3,
0ea5,0ea7,0eaa,0eab,0ead-0eb0,0eb2,0eb3,0ebd,0ec0-0ec4,0ec6,
Georgian: 10a0-10c5,10d0-10f6,
Hiragana: 3041-3094,309b-309e
Katakana: 30a1-30fe,
Bopmofo: 3105-312c,
Hangul: 1100-1159,1161-11a2,11a8-11f9
CJK Unified Ideographs: f900-fa2d,
fb1f-fb36,fb38-fb3c,fb3e,fb40-fb41,fb42-fb44,fb46-fbb1,fbd3-fd3f,
fd50-fd8f,fd92-fdc7,fdf0-fdfb,fe70-fe72,fe74,5e76-fefc,
ff21-ff3a,ff41-ff5a,ff66-ffbe,ffc2-ffc7,ffca-ffcf,ffd2-ffd7,
ffda-ffdc,4e00-9fa5