JTC1/SC22/WG14
N951
WG14/N951
String literals and concatenation
Clive Feather
<clive@demon.net>
Last changed 2001-08-14
Introduction
============
There is an inconsistency in the rules for string literal concatenation
and the relationship between source and execution character sets. This
paper discusses this inconsistency and suggests a new model and associated
changes to the Standard.
This paper was written following discussions on the WG14 reflector, with
particular input from Tanaka Keishiro and Antoine Leca.
Standard text
=============
The following text from the Standard is relevant.
Translation phase 1:
Physical source file multibyte characters are mapped,
in an implementation-defined manner, to the source
character set (introducing new-line characters for
end-of-line indicators) if necessary.
Translation phase 3:
The source file is decomposed into preprocessing
tokens and sequences of white-space characters
(including comments).
Translation phase 5:
Each source character set member and escape sequence
in character constants and string literals is
converted to the corresponding member of the execution
character set; if there is no corresponding member, it
is converted to an implementation-defined member other
than the null (wide) character.
Translation phase 6:
Adjacent string literal tokens are concatenated.
6.4.5#1:
[#1]
string-literal:
" s-char-sequence-opt "
L" s-char-sequence-opt "
s-char-sequence:
s-char
s-char-sequence s-char
s-char:
any member of the source character set except the double-quote ",
backslash \, or new-line character
escape-sequence
6.4.5#4:
[#4] In translation phase 6, the multibyte character
sequences specified by any sequence of adjacent character
and wide string literal tokens are concatenated into a
single multibyte character sequence. If any of the tokens
are wide string literal tokens, the resulting multibyte
character sequence is treated as a wide string literal;
otherwise, it is treated as a character string literal.
6.4.5#5:
[#5] In translation phase 7, a byte or code of value zero is
appended to each multibyte character sequence that results
from a string literal or literals. The multibyte
character sequence is then used to initialize an array of
static storage duration and length just sufficient to
contain the sequence. For character string literals, the
array elements have type char, and are initialized with the
individual bytes of the multibyte character sequence; for
wide string literals, the array elements have type wchar_t,
and are initialized with the sequence of wide characters
corresponding to the multibyte character sequence, as
defined by the mbstowcs function with an implementation-
defined current locale. The value of a string literal
containing a multibyte character or escape sequence not
represented in the execution character set is
implementation-defined.
Problems
========
Consider code like:
L"abc" "def"
The 6.4.5#4 text says that the multibyte sequences in the literals are
concatenated into a single sequence in translation phase 6. But, on the
other hand, multibyte characters were mapped to source characters in TP1,
and the source characters were then mapped to execution character set
characters in TP5. So there are no multibyte sequences available in TP6 to
be concatenated.
There are then further problems. Consider a string literal containing a
UCN:
L"\u8868"
At TP5 this is converted to a member of the execution character set, but
at TP7 (6.4.5#5) this literal is supposed to generate a multibyte
character that can be fed to mbstowcs. Nowhere is it explained where this
multibyte character comes from.
Finally, consider an implementation where the two byte sequence 0x95 0x5C
is the source encoding of U+8868. Look at the following literals:
L"@\" (@ represents the byte with value 0x95)
L"\x95\x5C"
L"\x95" "\\"
At TP5 the second of these is effectively converted to the first, and
after concatenation in TP6 so is the third. This means that all of these
literals generate an array of one element, holding the wide character
with value 0x955C. This is somewhat counter-intuitive, and Tanaka-san
states that it is not what users will expect or implementers will produce.
The alternative is to assume that TP1 will convert the first literal to
some internal character. But in this case TP7 lacks anything obvious to
pass to mbstowcs, and the other two cases still generate the "wrong"
answer.
Some examples of desired output
===============================
Our next step was to consider a range of examples and note what we thought
they should produce.
Example Array type Array contents
1: "ABC" (char [4]) { 0x41, 0x42, 0x43, 0x00 }
2: "\x12" "34" (char [4]) { 0x12, 0x33, 0x34, 0x00 }
3: "\x95" "\\" (char [3]) { 0x95, 0x5C, 0x00 }
4: "@\" (char [3]) { 0x95, 0x5C, 0x00 }
5:: "@" "\\" (char [3]) { 0x95, 0x5C, 0x00 } OR UNDEFINED
6: L"ABC" (wchar_t [4]) { 0x0041, 0x0042, 0x0043, 0x0000 }
7: L"\u8868" (wchar_t [2]) { 0x955C, 0x0000 }
8: L"\x95\\" (wchar_t [3]) { 0x0095, 0x005C, 0x0000 }
9: L"\x95" L"\\" (wchar_t [3]) { 0x0095, 0x005C, 0x0000 }
10: "\x95" L"\\" (wchar_t [3]) { 0x0095, 0x005C, 0x0000 }
11: L"@\" (wchar_t [2]) { 0x955C, 0x0000 }
12: L"\x955C" (wchar_t [2]) { 0x955C, 0x0000 }
13: L"\x95" (wchar_t [2]) { 0x0095, 0x0000 }
14: "@\\" UNDEFINED
15: "@" "\" UNDEFINED
Example 14 is undefined because \" is an escape sequence and so the
literal is unterminated.
Example 15 depends on whether @" is a valid multibyte sequence or not.
If it is, then the third " terminates the literal and the backslash causes
a syntax error. If it is not, the second literal is unterminated.
Example 5 is defined or undefined in the same way.
Principles
==========
From consideration of various examples we can derive a set of basic
principles for string literals.
[P1] The sequences:
L"a" L"b"
L"a" "b"
"a" L"b"
are completely equivalent. The final type of a concatenated string literal
depends only on whether any of the components have an L prefix, and not on
which ones they are.
[P2] The sequences:
"abc"
"ab" "c"
"a" "bc"
"abc"
are completely equivalent. The division of the string into literals does
not alter the final array. However, this applies only when the literals
consist of the same s-chars; the sequences:
"\x1234"
"\x12" "34"
are not equivalent because they involve different s-chars.
[P3] Multibyte sequences are converted to single source characters during
TP1, and so each multibyte sequence is a single s-char.
[P4] The literal "@\" contains one s-char but the literal "\x95\\"
contains two. These are not equivalent, and the latter is not merged to
form a multibyte character later on.
[P5] The two string literals:
"abc"
L"abc"
should be related. More precisely, applying mbstowcs to the former should
produce the latter.
[P6] When the final result will be a wchar_t array, each s-char in the
source generates exactly one element of the array.
[P7] When the final result will be a char array:
- a single byte source character generates exactly one byte
- an escape sequence generates exactly one byte
- a non-single byte multibyte source character generates one or more
bytes, and:
* mbstowcs applied to the sequence produces a single wide character;
* where it makes sense, the byte sequence in the array is the direct
analogue of the source multibyte character.
[P8] When the final result will be a wchar_t array, source shift sequences
are not separate s-chars and do not map to separate elements of the array.
[P9] WHen the final result will be a char array, source shift sequences
should appear in the array to the extent it makes sense (by analogue with
the last sub-bullet of P7).
New model
=========
Applying these principles to the processes in the Standard, we can
construct a new model.
The source character set contains the 95 required characters and the "new
line" indicator. It also contains as many additional characters as are
defined by every valid multibyte character (and making allowance for shift
states).
For example, suppose that a given encoding consists of:
- codes 1 to 96 are the required characters;
- codes 101 to 120 are always followed by a code from 1 to 100, and each
pair represents a character;
- codes 121 to 127 each represent one of four characters depending on the
choice of shift state;
- codes 97 to 100 select a shift state; this only affects codes 121 to 127.
Therefore the entire encoding contains 96+20*100+8*4 = 2128 characters,
and that is the size of the source character set.
Translation phase 1 converts all input to characters from this set. Thus
the sequence:
1 81 81 78 46 100 122 122 101 54
A ? ? / t $ $ `
is converted to the 6 source character sequence A\t$$'
If a source character can be generated in more than one way (e.g. through
the use of alternative shift sequences), an implementation is free to
annotate the character with this information. This annotation is used
later.
Within string literals, these sequences are parsed into s-chars during
TP3; in this case there are 5 such s-chars. Other source code also works
in terms of these source characters.
TP4 stringisation and token pasting works in terms of these source
characters.
The execution character set needs essentially the same set of characters
as the source had. At TP5 each s-char in a string literal is converted to
the corresponding execution character set character. At this point the
distinction between multibyte characters, UCNs, and other escape sequences
is lost (so \t, \x9 (or whatever), and an actual source tab all produce
the same character). At TP6 the sequences of characters are simply
concatenated without further change.
At TP7 each character in the execution character set generates either:
- a single wide character
- a sequence of characters
In the latter case, if the corresponding s-char came from a multibyte
character the sequence should match it if possible. The annotation
mechanism described above is one way to do this.
Proposed changes
================
The following changes to the Standard are required to put this model into
effect.
Firstly we specify this model in some detail:
5.2.1.3 Character encoding model
[#1] Translation phase 1 establishes the boundaries between multibyte
characters in the source. These are converted into /source character
encoding units/ that encode a single member of the source character
set (any shift sequences are merged with an adjacent unit). Source
character encoding units are never split or merged in subsequent
translation phases.
[#2] In translation phase 3, each source character encoding unit that
is not a member of the basic character set will become:
- an identifier-nondigit within an identifier or pp-number
- an h-char or q-char in a header-name
- a c-char within a character-constant
- an s-char within a string-literal, or
- a preprocessing-token on its own.
[#3] In translation phase 5, each c-char or s-char is converted to a
single /execution character encoding unit/ (ECEU). Each character
constant and string literal therefore becomes a sequence of ECEUs.
Note that there may be several representations of the same ECEU:
- a source character encoding unit, possibly derived from a multibyte
sequence
- a universal character name,
- a special escape sequence such as \t, or
- an octal or hexadecimal escape sequence
[#4] In translation phase 6, string literals are concatenated by
concatenating the ECEU sequences into a single sequence; the
total number of ECEUs involved is unchanged.
[#5] In translation phase 7, a string literal is converted to an array
of values by first appending an ECEU, representing the null
character, to the ECEU sequence. If it is a character string literal,
each ECEU then generates one or more elements of the char array; the
precise elements generated may depend on the source code encoding
unit that the ECEU derives from. If it is a wide string literal,
each ECEU generates one element of the wchar_t array.
[#6] Two character string literals or two wide string literals derived
from the same sequence of source character encoding units shall
generate identical arrays. A character string literal and a wide
string literal derived from the same sequence shall generate arrays
that correspond, as defined by the mbstowcs function with an
implementation-defined current locale.
Next we need to make the explanation of string concatenation in 6.4.5#4
to use this new model. This completely replaces the old text:
[#4] In translation phase 6, the contents of adjacent
character and wide string literal tokens are concatenated into
a single token as described in 5.2.1.3. If any of the tokens
are wide string literal tokens, the resulting token is
a wide string literal; otherwise, it is a character string
literal.
Finally we need to make the explanation of string literals in 6.4.5#5
also use this new model. Again, this completely replaces the old text.
[#5] In translation phase 7, a code of value zero (representing the
null character) is appended to each string literal. The contents of
the literal are then used to initialize an array of static storage
duration and length just sufficient to contain the sequence. For
character string literals, the array elements have type char; for
wide string literals, the array elements have type wchar_t. The
array is initialized as described in 5.2.1.3.
[No text replaces the last sentence of the current #5, as it duplicates
a requirement in TP5.]
If it is preferred that 6.4.5#5 not contain the reference to 5.2.1.3,
an alternative way to word the former would be:
[#5] In translation phase 7, a code of value zero (representing the
null character) is appended to each string literal. The contents of
the literal are then used to initialize an array of static storage
duration and length just sufficient to contain the sequence. For
character string literals, the array elements have type char; each
ECEU in the string literal (taken in order) determines the value of
one or more elements (the precise values may depend on the source
code encoding unit(s) that the ECEU derives from). For wide string
literals, the array elements have type wchar_t; each ECEU in the
string literal determines the initial value for the corresponding
element.
If so, 5.2.1.3#5 should be deleted and "in translation phase 7" should be
added to the end of the first sentence of 5.2.1.3#6.