Defect Report #037
Submission Date: 10 Dec 92
Submittor: WG14
Source: X3J11/91-043 (Isai Scheinberg)
Question 1
Subclause 5.2.1.2 Multibyte characters states:
The source character set may contain multibyte characters, used to represent
members of the extended character set. The execution character set may
also contain multibyte characters, which need not have the same encoding
as for the source character set. For both character sets, the following
shall hold:
- The single-byte characters defined in 5.2.1 shall be present.
and, a bit later on:
- A byte with all bits zero shall not occur in the second or subsequent
bytes of a multibyte character.
My interpretation (and all of the experts that I consulted with) of the
first rule, is that the basic character set (A-z, 0-9, etc.) shall be coded
in one-byte code. All multibyte locales that I know (EUC variants, SJIS)
follow this rule. But I may still be wrong.
If the above is true, then both 10646 (other than CM 5) and UNICODE fail
this rule and cannot be used as multibyte characters. UNICODE also fails
the second rule.
Response
The following answers apply (almost) equally to ISO 10646-1 and UNICODE.
They are expressed in terms of ISO 10646-1.
Clause 3, page 2, lines 18-24 and 40-42 define ``byte,'' ``character,''
and ``multibyte character'' as follows:
byte: The unit of data storage large enough to hold any member of
the basic character set of the execution environment.
character: A bit representation that fits in a byte. The representation
of each member of the basic character set in both the source and execution
environments shall fit in a byte.
multibyte character: A sequence of one or more bytes representing
a member of the extended character set of either the source or the execution
environment. The extended character set is a superset of the basic character
set.''
Therefore, if ISO 10646-1 were used as a basic character set, then by definition
a byte would have to be large enough to hold each member of the ISO 10646-1
character set. Also by definition this would make ISO 10646-1 a valid multibyte
character set.
If a byte were only eight bits long, the following answer would hold. ISO
10646-1 represents, in a particular byte order, the character 'a'
for example as follows.
0 0 0 97
---- 16-bit version
-------- 32-bit version
This fails subclause 5.2.1.2, page 11, lines 30-32:
- A byte with all bits zero shall be interpreted as a null character independent
of shift state.
- A byte with all bits zero shall not occur in the second or subsequent
bytes of a multibyte character.
Therefore, 8-bit bytes preclude the use of ISO 10646-1 as a multibyte character
set.
Previous Defect Report
< - >
Next Defect Report