JTC1/SC22/WG14 Document Nxxx
Title: Wide character code values for members of the
basic character set.
Author: Raymond Mak
Author Affiliation: IBM Corp.
E-mail Address: rmak@ca.ibm.com
Abstract : In using the C language to process
Unicode, it is natural to bind wchar_t with UCS-2 or UCS-4. However this
causes problem for EBCDIC based systems as Standard C imposes a restriction
on the wide character code values. Specifically, the standard requires
('x' == L'x') to hold true if x is a member of the basic character set.
This document explains the problem and suggests an amendment to the standard
to provide leeway for EBCDIC systems.
Introduction:
C99 7.17 paragraph 2 specifies in part:
"...
wchar_t
which is an integer type whose range of values can represent distinct codes for all members of the largest extended character set specified among the supported locales; the null character shall have the code value zero and each member of the basic character set shall have a code value equal to its value when used as the lone character in an integer character constant."
At issue here is the last part of the above sentence.
Since the code value of the basic characters in UCS-2 and UCS-4 are based on ASCII, EBCDIC systems cannot conform to the above sub clause if the encoding of wchar_t is UCS-2 or UCS-4. This makes it difficult for EBCDIC systems to use Unicode with the C language.
There is no programming situation that really requires this restriction. In fact, in can be argued that a program would naturally know the type of characters (wide or normal) it is processing; the appropriate character literal can always be used.
Note that a program can use the functions btowc and
wctob (7.24.6.1 and .2) to handle mixed processing of wide and normal characters.
Sub clause 7.17 offers little additional value.
Suggested Change to the Standard:
Change the last part of 7.17 paragraph 2 as follows:
"...
wchar_t
which is an integer type whose range of values can
represent distinct codes for all members of the largest extended character
set specified among the supported locales; the null character shall have
the code value zero; each member of the basic character set shall have
a code value equal to its value when used as the lone character in an integer
character constant if an implementation does not define __STDC_NARROW_WCHAR__."
The proposed change would allow an implementation
to deviate from the last part of 7.17 paragraph 2 if the macro __STDC_NARROW_WCHAR__
is defined. This would not affect ASCII based systems, but would provide
leeway for EBCDIC systems to process Unicode using C.
=====================================================================END