.
Last update: 1997-05-20
9945-2-40
_____________________________________________________________________________
Topic: I18N issues
Relevant Sections: 2.5.2.2
Classification: ambiguous
Defect Report:
-----------------------
(from Andrew Hume Doug McIlroy)
I18N Issues
Issue A
POSIX has defined a mechanism for talking about multi-
character sequences as a single unit, namely as collating
elements (CEs). Although CEs are motivated by sorting
issues, they appear in REs. This obviously leads to the
question of how to parse text into CEs? There are many pos-
sible answers, and furthermore, the parsing might be
affected by context. For example, given the usual alphabet
augmented by the collating element <ij> defined as <i><j>,
can the string ij ever be parsed as two collating elements?
________________________________________
[1] 2.5.2.2 says in the context of sorting, ``strings are
first broken up into a series of collating elements''
(line 1668). Does this apply to pattern matching? And
if so, how exactly is this done (for sorting or pattern
matching)?
Proposed Solution:
Add the following text somewhere; this text should be
referred to by line 1668 and by the general RE introduction
(2.8.2).
``When a string is interpreted as a sequence of
CEs, the sequence shall be as found by the follow-
ing process: starting at the first character of
the string, determine the longest prefix of the
string that matches a CE, add that CE to the
sequence and continue this process with the char-
acter after that prefix until the string is
exhausted.''
Note that this applies even if a sort key indicates
that a piece of the text is processed in backwards (right-
to-left) order; that is, the right-to-left processing
applies to the CEs found by a left-to-right lexical scan.
Rationale:
This is the greedy algorithm normally done in lexical
analysis. Any other choice would require backtracking with
potentially exponential runtime. It implies that, when
<i><j> is a collating element, under no circumstances can a
bracket expression match the i alone in the string ij. In
particular, neither [[.i.][.ij.]]j nor [[.i.]]j matches
ij. By contrast, i[[.j.]] does match ij, because in this
regular expression i denotes a character and is unaffected
by concerns about collating elements.
WG15 response for 9945-2:1993
-----------------------------------
The standard is unclear on this issue, and no conformance
distinction can be made between alternative implementations
based on this. This is being referred to the sponsor.
Rationale for Interpretation:
-----------------------------
None.
_____________________________________________________________________________