.
Last update: 1997-05-20
9945-2-42
_____________________________________________________________________________
Topic: Regular Expression Issues
Relevant Sections: 2.8.3
Defect Report:
-----------------------
(from Andrew Hume Doug McIlroy)
RE Issues
Issue C
The scheme used for defining REs and their semantics is
based on the historical notion that primitive REs match a
single character. This was broken with the introduction of
CEs, as they may be multiple characters in length. These
were added through bracket expressions so as to cause as
little impact to the rest of the specification.
This had two side-effects. The first is interactions
with the rest of the semantics. The second is that the his-
torical and careful difference in semantics between BREs and
EREs, which generally allowed BREs to be implemented with
polynomial-time algorithms, was erased. Both BREs and EREs
now in principle require algorithms with runtimes that are
exponential in the size of the pattern. For example, on
line 2860 (2.8.3.2), a bracket expression is defined to
match a character or CE. Unfortunately, if a duplicated
bracket expression contains multibyte CEs, then (exponential
runtime) backtracking could be necessary to find the longest
match.
The main problem with 2.8.3.1 is that it is unclear how
to interpret matching a character or CE against a string.
How is the string interpreted? As a sequence of characters,
a sequence of CEs, or as a mixture?
________________________________________
[9] On line 2860 (2.8.3.2), a bracket expression is defined
to match a single CE. Yet, line 2890 allows matching
any ``character or CE''.
Proposed Solution:
Change the phrase ``character or CE'' to ``CE''.
Rationale:
The wording suggests a nonexistent distinction.
________________________________________
[10] On line 2860 (2.8.3.2), a bracket expression is
defined to match a single CE. Yet, on lines 2949-2952
and lines 2957-2969, frequent reference is made to
``characters''.
Proposed Solution:
Change lines 2949-2952 and lines 2957-2969 to refer
only to CEs, and not to characters.
Rationale:
Same as previous rationale.
________________________________________
[11] Within 2.8.3.2, the term ``expression'' is used often
without a qualifier, and in these cases its meaning is
unclear. Examples include lines 2886 and 2891.
Proposed Solution:
The apparent intended meaning, namely CEs, should be
used.
WG15 response for 9945-2:1993
-----------------------------------
Part 9
Since the standard is unclear as to how strings are broken up into
a series of collating elements (see interpretation #40) it is also
unclear how to interpret matching a character or collating element
against a string, and as such no conformance distinction can be made
between alternative implementations based on this. This is being
referred to the sponsor.
Part10
The standard is unclear on this issue, and no conformance
distinction can be made between alternative implementations
based on this. This is being referred to the sponsor.
Concerns about the wording of this part of the standard have
been forwarded to the sponsor.
Part 11
The reference to "expressions" on page 80, line 2886 refers to the definition
on page 79, lines 2864-2866, and conforming implementations must conform to
this.
Rationale for Interpretation:
-----------------------------
None.
Proposed resolution forwarded: Aug 11 1995
Finalized: Sept 12 1995