.
Last update: 1997-05-20
9945-2-43.1
_____________________________________________________________________________
Topic: Regular Expressions - imprecise specification
Relevant Sections: 2.8
Classification: Q12: Ambiguous.
Q13: Defect.
Defect Report:
-----------------------
(from Andrew Hume Doug McIlroy)
Issue D
These questions address areas of imprecision in the
specifications of REs. With regard to question [13], 9945-2
uses various terms roughly synonymously (leftmost, first,
earliest) (subpattern, subexpression) for describing a
left-to-right order across a line of text. It would be well
if the text (and rationale) used one term consistently.
________________________________________
[12] Can a duplicated subexpression match the null string?
If so, will the duplication be repeated until the
expression does match the null string?
Proposed Solution:
A subexpression that can match a null string shall not
be duplicated.
Rationale:
Although adjacent duplication symbols are illegal (for
no apparent reason, particularly for EREs), \(x*\)* is a
legal expression, in which the *s are not adjacent, that
raises the question: how many times is the null string
matched?
Suppose that \(x*\)* were allowed. Does matching it
to xxx yield \1 = xxx or \1 = null? The latter alterna-
tive is consistent with the rule that a null string is
longer than no match. By extending xxx with a null string
instead of nothing, one gets a longer match. More null
strings make even longer matches.
To avert metaphysical questions about the last element
of an infinite sequence, or an element following an infinite
sequence, one could forbid null iterations. But this has an
unsatisfying corollary that \(x*\)* wouldn't match the null
string. Compromise positions might forbid null iterations
after the first iteration or after the first null iteration.
Such special pleading and the resultant implementation com-
plexity is not worth the return.
Lord and McIlroy have implemented a no-null-unless-
first policy; Spencer has implemented repeat-until-null.
Spencer's interpretation seems perhaps less strained, until
you realize that it makes references ( \1, \2, etc) to con-
tents of duplications useless because they will always be
null or undefined.
The proposed solution also outlaws EREs like (a?|b)*,
(a?)\{2,2\}, and . ([a-z]*(|^))* The set of regular lan-
guages is unchanged, however. There would still be legal
equivalents to these and all other outlawed expressions.
An alternative, to leave the meaning undefined, is
unacceptable. Users will be unaware of the exact bounds of
portability, and their biggest intellectual investments -
the hardest expressions and sed scripts - are liable to be
unportable.
________________________________________
[13] What is the meaning of the BRE \(\)? [2.8.5.2]
Proposed Solution:
Forbid this expression.
It would be also acceptable, but rather less so, to
define that the pattern \(\) matches the null string.
Rationale:
The pattern has no analog in EREs, no utility, and to
our knowledge no basis in history. It should thus be
banned. Otherwise, at least define what it means.
WG15 response for 9945-2:1993
-----------------------------------
Q12
The standard does not require the successful
matching of a null string after a successful match on a non null
string for duplication symbols. However, the
standard does not clearly say that you are not allowed to match
null strings after a successful match. Therefore the standard
in ambiguous.
The definition of a back reference expression in section 2.8.3.3
does not specify which match of a subexpression followed by a
duplication symbol is to be returned. We note however that the
definition regcomp() defined in section B5.1 pg 728 344-346
indicate that the back reference expression will match the last
string matched by the sub expression.
The standard is unclear on this issue, and no conformance
distinction can be made between alternative implementations
based on this. This is being referred to the sponsor.
Q13
Page 89 line 3263 requires this form to be accepted. But, the
text 2.8.3 does not describe its meaning.
Concerns about the wording of this part of the standard have
been forwarded to the sponsor.
Rationale for Interpretation:
-----------------------------
None
_____________________________________________________________________________