.
Last update: 1997-05-20
9945-2-43
_____________________________________________________________________________
Topic: Regular Expressions - imprecise specification
Relevant Sections: 2.8
Defect Report:
-----------------------
(from Andrew Hume Doug McIlroy)
Issue D
These questions address areas of imprecision in the
specifications of REs. With regard to question [13], 9945-2
uses various terms roughly synonymously (leftmost, first,
earliest) (subpattern, subexpression) for describing a
left-to-right order across a line of text. It would be well
if the text (and rationale) used one term consistently.
________________________________________
[12] Can a duplicated subexpression match the null string?
If so, will the duplication be repeated until the
expression does match the null string?
Proposed Solution:
A subexpression that can match a null string shall not
be duplicated.
Rationale:
Although adjacent duplication symbols are illegal (for
no apparent reason, particularly for EREs), \(x*\)* is a
legal expression, in which the *s are not adjacent, that
raises the question: how many times is the null string
matched?
Suppose that \(x*\)* were allowed. Does matching it
to xxx yield \1 = xxx or \1 = null? The latter alterna-
tive is consistent with the rule that a null string is
longer than no match. By extending xxx with a null string
instead of nothing, one gets a longer match. More null
strings make even longer matches.
To avert metaphysical questions about the last element
of an infinite sequence, or an element following an infinite
sequence, one could forbid null iterations. But this has an
unsatisfying corollary that \(x*\)* wouldn't match the null
string. Compromise positions might forbid null iterations
after the first iteration or after the first null iteration.
Such special pleading and the resultant implementation com-
plexity is not worth the return.
Lord and McIlroy have implemented a no-null-unless-
first policy; Spencer has implemented repeat-until-null.
Spencer's interpretation seems perhaps less strained, until
you realize that it makes references ( \1, \2, etc) to con-
tents of duplications useless because they will always be
null or undefined.
The proposed solution also outlaws EREs like (a?|b)*,
(a?)\{2,2\}, and . ([a-z]*(|^))* The set of regular lan-
guages is unchanged, however. There would still be legal
equivalents to these and all other outlawed expressions.
An alternative, to leave the meaning undefined, is
unacceptable. Users will be unaware of the exact bounds of
portability, and their biggest intellectual investments -
the hardest expressions and sed scripts - are liable to be
unportable.
________________________________________
[13] What is the meaning of the BRE \(\)? [2.8.5.2]
Proposed Solution:
Forbid this expression.
It would be also acceptable, but rather less so, to
define that the pattern \(\) matches the null string.
Rationale:
The pattern has no analog in EREs, no utility, and to
our knowledge no basis in history. It should thus be
banned. Otherwise, at least define what it means.
________________________________________
[14] On line 2792, what is left-to-right order in a match?
Proposed Solution:
Insert the following text at the end of line 2792:
A nested subpattern is to the left of the subpat-
tern that contains it. An earlier iteration of a
duplicated subpattern is to the left of a later
iteration. Neither side of an alternation is to
the left of the other. Length ties in an alterna-
tion are broken in favor of the left side.
Rationale:
Consider the pattern \(\(...\)*\(.....\)*.\).* matched
against the string xxxxxx. Is the leftmost subpattern the
first complete subpattern, namely ..., or the pattern that
lexically starts first, namely \1? Call these two cases
``first-finished'' and ``first-begun''. The proposed solu-
tion adopts first-finished.
A first-finished match yields \1 = xxxx, \2 = xxx,
and \3 unmatched. A first-begun match yields \1 =
xxxxxx, \2 unmatched, and \3 = xxxxx. Lord and Spencer
both do first-finished matching, although Spencer argues for
first-begun. In many cases, including the present example,
first-finished matching can be done with less backtracking.
McIlroy implemented both and found first-begun much more
awkward than first-finished. Historical practice favors
first-finished.
The left-associativity of concatenation (line 3256) and
the transparency of parentheses (line 2977) together imply
that \(...\)*\(.....\)*.* and \(\(...\)*\(.....\)*\).*
should match the same way. Under first-begun matching to
xxxxx, in the former case subpattern \(...\)* matches xxx;
in the latter it matches the null string. Thus first-begun
matching entails a contradiction.
________________________________________
[15] To which match is a backreference to a duplicated
subexpression bound? [2.8.3.3(4)]
Proposed Solution:
A backreference to a subexpression contained in a
duplication is bound to the (possibly nonexistent) match to
the subexpression in the most recent iteration of the dupli-
cation. Thus \(\(.\)\2\)* matches xxyyzz, with \2 refer-
ring to x, y, and z respectively in the three iterations
of the outer subexpression. However, \(\(.\)*\2#\)*
matches only the first six characters of xx#yy##, because
in a third iteration of the outer subexpression, . would
match nothing (as distinct from matching a null string) and
hence \2 would match nothing.
Rationale:
Current implementations agree on this interpretation,
which is a natural generalization of binding in regmatch
structures by regexec(). [B.5.2]
WG15 response for 9945-2:1993
-----------------------------------
Q12
The standard does not require the successful
matching of a null string after a successful match on a non null
string for duplication symbols. However, the
standard does not clearly say that you are not allowed to match
null strings after a successful match. Therefore the standard
in ambiguous.
The definition of a back reference expression in section 2.8.3.3
does not specify which match of a subexpression followed by a
duplication symbol is to be returned. We note however that the
definition regcomp() defined in section B5.1 pg 728 344-346
indicate that the back reference expression will match the last
string matched by the sub expression.
The standard is unclear on this issue, and no conformance
distinction can be made between alternative implementations
based on this. This is being referred to the sponsor.
Q13
Pg 89 line 3263 requires this form to be accepted. But, the
text 2.8.3 does not describe its meaning.
Concerns about the wording of this part of the standard have
been forwarded to the sponsor.
This section New to this revision
----------------------------------
Part 14:
The text on page 82, lines 2975-2979, and page 77, lines 2791-2796 are in
conflict and as such the standard is unclear on this issue, and no
conformance distinction can be made between alternative implementations
based on this. This is being referred to the sponsor.
Part 15:
The standard does not speak to this issue, and as such no conformance
distinction can be made between alternative implementations based on this.
This is being referred to the sponsor.
Rationale:
None.
Proposed resolution forwarded: Aug 11 1995
Finalized: Sept 12 1995