WG 14 Document: N1602
Date: 2012/02/10
Constant Problems
Version: 1.0.1
❝ This International Standard specifies the form and establishes the interpretation of programs expressed in the programming language C. Its purpose is to promote portability, reliability, maintainability, and efficient execution of C language programs on a variety of computing systems. ❞
abstract of ISO/IEC 9899:201x
❝ Below is a recapitulation of the grammar that was given throughout the earlier part of this appendix. It has exactly the same content, but is in different order.
The grammar has undefined terminal symbols integer-constant, character-constant, floating-constant, identifier, string, and enumeration-constant; the typewriter style words and symbols are terminals given literally. This grammar can be transformed mechanically into input acceptable for an automatic parser-generator. Besides adding whatever syntactic marking is used to indicate alternatives in productions, it is necessary to expand the one of constructions, and (depending on the rules of the parser-generator) to duplicate each production with an opt symbol, once with the symbol and once without. With one further change, namely deleting the production typedef-name: identifier and making typedef-name a terminal symbol, this grammar is acceptable to the YACC parser-generator. It has only one conflict, generated by the if-else ambiguity. ❞
The C Programming Language (second edition), Appendix A.13 Grammar
Abstract
This report identifies syntax defects in Section 6.4.4 Constants of the ISO/IEC 9899 International Standard. What it does not claim is that there are errors in the descriptions when interpreted for standalone lexers (or the regular language portions of some parsers) or that it is impossible to parse the section as specified.
Contents
- Foreword
- Interpreting Section 6.4.4 as BNF
- Redundancy Elimination
- Integer Constant Tested
- Floating Constant Tested
- Solution ‘A’: Floating Constant Adjusted
- Adjusted Floating Constant Tested
- Solution ‘B’: Integer Constant Adjusted
- Adjusted Integer Constant Tested
- Shift-Reduce Conflicts in character-constant
- Conflict-Free Constants
- Concluding Remarks
Foreword
The discovery of the syntax problem in Section 6.4.4 comes about as a result of the methodology used to interpret the standard. A significant part of which is to assign the portions normally done by a lexer to the parser function—doing so in the belief that it is possible by this method (though not convenient) to attain the portion of Dennis Ritchie’s statement quoted above:
- “… this grammar is acceptable to the YACC parser-generator. It has only one conflict …”.
Also,
- “The grammar has undefined terminal symbols integer-constant, character-constant, floating-constant, …”.
is taken to be a reasonable justification of the parser-only approach because it does not mention specifically that a lexer utility must provide the terminal symbols (although it is highly likely one was used and that the grammar inherited from ANSI X3.159-1989 is the transcription of regular expressions to a formal definition).
For the purposes of this report, enumeration-constant is elided by a parser token and character-constant, because it is not a factor in the egregious syntax conflicts, is dealt with after they are addressed.
Grammar files and tests referred to later in this report are located in subfolders of this document’s enclosing folder. The tests have been run with Bison 2.5 and in cases where conflicts are reported, the facility for LR(1) parser generation has been used so as to avoid the possibility that a less powerful LALR(1) parser would report unwarranted conflicts.
In all cases, to run the tests, you should cd to the appropriate subfolder and run the ".sh" script from there.
Interpreting Section 6.4.4 as BNF
The grammar for section 6.4.4 in the standard has been transliterated to the BNF style suitable for utilities like yacc or bison. The folder 1_Constant_RR120 contains the "constant.y" file. The shell script will attempt to produce the "constant.c" file with bison and compile it with gcc. Note that bison reports 120 reduce-reduce conflicts.
The grammar file has a code implementation. Run the binary with the command:
./constant test.txt;
As you can see, the executable chokes on both octals and integers with more than one digit. Floating constants appear to be parsed as expected. At this point, enumeration-constant and character-constant are removed from the grammar so as to show the reduce-reduce problems emanate from integer-constant and floating-constant.
Redundancy Elimination
One of the conveniences a lexer or an EBNF-style parser (i.e. a parser with an embedded regular language facility) offers is to take care of rules with overlapping content without complaining. But a yacc-style BNF implementation doesn’t necessarily permit the same amount of overlap. In the folder 2_Constant_RR18 an overlap between hexadecimal-constant and hexadecimal-digit-sequence has been taken care of in the "constant.y" file:
// hexadecimal-constant // 6.4.4.1 // : hexadecimal-prefix hexadecimal-digit // | hexadecimal-constant hexadecimal-digit // ; hexadecimal-constant // 6.4.4.1 : hexadecimal-prefix hexadecimal-digit-sequence ;
This single change will lower the number of conflicts from 120 to 98. A change the standard could implement to lower the number conflicts without affecting anything else.
In addition, another overlap between digit and nonzero-digit is dealt with:
//digit: '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9' ; //digit: octal-digit | '8' | '9' ; digit : '0' | nonzero-digit ;
The choice of subsuming nonzero-digit v.s. octal-digit is arbitrary, however, you may verify that using nonzero-digit will leave 18 reduce-reduce conflicts whereas octal-digit will not show any improvement over the 98 conflicts left by changing hexadecimal-constant.
The main thing about subsuming nonzero-digit is that it lowers the number of conflicts from tracing through a maze to a point where it’s possible to discern the rules generating them — namely that decimal-constant, digit, and octal-digit are all in the integer-constant rule chain and that only digit is included through digit-sequence in floating-constant. This leads to the next two tests.
Integer Constant Tested
The test in 3_IntegerConstantSolo temporarily eliminates floating-constant and its sub-rules from the scenario. integer-constant is used exactly as specified in the standard with no overlaps eliminated. No conflicts result.
Floating Constant Tested
The test in 4_FloatingConstantSolo temporarily eliminates integer-constant and its sub-rules from the scenario. floating-constant is used exactly as specified in the standard with no overlaps eliminated. No conflicts result.
Solution ‘A’: Floating Constant Adjusted
The previous tests lead to the conclusion that all the conflicts arise between the factors of integer-constant and floating-constant. In our first attempt to eliminate them entirely, we will assume that the statement in the Floating constants (6.4.4.2) section:
“The components of the significand part may include a digit sequence representing the whole-number part …”
is overly generous. Instead, we will specify:
“The components of the significand part may include a decimal-constant or a single '0' representing the whole-number part …”
This is overwhelmingly the common practice in system headers and we’ll just say “tough luck” to those with input significand data prefixed with multiple zeros.
The changes are as follows:
decimal-floating-constant // 6.4.4.2 : fractional-constant | fractional-constant floating-suffix | fractional-constant exponent-part | fractional-constant exponent-part floating-suffix | decimal-constant exponent-part | decimal-constant exponent-part floating-suffix | '0' exponent-part | '0' exponent-part floating-suffix ; fractional-constant // 6.4.4.2 : '.' digit-sequence | decimal-constant '.' digit-sequence | decimal-constant '.' | '0' '.' | '0' '.' digit-sequence ;
When the test in 5_DecConstFloat is run, no conflicts result. Can we be satisfied? No. There is another consideration. It is that we should take extensions into account to see how they fit with this solution.
Adjusted Floating Constant Tested
In 6_DecConstFixed we insert the grammar details from ISO/IEC TR 18037 and adjust decimal-fixed-constant to match the changes in floating-constant thusly:
decimal-fixed-constant : fractional-constant fixed-suffix | fractional-constant exponent-part fixed-suffix // | digit-sequence fixed-suffix // | digit-sequence exponent-part fixed-suffix | decimal-constant fixed-suffix | decimal-constant exponent-part fixed-suffix | '0' fixed-suffix | '0' exponent-part fixed-suffix ;
The test is run but the result is not pleasing. There are 4 shift-reduce conflicts with long-suffix and unsigned-suffix. Although it might be possible to fix the problem, we shall not pursue this solution any further because of the potential of alienating some users, backwards compatiblity problems, and gnarly grammar requirements for language extensions.
Solution ‘B’: Integer Constant Adjusted
Now the factors of integer-constant are adjusted in hopes of finding a better solution. It is here we detail the major defect this report remarks:
The current syntax specifies that octal-constant, decimal-constant, and digit-sequence manifest at the same level. octal-constant and decimal-constant specify a mutually exclusive situation for the leading digit but digit-sequence will accept any leading digit. To pit either of octal-constant or decimal-constant against digit-sequence results in conflicts. This is not acceptable.
Our solution is to eliminate octal-constant, nonzero-digit, and decimal-constant and use digit-sequence in integer-constant.
integer-constant // 6.4.4.1 : digit-sequence | digit-sequence integer-suffix | hexadecimal-constant | hexadecimal-constant integer-suffix ;
This change eliminates all conflicts reported by the parser generator. At first glance, it might seem to be a radical solution, but in fact it is not:
- Firstly, consider that there is no integral type octal or nibble. These types, if they existed, would constrain their digit representation to precisely that of the octal-digit subset. So octal-constant is just a notational convenience—not a first class type in the integral types.
- As illustrated in the code example of Adjusted Integer Constant Tested, the ability to recognize octal-constant and decimal-constant is not lost. They can be differentiated as sub-classes of digit-sequence in integer-constant at an early and convenient time.
- The number of operations to validate digit-sequence and assign it to an integer-constant value can decrease from a separately specified octal-constant and decimal-constant. A single validate/assign from digit-sequence is more efficient.
- There are fewer specification rules and hence any DFA construction will be more efficient, it’s resulting size smaller, and runtime faster.
- Dennis Ritichie states that integer-constant and floating-constant are “undefined terminal symbols”. So the subset of the grammar rules inherited from ANSI X3.159-1989 which are the constituent parts of integer-constant don’t appear to be obligatory parts the standard should be constrained by—especially considering that they produce syntax conflicts. This does not imply that the current descriptions are invalid. Only that their application should be limited to defining semantic interpretation or formal lexical definitions in the places where they have an adverse affect on the syntactic sub-structure.
The folder 7_DigSeq contains a grammar file with the digit-sequence solution. The fixed-constant extension grammar has been merged into this grammar and passes through the parser generator without complaint. Also, it does not have to be adjusted as it was in the previous floating-constant attempt.
Adjusted Integer Constant Tested
The folder 8_DigSeqCode contains a grammar file with the digit-sequence solution including both fixed-constant and decimal-floating-suffix (from ISO/IEC DTR 24732 (decimal floating-point arithmetic)) extensions added. Note that the grammar suggested in the latter is not optimal and has been re-specified as decimal-floating-suffix so as to leave floating-suffix untouched for use in hexadecimal-floating-constant.
The "const.y" grammar file in 8_DigSeqCode also includes code and there is a test file which can be passed to the compiled executable. The test file includes some example constants from ISO/IEC TR 18037.
Shift-Reduce conflicts in character-constant
The “const.y” file the folder 9_Constant_SR38 has had enumeration-constant and character-constant rules re-inserted. The running of its test shows 38 shift-reduce conflicts.
Discussion
These shift-reduce conflicts arise because octal digits arriving after the first \ octal-digit in octal-escape-sequence and hexadecimal digits arriving after the first \x hexadecimal-digit in hexadecimal-escape-sequence can’t be disambiguated from a c-char digit. The conflicts are broken down into 8 for the second octal-digit, 8 for the third octal-digit in octal-escape-sequence, and 22 for any hexadecimal-digit appended to the first rule of hexadecimal-escape-sequence.
The default parser behavior in the case of shift-reduce conflicts is to shift, and therefore, digits arriving after \ octal-digit or \x hexadecimal-digit will be assigned to the appropriate escape sequence until a non-digit character, the octal-escape-sequence limit of 3 digits, or the terminating single-quote character of the character-constant rule arrives. But this behavior does not permit the early assignment of these escape constants in escape-sequence. The pertinent paragraph from section 6.4.4.4 Character constants of the standard states:
The value of an octal or hexadecimal escape sequence shall be in the range of representable values for the corresponding type:
Prefix Corresponding Type none unsigned char L unsigned type corresponding to wchar_t u char16_t U char32_t
so it’s not possible to assign a value to these items until the character-constant rule is invoked.
character-constant : '\'' c-char-sequence '\' | 'L' '\'' c-char-sequence '\'' | 'u' '\'' c-char-sequence '\'' | 'U' '\'' c-char-sequence '\'' ;
Also noted is this paragraph from the same section of the standard:
- Each octal or hexadecimal escape sequence is the longest sequence of characters that can constitute the escape sequence.
This seems to imply that the limitation of 3 octal-digits in octal-escape-sequence is an unnecessary restriction.
Pseudo-code for octal-escape-sequence in this section’s test is as follows:
octal-escape-sequence : '\\' octal-digit | '\\' octal-digit octal-digit | '\\' octal-digit octal-digit octal-digit { // code here inserts "### $1$2$3$4 ### -> " } ;
and for an input of \7654 the result shows: '### \765 ### -> 4'. So currently, octal-escape-sequence doesn't match the text of paragraph 7 of the standard.
Conflict-Free Constants
The “const.y” file the folder 10_Constant contains grammar alterations which eliminate the shift-reduce problems in character-constant. The alterations are where:
octal-escape-sequence : '\\' octal-digit | '\\' octal-digit octal-digit | '\\' octal-digit octal-digit octal-digit ;
is changed to:
octal-escape-start : '\\' octal-digit ;
and
hexadecimal-escape-sequence : '\\' 'x' hexadecimal-digit | hexadecimal-escape-sequence hexadecimal-digit ;
is changed to:
hexadecimal-escape-start : '\\' 'x' hexadecimal-digit ;
and escape-sequence is adjusted accordingly:
escape-sequence : simple-escape-sequence | octal-escape-start | hexadecimal-escape-start | universal-character-name ;
The resulting behavior is that a single escaped octal-digit or hexadecimal-digit is followed by one or more c-char characters and the escapes may be validated and assigned in character-constant. The rule names have been changed to reflect their role.
One thing to remark about this solution is that currently it’s possible to insert both a start and end sequence delimiter for all rules but octal-escape-sequence in escape-sequence. The changes apply this limitation to hexadecimal-escape-sequence as well. Although this is a relatively minuscule point—because one delimiter is enough to parse the escape sequence—we feel it should be remarked on in the interests of completeness.
Concluding Remarks
Syntax defects in the Section 6.4.4 Constants section of ISO/IEC 9899 have been identified and described.
While Solution ‘B’ eliminates all conflicts between integer-constant and floating-constant and works efficiently, it has the side-effect of demoting both octal-constant and decimal-constant to being sub-classes of digit-sequence.
A solution which eliminates the shift-reduce problems in character-constant has been found. It has a minor side effect for hexadecimal-escape-sequence.
Our solutions result in a conflict-free grammar for the constant section. The beneficial side effect is a totally stable take off point for future features.
Our request to WG14 is that it follow the straight-forward honesty of Dennis Ritchie so as to make it clear where the grammar presented in ISO/IEC 9899 deviates from being conflict-free LR(1) and note the changes necessary to keep it within that constraint. In this respect, we feel it would be more in accord with the first paragraph of the abstract quoted at the beginning of this report if the standard kept its grammar “portable, reliable, and efficient”.
Copyright © 2012, Philip Aker. All Rights Reserved.
Permission granted to ISO JTC1 / SC22 / WG14 for use and distribution of this document.
§