==========================================================
Submitter: Kayvan Memarian and Peter Sewell
Submission Date: 2016-09-22
Document: WG14 N2089
Related: Section 2 of N2012, Sections 3.1 and 3.2 (Q48-59) of N2013, DR338, DR451, N1793, N1818.
This document is based on N2012 (Section 2), adding a concrete Technical Corrigendum proposal for discussion, revising the text, and adding concrete examples (from N2013).
Reading uninitialised values has many conceivable semantics in C, including:
undefined behaviour (meaning that the compiler is free to compile the program to arbitrary code, with or without a warning)
making the result of any expression involving that value unpredictable
giving an arbitrary and unstable value (i.e, maybe with a different value on another read)
giving an arbitrary but stable value (with the same value if you read again)
Our survey (Question 2/15) shows that reading of uninitialised values is used in practice, e.g. when copying or serialising a struct which has some uninitialised members or uninitialised padding, (more rarely) when comparing against such a struct, and in debugging. Making such programs have undefined behaviour (option (a)) seems undesirable, permitting future compiler optimisations to break reasonable code. On the other hand, current mainstream compilers do optimise in ways that make (c) and (d) unsound.
The C11 standard allows reading of uninitialised values in certain circumstances. For types that (in the particular implementation in question) do not have any trap representations, this is undefined iff "the lvalue designates an object of automatic storage duration that could have been declared with the register storage class (never had its address taken)" (see 6.3.2.1p2 and DR338). This clause seems to have been added to cope with the Itanium NaT, but it's not clear that it correctly models that (see below), and for other implementations it seems to complicate and weaken the language for no purpose.
In cases where reading uninitialised values is allowed (for the moment again considering only types without trap representations for the implementation in question), the standard gives an unspecified value, but for several questions, detailed below, the current standard text is unclear how these behave. This makes it unclear exactly what analysis and optimisation is allowed.
The best solution we can see, given the existing code and compilers, is to adopt a semantics that gives (b), and that clarifies each of the questions below.
We do so by a modest change to the C abstract machine: for any scalar type, we extend the set of values of that type with a symbolic "unspecified value" token, then we can give rules defining how that is propagated, e.g. if one adds an unspecified value and a concrete integer. The unspecified value token does not have a bit-level representation. We detail a possible technical corrigendum for all this below, after the examples.
As always, the "as if" rule applies: the fact that the abstract machine manipulates an explicit "unspecified value" token doesn't mean that implementations have to. Normal implementations may at compile-time but will not at runtime: at runtime they will typically have some arbitrary bit pattern where the abstract machine has the unspecified value token, and the looseness of the rules for operations on unspecified values licenses compiler optimisations. Our unspecified value token is (roughly) a language-level analogue of the LLVM undef
.
We believe that the same machinery can be used to handle structure padding and the padding after the current member of a union.
As an aside, it's possible that a "safe C" might also want to provide (d) as an option, especially given that a third of our survey respondents believe they can rely on that.
Example trap_representation_1.c
int main() {
int i;
int *p = &i;
int j=i; // should this have undefined behaviour?
// note that i is read but the value is not used
}
Example trap_representation_2.c
int main() {
int i;
int j=i; // should this have undefined behaviour?
// note that i is read but the value is not used
}
In C11 the first has defined behaviour and the second has undefined behaviour. In our proposal, both have defined behaviour, with the read of i
giving the unspecified value token, and that being written to j
. Note that these are minimal test cases: we're not saying that these are in themselves desirable code - in the cases that arise in practice, the creation of a (perhaps partially) uninitialised value and its use are separated. Note also that on many implementations int
has no unused representation bit patterns, so on those there can be no trap representations here. In the Itanium case, the NaT bit is per-register data, not a memory-storable bit pattern, and the "address taken" clause of 6.3.2.1p2 seems to have been intended to ensure that i
is allocated in memory - but an optimising compiler might easily keep a value in registers down some call hierarchy.
We start with this so that printf
can be used in later examples.
Example unspecified_value_library_call_argument.c
#include <stdio.h>
int main()
{
unsigned char c;
unsigned char *p = &c;
printf("char 0x%x\n",(unsigned int)c);
// should this have defined behaviour?
}
ISO C11 is unclear. The DR451 CR says "library functions will exhibit undefined behavior when used on indeterminate values" but here we are more specifically looking at unspecified values. We see no benefit from making this undefined behaviour, and we are not aware that compilers assume so. It prevents (e.g.) serialising or debug printing of partially uninitialised structs, or (if padding bytes are treated the same as other uninitialised values) byte-by-byte serialising of structs containing padding. Accordingly, we suggest that library functions such as printf
, when called with an unspecified value, are executed by first making an unspecified (nondeterministic) choice at call-time of a concrete value. This permits the instability of uninitialised values that we see in practice.
The DR451 CR also says "The committee agrees that this area would benefit from a new definition of something akin to a 'wobbly' value and that this should be considered in any subsequent revision of this standard. The committee also notes that padding bytes within structures are possibly a distinct form of 'wobbly' representation.". As far as we can see, our proposal subsumes the need for a distinct notion of 'wobbly' representation.
Example unspecified_value_control_flow_choice.c
#include <stdio.h>
int main()
{
unsigned char c;
unsigned char *p = &c;
if (c == 'a')
printf("equal\n");
else
printf("nonequal\n");
// should this have defined behaviour?
}
ISO C11 is unclear (it does not discuss this). We suggest "yes": permitting a runtime unspecified (nondeterministic) choice at any control-flow choice between specified alternatives based on an unspecified values. The only potential reason otherwise that we are aware of is (as noted by Joseph Myers) jump tables indexed by such a value, if implementations don't do a range check, but that seems likely to lead to security weaknesses. More conservatively, one could conceivably treat any switch
whose controlling expression has an unspecified value as having undefined behaviour. Computed goto
s (if they were allowed in the standard) on unspecified values should give undefined behaviour.
Example unspecified_value_stability.c
#include <stdio.h>
int main() {
// assume here that int has no trap representations and
// that printing an unspecified value is not itself
// undefined behaviour
int i;
int *p = &i;
// can the following print different values?
printf("i=0x%x\n",i);
printf("i=0x%x\n",i);
printf("i=0x%x\n",i);
printf("i=0x%x\n",i);
}
Clang sometimes prints distinct values here (this is consistent with the Clang internal documentation: ). Accordingly, we think the answer has to be "yes".
Example unspecified_value_strictness_and_1.c
#include <stdio.h>
int main() {
unsigned char c;
unsigned char *p=&c;
unsigned char c2 = (c | 1);
unsigned char c3 = (c2 & 1);
// does c3 hold an unspecified value (not 1)?
printf("c=%i c2=%i c3=%i\n",(int)c,(int)c2,(int)c3);
}
An LLVM developer remarks that different parts of LLVM assume that undef
is propagated aggressively or that it represents an unknown particular number.
In our proposal, the read of c
will give the unspecified value token, the result of the binary operation |
, if given at least one unspecified-value arguments, will also be the unspecified value token, which will be written to c2
. Likewise, the binary &
will be strict in unspecified-value-ness, and c3
will end up as the unspecified value. The printf
will then make nondeterministic choices for each of these, allowing arbitrary character-valued integers to be printed by implementations.
Our unspecified value token is a per-scalar-type-object entity, not a per-bit entity (except for single-bit bitfields, where the two coincide).
(Note this would make the N1793 Fig.4 printhexdigit not useful when applied to an uninitialised structure member.)
Example unspecified_value_daemonic_1.c
int main() {
int i;
int *p = &i;
int j = i;
int k = 1/j; // should this have undefined behaviour?
}
The division operation has undefined behaviour for certain concrete argument values, i.e. 0, to accommodate implementation behaviour. If there is an abstract-machine execution in which the second argument is an unspecified value, then a corresponding execution of an actual implementation might divide by zero, so in the abstract machine division should be daemonic: division by an unspecified value should be just as "bad" as division by zero. The same holds for other partial operations and library calls.
This seems to be relied on in practice, and consistent with the "unspecified value token" semantics we have so far, so we suggest "yes" (except in the (Itanium) implementation-defined case above where all reads of uninitialised values give UB). The copy will have an unspecified value for the same member.
Consistent with this, forming a structure value should not be strict in unspecified-value-ness: in the following example, the read of the structure value from s1
and write to s2
should both be permitted, and should copy the value of i1=1
. The read of the uninitialised member should not give rise to undefined behaviour (is this contrary to the last sentence of 6.3.2.1p2, or could the structure not `have been declared with the register storage class'' in any case?) . What
s2.i2` holds after the structure copy depends on the rest of the unspecified-value semantics; in our proposal, it holds the unspecified value token.
Example unspecified_value_struct_copy.c
#include <stdio.h>
typedef struct { int i1; int i2; } st;
int main() {
st s1;
s1.i1 = 1;
st s2;
s2 = s1; // should this have defined behaviour?
printf("s2.i1=%i\n",s2.i1);
}
This and the following questions investigate whether the property of being an unspecified value is associated with arbitrary (possibly aggregate) C values, or with "leaf" (scalar-type) values, or with individual bitfields, or with individual representation bytes of values, or with individual representation bits of values.
In principle there is a similar question for unions: can a union value as a whole be an unspecified value? There might be a real semantic difference, between an unspecified value as whole and a union that contains a specific member which itself is an unspecified value. But it's unclear whether there is a test in ISO C that distinguishes the two.
Example besson_blazy_wilke_bitfields_1u.c
#include <stdio.h>
struct f {
unsigned int a0 : 1; unsigned int a1 : 1;
} bf ;
int main() {
unsigned int a;
bf.a1 = 1;
a = bf.a1;
printf("a=%u\n",a);
}
This example is from Besson, Blazy, and Wilke 2015.
For consistency with the rest of our per-leaf-value proposal, we suggest "yes".
Example unspecified_value_representation_bytes_1.c
#include <stdio.h>
int main() {
// assume here that the implementation-defined
// representation of int has no trap representations
int i;
unsigned char c = * ((unsigned char*)(&i));
// does c now hold an unspecified value?
printf("i=0x%x c=0x%x\n",i,(int)c);
printf("i=0x%x c=0x%x\n",i,(int)c);
}
The best answer to this is unclear from all points of view: ISO C11 doesn't address the question; we don't know whether existing compilers assume these are unspecified values, and we don't know whether existing code relies on them not being unspecified values.
For stylistic consistency one might take the answer to be "yes", but then (given the suggested answers above) a bytewise hash or checksum computation involving them would produce an unspecified value. In a more concrete semantics, it could produce different results in different invocations, even if the value is not mutated in the meantime.
We don't have sufficient grounds for a strong conclusion at present. We tentatively suggest "yes".
Example unspecified_value_representation_bytes_4.c
#include <stdio.h>
int main() {
// assume here that the implementation-defined
// representation of int has no trap representations
int i;
printf("i=0x%x\n",i);
printf("i=0x%x\n",i);
unsigned char *cp = (unsigned char*)(&i);
*(cp+1) = 0x22;
// does *cp now hold an unspecified value?
printf("*cp=0x%x\n",*cp);
printf("*cp=0x%x\n",*cp);
}
This too is unclear. One could take the first such access as "freezing" the unspecified value and its representation bytes, but we don't know whether that would be sound with respect to current compiler behaviour. The simplest choice is "yes".
Example unspecified_value_representation_bytes_2.c
#include <stdio.h>
int main() {
// assume here that the implementation-defined
// representation of int has no trap representations
int i;
printf("i=0x%x\n",i);
printf("i=0x%x\n",i);
* (((unsigned char*)(&i))+1) = 0x22;
// does i now hold an unspecified value?
printf("i=0x%x\n",i);
printf("i=0x%x\n",i);
}
Again "yes" is the simplest choice, but one could argue instead that a read of the whole should give any nondeterministically chosen value consistent with the concretely written bytes.
Values, unspecified values, indeterminate values, and trap representations are currently defined as follows (3.19):
3.19
1 value
precise meaning of the contents of an object when interpreted as having a specific type3.19.2
1 indeterminate value
either an unspecified value or a trap representation3.19.3
1 unspecified value
valid value of the relevant type where this International Standard imposes no requirements on which value is chosen in any instance
2 NOTE
An unspecified value cannot be a trap representation.3.19.4
1 trap representation
an object representation that need not represent a value of the object type
There "unspecified value" is used to speak of a particular but unknown concrete value (one might be misled by the language of 3.19.3 into thinking that an uninitialised variable that cannot be a trap representation must hold some particular concrete value at runtime, and hence that it will be stable if read multiple times).
Instead, in our proposal there is a "symbolic" unspecified value token at each type, and the abstract machine operates over the disjoint union of the normal concrete semantic values and this token. The notions of value, indeterminate value, and trap representation change accordingly, but in a way that basically preserves the way these terms are currently used:
3.19
1 concrete value
a concrete semantic value (the meaning of the non-trap-representation contents of an object when interpreted as having a specific type)1 value
either a concrete value or the abstract unspecified value token for a specific type3.19.2
1 indeterminate value
either a concrete value, or the unspecified value token for a specific type, or a trap representation for a specific type3.19.3
1 unspecified value
an abstract token, distinct from all concrete values.
2 NOTE unspecified values typically will not have any runtime representation; they are used in the C abstract machine to define what optimisations are allowed, and implementations may use them for compile-time analysis. In the C abstract machine there is a single unspecified value for each type.3.19.4
1 trap representation
a concrete object representation that does not represent a value of the object type
In C11 unspecified values arise in two main ways:
These are unchanged in our proposal, but we have to remove the following, to make reading uninitialised values for types without trap representations be defined behaviour:
Following the general principle that an unspecified value in the C abstract machine should permit the implementation behaviour when given an arbitrary concrete value, control-flow choices between bounded alternatives should be an unspecified choice between them, while control-flow choices to an unbounded choice of locations should give unspecified behaviour. Accordingly, we suggest the following changes to the standard text:
modify (§6.5.15#4, for the Conditional Operator) from:
The first operand is evaluated; there is a sequence point between its evaluation and the evaluation of the second or third operand (whichever is evaluated). The second operand is evaluated only if the first compares unequal to 0; the third operand is evaluated only if the first compares equal to 0; the result is the value of the second or third operand (whichever is evaluated), converted to the type described below.
to the following:
The first operand is evaluated; there is a sequence point between its evaluation and the evaluation of the second or third operand (whichever is evaluated). If the first operand evaluates to an unspecified value, then is unspecified then it is unspecified which of the second or third operand is evaluated. Otherwise, the second operand is evaluated only if the first compares unequal to 0; the third operand is evaluated only if the first compares equal to 0; the result is the value of the second or third operand (whichever is evaluated), converted to the type described below.
add to (6.8.4.1#2, for the if statement)
If the controlling expression evaluates to an unspecified value, it is unspecified which substatement is executed.
For switch statements, we could either follow the above or make it unspecified behaviour.
add to (§6.8.5#4, for iteration statements)
If the controlling expression evaluates to an unspecified value, it unspecified whether the repetition occurs.
The semantics of expression operators should be "daemonic" when given an operand that evaluates to an unspecified value. That is, for any operand of an expression operator, if there is a given concrete value for which the behaviour of that operator is undefined, then if that operand is an unspecified value, then the behaviour of the operator is undefined (because an exceptional condition may occur).
For example if the value of the second operand of the /
operator is zero, it gives undefined behaviour, and hence if the second operand is an unspecified value, it should likewise give undefined behaviour. Similarly if the value of any operand of a signed +
operator is unspecified, the operation is undefined.
Accordingly, we suggest the following change to the text of the standard:
adding as a new clause to §6.5:
During the evaluation of an expression operator, if an unspecified value is used as an argument such that there exists a specified value for which the behaviour is undefined, then the behaviour is undefined.
To give an implementation intuition: daemonicity means that implementations do not need to take care not to introduce undefined behaviours when choosing a concrete value for something that in the C abstract machine is an unspecified value.
We suggest that the semantics of expression operators should be made strict on unspecified values. That is, except in the cases of daemonic undefined behaviour above, if the value of any operand of an expression operator is an unspecified value, then the result is an unspecified value.
For example if the value of the first operand of the /
operator is unspecified, then the result has unspecified value.
Accordingly, we suggest the following changes to the text of the standard:
adding as a new clause to §6.5:
During the evaluation of an expression operator (other then the relations operators, equality operators, logical OR and AND operators and conditional operators), if an operand has an unspecified value and this does not cause undefined behavior, then the value of the operator is unspecified.
Fleshing out the details of strict and daemonic treatment of unspecified values, we propose the following changes (one could just rely on the diffs from the two sections above, but that seems likely to lead to confusion - it seems worth adding the more verbose but detailed diff below too).
Function call
add the following clause to (§6.5.2.2 Semantics):
If the value of the expression that denotes the called function is unspecified, the behavior is undefined.
Structure and union members
append the following to the end of (§6.5.2.3#3 and #4):
If the value of the first expression is unspecified, the behavior is undefined.
Postfix increment and decrement operators
add the following clause at the beginning of (§6.5.2.4 Semantics):
If the value of the operand of the postfix increment or decrement operand is unspecified, the behavior is undefined.
Address operator add the following sentence to(§6.5.3.2#3):
If the value of its operand is unspecified, the behavior is undefined.
Indirection operator add to (§6.5.3.2#4) before its current last sentence:
If the value of the operand is unspecified, the behavior is undefined.
NOTE: this also takes care of the array subscripting operator given (§6.5.2.1#2)
unary ~ operator add the following sentence to (§6.5.3.3#4):
If the value of the operand is unspecified, the result is the unspecified value of the promoted type.
sizeof and _Alignof operators add the following clause to (§6.5.3.4 Semantics):
If the value of the operand of the sizeof or _Alignof operator is unspecified, then the result is the unspecified value of type
size_t
.
Multiplicative operators in (§6.5.5#5), replace:
In both operations, if the value of the second operand is zero, the behavior is undefined.
by
In both operations, if the value of the second operand is zero or is unspecified, the behavior is undefined.
and add the following clause to (§6.5.5 Semantics):
If the common real type is signed and the value of either operand is unspecified, the behavior is undefined.
Additive operators add the following clause to (§6.5.6):
If both operands have arithmetic type and their common real type is signed, if the value of either operand is unspecified, the behavior is undefined.
and the following clause to (§6.5.6):
If one of the operand has a pointer type and the value of either operand is unspecified, the behavior is undefined.
Bitwise shift operator replace the current last sentence of (§6.5.7#3):
If the value of the right operand is negative or is greater than or equal to the width of the promoted left operand, the behavior is undefined.
with
If the value of the right operand is negative, is greater than or equal to the width of the promoted left operand or is unspecified, the behavior is undefined.
and append to the end of (§6.5.7#4):
In particular if
E1
has signed type and its value is unspecified, the behavior is undefined.
Relational operator append to the end of (§6.5.8#5):
In particular if any the value of either operand is unspecified, the behaviour is undefined.
and add the following clause to the end of (§6.5.8):
When two objects of real types are compared, and the value of either operand is unspecified, it is unspecified whether operator shall yield 1 or 0.
Equality operator in clause (§6.5.9#3) replace current final sentence:
For any pair of operands, exactly one of the relations is true.
by
For any pair of operands, exactly one of the relations is true (except when the value of either operand is unspecified).
and append to the end of (§6.5.9#4):
If the value of either operand is unspecified, it is unspecified whether they are equal.
and add the following clause to (§6.5.9): > If at least one operand is a pointer and the value of either operand is unspecified, > the behavior is undefined.
Bitwise AND operator append to the end of (§6.5.10#4):
If the value of either operand is unspecified, the result is the unspecified value of the common real type of the operands.
Bitwise exclusive OR operator append to the end of (§6.5.11#4):
If the value of either operand is an unspecified, the result is the unspecified value of the common real type of the operands.
Bitwise inclusive OR operator append to the end of (§6.5.12#4):
If the value of either operand is unspecified, the result is the unspecified value of the common real type of the operands.
Logical AND operator add the following sentence to (§6.5.13#3):
If the value of either operand is unspecified, it is unspecified whether the operator yield 1 or 0.
and append to the end of (§6.5.13#4):
If the value of the first operand is unspecified, it is unspecified whether the second operand is evaluated.
Logical OR operator add the following sentence to (§6.5.14#3):
If the value of either operand is unspecified, it is unspecified whether the operator yield 1 or 0.
and append to the end of (§6.5.14#4):
If the value of the first operand is unspecified, it is unspecified whether the second operand is evaluated.
Conditional operator modify (§6.5.15#4) from the current:
The first operand is evaluated; there is a sequence point between its evaluation and the evaluation of the second or third operand (whichever is evaluated). The second operand is evaluated only if the first compares unequal to 0; the third operand is evaluated only if the first compares equal to 0; the result is the value of the second or third operand (whichever is evaluated), converted to the type described below.
to the following:
The first operand is evaluated; there is a sequence point between its evaluation and the evaluation of the second or third operand (whichever is evaluated). If the first operand evaluates to an unspecified value, then is unspecified then it is unspecified which of the second or third operand is evaluated. Otherwise, the second operand is evaluated only if the first compares unequal to 0; the third operand is evaluated only if the first compares equal to 0; the result is the value of the second or third operand (whichever is evaluated), converted to the type described below.
Assignment operators add the following clause to (§6.5.16):
If the value of the left operand is unspecified, the behavior is undefined.
To permit (for example) bytewise marshalling of structs that may contain unspecified value members or unspecified value padding, we have to allow invocation of standard library functions that make system calls with unspecified value arguments. The following sentence should therefore be added to clause (§7.1.4#1):
If an argument to a standard library function has an unspecified value, it is replaced by a nondeterministic concrete value in an unspecified fashion.
NOTE: This also makes standard library functions that have undefined behaviour for specific arguments daemonic, in the same way as operators.
NOTE: It might be preferable to restrict these conversions to library functions performing I/O (or more generally having to do with system calls).
NOTE: A return from main()
with an unspecified value should be similar, making a nondeterministic choice of a concrete value in an unspecified fashion.
§6.2.6 Representations of types has to be adapted to account for the fact that values can be either concrete values (which have representations) or unspecified value tokens (which do not), and to specify that when one reads a byte from an unspecified value, one gets an unspecified value at type unsigned char
.
6.2.6 Representations of types
modify (§6.2.6.1#3) from the current
Values stored in unsigned bit-fields and objects of type unsigned char shall be represented using a pure binary notation.49)
to:
Concrete values stored in unsigned bit-fields and objects of type unsigned char shall be represented using a pure binary notation.49)
modify (§6.2.6.1#4) from the current
Values stored in non-bit-field objects of any other object type consist of
n × CHAR_BIT
bits, wheren
is the size of an object of that type, in bytes. The value may be copied into an object of typeunsigned char [n]
(e.g., bymemcpy
); the resulting set of bytes is called the object representation of the value. Values stored in bit-fields consist of m bits, where m is the size specified for the bit-field. The object representation is the set of m bits the bit-field comprises in the addressable storage unit holding it. Two values (other than NaNs) with the same object representation compare equal, but values that compare equal may have different object representations.
to
Concrete values stored in non-bit-field objects of any other object type shall be represented with
n × CHAR_BIT
bits, wheren
is the size of an object of that type, in bytes. A concrete value may be copied into an object of typeunsigned char [n]
(e.g., bymemcpy
); the resulting set of bytes is called the object representation of the value. An unspecified value stored in a non-bit-field object may also be copied into such an array, the elements of which then hold the unspecified value for typeunsigned char
. Concrete values stored in bit-fields consist ofm
bits, wherem
is the size specified for the bit-field. The object representation is the set ofm
bits the bit-field comprises in the addressable storage unit holding it. Two concrete values (other than NaNs) with the same object representation compare equal, but concrete values that compare equal may have different object representations.
Consider either (1) a object which has not been initialised, and therefore has an unspecified value, which then has one or more (but not all) of its representation byte written to with a concrete value, or (2) an object that has been initialised but which has one or more (but not all) of its representation bytes overwritten from some other uninitialised object. In either case we have a mix of unspecified-value and concrete representation bytes. When reading from the whole object, the read value should be the unspecified value of the appropriate type. In particular if the read value is then stored back to the object, the representation bytes that were made concrete disappear.
§6.3.2.1 Lvalues, arrays, and function designators. Replace
Except when it is the operand of the
sizeof
operator, the unary&
operator, the++
operator, the--
operator, or the left operand of the.
operator or an assignment operator, an lvalue that does not have array type is converted to the value stored in the designated object (and is no longer an lvalue); this is called lvalue conversion.
by
Except when it is the operand of the
sizeof
operator, the unary&
operator, the++
operator, the--
operator, or the left operand of the.
operator or an assignment operator, an lvalue that does not have array type is converted to the value stored in the designated object (and is no longer an lvalue); this is called lvalue conversion. If any byte of any scalar type subobject of the designated object is an unspecified value, this conversion for that subobject gives the unspecified value for that type.