==========================================================
Document: N2223
Related:
WG14 Pittsburgh meeting, 2016-10
WG14 London meeting, 2016-04
We have been working for some time to clarify the memory object model of C: the behaviour of pointer operations, uninitialised values, padding, effective types, and so on. These have been vexed questions for many years. C values cannot be treated as simple abstract or concrete entities: the language exposes their representations, but compiler optimisations rely on analyses that reason about provenance and initialisation status, not just runtime representations. The current ISO standard some aspects of this unclear, and in some aspects conflicts with de facto standard usage - which itself is difficult to investigate.
We have subdivided the problem into six areas, covering subsets of the questions from our London 2016-04 WG14 N2013. For the first four we provide notes for the Brno 2018-04 WG14 meeting, the first three of which are revisions of our N2089, N2090, N2091 from the Pittsburgh 2016-10 meeting.
N2219: Clarifying Pointer Provenance (Q1-Q20) v3
Here we have a proposed semantics worked out in detail, including a proposed change to the standard text that captures it, and we have reasonable confidence that it is doing the right thing.
N2221: Clarifying Unspecified Values (Q48-Q59) v3
For these two we also have proposed semantics worked out in detail, including proposed changes to the standard text, but there are substantial choices to be made which should be discussed.
N2222: Further Pointer Issues (Q21-Q46)
This is a collection of additional more specific issues about C pointer semantics which either need clarification or where there seems to be a difference between ISO C and de facto standard usage and implementations.
Structure and Union Padding (Q60-Q72)
For these we refer to Section 3.3 of our revised N2013
Effective Types (Q73-Q81)
For these we refer to Section 4 of our revised N2013
In several areas there are multiple desirable semantic options, e.g. for C with and without the mainstream implementation -fno-strict-aliasing
flag, which has a profound effect on the language. The ISO standard has historically not defined such options (beyond the current implementation-defined aspects), but we believe that in the next major version it should define the major options. We identify some of them (some already existing, some new) in these notes. For some, one might want it to be implementation-defined which options are supported. This also provides a way of defining common "safer C" dialects, by a suitable combination of options.
Our questions and proposals are informed by our previous surveys of C experts, by extensive discussion with many people, and by our Cerberus project work, in which we are developing a rigorous semantic model for a substantial fragment of C.
Cerberus is available via a web interface that allows one to explore the behaviour of small test programs. It shows the elaboration into Core and the execution thereof, and it also highlights the relevant clauses of the C11 standard. Feedback from WG14 would be welcome (though note that it is work in progress and subject to various limitations, as below).
Cerberus supports many of the examples in these notes, and the web interface includes them.
Where the ISO C11 standard is clear and corresponds with practice, Cerberus aims to follow that.
Where there are differences, chiefly in the memory layout model (the behaviour of pointers, pointer arithmetic, uninitialised values, etc.), we aim to clarify the de facto standards, understand how the ISO standard could be reconciled with them, and provide options that capture both.
Cerberus precisely defines a range of allowed behaviour, not just that of some specific implementation, with a combination of loose (but precise) specification, e.g. for evaluation order, and parameterisation (e.g. for integer type sizes).
Cerberus is executable as a test oracle, to explore all behaviours or single paths of small test programs. It should be able to identify all undefined behaviours. It can thus be used as a reference for discussion and for understanding the standard. It should also be usable as an oracle for compiler testing. It is not an analysis tool for production C code.
The semantics defines a source-language semantics for C source. The design of compiler internal-language semantics is subject to different trade-offs, e.g. as in recent work for LLVM by Lee et al. (J. Lee, Y. Kim, Y. Song, C. Hur, S. Das, D. Majnemer, J. Regehr, N. P. Lopes. Taming Undefined Behavior in LLVM. In Proc. PLDI 2017). There should be a refinement relation between the two, and preliminary discussion suggests that should be achievable.
The semantics is factored via an elaboration into a simpler Core language, to make it readable and conceptually and mathematically tractable; the dynamic semantics of Core can be linked with various sequential memory object models. We currently have two such models: a symbolic model, in which allocation addresses are maintained symbolically and execution accumulates constraints on them (with Z3), and a concrete model, using a specific allocator.
The front-end is written from scratch to closely follow the C11 standard, including a parser that follows the C11 standard grammar, and a typechecker.
We intend to make Cerberus open-source in due course.
Our model covers many features of C, both syntactic and semantic, but to keep the problem manageable we exclude some important aspects. We do not currently attempt to cover preprocessor features, C11 character-set features, floating-point and complex types (beyond simple float constants), user-defined variadic functions (we do cover printf
), bitfields, volatile
, restrict
, generic selection, register
, flexible array members, some exotic initialisation forms, signals, longjmp
, multiple translation units, most of the standard library, or concurrency. We focus on the C commonly used for mainstream systems programming without effective types, with -fno-strict-aliasing
. We are not trying to cover every conceivable C implementation, or those on exotic architectures. Our semantics is intended as a source semantics for C. For an intermediate language, e.g. LLVM, there are related but distinct goals. Our implementation and web tool are intended to explore the behaviour of small test cases, which is already challenging, not as a bug-finding tool for production C code, which would need considerably higher performance and coverage. Finally, while we have done significant testing and validation, more would always be desirable; Cerberus is work in progress, not a completely finished system.