ISO/IEC JTC1 SC22 WG21 N2556 = 08-0066 - 2008-02-29
Paul E. McKenney, paulmck@linux.vnet.ibm.com
Hans-J. Boehm, Hans.Boehm@hp.com, boehm@acm.org
Lawrence Crowl, crowl@google.com, Lawrence@Crowl.org
Introduction
    Problem
    Prior Work
    Alternatives Considered
Solution
    Dependency Root
    Dependency Tree
    Informal Specification
Examples
    Indirection
    Code Removal
    Control-Senstive Indirection
    Constant Propogation
    Control Elimination
    Control Dependence
    Conditional Subexpression Elimination
    Constant Results
    Selective Dependency Ordering
Implementation
    Promote to Acquire
    Avoid Some Optimizations
    Track Optimizations
    Truncate Data-Dependency Trees
    Annotate Functions
Proposed Wording
    1.10 Multi-threaded executions and data races [intro.multithread]
    29 Atomic operations library [atomics]
    29.1 Order and Consistency [atomics.order]
    29.4.4 (tentative) Operations [atomics.types.operations (tentative)]
The efficiency of data structures that are read frequently and written rarely can substantially affect the scalability of some applications. Based on experience in making the Linux operating system scalable, we propose addenda to the memory model (N2429) and atomics library (N2427) for inter-thread data-dependency ordering.
This proposal admits a trivial implementation, limiting significant implementation investment to those compilers and platforms where that investment will be recovered.
There are two significant use cases where the current working draft (N2461) does not support scalability near that possible on some existing hardware.
In such cases, use of inter-thread data-dependency ordering has resulted in order-of-magnitude speedups and similar improvements in scalability on machines that support inter-thread data-dependency ordering. Such speedups are possible because such machines can avoid the expensive lock acquisitions, atomic instructions, or memory fences that are otherwise required.
A simplified example use of inter-thread data-dependency ordering, found within the Linux kernel, looks something like the following:
struct foo {
    int a;
    struct foo *next;
};
struct foo *head;
void insert(int a) {
   struct foo *p = kmalloc(sizeof(*p), GFP_KERNEL); /* cannot fail */
   spin_lock(&mylock);
   p->a = 1;
   p->next = head;
   smp_wmb();  /* Can be thought of as a store-release fence. */
   head = p;
   spin_unlock(&mylock);
}
int getfirstval(void) { /* requires head not NULL */
   struct foo *q = rcu_dereference(head);  /* see discussion below. */
   int retval = q->a;
   return retval;
}
The effect of getfirstval
is to return the value at the head of the list
with little more (or even no more) overhead
than would be required if the list were immutable,
but while still allowing updates.
The rcu_dereference() API used in getfirstval()
can be fully implemented in different ways,
optimized for different classes of machines:
rcu_dereference()
simply prevents the compiler from performing optimizations
that would order operations
with data dependencies on q
before the load from head.
In this case,
the code relies on the strong ordering
to prevent the assignment to retval
from seeing the pre-initialized version of the a field
because the store to a
must precede the store to head->next.
rcu_dereference()
again prevents the compiler from performing optimizations
that would order operations with data dependencies on q
before the load from head.
However, in this case,
the code
relies on the the machine's enforcement of data-dependency ordering
to prevent the assignment to retval
from seeing the pre-initialized version of the a field,
because q->a depends on q.
rcu_dereference()
is promoted to a load-acquire operation.
Because the acquire prevents all subsequent memory references
from being reordered with the load from head,
it must prevent any subsequent operations depending on h
from being reordered with the load from head.
The current working draft
(N2461)
would require that these machines
implement rcu_dereference() using either an acquire fence or
a load-acquire.
In both cases, this prohibits useful classes of compiler optimizations
that involve code motion that does not break dependencies on the
load from head.
Worse yet, this requires emitting expensive memory fences for
the second class of machines, which can result in unacceptable performance
degradation.
More elaborate examples are described in a presentation at the Oxford 2007 meeting, describing use cases from the Linux kernel. These uses cases begin on slide 37 and include traversal of multiple levels of pointers, indexing arrays, and casts.
N2171 and N2176 are the basis for the current memory model. These proposals support a wide range of memory-ordering use cases, but do not support dependency ordering.
N2153 by Silvera, Wong, McKenney, and Blainey was the first proposal to explicitly address weakly ordered architectures and the issues surrounding dependency ordering. It was succeded by N2237. These papers also presented a number of use cases motivating non-SC memory ordering, including dependency ordering.
N2195
by Peter Dimov
proposes an atomic_load_address()
template function that protects a single level of indirection.
Although this suffices for the very simple example above, it does
not handle other examples given in a
presentation at the Oxford 2007 meeting
describing use cases from the Linux kernel (beginning on slide 37).
In particular,
N2195,
does not support data dependencies that traverse
multiple levels of indirection nor that traverse array accesses.
An alternative proposal in N2195, introduces the notion of dynamic dependencies. Use of dynamic dependencies would permit the data-dependency trees to be scanned after performing those optimizations that do not break dynamic data-dependency trees. However, this proposal was rejected due to software-engineering concerns, which loom especially large in cases where the compiler is able to perform optimizations that the programmer cannot anticipate. For example, the programmer might be forgiven for assuming that an argument to a given function was variable, but a compiler doing inter-procedural analysis might discover that it was in fact constant, or, worse yet, zero. The compiler is therefore required to propagate dependency trees regardless of optimization.
N2492, by Paul E. McKenney, Hans-J. Boehm, and Lawrence Crowl, and N2493, and by Paul E. McKenney and Lawrence Crowl, present an approach combining elaborations to the memory model, atomics API, and annotations. This proposal adapts that work in light of discussions at the 2008 Bellevue meeting.
Although control dependencies are extremely intuitive, there are comparatively few known control-dependency use cases, and ARM CPUs only partially support control dependencies. Furthermore, some of the more troublesome optimization issues with switch statements involve control rather than data dependencies. Therefore, there is no support for control dependencies. If experience indicates that control dependencies also need to be enforced, a separate proposal will be put forward for them.
Prohibiting dependency-breaking optimizations would remove the need for annotations. This faced severe resistance, as a number of people felt that this would prohibit valuable optimizations. Therefore, this proposal requires annotations for function arguments and return values through which data dependencies are required to flow. As inter-compilation-unit analysis becomes more common, it is hoped that tools will appear that check annotations or perhaps even produce them automatically. However, individual implementations are free to avoid the dependency issue entirely by simply refraining from breaking data dependencies, or by emitting compensating memory fences when breaking data dependencies. (Full disclosure: this was in fact the original proposal.)
Simply relying on acquire fences would remove the need for dependency ordering. Although this is a reasonable strategy for many machines, it is inappropriate for weakly ordered machines that support data-dependency ordering.
We propose explicit program support for inter-thread data-dependency ordering. Programmers will explicitly mark the root of tree of data-dependent operations, and implementations will respect that ordering.
To mark the root of a inter-thread data-dependency tree,
programmers will use new variant of the atomic load defined in
N2427.
Specifically,
this proposal augments 
N2427
by adding a memory_order_consume option that may be supplied
to operations for which data-dependency semantics are permitted.
The memory_order enumeration in
N2427
would then read as follows, keeping the enumeration in rough order
of increasing memory-ordering strength:
typedef enum memory_order {
    memory_order_relaxed, memory_order_consume, memory_order_acquire,
    memory_order_release, memory_order_acq_rel, memory_order_seq_cst
} memory_order;
Given the load value as the root of a data-dependency tree, the tree is loosely defined as any operation or function (within the same thread) that has a data-dependent argument within the tree or reads a variable stored by a data-dependent assignment within the tree.
Note that it is possible for a given value to be part of multiple dependency trees. One way that this might happen would be to add a value in one dependency tree to another value in a different dependency tree. The sum would then be in both dependency trees.
The compiler must preserve the dependency tree through all optimizations. In particular, if the compiler is able to optimize a member of a dependency tree to a constant, then the compiler must either produce code that preserves the dependency tree or emit a memory barrier appropriate to the target architecture.
A data-dependency tree ends by the death of values within the tree. Since trees extend into called functions and out through return values, these trees may extend until the end of program execution. The section on implementation describes strategies for dealing this unbounded extent in the normal compilation process.
When normal compilation of an unbounded extent
proves too inefficient,
the programmer may explicitly prune a data-dependency tree
by passing a value through the identity function
std::kill_dependency.
The result is, by definition, not inter-thread data-dependent on the argument,
even though the values are identical.
Informally, we define inter-thread data-dependency ordering in terms of
The full details are within the formal wording.
In N2176, Hans Boehm lists a number of example optimizations that can break dependency trees, which are discussed in the following subsections.
N2176 example code:
r1 = x.load(memory_order_relaxed);
r2 = *r1;
Recoding to this proposal's API:
r1 = x.load(memory_order_consume);
r2 = *r1;
Assuming that x is an atomic,
the x.load(memory_order_consume)
will form the root of a dependency tree.
The dependency tree extends to the indirection through r1,
so the dependency is ordered.
N2176 example code:
r1 = x.load(memory_order_relaxed);
r3 = &a + r1 - r1;
r2 = *r3;
This could legitimately be optimized to the following, breaking the dependency tree:
r1 = x.load(memory_order_relaxed);
r3 = &a;
r2 = *r3;
However, recoding to this proposal's API:
r1 = x.load(memory_order_consume);
r3 = &a + r1 - r1;
r2 = *r3;
Again assuming that x is an atomic,
the x.load(memory_order_consume)
will form the root of a dependency tree.
The dependency tree extends to the indirection through r1,
so the dependency is ordered.
Because the dependency trees must be traced prior to optimization,
if the optimization is performed,
a countervailing memory fence or artificial data dependency must be inserted.
N2176 example code, recoding to this proposal's API:
r1 = x.load(memory_order_consume);
if (r1 == 0)
        r2 = *r1;
else
	r2 = *(r1 + 1);
Assuming that x is an atomic,
the x.load(memory_order_consume)
will form the root of a dependency tree.
The dependency tree extends to the indirection through r1,
so the dependency is ordered.
N2176
example code, as modified during email discussions,
where x is known to be either 0 or 1:
if (x.load(memory_order_consume))
	...
else
	...
y = 42 * x / 13;
This might be optimized to the following:
if (x.load(memory_order_consume)) {
	...
	y = 3;
} else {
	...
	y = 0;
}
assuming that x is an atomic,
the x.load(memory_order_consume)
will form the root of a dependency tree.
The dependency tree extends to the assignment to y,
so the dependency is ordered.
If the underlying machine
preserves control-dependency ordering for writes,
this optimization is perfectly legal.
If the underlying machine does not preserve control-dependency ordering,
then either this optimization must be avoided,
a memory fence must be emitted after the load of x,
or an artificial data dependency must be manufactured.
An example artificial data dependency might be as follows:
if (r1 = x.load(memory_order_consume)) {
	...
	y = 3;
} else {
	...
	y = 0;
}
y = y + r1 - r1;
The compiler would need to decide whether the add and subtract was better than the multiply and divide.
N2176 example code:
r1 = x.load(memory_order_consume);
if (r1)
        r2 = y.a;
else
	r2 = y.a;
This might be optimized to the following in order to break dependency trees:
r1 = x.load(memory_order_relaxed);
r2 = y.a;
This is a control dependency, so falls outside the scope of this proposal.
N2176 example code:
r1 = x.load(memory_order_consume);
if (r1)
	f(&y);
else
	g(&y);
Assuming that x is an atomic,
the x.load(memory_order_consume)
will form the root of a dependency tree.
However, there is no data dependency
between the load and either of the function calls.
There is instead a control dependency,
which does not force ordering in this proposal.
If this example were to be modified
so that the variable r1
were passed to f() and g()
(rather than y as shown above),
then the functions would have a data dependency on the load.
N2176 example code:
r2 = x.load(memory_order_consume);
r3 = r2->a;
There might be at temptation to optimize the code as follows:
r2 = x.load(memory_order_consume);
r3 = r1->a;
if (r1 != r2) r3 = r2->a;
However, assuming that x is an atomic,
the x.load(memory_order_consume)
will form the root of a dependency tree.
The dependency tree extends to the indirection through r2,
so the dependency is ordered and the optimization prohibited,
at least in absence of a compensating fence
or artificially generated data dependency.
N2176 example code:
r1 = x.load(memory_order_consume);
r2 = a[r1->index % a_size];
If the variable a_size
is known to the compiler to have the value one,
then there might be a temptation to optimize as follows:
r1 = x.load(memory_order_consume);
r2 = a[0];
However, again assuming that x is an atomic,
the x.load(memory_order_consume)
will form the root of a dependency tree.
The dependency tree extends to the indirection through r1,
so the dependency is ordered.
Therefore, this optimization is prohibited
unless accompanied by a compensating memory barrier
or artificial data dependency.
In some cases, dependency ordering is important only for some fields of a structure. For example:
r1 = x.load(memory_order_consume);
r2 = r1->index;
do_something_with(a[r2]);
Indexing a[] with an uninitialized field could be fatal,
but once the corresponding array element has been fetched,
we might not care about subsequent dependencies.
The std::kill_dependency primitive
enables the programmer to tell the compiler
that specific dependencies may be broken,
for example, as follows:
r1 = x.load(memory_order_consume);
r2 = r1->index;
do_something_with(a[std::kill_dependency(r2)]);
This allows the compiler to reorder the call to do_something_with,
for example, by performing speculative optimizations
that predict the value of a[r2].
There are several implementation strategies. The first strategy is acceptable on all machines and compilers. The subsequent strategies are appropriate to subsets thereof.
This proposal is expected to have minimal effect on strongly ordered machines (e.g., x86) and on weakly ordered machines that do not support data-dependency ordering (e.g., Alpha). The major burden of this proposal would fall on weakly ordered machines and their compilers that reorder data-dependent operations, such as ARM and PowerPC (and possibly also Itanium). Even for these architectures, a fully conforming compiler could use the same approach as weakly ordered machines that do not support data-dependency ordering, albeit at a performance penalty.
Simply promoting all memory_order_consume operations
to memory_order_acquire
will meet the requirements of this proposal.
For weakly ordered machines without data-dependency ordering, this implementation is also necessary. For other machines, it also serves as trivial first implementation.
Compilers can implement memory_order_consume loads
as regular loads,
so long as the compiler attempts no optimizations
that break data dependencies.
This strategy will be particularly useful for non-optimizing compilers.
This strategy does not apply to weakly ordered machines without data-dependency ordering, but only to strongly ordered machines or weakly ordered machines with data-dependency ordering.
For implementations on strongly ordered machines
or weakly ordered machines with data-dependency ordering,
compilers can implement memory_order_consume loads
as regular loads,
so long as the compiler tracks operations within a data-dependency tree
and avoids optimizations that break data dependencies of those operations.
Note, however, the caveat in the next subsection.
In terms of the implementation burden on compilers,
some of the compiler work to implement this strategy
is also required
to respect the existing memory_order_acquire loads.
This strategy applies primarily to weakly ordered machines with data-dependency ordering, secondarily to strongly ordered machines, and does not apply to weakly ordered machines without data-dependency ordering.
The above strategy implies that the compiler is avoiding optimizations in all functions dynamically called on a data-dependency tree. This implication is unacceptable for compilers that see only a portion of those functions.
However, the compiler does not need to see all functions; it can simply emit an acquire fence on the tree root (which is atomic) before a tree extends into a function call or out of a function return. Given such a convention, the compiler can assume that there are no optimization restrictions at the start of a function. This strategy enables fully-optimized per-function compilation, with run-time performance no worse than, and often much better than, the first strategy.
This strategy becomes more effective when performed after inlining or when considered in inter-procedural optimization.
Many uses of data-dependency operations will be in the implementation of data structures. If their (presumably non-inline) access functions must truncate the data-dependency tree on return, much of the potential performance of data-dependency ordering may be lost.
To address this performance opportunity, we propose to annotate function parameters and results to indicate that the compiler should assume that code on the other side of the function will handle depencencies correctly.
As these annotations are not essential to data-dependency ordering, they are covered in a separate proposal, N2493.
This section proposes wording changes to working draft N2461.
Edit paragraph 4 as follows:
The library defines a number of atomic operations (clause 29) and operations on locks (clause 30) that are specially identified as synchronization operations. These operations play a special role in making assignments in one thread visible to another. A synchronization operation is either a consume operation, an acquire operation,
ora release operation, or both an acquire and release operation, on one or more memory locations; the semantics of these are described below. In addition, there are relaxed atomic operations, which are not synchronization operations, and atomic read-modify-write operations, which have special characteristics, also described below. [Note: For example, a call that acquires a lock will perform an acquire operation on the locations comprising the lock. Correspondingly, a call that releases the same lock will perform a release operation on those same locations. Informally, performing a release operation on A forces prior side effects on other memory locations to become visible to other threads that later perform a consume or an acquire operation on A. We do not include "relaxed" atomic operations as synchronization operations although, like synchronization operations, they cannot contribute to data races. —end note]
Paragraph 6 (defining release sequence) is unchanged:
A release sequence on an atomic object M is a maximal contiguous sub-sequence of side effects in the modification order of M, where the first operation is a release, and every subsequent operation
- is performed by the same thread that performed the release, or
- is a non-relaxed atomic read-modify-write operation.
Paragraph 7 (defining synchronizes with) is unchanged:
An evaluation A that performs a release operation on an object M synchronizes with an evaluation B that performs an acquire operation on M and reads a value written by any side effect in the release sequence headed by A. [Note: Except in the specified cases, reading a later value does not necessarily ensure visibility as described below. Such a requirement would sometimes interfere with efficient implementation. —end note] [Note: The specifications of the synchronization operations define when one reads the value written by another. For atomic variables, the definition is clear. All operations on a given lock occur in a single total order. Each lock acquisition "reads the value written" by the last lock release. —end note]
After paragraph 7, add the following paragraphs.
An evaluation A carries a dependency to an evaluation B if
- the value of A is used as an operand of B, and:
or
- B is not an invocation of any specialization of
std::kill_dependency, and- A is not the left operand to the comma (',') operator,
- A writes a scalar object or bit-field M, B reads the value written by A from M, and A is sequenced before B, or
- for some evaluation X, A carries a dependency to X, and X carries a dependency to B.
[Note: 'Carries a dependency to' is a subset of 'is sequenced before', and is similarly strictly intra-thread. —end note]
An evaluation A is dependency-ordered before an evaluation B if,
- A performs a release operation on an atomic object M, B performs a consume operation on M, and B reads a value written by any side effect in the release sequence headed by A, or
- for some evaluation X, A is dependency-ordered before X and X carries a dependency to B.
[Note: The relation 'is dependency-ordered before' is analogous to 'synchronizes with', but uses release/consume in place of release/acquire. —end note]
An evaluation A inter-thread happens before an evaluation B if,
- A synchronizes with B, or
- A is dependency-ordered before B, or
- for some evaluation X,
- A synchronizes with X and X is sequenced before B, or
- A is sequenced before X and X inter-thread happens before B, or
- A inter-thread happens before X and X inter-thread happens before B.
[Note: The 'inter-thread happens before' relation describes arbitrary concatenations of 'sequenced before', 'synchronizes with' and 'dependency-ordered before' relationships, with two exceptions. The first exception is that a concatenation is not permitted to end with 'dependency-ordered before' followed by 'sequenced before'. The reason for this limitation is that a consume operation participating in a 'dependency-ordered before' relationship provides ordering only with respect to operations to which this consume operation actually carries a dependency. The reason that this limitation applies only to the end of such a concatenation is that any subsequent release operation will provide the required ordering for a prior consume operation. The second exception is that a concatenation is not permitted to consist entirely of 'sequenced before'. The reasons for this limitation are (1) to permit 'inter-thread happens before' to be transitively closed and (2) the 'happens before' relation, defined below, provides for relationships consisting entirely of 'sequenced before'. —end note]
Edit paragraph 8 as follows:
An evaluation A happens before an evaluation B if:
- A is sequenced before B, or
A synchronizes with B, or
for some evaluation X,A inter-thread happens beforeX and X happens beforeB.
Edit the atomic library synopsis, as follows:
// 29.1, order and consistency
enum memory_order; template< typename T > T kill_dependency( T y );
Edit the enum memory_order synopsis as follows:
typedef enum memory_order { memory_order_relaxed, memory_order_consume, memory_order_acquire, memory_order_release, memory_order_acq_rel, memory_order_seq_cst } memory_order;
Edit the memory_order effects as follows:
Element Meaning memory_order_relaxedthe operation does not order memory memory_order_releasethe operation performs a release operation on the affected memory location, thus making regular memory writes visible to other threads through the atomic variable to which it is applied memory_order_acquirethe operation performs an acquire operation on the affected memory location, thus making regular memory writes in other threads released through the atomic variable to which isit is applied visible to the current threadmemory_order_consumethe operation performs a consume operation on the affected memory location, thus making regular memory writes in other threads released through the atomic variable to which it is applied visible to the regular memory reads that are dependencies of this consume operation. memory_order_acq_relthe operation has both acquire and release semantics memory_order_seq_cstthe operation has both acquire and release semantics w, and, in addition, has sequentially-consistent operation ordering
After paragraph 6, append as follows:
template< typename T >
T kill_dependency( T y );Effects: The argument does not carry a dependency to the return value ([intro.multithread]).
Returns:
y
Edit paragraph 5 (referring to test and set operations) as follows:
Effects: Atomically sets the value pointed to by
objector bythistotrue. Memory is affected according to the value oforder.These operations are read-modify-write operations in the sense of the "synchronizes with" definition (1.10), so both such an operation and the evaluation that produced the input value synchronize with any evaluation that reads the updated value.These operations are atomic read-modify-write operations (1.10 [intro.multithread]).
Edit paragraph 10 (referring to fence operations) as follows:
Effects: Memory is affected according to the value of
order.These operations are read-modify-write operations in the sense of the "synchronizes with" definition (1.10), so both such an operation and the evaluation that produced the input value synchronize with any evaluation that reads the updated value.These operations are atomic read-modify-write operations (1.10 [intro.multithread]).
Edit paragraph 5 (referring to store operations) as follows:
Requires: The
orderargument shall not bememory_order_consume,memory_order_acquire, normemory_order_acq_rel.
Edit paragraph 12 (referring to swap operations) as follows:
Effects: Atomically replaces the value pointed to by
objector bythiswithdesired. Memory is affected according to the value oforder.These operations are read-modify-write operations in the sense of the "synchronizes with" definition (1.10), so both such an operation and the evaluation that produced the input value synchronize with any evaluation that reads the updated value.These operations are atomic read-modify-write operations (1.10 [intro.multithread]).
Edit paragraph 15 (referring to compare and swap operations) as follows:
Effects: Atomically, compares the value pointed to by
objector bythisfor equality with that inexpected, and iftrue, replaces the value pointed to byobjector bythiswithdesired, and iffalse, updates the value inexpectedwith the value pointed to byobjector bythis. Further, if the comparison istrue, memory is affected according to the value ofsuccess, and if the comparison isfalse, memory is affected according to the value offailure. When only onememory_orderargument is supplied, the value ofsuccessisorder, and the value offailureisorderexcept that a value ofmemory_order_acq_relshall be replaced by the valuememory_order_requireand a value ofmemory_order_releaseshall be replaced by the valuememory_order_relaxed.These operations are read-modify-write operations in the sense of the "synchronizes with" definition (1.10), so both such an operation and the evaluation that produced the input value synchronize with any evaluation that reads the updated value.These operations are atomic read-modify-write operations (1.10 [intro.multithread]).
Edit paragraph 19 (referring to fence operations) as follows:
Requires: The
orderargument shallnotbe neithermemory_order_relaxednormemory_order_consume.
Edit paragraph 20 (referring to fence operations) as follows:
Effects: Memory is affected according to the value of
order.These operations are read-modify-write operations in the sense of the "synchronizes with" definition (1.10), so both such an operation and the evaluation that produced the input value synchronize with any evaluation that reads the updated value.These operations are atomic read-modify-write operations (1.10 [intro.multithread]).
Edit paragraph 22 (referring to fetch and op operations) as follows:
Effects: Atomically replaces the value pointed to by
objector bythiswith the result of the computation applied to the value pointed to byobjector bythisand the givenoperand. Memory is affected according to the value oforder.These operations are read-modify-write operations in the sense of the "synchronizes with" definition (1.10), so both such an operation and the evaluation that produced the input value synchronize with any evaluation that reads the updated value.These operations are atomic read-modify-write operations (1.10 [intro.multithread]).