Paul E. McKenney, paulmck@linux.vnet.ibm.com
Clark Nelson, clark.nelson@intel.com
Hans-J. Boehm, Hans.Boehm@hp.com, boehm@acm.org
Lawrence Crowl, crowl@google.com, Lawrence@Crowl.org
Introduction
Problem
Prior Work
Alternatives Considered
Solution
Dependency Root
Dependency Tree
Informal Specification
Examples
Indirection
Code Removal
Control-Sensitive Indirection
Constant Propagation
Control Elimination
Control Dependence
Conditional Subexpression Elimination
Constant Results
Selective Dependency Ordering
Implementation
Promote to Acquire
Avoid Some Optimizations
Track Optimizations
Truncate Data-Dependency Trees
Annotate Functions
Proposed Wording
5.1.2.4 Multi-threaded executions and data races
7.16.1 Introduction
7.16.2 Order and Consistency
7.16.2.1 The kill_dependency
macro
7.16.3.1 The atomic_thread_fence
function
7.16.6.1 The atomic_store
generic functions
The efficiency of data structures that are read frequently and written rarely can substantially affect the scalability of some applications. Based on experience in making the Linux operating system scalable, and on more recent user-application work, we propose addenda to the C memory model and atomics library for inter-thread data-dependency ordering. This proposal is based on a similar proposal for C++ (WG21).
This proposal admits a trivial implementation, limiting significant implementation investment to those compilers and platforms where that investment will be recovered.
There are two significant use cases where the current working draft (WG21 N2461) does not support scalability near that possible on some existing hardware.
In such cases, use of inter-thread data-dependency ordering has resulted in order-of-magnitude speedups and similar improvements in scalability on machines that support inter-thread data-dependency ordering, which includes ARM, PowerPC, and embedded MIPS. Such speedups are possible because such machines can avoid the expensive lock acquisitions, atomic instructions, or memory fences that are otherwise required.
On other hardware architectures, including x86, SPARC TSO, and the IBM mainframe, use of inter-thread dependency ordering can permit the compiler to undertake code-movement and value-speculation optimizations that would prohibited by use of alternatives such as load-acquire.
A simplified example use of inter-thread data-dependency ordering, found within the Linux kernel, looks something like the following:
struct foo {
int a;
struct foo *next;
};
struct foo *head;
int b; /* equivalent to atomic_int with relaxed access on Linux platforms. */
void insert(int a) {
struct foo *p = kmalloc(sizeof(*p), GFP_KERNEL); /* cannot fail */
spin_lock(&mylock);
p->a = 1;
p->next = head;
rcu_assign_pointer(head, p); /* Can be thought of as a store-release. */
spin_unlock(&mylock);
}
int getfirstval(void) { /* requires head non-NULL */
struct foo *q = rcu_dereference(head); /* see discussion below. */
int retval = q->a + b;
return retval;
}
The effect of getfirstval
is to return the value at the head of the list
with little more (or even no more) overhead
than would be required if the list were immutable,
but while still allowing updates.
The rcu_dereference()
API used in getfirstval()
can be fully implemented in different ways,
optimized for different classes of machines:
rcu_dereference()
simply prevents the compiler from performing optimizations
that would order operations
with data dependencies on q
before the load from head
.
In this case,
the code relies on the strong ordering
to prevent the assignment to retval
from seeing the pre-initialized version of the a
field
because the store to a
must precede the store to head->next
.
rcu_dereference()
again prevents the compiler from performing optimizations
that would order operations with data dependencies on q
before the load from head
.
However, in this case,
the code
relies on the the machine's enforcement of data-dependency ordering
to prevent the assignment to retval
from seeing the pre-initialized version of the a
field,
because q->a
depends on q
.
rcu_dereference()
is promoted to a load-acquire operation.
Because the acquire prevents all subsequent memory references
from being reordered with the load from head
,
it must prevent any subsequent operations depending on h
from being reordered with the load from head
.
The current working document
(WG14 N1401)
would require that these machines
implement rcu_dereference()
using either an acquire fence or
a load-acquire.
In both cases, this prohibits useful classes of compiler optimizations
that involve code motion that does not break dependencies on the
load from head
.
In the above example, the compiler would be needlessly prohibited from
reordering the fetch of b to precede the
rcu_dereference()
.
Worse yet, this requires emitting expensive memory fences for
the second class of machines, which can result in unacceptable performance
degradation.
Current practice in the Linux kernel uses a combination of gcc extensions
and volatile casts for rcu_dereference()
as shown below:
#define ACCESS_ONCE(x) (*(volatile typeof(x) *)&(x))
#define rcu_dereference(p) \
({ \
typeof(p) _________p1 = ACCESS_ONCE(p); \
smp_read_barrier_depends(); \
(_________p1); \
})
The “({})” notation allows C-language statements to be used in
contexts expecting an expression.
ACCESS_ONCE()
undertakes the volatile cast, using
typeof()
to preserve type checking.
rcu_dereference()
uses ACCESS_ONCE()
for the volatile cast, which prevents merging or redoing loads.
The smp_read_barrier_depends()
function is a no-op on
all CPUs other than DEC Alpha, where it supplies the required memory
fence.
The final “(_________p1);
” returns the value of
the pointer.
Optimizations that might break dependency chains must be suppressed.
A user-level implementation
of RCU is now available under the LGPL licence.
Although this approach has worked for some decades, it would be preferable to have formal support for this usage pattern.
More elaborate examples are described in a presentation at the Oxford 2007 WG21 meeting, describing use cases from the Linux kernel. These uses cases begin on slide 37 and include traversal of multiple levels of pointers, indexing arrays, and casts.
WG21 N2171 and WG21 N2176 are the basis for the current memory model. These proposals support a wide range of memory-ordering use cases, but do not support dependency ordering.
WG21 N2153 by Silvera, Wong, McKenney, and Blainey was the first proposal to explicitly address weakly ordered architectures and the issues surrounding dependency ordering. It was succeded by WG21 N2237. These papers also presented a number of use cases motivating non-SC memory ordering, including dependency ordering.
WG21 N2195
by Peter Dimov
proposes an atomic_load_address()
template function that protects a single level of indirection.
Although this suffices for the very simple example above, it does
not handle other examples given in a
presentation at the WG21 Oxford 2007 meeting
describing use cases from the Linux kernel (beginning on slide 37).
In particular,
WG21 N2195,
does not support data dependencies that traverse
multiple levels of indirection nor that traverse array accesses.
An alternative proposal in WG21 N2195, introduces the notion of dynamic dependencies. Use of dynamic dependencies would permit the data-dependency trees to be scanned after performing those optimizations that do not break dynamic data-dependency trees. However, this proposal was rejected due to software-engineering concerns, which loom especially large in cases where the compiler is able to perform optimizations that the programmer cannot anticipate. For example, the programmer might be forgiven for assuming that an argument to a given function was variable, but a compiler doing inter-procedural analysis might discover that it was in fact constant, or, worse yet, zero. The compiler is therefore required to propagate dependency trees regardless of optimization.
WG21 N2492, by Paul E. McKenney, Hans-J. Boehm, and Lawrence Crowl, and WG21 N2493, and by Paul E. McKenney and Lawrence Crowl, present an approach combining elaborations to the memory model, atomics API, and annotations. Finally, WG21 N2556 by Paul E. McKenney, Hans-J. Boehm, and Lawrence Crowl incorporated feedback from the Bellevue meeting. This document further includes feedback from the Core Working Group at the Sophia Antipolis meeting.
Although control dependencies are extremely intuitive, there are comparatively few known control-dependency use cases, and ARM CPUs only partially support control dependencies. Furthermore, some of the more troublesome optimization issues with switch statements involve control rather than data dependencies. Therefore, there is no support for control dependencies. If experience indicates that control dependencies also need to be enforced, a separate proposal will be put forward for them.
Prohibiting dependency-breaking optimizations would remove the need for annotations. This faced severe resistance, as a number of people felt that this would prohibit valuable optimizations. Therefore, this proposal requires annotations for function arguments and return values through which data dependencies are required to flow. As inter-compilation-unit analysis becomes more common, it is hoped that tools will appear that check annotations or perhaps even produce them automatically. However, individual implementations are free to avoid the dependency issue entirely by simply refraining from breaking data dependencies, or by emitting compensating memory fences when breaking data dependencies. (Full disclosure: this was in fact the original proposal.)
Simply relying on acquire fences would remove the need for dependency ordering. Although this is a reasonable strategy for many machines, it is inappropriate for weakly ordered machines that support data-dependency ordering.
We propose explicit program support for inter-thread data-dependency ordering. Programmers will explicitly mark the root of tree of data-dependent operations, and implementations will respect that ordering.
To mark the root of a inter-thread data-dependency tree,
programmers will use new variant of the atomic load defined in
WG21 N2427.
Specifically,
this proposal augments
WG21 N2427
by adding a memory_order_consume
option that may be supplied
to operations for which data-dependency semantics are permitted.
The memory_order
enumeration in
WG21 N2427
would then read as follows, keeping the enumeration in rough order
of increasing memory-ordering strength:
typedef enum memory_order {
memory_order_relaxed, memory_order_consume, memory_order_acquire,
memory_order_release, memory_order_acq_rel, memory_order_seq_cst
} memory_order;
Given the load value as the root of a data-dependency tree, the tree is loosely defined as any operation or function (within the same thread) that has a data-dependent argument within the tree or reads a variable stored by a data-dependent assignment within the tree.
Note that it is possible for a given value to be part of multiple dependency trees. One way that this might happen would be to add a value in one dependency tree to another value in a different dependency tree. The sum would then be in both dependency trees.
The compiler must preserve the dependency tree through all optimizations. In particular, if the compiler is able to optimize a member of a dependency tree to a constant, then the compiler must either produce code that preserves the dependency tree or emit a memory fence appropriate to the target architecture.
A data-dependency tree ends by the death of values within the tree. Since trees extend into called functions and out through return values, these trees may extend until the end of program execution. The section on implementation describes strategies for dealing this unbounded extent in the normal compilation process.
When normal compilation of an unbounded extent
proves too inefficient,
the programmer may explicitly prune a data-dependency tree
by passing a value through the identity function
std::kill_dependency
.
The result is, by definition, not inter-thread data-dependent on the argument,
even though the values are identical.
Informally, we define inter-thread data-dependency ordering in terms of
The full details are within the formal wording.
In WG21 N2176, Hans Boehm lists a number of example optimizations that can break dependency trees, which are discussed in the following subsections.
WG21 N2176 example code:
r1 = x.load(memory_order_relaxed);
r2 = *r1;
Recoding to this proposal's API:
r1 = x.load(memory_order_consume);
r2 = *r1;
Assuming that x
is an atomic,
the x.load(memory_order_consume)
will form the root of a dependency tree.
The dependency tree extends to the indirection through r1,
so the dependency is ordered.
WG21 N2176 example code:
r1 = x.load(memory_order_relaxed);
r3 = &a + r1 - r1;
r2 = *r3;
This could legitimately be optimized to the following, breaking the dependency tree:
r1 = x.load(memory_order_relaxed);
r3 = &a;
r2 = *r3;
However, recoding to this proposal's API:
r1 = x.load(memory_order_consume);
r3 = &a + r1 - r1;
r2 = *r3;
Again assuming that x
is an atomic,
the x.load(memory_order_consume)
will form the root of a dependency tree.
The dependency tree extends to the indirection through r1,
so the dependency is ordered.
Because the dependency trees must be traced prior to optimization,
if the optimization is performed,
a countervailing memory fence or artificial data dependency must be inserted.
WG21 N2176 example code, recoding to this proposal's API:
r1 = x.load(memory_order_consume);
if (r1 == 0)
r2 = *r1;
else
r2 = *(r1 + 1);
Assuming that x
is an atomic,
the x.load(memory_order_consume)
will form the root of a dependency tree.
The dependency tree extends to the indirection through r1,
so the dependency is ordered.
WG21 N2176
example code, as modified during email discussions,
where x
is known to be either 0 or 1:
if (x.load(memory_order_consume))
...
else
...
y = 42 * x / 13;
This might be optimized to the following:
if (x.load(memory_order_consume)) {
...
y = 3;
} else {
...
y = 0;
}
assuming that x
is an atomic,
the x.load(memory_order_consume)
will form the root of a dependency tree.
The dependency tree extends to the assignment to y
,
so the dependency is ordered.
If the underlying machine
preserves control-dependency ordering for writes,
this optimization is perfectly legal.
If the underlying machine does not preserve control-dependency ordering,
then either this optimization must be avoided,
a memory fence must be emitted after the load of x
,
or an artificial data dependency must be manufactured.
An example artificial data dependency might be as follows:
if (r1 = x.load(memory_order_consume)) {
...
y = 3;
} else {
...
y = 0;
}
y = y + r1 - r1;
The compiler would need to decide whether the add and subtract was better than the multiply and divide.
WG21 N2176 example code:
r1 = x.load(memory_order_consume);
if (r1)
r2 = y.a;
else
r2 = y.a;
This might be optimized to the following in order to break dependency trees:
r1 = x.load(memory_order_relaxed);
r2 = y.a;
This is a control dependency, so falls outside the scope of this proposal.
WG21 N2176 example code:
r1 = x.load(memory_order_consume);
if (r1)
f(&y);
else
g(&y);
Assuming that x
is an atomic,
the x.load(memory_order_consume)
will form the root of a dependency tree.
However, there is no data dependency
between the load and either of the function calls.
There is instead a control dependency,
which does not force ordering in this proposal.
If this example were to be modified
so that the variable r1
were passed to f()
and g()
(rather than y
as shown above),
then the functions would have a data dependency on the load.
WG21 N2176 example code:
r2 = x.load(memory_order_consume);
r3 = r2->a;
There might be at temptation to optimize the code as follows:
r2 = x.load(memory_order_consume);
r3 = r1->a;
if (r1 != r2) r3 = r2->a;
However, assuming that x
is an atomic,
the x.load(memory_order_consume)
will form the root of a dependency tree.
The dependency tree extends to the indirection through r2
,
so the dependency is ordered and the optimization prohibited,
at least in absence of a compensating fence
or artificially generated data dependency.
WG21 N2176 example code:
r1 = x.load(memory_order_consume);
r2 = a[r1->index % a_size];
If the variable a_size
is known to the compiler to have the value one,
then there might be a temptation to optimize as follows:
r1 = x.load(memory_order_consume);
r2 = a[0];
However, again assuming that x
is an atomic,
the x.load(memory_order_consume)
will form the root of a dependency tree.
The dependency tree extends to the indirection through r1
,
so the dependency is ordered.
Therefore, this optimization is prohibited
unless accompanied by a compensating memory fence
or artificial data dependency.
In some cases, dependency ordering is important only for some fields of a structure. For example:
r1 = x.load(memory_order_consume);
r2 = r1->index;
do_something_with(a[r2]);
Indexing a[]
with an uninitialized field could be fatal,
but once the corresponding array element has been fetched,
we might not care about subsequent dependencies.
The std::kill_dependency
primitive
enables the programmer to tell the compiler
that specific dependencies may be broken,
for example, as follows:
r1 = x.load(memory_order_consume);
r2 = r1->index;
do_something_with(a[std::kill_dependency(r2)]);
This allows the compiler to reorder the call to do_something_with
,
for example, by performing speculative optimizations
that predict the value of a[r2]
.
There are several implementation strategies. The first strategy is acceptable on all machines and compilers. The subsequent strategies are appropriate to subsets thereof.
This proposal is expected to have minimal effect on strongly ordered machines (e.g., x86) and on weakly ordered machines that do not support data-dependency ordering (e.g., Alpha). The major burden of this proposal would fall on weakly ordered machines and their compilers that reorder data-dependent operations, such as MIPS, ARM, and PowerPC. Even for these architectures, a fully conforming compiler could use the same approach as weakly ordered machines that do not support data-dependency ordering, albeit at a performance penalty.
Simply promoting all memory_order_consume
operations
to memory_order_acquire
will meet the requirements of this proposal.
For weakly ordered machines without data-dependency ordering, this implementation is also necessary. For other machines, it also serves as trivial first implementation.
Compilers can implement memory_order_consume
loads
as regular loads,
so long as the compiler attempts no optimizations
that break data dependencies.
This strategy will be particularly useful for non-optimizing compilers.
This strategy does not apply to weakly ordered machines without data-dependency ordering, but only to strongly ordered machines or weakly ordered machines with data-dependency ordering.
For implementations on strongly ordered machines
or weakly ordered machines with data-dependency ordering,
compilers can implement memory_order_consume
loads
as regular loads,
so long as the compiler tracks operations within a data-dependency tree
and avoids optimizations that break data dependencies of those operations.
Note, however, the caveat in the next subsection.
In terms of the implementation burden on compilers,
some of the compiler work to implement this strategy
is also required
to respect the existing memory_order_acquire
loads.
This strategy applies primarily to weakly ordered machines with data-dependency ordering, secondarily to strongly ordered machines, and does not apply to weakly ordered machines without data-dependency ordering.
The above strategy implies that the compiler is avoiding optimizations in all functions dynamically called on a data-dependency tree. This implication is unacceptable for compilers that see only a portion of those functions.
However, the compiler does not need to see all functions; it can simply emit an acquire fence on the tree root (which is atomic) before a tree extends into a function call or out of a function return. Given such a convention, the compiler can assume that there are no optimization restrictions at the start of a function. This strategy enables fully-optimized per-function compilation, with run-time performance no worse than, and often much better than, the first strategy.
This strategy becomes more effective when performed after inlining or when considered in inter-procedural optimization.
Many uses of data-dependency operations will be in the implementation of data structures. If their (presumably non-inline) access functions must truncate the data-dependency tree on return, much of the potential performance of data-dependency ordering may be lost.
To address this performance opportunity, we propose to annotate function parameters and results to indicate that the compiler should assume that code on the other side of the function will handle depencencies correctly.
As these annotations are not essential to data-dependency ordering, they are covered in a separate proposal, WG21 N2493.
This section proposes wording changes to working draft N1401.
Edit paragraph 5 as follows:
The library defines a number of atomic operations (7.16) and operations on
locksmutexes (lib - ???7.24) that are specially identified as synchronization operations. These operations play a special role in making assignments in one thread visible to another. A synchronization operation on one or more memory locations is either an acquire operation or a release operation,orboth an acquire and release operation, or a consume operation. A synchronization operation without an associated memory location is a fence and can be either an acquire fence, a release fence, or both an acquire and release fence. In addition, there are relaxed atomic operations, which are not synchronization operations, and atomic read-modify-write operations, which have special characteristics.
Editing note: Changing "lock" to "mutex" is an editorial change of terminology which should happen globally through 5.1.2.4, for consistency with the terminology used in 7.24. It is indicated herein, only in paragraphs that need to be changed for this technical proposal, solely as an aid to the reader.
Edit paragraph 6 as follows:
NOTE 2 For example, a call that acquires a
lockmutex will perform an acquire operation on the locations composing thelockmutex. Correspondingly, a call that releases the samelockmutex will perform a release operation on those same locations. Informally, performing a release operation on A forces prior side effects on other memory locations to become visible to other threads that later perform an acquire or consume operation on A. We do not include relaxed atomic operations as synchronization operations although, like synchronization operations, they cannot contribute to data races.
Before paragraph 14, add the following paragraphs:
14 An evaluation A carries a dependency to an evaluation B if
- the value of A is used as an operand of B, unless:
- B is an invocation of the
kill_dependency
macro, or- A is the left operand of a logical AND (
&&
, see 6.5.13) or logical OR (||
, see 6.5.14) operator, or- A is the left operand of a conditional (
?:
, see 6.5.15) operator, or- A is the left operand of a comma (
,
) operator (6.5.17);or
- A writes a scalar object or bit-field M, B reads the value written by A from M, and A is sequenced before B, or
- for some evaluation X, A carries a dependency to X, and X carries a dependency to B.
15 NOTE The "carries a dependency" relation is a subset of the "sequenced before" relation, and is similarly strictly intra-thread.
16 An evaluation A is dependency-ordered before an evaluation B if
- A performs a release operation on an atomic object M, and B performs a consume operation on M and reads a value written by any side effect in the release sequence headed by A, or
- for some evaluation X, A is dependency-ordered before X and X carries a dependency to B.
17 NOTE The "dependency-ordered before" relation is analogous to the "synchronizes with" relation, but uses release/consume in place of release/acquire.
Edit the existing paragraph 14 (which should be renumbered to paragraph 18) as follows:
1418 An evaluation A inter-thread happens before an evaluation B if A synchronizes with B, or A is dependency-ordered before B, or for some evaluation X:
- A synchronizes with X and X is sequenced before B, or
- A is sequenced before X and X inter-thread happens before B, or
- A inter-thread happens before X and X inter-thread happens before B.
Editing note: In the C++ WD, this paragraph has two levels of bullets. The bullets that appear in the C WD are at the second level, and the conditions expressed in the lead-in are at the first level.
Immediately after the previous paragraph (formerly numbered 14), add a new paragraph (which should be numbered 19):
19 NOTE The "inter-thread happens before" relation describes arbitrary concatenations of "sequenced before", "synchronizes with" and "dependency-ordered before" relationships, with two exceptions. The first exception is that a concatenation is not permitted to end with "dependency-ordered before" followed by "sequenced before". The reason for this limitation is that a consume operation participating in a "dependency-ordered before" relationship provides ordering only with respect to operations to which this consume operation actually carries a dependency. The reason that this limitation applies only to the end of such a concatenation is that any subsequent release operation will provide the required ordering for a prior consume operation. The second exception is that a concatenation is not permitted to consist entirely of "sequenced before". The reasons for this limitation are (1) to permit "inter-thread happens before" to be transitively closed and (2) the "happens before" relation, defined below, provides for relationships consisting entirely of "sequenced before".
Edit paragraph 2 as follows:
The macros defined are
ATOMIC_INTEGRAL_LOCK_FREE ATOMIC_ADDRESS_LOCK_FREEwhich indicate the general lock-free property of integer and address atomic types;
andATOMIC_FLAG_INITwhich expands to an initializer for an object of type atomic_flag
.; andkill_dependencywhich terminates a dependency chain.
Edit paragraph 1 as follows:
The enumerated type
memory_order
specifies the detailed regular (non-atomic) memory synchronization operations as defined in 5.1.2.4 and may provide for operation ordering. Its enumeration constants are as follows:memory_order_relaxed memory_order_consume memory_order_acquire memory_order_release memory_order_acq_rel memory_order_seq_cst
After paragraph 4, insert a new paragraph:
For
memory_order_consume
, a load operation performs a consume operation on the affected memory location.
kill_dependency
macroInsert a new section with the preceding title and the following content:
Synopsis
1 #include <stdatomic.h> type kill_dependency(type y);Description
2 The argument does not carry a dependency to the return value (5.1.2.4).
Returns
3 The value of
y
.
atomic_thread_fence
functionEdit the second bullet of paragraph 2 as follows:
- is an acquire fence, if
order == memory_order_acquire || order == memory_order_consume
;
atomic_store
generic functionsEdit paragraph 2 as follows:
The order argument shall not be
memory_order_consume
,memory_order_acquire
, normemory_order_acq_rel
. Atomically replace the value pointed to byobject
with the value ofdesired
. Memory is affected according to the value oforder
.