Doc. No.: | WG14/N1411 |
---|---|
Date: | 2009-09-30 |
Reply to: | Hans-J. Boehm |
Phone: | +1-650-857-3406 |
Email: | Hans.Boehm@hp.com |
This clearly requires some further work, especially since some details of the memory model are still under discussion. This preliminary version will hopefully aid such discussion.
There is clearly much more that could be said about this topic. I've attempted to limit the discussion to a brief overview of the points that are likely to be considered both important and controversial. It's unclear to me that I hit exactly the right set. Comments are appreciated.
This relies heavily on prior joint work with many others, including Sarita Adve, Lawrence Crowl, Paul McKenney, Clark Nelson, Herb Sutter, and others.
Unfortunately, sequential consistency is expensive to guarantee, especially in the presence of imperfect compiler analysis. Both compilers and hardware obtain improved performance by reordering operations in ways that become visible to multithreaded programs.
Perhaps even more troubling is that under pure sequential consistency the meaning of a program becomes dependent on the granularity of the individual steps. A program doesn't have the same meaning if memory operations are performed a byte at a time as it does if they are performed a word at a time. In the first case, the interleaving is performed at a finer granularity; if a thread changes the value of x from 0 to 100000, another thread may observe a "half assigned" value of 34464.
In order to avoid both implementation cost and granularity issues, many prior approaches (e.g. Ada 83, Posix Threads, Java, and foundational work by Adve and Hill) approximate a model in which sequential consistency is promised only if there are no "data races", i.e. no memory location can ever be concurrently accessed by two or more threads, where one or more of the accesses modifies the object. Data races are defined in terms of sequentially consistent executions.
Data-race-free programs can observe neither critical reordering optimizations nor differences in access granularity. Any thread that tried to observe such differences would introduce a data race.
C, like C++, follows a variant of this approach, in which also
Unfortunately, the need for explicitly ordered atomics also significantly complicates the description of shared object semantics (the "concurrency memory model"). The specification has to cover the issues arising from such atomics, and the simple "sequential consistency for data-race-free programs" becomes a non-obvious theorem that holds in the absence of explicitly ordered atomics, and is only alluded to in a note. (See Boehm, Adve, Foundations of the C++ Concurrency Memory Model, PLDI 08 for more details.) Nonetheless, we expect programmers to program to this DRF model nearly all the time, and only rarely rely on the more intricate specification in the standard.
Since the specification must handle explicitly ordered atomics, it defines executions and data races differently from the informal description we gave here. In the presence of such non-sequentially-consistent behavior, we can no longer describe program executions as interleavings of thread executions. Instead we describe when a set of thread executions, taken together, describes a valid program execution. Based on which modifications are observed by which accesses, we get a set of ordering constraints that give rise to a "happens before" relation, relating actions that must appear to happen in a specific order. A set of thread executions is valid only if the values produced by accesses are actually consistent with this relation.
A happens-before relationship is introduced because either two actions are performed by the same thread in a well-defined order (sequenced before), or a thread T2 observes a synchronization action performed by another thread T1 (synchronizes with), and hence should also see all prior memory operations performed by T1. For example, if x is declared as int and y is declared as atomic_int, and r1 are local int variables, all initially zero, then
Thread 1 | Thread 2 |
x = 1; atomic_store(&y, 1); |
r1 = atomic_load(&y); if (r1) r2 = x; |
The same argument continues to apply if the weaker memory_order_acquire and memory_order_release orderings are specified:
Thread 1 | Thread 2 |
x = 1; atomic_store_explicit(&y, 1, memory_order_release); |
r1 = atomic_load_explicit(&y, memory_order_acquire); if (r1) r2 = x; |
There is a subtle issue related to when a thread observes a synchronization operation from another thread. What if, instead of the value of x written by thread 1, thread 2 read a later value? Requiring the "synchronizes with" relationship to be maintained in all such cases introduces implementation issues on some platforms. But allowing a third thread to invalidate the relationship by atomically adding zero to x in the middle, even with memory_order_relaxed, seemed surprising, unintuitive, and unnecessary. The notion of a release sequence represents a compromise between these two positions that does not create implementation difficulties, but disallows the most unintuitive behaviors.
Explicitly ordered atomics still often allow surprising results when multiple objects are involved. For example, if x and y are atomic_ints which are initially zero, and r1 and r2 are unshared int variables,
Thread 1 | Thread 2 |
atomic_store_explicit(&x, 1, memory_order_release); r1 = atomic_load_explicit(&y, memory_order_acquire); |
atomic_store_explicit(&y, 1, memory_order_release); r2 = atomic_load_explicit(&x, memory_order_acquire); |
However, it was concluded that this kind of counterintuitive result should not be possible if only a single variable is involved, even if memory_order_relaxed is specified. For example, if an atomic counter x is periodically incremented with atomic_fetch_add_explicit(&x, 1, memory_order_relaxed) and repeatedly read in a different thread with atomic_load_explicit(&x, memory_order_relaxed), we still want to guarantee that the loads see non-decreasing values of x. Most hardware platforms provide such guarantees with respect to individual variables, even in the absence of memory fences. Thus the additional cost is usually limited to inhibited compiler optimizations. It was felt that programmers generally expect this kind of single-variable ordering (often referred to as cache coherence), and enforcing it unconditionally is well worth the small implementation cost.
As a result, modification orders were introduced to require that even memory_order_relaxed operations on a single location appear to be executed in a single global order, thus enforcing the preceding requirement.
Again, it is important to remember that all of these issues can be ignored by the programmer in the absence of explicit memory_order specifications. The real specification for those programs remains "sequential consistency for data-race-free programs".
The desire to provide this simple model for most programmers does impose some burdens on implementations. The more interesting ones are:
For example a store to the field b bit-field in struct {char a; int b:11; int c:5; char d; } may not load a and then write back its original value, since that might hide a concurrent update to a by another thread, even in a data-race-free program.
Implementations of older versions of the C standard often overwrote adjacent data, particularly in cases like the preceding one, even in the presence of threads. Unfortunately, such implementations made it unclear when programs really contained a data race. On modern hardware, violating these restrictions rarely results in substantial performance benefit, unless the rules become so relaxed that multi-threaded programs are nearly impossible to write.
These restrictions do not apply to single-threaded implementations. They can also be avoided by making the read and rewrite of for example, an adjacent field indivisible through an atomic-read-modify-write instruction. This is usually impractically expensive for multiprocessors, but it may be a useful option on uniprocessors, restartable atomic sequences as the implementation of atomic read-modify-write operations. (See Bershad, Redell, and Ellis, "Fast Mutual Exclusion for Uniprocessors", ASPLOS 1992.)
Although this can be a profitable optimization, and it rarely causes problems in practice, we know of no reasonable methodology to ensure correctness of multi-threaded programs without restricting it. This optimization can invalidate programs in surprising ways, since it may apply after other transformations, such as function inlining.
Thread 1 | Thread 2 | Thread 3 | Thread 4 |
atomic_store(&x, 1); |
atomic_store(&y, 1); |
r1 = atomic_load(&x); r2 = atomic_load(&y); |
r3 = atomic_load(&y); r4 = atomic_load(&x); |
Although much energy was invested in designing a simple and usable memory model that did not impose this requirement, these attempts were ultimately judged unsuccessful, both in that they were too complex, and also in that they either allowed unintuitive results on more practically important examples, or failed to avoid the implementation difficulties. Thus we continue to require sequential consistency for examples such as this. Some major architectural specifications have since been clarified to make this less expensive to implement.
The Java programming language already previously imposed similar requirements, though those were not fully appreciated. Thus this does not impose a new requirement on hardware.
C atomic types provide objects that can safely be concurrently accessed and updated by multiple threads. Operations appear to happen indivisibly. In the absence of explicit memory_order specifications, programs using such atomic variables continue to provide an interleaving-based memory semantics.
This functionality is often confused with, but significantly different from, that provided by the volatile type qualifier in C. (Other languages, notably Java, use volatile to mean something much closer to C's atomic. The committee had no control over this unfortunate conflict in terminology.) In particular C volatile variables can be structures of arbitrary size, and there is no promise that accesses will appear indivisible. Historically, implementations of sufficiently small volatile variables have varied in the extent to which they ensure ordering of memory accesses, and hence interleaving-based semantics. We are not aware of any that provide the guarantees we expect for atomic variables on modern multiprocessors.
Although there was initial sentiment in favor of strengthening volatile to include atomic thread synchronization semantics (see Boehm & Maclaren, "Should volatile acquire atomicity and thread visibility semantics"), this was found to be impractical, both due to performance implications on existing code and, even more importantly, due to the inherent conflict between existing large volatile structures that were never meant to be accessed indivisibly and the need to provide indivisible accesses for variables used for thread communication.
Previous implementations of threads with C have generally either not provided features analogous to the atomic types, or have provided them only incompletely. As a result, many systems attempted to provide their own, either to provide more efficient concurrent access to shared counters, flags, or the like, or sometimes to provide for safe access from signal handlers. Aside from the signal handler case, accesses to these shared objects could instead have been protected by a mutex. However the difference in cost between, for example, testing an atomic initialization flag and acquiring a mutex is often large enough to make the mutex option very unattractive.
C's (and C++'s) atomic types deviate from prior library designs primarily in that:
Explicit memory fences can accurately express intent in some cases, and potentially facilitate porting of legacy code written with platform-dependent fences. And they are provided for those cases. However, they also tend to needlessly overconstraining ordering in many other cases. For example, consider the program segment:
z = ...; atomic_store_explicit(&z_done, 1, memory_order_release); atomic_store_explicit(&y, ..., memory_order_relaxed);This prevents reordering of stores to z and z_done, but does not prevent reordering of the stores to z and y, and hence allows the store to y to proceed without waiting for the store to z to complete. This is commonly the intent for flags used for thread communication, but not expressible with fences. Although this does not affect most current architectures, it is important for at least one, and is likely to enable future hardware optimizations.
Similarly, ordering constraints associated with an operation on atomic variable a can be removed if the compiler can determine that a is accessible from only one thread, e.g. due to thread inlining. Ordering constraints associated with fences are almost impossible to remove.
Although atomic types are useful primarily when the underlying hardware allows their implementation without the use of locks, they may (with the exception of atomic_flag) be implemented by hashing locations to one of a collection of locks, and acquiring the corresponding lock before operating on a location. The atomic_is_lock_free() function allows client code to query if such an implementation is in use, and thus the corresponding operations are slower, and unusable from asynchronous signal handlers. Note that this property is often not known until the program is actually loaded, since atomic operation support commonly varies between different "minor" variants of the same architecture.
Implementations will have to resort to a lock-based implementation if any of the atomic operations corresponding to that type are not supported by the hardware. (But note that compare_exchange on smaller objects can usually be safely emulated by cmpxchg on larger objects, and cmpxchg always suffices for a lock-free emulation of the others.) Hence the set of provided operations is conservatively restricted to those available across nearly all modern platforms.