Doc. No.: WG14/N1411
Date: 2009-09-30
Reply to: Hans-J. Boehm
Phone: +1-650-857-3406
Email: Hans.Boehm@hp.com

N1411: Memory Model Rationale

Contents

This paper contains an initial draft of a rationale for the C concurrency memory model and associated atomics library ( N1349), which are very similar to the analogous C++ features. It is based in part on some C++ committee papers, notably N2176: Memory Model Rationales, N2138: A Less Formal Explanation of the Proposed C++ Concurrency Memory Model, and N2427: Atomic Types and Operations

This clearly requires some further work, especially since some details of the memory model are still under discussion. This preliminary version will hopefully aid such discussion.

There is clearly much more that could be said about this topic. I've attempted to limit the discussion to a brief overview of the points that are likely to be considered both important and controversial. It's unclear to me that I hit exactly the right set. Comments are appreciated.

This relies heavily on prior joint work with many others, including Sarita Adve, Lawrence Crowl, Paul McKenney, Clark Nelson, Herb Sutter, and others.

5.1.2.4: Multi-threaded executions and data races

Variables shared between threads raise a number of complex issues that have often only been incompletely addressed. The simplest possible interpretation of a multi-threaded program is that of simply interleaving the steps performed by each of the threads. Each use of a shared memory location sees the last assignment to that location in this interleaving. Following Lamport, this is normally referred to as sequential consistency.

Unfortunately, sequential consistency is expensive to guarantee, especially in the presence of imperfect compiler analysis. Both compilers and hardware obtain improved performance by reordering operations in ways that become visible to multithreaded programs.

Perhaps even more troubling is that under pure sequential consistency the meaning of a program becomes dependent on the granularity of the individual steps. A program doesn't have the same meaning if memory operations are performed a byte at a time as it does if they are performed a word at a time. In the first case, the interleaving is performed at a finer granularity; if a thread changes the value of x from 0 to 100000, another thread may observe a "half assigned" value of 34464.

In order to avoid both implementation cost and granularity issues, many prior approaches (e.g. Ada 83, Posix Threads, Java, and foundational work by Adve and Hill) approximate a model in which sequential consistency is promised only if there are no "data races", i.e. no memory location can ever be concurrently accessed by two or more threads, where one or more of the accesses modifies the object. Data races are defined in terms of sequentially consistent executions.

Data-race-free programs can observe neither critical reordering optimizations nor differences in access granularity. Any thread that tried to observe such differences would introduce a data race.

C, like C++, follows a variant of this approach, in which also

  1. Programs with data races have completely undefined semantics, and are thus effectively considered erroneous. This preserves a number of existing optimization opportunities, in that compilers may continue to assume that ordinary variables are not asynchronously updated by other means. It also avoids the largely unsolved research problem of specifying reasonable semantics for data races. It allows and encourages implementations to detect data races.
  2. Data races are most commonly prevented using mutexes or similar synchronization mechanisms.
  3. However, it is also possible to introduce atomic objects that may be used for synchronization, i.e. can be concurrently accessed and updated by multiple threads. They are effectively exempt from data races. Unless an explicit memory_order_ argument is specified for operations on these objects, it remains the implementation's responsibility to ensure that programs without data races (on non-atomic objects) exhibit sequentially consistent behavior. On many platforms, such atomic operations will be implemented using memory fence instructions that prevent hardware reordering of memory operations.
  4. Since the memory fences required to implement such atomic operations on current hardware tend to be expensive, the C library additionally provides operations on atomic objects that explicitly specify memory visibility constraints. These operations violate sequential consistency, and provide improved performance at the expense of significantly increased complexity.
The presence of atomic objects makes it easy to replace any existing uses of intentional data races. (Such uses typically don't conform to older standards either, but are nonetheless fairly common.) The presence of explicit memory ordering constraints makes it possible to precisely express the requirements on visibility to other threads.

Unfortunately, the need for explicitly ordered atomics also significantly complicates the description of shared object semantics (the "concurrency memory model"). The specification has to cover the issues arising from such atomics, and the simple "sequential consistency for data-race-free programs" becomes a non-obvious theorem that holds in the absence of explicitly ordered atomics, and is only alluded to in a note. (See Boehm, Adve, Foundations of the C++ Concurrency Memory Model, PLDI 08 for more details.) Nonetheless, we expect programmers to program to this DRF model nearly all the time, and only rarely rely on the more intricate specification in the standard.

Since the specification must handle explicitly ordered atomics, it defines executions and data races differently from the informal description we gave here. In the presence of such non-sequentially-consistent behavior, we can no longer describe program executions as interleavings of thread executions. Instead we describe when a set of thread executions, taken together, describes a valid program execution. Based on which modifications are observed by which accesses, we get a set of ordering constraints that give rise to a "happens before" relation, relating actions that must appear to happen in a specific order. A set of thread executions is valid only if the values produced by accesses are actually consistent with this relation.

A happens-before relationship is introduced because either two actions are performed by the same thread in a well-defined order (sequenced before), or a thread T2 observes a synchronization action performed by another thread T1 (synchronizes with), and hence should also see all prior memory operations performed by T1. For example, if x is declared as int and y is declared as atomic_int, and r1 are local int variables, all initially zero, then

Thread 1 Thread 2
x = 1;
atomic_store(&y, 1);
r1 = atomic_load(&y);
if (r1) r2 = x;
ensures that r2 must be 1 whenever r1 is. This is true because, assuming r1 = 1, the atomic_store operation synchronizes with the atomic_load operation. Together with the sequencing relationships, this implies the assignment to x in thread 1 happens before the access to x in thread 2.

The same argument continues to apply if the weaker memory_order_acquire and memory_order_release orderings are specified:

Thread 1 Thread 2
x = 1;
atomic_store_explicit(&y, 1, memory_order_release);
r1 = atomic_load_explicit(&y, memory_order_acquire);
if (r1) r2 = x;
But the observation would no longer hold if the weakest memory order specification, memory_order_relaxed were used in either position.

There is a subtle issue related to when a thread observes a synchronization operation from another thread. What if, instead of the value of x written by thread 1, thread 2 read a later value? Requiring the "synchronizes with" relationship to be maintained in all such cases introduces implementation issues on some platforms. But allowing a third thread to invalidate the relationship by atomically adding zero to x in the middle, even with memory_order_relaxed, seemed surprising, unintuitive, and unnecessary. The notion of a release sequence represents a compromise between these two positions that does not create implementation difficulties, but disallows the most unintuitive behaviors.

Explicitly ordered atomics still often allow surprising results when multiple objects are involved. For example, if x and y are atomic_ints which are initially zero, and r1 and r2 are unshared int variables,

Thread 1 Thread 2
atomic_store_explicit(&x, 1, memory_order_release);
r1 = atomic_load_explicit(&y, memory_order_acquire);
atomic_store_explicit(&y, 1, memory_order_release);
r2 = atomic_load_explicit(&x, memory_order_acquire);
it is perfectly possible for r1 and r2 to both be zero after execution. This reflects the fact that even in the absence of compiler optimization, this can easily happen if hardware stores are buffered and do not immediately become visible to other processor cores, as is almost always the case.

However, it was concluded that this kind of counterintuitive result should not be possible if only a single variable is involved, even if memory_order_relaxed is specified. For example, if an atomic counter x is periodically incremented with atomic_fetch_add_explicit(&x, 1, memory_order_relaxed) and repeatedly read in a different thread with atomic_load_explicit(&x, memory_order_relaxed), we still want to guarantee that the loads see non-decreasing values of x. Most hardware platforms provide such guarantees with respect to individual variables, even in the absence of memory fences. Thus the additional cost is usually limited to inhibited compiler optimizations. It was felt that programmers generally expect this kind of single-variable ordering (often referred to as cache coherence), and enforcing it unconditionally is well worth the small implementation cost.

As a result, modification orders were introduced to require that even memory_order_relaxed operations on a single location appear to be executed in a single global order, thus enforcing the preceding requirement.

Again, it is important to remember that all of these issues can be ignored by the programmer in the absence of explicit memory_order specifications. The real specification for those programs remains "sequential consistency for data-race-free programs".

The desire to provide this simple model for most programmers does impose some burdens on implementations. The more interesting ones are:

No adjacent field overwrites

The semantics, and particularly the definition of "memory location" severely limit the freedom of compilers supporting threads to implement a store to a small object by loading a larger piece of memory, modifying it, and then writing back the larger piece.

For example a store to the field b bit-field in struct {char a; int b:11; int c:5; char d; } may not load a and then write back its original value, since that might hide a concurrent update to a by another thread, even in a data-race-free program.

Implementations of older versions of the C standard often overwrote adjacent data, particularly in cases like the preceding one, even in the presence of threads. Unfortunately, such implementations made it unclear when programs really contained a data race. On modern hardware, violating these restrictions rarely results in substantial performance benefit, unless the rules become so relaxed that multi-threaded programs are nearly impossible to write.

These restrictions do not apply to single-threaded implementations. They can also be avoided by making the read and rewrite of for example, an adjacent field indivisible through an atomic-read-modify-write instruction. This is usually impractically expensive for multiprocessors, but it may be a useful option on uniprocessors, restartable atomic sequences as the implementation of atomic read-modify-write operations. (See Bershad, Redell, and Ellis, "Fast Mutual Exclusion for Uniprocessors", ASPLOS 1992.)

No other spurious stores

It was also quite common for compilers to introduce stores by temporarily keeping a potentially shared variable in a register for the duration of a loop, and unconditionally storing it back at the end of the loop, even if the loop executed for zero iterations. In the zero iteration case, the original value would be stored back. But concurrent updates of the variable by another thread could again be hidden. This can again happen even for data-race-free program. (See Boehm, "Threads Cannot be Implemented as a Library", PLDI05 for more details.)

Although this can be a profitable optimization, and it rarely causes problems in practice, we know of no reasonable methodology to ensure correctness of multi-threaded programs without restricting it. This optimization can invalidate programs in surprising ways, since it may apply after other transformations, such as function inlining.

Total store order

The implementation has to implement atomic assignments so that they appear to execute as though the thread actions were interleaved. This means in particular that in the following example (x and y atomic, initially zero)
Thread 1 Thread 2 Thread 3 Thread 4
atomic_store(&x, 1);
atomic_store(&y, 1);
r1 = atomic_load(&x);
r2 = atomic_load(&y);
r3 = atomic_load(&y);
r4 = atomic_load(&x);
threads 3 and 4 cannot see the assignments to x and y in opposite order, i.e. r1 = r3 = 1 and r2 = r4 = 0 must be impossible. This is unpleasantly expensive to enforce on a few architectures. In particular, it is not immediately clear that inserting fence instructions between every pair of adjacent instructions is sufficient. (It can always be implemented if atomics are always implemented using locks. But that is clearly undesirable.)

Although much energy was invested in designing a simple and usable memory model that did not impose this requirement, these attempts were ultimately judged unsuccessful, both in that they were too complex, and also in that they either allowed unintuitive results on more practically important examples, or failed to avoid the implementation difficulties. Thus we continue to require sequential consistency for examples such as this. Some major architectural specifications have since been clarified to make this less expensive to implement.

The Java programming language already previously imposed similar requirements, though those were not fully appreciated. Thus this does not impose a new requirement on hardware.

Atomics <stdatomic.h>

C atomic types provide objects that can safely be concurrently accessed and updated by multiple threads. Operations appear to happen indivisibly. In the absence of explicit memory_order specifications, programs using such atomic variables continue to provide an interleaving-based memory semantics.

This functionality is often confused with, but significantly different from, that provided by the volatile type qualifier in C. (Other languages, notably Java, use volatile to mean something much closer to C's atomic. The committee had no control over this unfortunate conflict in terminology.) In particular C volatile variables can be structures of arbitrary size, and there is no promise that accesses will appear indivisible. Historically, implementations of sufficiently small volatile variables have varied in the extent to which they ensure ordering of memory accesses, and hence interleaving-based semantics. We are not aware of any that provide the guarantees we expect for atomic variables on modern multiprocessors.

Although there was initial sentiment in favor of strengthening volatile to include atomic thread synchronization semantics (see Boehm & Maclaren, "Should volatile acquire atomicity and thread visibility semantics"), this was found to be impractical, both due to performance implications on existing code and, even more importantly, due to the inherent conflict between existing large volatile structures that were never meant to be accessed indivisibly and the need to provide indivisible accesses for variables used for thread communication.

Previous implementations of threads with C have generally either not provided features analogous to the atomic types, or have provided them only incompletely. As a result, many systems attempted to provide their own, either to provide more efficient concurrent access to shared counters, flags, or the like, or sometimes to provide for safe access from signal handlers. Aside from the signal handler case, accesses to these shared objects could instead have been protected by a mutex. However the difference in cost between, for example, testing an atomic initialization flag and acquiring a mutex is often large enough to make the mutex option very unattractive.

C's (and C++'s) atomic types deviate from prior library designs primarily in that:

Minimizing reliance on explicit memory fences enables our default "sequential consistency for data-race-free programs" approach. Many people also feel that attached ordering constraints are easier to use, at least for newly written code.

Explicit memory fences can accurately express intent in some cases, and potentially facilitate porting of legacy code written with platform-dependent fences. And they are provided for those cases. However, they also tend to needlessly overconstraining ordering in many other cases. For example, consider the program segment:

z = ...;
atomic_store_explicit(&z_done, 1, memory_order_release);
atomic_store_explicit(&y, ..., memory_order_relaxed);
This prevents reordering of stores to z and z_done, but does not prevent reordering of the stores to z and y, and hence allows the store to y to proceed without waiting for the store to z to complete. This is commonly the intent for flags used for thread communication, but not expressible with fences. Although this does not affect most current architectures, it is important for at least one, and is likely to enable future hardware optimizations.

Similarly, ordering constraints associated with an operation on atomic variable a can be removed if the compiler can determine that a is accessible from only one thread, e.g. due to thread inlining. Ordering constraints associated with fences are almost impossible to remove.

Although atomic types are useful primarily when the underlying hardware allows their implementation without the use of locks, they may (with the exception of atomic_flag) be implemented by hashing locations to one of a collection of locks, and acquiring the corresponding lock before operating on a location. The atomic_is_lock_free() function allows client code to query if such an implementation is in use, and thus the corresponding operations are slower, and unusable from asynchronous signal handlers. Note that this property is often not known until the program is actually loaded, since atomic operation support commonly varies between different "minor" variants of the same architecture.

Implementations will have to resort to a lock-based implementation if any of the atomic operations corresponding to that type are not supported by the hardware. (But note that compare_exchange on smaller objects can usually be safely emulated by cmpxchg on larger objects, and cmpxchg always suffices for a lock-free emulation of the others.) Hence the set of provided operations is conservatively restricted to those available across nearly all modern platforms.