ISO/IEC JTC1 SC22 WG14 N1461 – 2010-05-09
Paul E. McKenney, paulmck@linux.vnet.ibm.com
This document lays out a rationale for the memory model, including dependency ordering. There are many parallel programming models, and different applications will be best served by different programming models. As always, use the right tool for the job at hand. This document's focus is on the case where the right tool for the job is the multithreaded shared-memory programming model, as this is the programming model that stresses the memory model, which is the topic at hand.
Even within the confines of the multithreaded shared-memory programming model, there are a number of synchronization mechanisms and a number of ways of using them. Locking is perhaps the best known and has perhaps the longest history of successful use. A very natural question is therefore “why not restrict the memory model to the minimum required to support locking?”
The first thing to keep in mind is that a great many projects that ostensibly use only simple locking in fact use very sophisticated techniques on critical code paths. In most cases, these techniques rely on non-standard extensions, and in too many cases, they are vulnerable to any number of compiler and hardware optimizations. A major goal of the memory model is therefore to provide a standard basis for such techniques, in order to avoid accidental but subtle breakage of critical code paths.
Of course, it is possible to achieve quite a bit within the confines of simple locking. Unfortunately, so-called “simple” locking can be surprisingly complex due to the fact that it is not possible to protect a dynamically allocated data element using a lock located inside of that element. The reason for this restriction can be seen by considering situations involving concurrent access to and deletion of that data element. For more detail on this issue, see Section 5 of Gamsa's 1999 paper, where it is called “existence guarantees.” Some form of existence guarantee is required to attain decent scalability and performance, even on multicore systems of moderate size.
Production software has used a number of ways of providing existence guarantees without resorting to dependency ordering, but all of them are problematic in one way or another:
Each of these approaches is described in one of the following sections.
One way of sidestepping the need to provide existence guarantees is to avoid deallocation. The traditional method of accomplishing this without leaking memory is to provide fixed-length arrays for each type of data element, and this in fact was the approach taken by Sequent Computer Systems throughout the 1980s and the early 1990s. The fatal flaw with this approach is that it requires that each system be painstakingly configured, which leaves such systems woefully vulnerable to changes in workload.
There are nevertheless some situations in which preconfigured fixed-length arrays are the right tool for the job, for example, in small fixed-function devices with predictable workloads. However, this approach is simply no longer practical for any sort of general-purpose computing system.
This approach uses a global lock to guard acquiring a reference to the data element in question. This same global lock must then be held when deleting this same data element.
In many cases, use of a global lock will result in problematic levels of lock contention, even on modest systems by 2010 standards. Use of a global lock also introduces other problems that are discussed in the next section.
When faced with a lock-contention bottleneck such as the one mentioned in the preceding section, the standard parallel-programming technique is to break up the lock in the hope of reduce contention. One way to accomplish this is to replace the single global lock with a fixed array of locks. The address of the data element in question is hashed to obtain the index of the lock to be used to guard acquisition of a reference to that data element. This approach permits lock contention to be reduced to an arbitrarily low level by using an arbitrarily large global array of locks, but there are a number of problems with this approach:
Despite its severe disadvantages, this approach is heavily used in practice. In my opinion, the pain induced by heavy use of this approach is an important reason for the perceived difficulty of shared-memory parallel programming.
Another solution to lock contention is to partition the data across the threads. Per-thread locks are used to guard access to the corresponding thread's data elements.
This partitioning across threads usually avoids the poor locality of reference, and usually also avoids the confusion as to which lock should be acquired. However, the increased difficulty of avoiding deadlock still remains, and if data elements can migrate among threads, the need to repeatedly acquire locks also remains. In addition, it is not always feasible to partition data across threads.
Yet another solution to lock contention is to avoid locks, for example, using a global reference counter to guard access to the data elements. Deleting a given element requires it to be removed from the enclosing data structure (so that no new references to it can be obtained), waiting for the reference count to drop to zero, then freeing the element. Reference counting is heavily used in production code.
Unfortunately, all the atomic operations on the single global reference counter can result in intolerable levels of memory contention, degrading both scalability and performance.
Furthermore, this approach is vulnerable to compiler and hardware optimizations when attempting to acquire a reference to a data element that is being concurrently added to the enclosing data structure. Such optimizations can result in the thread acquiring the reference seeing pre-initialization contents of the data element. One way to solve this problem is to require use of atomic operations on the pointer itself or the fields of the data structure pointed to. Of course, another way to solve this problem is to use dependency ordering, as discussed below.
Regardless of the solution used, this violation of causality is profoundly counter-intuitive to most developers, even more so than the possibility that different threads might see unrelated operations occur in different orders. I am a case in point. It took the DEC Alpha architects a long time to convince me that DEC Alpha could violate causality in this manner — and even longer for me to convince them that their documentation of this point was insufficient.
In addition, this approach suffers from many of the problems listed in the next section.
The standard approach to solving memory-contention problems is similar to that for lock-contention problems, leading to a hashed array of reference counters. This approach suffers from many of the same problems that plague hashed arrays of locks:
This approach sees some use in practice, and, as with the hashed arrays of locks, is partially responsible for parallel programming's reputation of complexity.
Similar to locking, another solution to the memory contention problem is to partition the data elements across threads, at least in cases where such partitioning is feasible. Again similar to locking, such partitioning usually avoids both the poor locality of reference and the confusion as to which reference should be acquired. However, if data elements can migrate among threads, the need to repeatedly acquire reference counters still remains. And regardless of whether or not data elements can migrate, updaters can potentially be indefinitely postponed while waiting for a reference counter to go to zero.
The indefinite-postponement issue called out in the previous section can be avoided by using a pair of reference counters, in a manner similar to “generation counters.” The idea is to keep a notion of the “current” member of the pair, so that threads attempting to acquire new references to data elements increment the current member. However, these threads keep track of which counter they incremented, and decrement the same counter that they incremented, regardless of which happens to be current at the time of the decrement.
Threads deleting elements can then remove the data element from the enclosing data structure, switch which counter of the pair is current, and then wait for the other counter to decrease to zero. As long as each thread that increments a reference counter decrements it in finite time, indefinite postponement cannot happen.
Please note that there are complications that must be handled correctly, so please see a real implementation (LGPL license) rather than attempting to implement something on the basis of this skeletal description. However, these complications can be hidden behind a simple API.
The preceding discussion of reference counting might well have suggested the garbage collectors could be helpful, and in fact they are. However, garbage collectors only help when removing data elements from the enclosing data structure. In particular, garbage collectors do nothing to prevent compiler and hardware optimizations from engaging in the profoundly counter-intuitive practice of breaking dependency chains, thus permitting a thread accessing a newly inserted data element to see the pre-initialization values of that element's fields.
Furthermore, not all workloads can tolerate a garbage collector. However, other methods, such as RCU can often be used for these workloads. That said, these other methods still must handle the pre-initialization-values issue.
The main shortcoming remaining in the best-of-breed approaches (garbage collectors and RCU) is the lack of dependency ordering.
Although many projects have lived for quite some time without use of dependency ordering, I hope that the problems listed in this section demonstrate why such a life is not necessarily a happy one. Worse yet, many of these projects are likely to be relying on dependency ordering without realizing that hardware and software optimizations can break dependency chains, and perhaps without even realizing that they are relying on dependency ordering to begin with.
This situation motivated the proposals to add dependency ordering to the C standard, most recently in the form of n1444.
Of course, as noted earlier, practitioners should use the right tool for the job. For a great many jobs, the right tool will continue to be single-threaded programs.
There have been a number of further objections to dependency ordering. This section discusses these objections.
memory_order_consume
to
memory_order_release
.
Such implementations will suffer some performance degradation
and will unnecessarily prohibit some types of code-motion
optimizations, but this is nevertheless a valid design point.
Implementations supporting the gcc-style memory
directive
to the __asm
directive can gain similar simplicity
with somewhat smaller performance degradation.
In summary, it has been about three decades since the notion of relying on dependency ordering was introduced. It is time to get it into the C standard.