Relaxed (non-sequentially consistent) memory models are typically specified in terms of either acquire/release semantics that introduce ordering between two synchronizing operations from separate threads, or bidirectional fences that introduce ordering between two operations from the same thread. The C++ memory model chooses the first approach, for good reasons. Nevertheless, many existing hardware platforms are actually based on the second and rely on fences, also called memory barriers. The canonical fence-based architecture is Sun SPARC in RMO mode, and the IBM POWER architecture is widely available. Even the poster child for the acquire/release model, the Intel IA-64 architecture, includes a bidirectional full fence instruction, mf, and so does the x86/x64 architecture (mfence).
A fence is a primitive that enforces ordering between preceding loads or stores and subsequent loads or stores. In SPARC RMO terms the various combinations are called #LoadLoad, #LoadStore, #StoreLoad and #StoreStore and can be combined, leading to a total of 16 combinations (one of which is a no-op, and another is a full fence). In practice, the #StoreLoad barrier is the most expensive and its cost is essentially the same as a full fence; the others are relatively cheap even in combination. These costs have allowed the designers of the POWER architecture to simplify the full set into two instructions, LWSYNC (#LoadLoad | #LoadStore | #StoreStore) and SYNC (a full fence). We'll use lwsync from now on as a generic way to refer to the relatively cheap fence.
(Another possibility is to provide an acquire fence, #LoadLoad | #LoadStore, a release fence, #LoadStore | #StoreStore, and an acquire+release fence, #LoadLoad | #LoadStore | #StoreStore. This corresponds better to the existing terminology in the C++ memory model and we'll return to it later on.)
While it's trivial to map the acquire/release model to a bidirectional fence model, albeit with a slight theoretical loss in performance, the reverse is not true. This has led to a pressure to include fence primitives into the C++ memory model; see for example N2153 that makes a fine case. This pressure has led to the addition of a fence member function to all atomic types and to the definition of a global atomic object called atomic_global_fence_compatibility.
The motivation for having fence primitives in C++ is fourfold:
The rest of this paper explains why the existing mechanism in the working draft does not meet these challenges, and proposes an alternative that does.
The current working draft, N2588, has adopted a mechanism put forward in N2324 that consists of atomic types supporting a dedicated "fence" operation. On the memory model level, this is a read-modify-write operation that leaves the value unchanged. This is an elegant attempt to provide a solution to the conditional synchronization problem described in N2153/N2237 without introducing a separate primitive (which corresponds to our point 4 above). It also aims to address points 1-3, but as a secondary goal.
At first glance, a programmer and an implementer would reasonably expect these fence operations to be implemented as follows:
Table 1:
| POWER | x86, x64 | |
| x.fence(memory_order_acquire) | LWSYNC | no-op | 
| x.fence(memory_order_release) | LWSYNC | no-op | 
| x.fence(memory_order_acq_rel) | LWSYNC | no-op | 
| x.fence(memory_order_seq_cst) | SYNC | mfence, or a locked instruction | 
This implementation turns out to not match the specification. Consider the following
Example 1:
| Thread 1: | Thread 2: | |
| x.store( 1, memory_order_relaxed ); z.fence( memory_order_acq_rel ); r1 = y.load( memory_order_relaxed ); | y.store( 1, memory_order_relaxed ); z.fence( memory_order_acq_rel ); r2 = x.load( memory_order_relaxed ); | 
        Since read-modify-write operations on a single location z form a total
        order (1.10/5) and since z.fence is a RMW operation, we must conclude
        that one of the two z.fence calls precedes the other in this order.
        As the example is symmetrical, we can arbitrarily pick an order without sacrificing
        generality; let for example the fence operation in thread 1 precede the fence operation
        in thread 2. Since the fences in our example have both acquire and release semantics
        due to the memory_order_acq_rel constraint, they synchronize-with
        each other, and hence, the fence in thread 1 happens before the fence in
        thread 2. This, in turn, implies that the x.store in thread 1 happens
        before x.load in thread 2, and that r2 must be 1.
In other words, non-sequentially consistent outcomes are prohibited by the fences.
        If we now look at our table above, we see that the intended LWSYNC or no-op implementation
        of z.fence(memory_order_acq_rel) is not enough to prohibit the non-sequentially
        consistent outcome of r1 == r2 == 0. We must conclude that z.fence(memory_order_acq_rel)
        must be implemented as SYNC or mfence:
Table 2:
| POWER | x86, x64 | |
| x.fence(memory_order_acquire) | LWSYNC | no-op | 
| x.fence(memory_order_release) | LWSYNC | no-op | 
| x.fence(memory_order_acq_rel) | SYNC | mfence, or a locked instruction | 
| x.fence(memory_order_seq_cst) | SYNC | mfence, or a locked instruction | 
No problem, we'll just use acquire or release when we need lwsync, right?
Example 2:
| Thread 1: | Thread 2: | |
| x.store( 1, memory_order_relaxed ); z.fence( memory_order_release ); z.fence( memory_order_acquire ); r1 = y.load( memory_order_relaxed ); | y.store( 1, memory_order_relaxed ); z.fence( memory_order_release ); z.fence( memory_order_acquire ); r2 = x.load( memory_order_relaxed ); | 
        Again, we have a total order over the operations on z. No matter how
        we reorder the four fences, one of the release fences will precede the acquire fence
        in the other thread, the two will synchronize-with each other, and order
        one of the stores with its corresponding load, as happened in example 1.
        Therefore, the acquire and release fences can't both be LWSYNC or a no-op, as this
        would again allow the r1 == r2 == 0 outcome that the memory model tells
        us cannot occur. We are forced to update our implementation table once again:
Table 3:
| POWER | x86, x64 | |
| x.fence(memory_order_acquire) | SYNC or LWSYNC | mfence or no-op | 
| x.fence(memory_order_release) | LWSYNC or SYNC | no-op or mfence | 
| x.fence(memory_order_acq_rel) | SYNC | mfence | 
| x.fence(memory_order_seq_cst) | SYNC | mfence | 
This looks disturbing enough, but we aren't done yet. Since fences are read-modify-write operation, they can synchronize with acquire loads and release stores:
Example 3:
| Thread 1: | Thread 2: | |
| x.store( 1, memory_order_relaxed ); z.fetch_add( 0, memory_order_release ); z.fence( memory_order_acquire ); r1 = y.load( memory_order_relaxed ); | y.store( 1, memory_order_relaxed ); z.fetch_add( 0, memory_order_release ); z.fence( memory_order_acquire ); r2 = x.load( memory_order_relaxed ); | 
Example 4:
| Thread 1: | Thread 2: | |
| x.store( 1, memory_order_relaxed ); z.fence( memory_order_release ); z.load( memory_order_acquire ); r1 = y.load( memory_order_relaxed ); | y.store( 1, memory_order_relaxed ); z.fence( memory_order_release ); z.load( memory_order_acquire ); r2 = x.load( memory_order_relaxed ); | 
Using a similar reasoning as with example 2, and observing the release sequence rules in 1.10/6 and 1.10/7, we can reasonably conclude that both acquire and release fences need to map to SYNC. To be fair, an implementer on the POWER platform still does have the option to use LWSYNC for one of the fences, at the cost of moving the SYNC to the load or to the fetch_add, respectively. This is unlikely, as it will decrease the performance of a primary primitive in favor of a secondary one. Our final implementation table now looks like:
Table 4:
| POWER | x86, x64 | |
| x.fence(memory_order_acquire) | SYNC | mfence | 
| x.fence(memory_order_release) | SYNC | mfence | 
| x.fence(memory_order_acq_rel) | SYNC | mfence | 
| x.fence(memory_order_seq_cst) | SYNC | mfence | 
In short, on the currently popular platforms the fence primitive is an overengineered way to insert an expensive full fence, and its argument plays no part. This makes it extremely unattractive, especially on the x86/x64 platform. It is quite likely that implementations will deliberately ignore the specification and just provide the obvious implementation in Table 1, which is what programmers expect and can make use of.
Again, to be fair, we must acknowledge the possibility of a new hardware being able to take advantage of the current specification and somehow implement the primitive in a more performant way. This however doesn't help. Any program that uses fences will suffer severe performance penalties when ported to, for example, x86/x64; this in practice renders the fence primitive non-portable and platform-dependent.
        Other surprising properties of the specification concern the atomic_global_fence_compatibility
        atomic object. It is provided as a migration path for programs using traditional
        address-free fences, such as in the typical
Example 5:
| Thread 1: | Thread 2: | |
| x = 1; // non-atomic
atomic_global_fence_compatibility
    .fence( memory_order_acq_rel );
y.store( 1, memory_order_relaxed ); | if( y.load( memory_order_relaxed ) == 1 )
{
    atomic_global_fence_compatibility
        .fence( memory_order_acq_rel );
    assert( x == 1 );
} | 
(We can assume that the original version of the code contained a #StoreStore or stronger fence in thread 1, and a #LoadLoad or stronger fence in thread 2.)
        This port works, but it's important to understand why. As before, there is a total
        order on the operations on atomic_global_fence_compatibility, and we
        must examine both possibilities.
If the fence in thread 1 precedes the fence in thread 2, we conclude from the release semantics of the former and the acquire semantics of the latter that the store to x in thread 1 happens before the load from x in thread 2, the assert passes, and all is well.
If, on the other hand, the fence in thread 2 precedes the fence in thread 1, we conclude in a similar manner that the load from y happens before the store to y, which means that the branch isn't taken and the assert is not executed. Note, however, that now we used the acquire semantics of the fence in thread 1 and the release semantics of the fence in thread 2.
        In short, we needed both fences to have both acquire and release semantics, otherwise
        we wouldn't be able to conclude our analysis and prove that the example is correct
        and the assertion never fails. Even though the original code contained only #StoreStore
        and #LoadLoad, two of the most cheap fences, we were forced to employ memory_order_acq_rel
        to achieve equivalent semantics.
        It would be reasonable to assume that programmers may be tempted to try memory_order_acquire
        or memory_order_release instead; they are provided as options, so they
        must be useful for something! But this is a trap and would lead to incorrect
        code, at least according to the specification. (The code may well work in practice,
        as suggested by table 4, but it isn't guaranteed to.)
The obvious alternative fomulation:
        void atomic_global_fence();
    provides equivalent functionality to the current
        extern const atomic_flag atomic_global_fence_compatibility;
    
        while at the same time being better named and much less error prone. The aim of
        the explicit global variable atomic_global_fence_compatibility is to
        render the dependency on a single location explicit, highlighting the potential
        performance problem and encouraging users to migrate their algorithms away from
        a global fence formulation and into a form that is more friendly to the C++ memory
        model. It is questionable whether the the benefits of this approach outweigh its
        obvious downsides, and the long and inconvenient name that has been chosen is certain
        to lead to its shortening or wrapping, hiding it from view and subverting the intent.
Having concluded our analysis, we can now revisit our criteria 1-4:
atomic_global_fence_compatibility
            alternative is error prone, suboptimal and is, in fact, designed to discourage people
            from using it. Some of these programmers will be slow to migrate to the C++ memory
            model. (Others will of course bite the bullet and learn the new acquire/release
            style; but the point is that they could've been writing code using their old skills
            while learning.)do_work(A,B); release_fence(); lock_A.store_raw(LOCK_UNLOCKED); lock_B.store_raw(LOCK_UNLOCKED);
and its formulation in terms of the current working draft primitives:
do_work(A,B); lock_A.fence( memory_order_release ); lock_B.fence( memory_order_release ); lock_A.store( LOCK_UNLOCKED, memory_order_relaxed ); lock_B.store( LOCK_UNLOCKED, memory_order_relaxed );
The ideal outcome on a x86 platform, where all stores already have release semantics, would be for the fences to disappear completely. Alas, this is not what would happen; as a look at table 4 demonstrates, a typical implementation would emit two mfence instructions (one of which may be eliminated by the peephole optimizer), with a predictably undesirable performance consequences.
The alternative proposed in this paper isn't new. It has been suggested, with a slightly different syntax, in N2153 (N2237), N2195 and N2262. It consists of a single function, atomic_memory_fence:
        void atomic_memory_fence( memory_order mo );
and its intended implementation on some platforms of interest is shown in the following table:
| mo | SPARC RMO | POWER | IA-64 | x86, x64 | 
| memory_order_relaxed | no-op | no-op | no-op | no-op | 
| memory_order_acquire | #LoadLoad | #LoadStore | LWSYNC | mf | no-op | 
| memory_order_release | #LoadStore | #StoreStore | LWSYNC | mf | no-op | 
| memory_order_acq_rel | #LoadLoad | #LoadStore | #StoreStore | LWSYNC | mf | no-op | 
| memory_order_seq_cst | #LoadLoad | #LoadStore | #StoreLoad | #StoreStore | SYNC | mf | mfence, or a locked instruction | 
What is new is that this paper proposes a specification of atomic_memory_fence that describes the intended semantics within the framework of the C++ memory model.
This primitive meets our criteria 1-4 above, for the following reasons:
atomic_memory_fence(memory_order_acq_rel).In addition,
        void atomic_compiler_fence( memory_order mo );
        is proposed as a companion primitive that only restricts the compiler optimizations
        and reorderings in the exact same way an atomic_memory_fence does,
        without actually emitting a hardware fence instruction. This primitive supports
        our use case 3 above; it is intended to allow a programmer that already knows what
        instructions are to be emitted to communicate this knowledge to the compiler.
        The theoretical justification of atomic_compiler_fence is that it allows
        the programmer to specify the order in which relaxed writes to atomic variables
        become visible to a signal handler executed in the same thread.
Remove the
        void fence(memory_order) const volatile;
    members from all types in [atomics].
Remove the
        void atomic_flag_fence(const volatile atomic_flag *object, memory_order order);
    function.
Remove the
        void atomic_memory_fence(const volatile atomic_type*, memory_order);
    functions.
Remove the definition
        extern const atomic_flag atomic_global_fence_compatibility;
    Add
// 29.5, fences
void atomic_memory_fence(memory_order);
void atomic_compiler_fence(memory_order);
        to the synopsis of <cstdatomic>.
Add a new section, [atomic.fences], with the following contents:
29.5 Fences
This section introduces synchronization primitives called fences. Fences can have acquire semantics, release semantics, or both. A fence with acquire semantics is called an acquire fence. A fence with release semantics is called a release fence.
A release fence R synchronizes with an acquire fence A if a load that is sequenced before A observes the value written by a store sequenced after R, or a value written by any side effect in the release sequence (1.10/6) that store would produce had it been a release operation.
A release fence R synchronizes with an acquire operation A if A observes the value written by a store sequenced after R, or a value written by any side effect in the release sequence that store would produce had it been a release operation.
A release operation R synchronizes with an acquire fence A if a load, sequenced before A, observes the value written by R, or a value written by any side effect in the release sequence headed by R.
        Fences can be sequentially consistent. A sequentially consistent fence has both
        acquire and release semantics. Sequentially consistent fences participate in the
        single total order S over all memory_order_seq_cst operations
        (29.1/2). Each sequentially consistent fence synchronizes with the sequentially
        consistent fence that follows it in S.
    
        void atomic_memory_fence(memory_order mo);
    Effects: Depending on the value ofmo, this operation:
- has no effects, if
mo == memory_order_relaxed;- is an acquire fence, if
mo == memory_order_acquire;- is a release fence, if
mo == memory_order_release;- is both an acquire fence and a release fence, if
mo == memory_order_acq_rel;- is a sequentially consistent fence, if
mo == memory_order_seq_cst;
        void atomic_compiler_fence(memory_order mo);
    
        Effects: equivalent to atomic_memory_fence(mo), except that
        synchronizes with relationships are established only between a thread and a signal
        handler executed in the same thread.
    
        [Note: atomic_compiler_fence can be used to specify the order
        in which actions performed by the thread become visible to the signal
        handler. — end note]
    [Note: Compiler optimizations or reorderings of loads and stores are inhibited in the same way as withatomic_memory_fence, but the hardware fence instructions thatatomic_memory_fencewould have inserted are not emitted. — end note]
Thanks to Hans Boehm, Lawrence Crowl, Paul McKenney and Raul Silvera for reviewing this paper.
--end