ISO/IEC JTC1 SC22 WG21 N2513 = 08-0023 - 2008-02-01
Lawrence Crowl, crowl@google.com, Lawrence@Crowl.org
This document is a revision of
N2444
= 07-0314 - 2007-10-16.
The change consists of a relaxation on termination
to enable thread managers in static duration variables.
See the new paragraph in
3.6.3 Termination [basic.start.term],
the removed bullet in
18.4 Start and termination [lib.support.start.term],
and the removed paragraph in
30.2.1.2 thread constructors [thread.thread.constr].
The base working draft for this document is N2521 Working Draft, Standard for Programming Language C++.
thread constructors [thread.thread.constr]Concurrency introduces potential deadlock or data races in the dynamic initialization and destruction of static-duration objects. Because the opportunity for programmers to manually synchronize such objects is limited by a lack of syntactic handles, the language must introduce new syntax, define synchronization, or limit programs.
This proposal breaks the problem into three reasonably separable parts: initialization of function-local static-duration objects, initialization of non-local static-duration objects, and destruction of all static-duration objects.
This proposal exploits properties of a new algorithm for fast synchronized initialization by Mike Burrows of Google. (See the appendix for an implementation and an argument for its correctness.) The algorithm has three attributes signficant to its use here.
The core problem with function-local static-duration object initialization is that the containing function may be invoked concurrently, and thus the definition may execute concurrently. Failing to synchronize may introduce a race condition. The obvious solution is to synchronize. The problem is that such synchronization may introduce deadlock if the synchronization involves holding a lock while the initializer executes.
The proposal relies on Burrows' algorithm to avoid holding a lock while the initializer executes. Any deadlock that occurs as a result of initialization, must necessarily be a race condition in the absence of synchronization.
The GNU compiler currently implements synchronized function-local variable initialization, though it holds the lock during initialization.
In paragraph 4, edit
Otherwise such an object is initialized the first time control passes through its declaration; such an object is considered initialized upon the completion of its initialization. If the initialization exits by throwing an exception, the initialization is not complete, so it will be tried again the next time control enters the declaration. If control enters the declaration concurrently while the object is being initialized, the concurrent execution waits for completion of the initialization. The implementation shall not introduce any locking around execution of the initializer. If control re-enters the declaration
(recursively)while the object is being initialized, the behavior is undefined.
There are two problems with non-local static-duration object initialization. The first problem is that such initialization may occur concurrently. There are two sources to the concurrency, dynamic opening of a shared library in a multi-threaded context and creating of multiple threads in initializers, and hence the potentially implied concurrent execution of other initializers. The second problem is that static-duration object initialization may constitute a large fraction of process time, particularly when users are waiting for program start, and hence may benefit from parallel execution.
The standard currently admits unordered initialization. When initializing one object, reference to a relatively unordered object may result in access before dynamic initialization but after zero initialization. Since the concept of zero initialization has limited utility for many objects, particularly those that require dynamic construction, we consider this feature of the current C++ standard to be of limited utility. Therefore, we propose to make such references undefined.
 Making such references undefined
would normally imply that
std::cout and std::operator new
could not be used by any dynamic initializer.
To resolve this problem,
the standard should provide additional guarantees,
and it currently appears to do so.
In the sequential 2003 standard, unspecified order of initialization essentially required some serial order of initialization. In the parallel proposed standard, the unspecified order of initialization admits concurrent initialization. The necessary additional restriction is that a single object may be initialized at most once. This restriction may require locking. With concurrent initialization, any write (or read in the presence of writes) to statically-initialized objects must be properly locked. An alternate approach, not proposed here, is to require a single global order for all initialization.
 The implementation must initialize
static-duration objects before any of their use
within main or the functions it calls.
To the best of our knowledge, no compiler synchronizes the initialization of non-local static-duration objects. However, the technical issues are similar to existing function-local initializations, and so present no new challenges.
In paragraph 1, edit
Dynamic initialization of
ana non-function-local static-duration object is either ordered or unordered. Definitions of explicitly specialized class template static data members have ordered initialization. Other class template static data members (i.e., implicitly or explicitly instantiated specializations) have unordered initialization. Other objects defined in namespace scope have ordered initialization. Such objectsObjectswithin a single translation unit and with ordered initialization are relatively ordered, otherwise they are relatively unordered. Relatively ordered objects shall be initialized in the order of their definitions within the translation unit. The order of initialization is unspecified for relatively unordered objectswith unordered initialization and for objects defined in different translation units. If the initialization of an object uses a relatively unordered object, the behavior is undefined; no diagnostic is required. [Note: This definition permits concurrent initialization.]
The primary problem with destruction of static-duration objects is access to static-duration objects after their destructors have executed, thus resulting in undefined behavior. To prevent this problem, we require that all user threads finish before destruction begins. For threads that do not naturally finish, mechanisms to terminate threads are proposed in N2447 Multi-threading Library for Standard C++ and its initial incorporation in N2521 Working Draft, Standard for Programming Language C++.
Destruction also introduces the potential to consume a large fraction of process time. Therefore, similar to initialization, we enable concurrent destruction.
Objects defined at namespace scope with relatively ordered initializations must be destructed in reverse order of their initialization.
 The complication to this approach
is destruction of function-local static-duration objects
and calls to functions registered with std::atexit.
Since the order
of initialization of function-local static-duration objects
and of calls to std::atexit
is defined by execution,
rather than by declaration,
we call these execution-ordered objects.
(For expository purposes, calls to std::atexit
correspond to initialization of a virtual object.)
Non-local static-duration objects are called declaration-ordered objects.
For execution-ordered objects
initialized from within the dynamic context
of a declaration-ordered initialization,
their destruction
shall occur in reverse order of completion of their initialization
immediately before the destruction of the context object.
To make this approach viable, initialization and destruction of execution-ordered objects must obey the same restrictions as those on declaration-ordered objects. Specifically, the initialization and destruction of an execution-ordered object must not use an object that is relatively unordered with respect to the context object and must synchronize use of objects with trivial destructors.
Finally, execution-ordered objects may be initialized outside the context of initialization of a declaration-ordered object, that is, they may be initialized from ordinary code. These objects are destructed in reverse order of the completion of their initialization before destruction of any declaration-ordered object.
One potential concern is the run-time cost of managing destruction order. To address this issue, we provide an implementation outline below.
In paragraph 1, edit
Destructors (12.4) for initialized objects of static storage duration
(declared at block scope or at namespace scope)are called as a result of returning frommainand as a result of callingstd::exit(18.4).These objects are destroyed in the reverse order of the completion of their constructor or of the completion of their dynamic initialization.Dynamic initializations of local static objects executing inside the dynamic scope of the initialization a non-local static object are relatively ordered to the non-local static object. Dynamic initializations of local static objects executing outside the dynamic scope of the initialization a non-local static object are relatively ordered tomain. Objects with relatively ordered initialization shall be destroyed in reverse order of completion of their initialization. All objects relatively ordered tomainshall be destroyed before any non-local static-duration objects; otherwise, if the destruction of an object uses a relatively unordered object, the behavior is undefined; no diagnostic is required. [Note: This definition permits concurrent destruction.] If an object is initialized statically, the object is destroyed in the same order as if the object was dynamically initialized. For an object of array or class type, all subobjects of that object are destroyed before any local object with static storage duration initialized during the construction of the subobjects is destroyed.
In paragraph 2, edit
If a function contains a local object of static storage duration that has been destroyed and the function is called during the destruction of an object with static storage duration, the program has undefined behavior if the flow of control passes through the definition of the previously destroyed local object. Likewise, the behavior is undefined if the function-local object is used indirectly (i.e. through a pointer) after its destruction.
In paragraph 3, edit
A call to
std::atexitfrom inside the dynamic scope of the initialization of a non-local static object is relatively ordered to the non-local static object. A call tostd::atexitfrom outside the dynamic scope of the initialization of a non-local static object is relatively ordered tomain. If a function is registered withstd::atexit(see<cstdlib>, 18.4) then following the call tostd::exit, any relatively ordered objects with static storage duration initialized prior to the registration of that function shall not be destroyed until the registered function is called from the termination process and has completed. Forana relatively ordered object with static storage duration constructed after a function is registered withstd::atexit, then following the call tostd::exit, the registered function is not called until the execution of the object's destructor has completed. Ifstd::atexitis called during the construction of an object, the complete object to which it belongs shall be destroyed before the registered function is called.
After paragraph 3, add the following new paragraph. This paragraph replaces a more restrictive specification in prior versions of the paper.
Every thread shall ensure that all its uses of static-duration variables happen before ([intro.multithread]) their destruction and that all calls to the standard library happen before ([intro.multithread]) completion of destruction of static-duration variables and execution of
std::atexitregistered functions ([support.start.term]). [Note: Terminating every thread before a call tostd::exitor the exit frommainis sufficient, but not necessary, to satisfy this requirement. This requirement permits thread managers as static-duration objects. —end note]
Note that in prior versions of this paper, there was a specification to add the following new first bullet to paragraph 8. However, that bullet has been removed in favor of the new paragraph above, which is both more general and less restrictive.
The program shall ensure that all threads, except the main thread, have terminated before calling
exitor returning frommain.
In paragraph 8, existing bullet 1, edit
First, objects with static storage duration are destroyed and functions registered by calling
atexitare called.219) See 3.6.3 for the order of destructions and calls.Non-local objects with static storage duration are destroyed in the reverse order of the completion of their constructor.(Automatic objects are not destroyed as a result of callingexit().)218)Functions registered withatexitare called in the reverse order of their registration, except that a function is called after any previously registered functions that had already been called at the time it was registered.219) A function registered withatexitbefore a non-local objectobj1of static storage duration is initialized will not be called untilobj1's destruction has completed. A function registered withatexitafter a non-local objectobj2of static storage duration is initialized will be called beforeobj2's destruction starts. A local static objectobj3is destroyed at the same time it would be if a function calling theobj3destructor were registered withatexitat the completion of theobj3constructor.
thread constructors [thread.thread.constr]
Every thread shall ensure that each of its uses of static-duration variables happens before their destruction and that all calls to the standard library happen before completion of destruction of static-duration variables and execution of functions registered bystd::atexit(18.4). [Note: Terminating the thread before a call tostd::exitor the exit frommainis sufficient, but not necessary, to satisfy this requirement. This requirement permits the implementation of thread managers as static-duration objects. —end note]
Mike Burrows, m3b at Google.
I show a fast form of concurrent synchronized initialization in the form of an alternate implementation of pthread_once.
It is our intent to make the following technique freely available. To that end, some licensing appears to be required.
Copyright (c) 2007, Google Inc.
All rights reserved.
Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
The header defines the synchronization type, defines a synchronization sentinal values, declares the synchronization function, and defines an inline fast synchronization function.
#ifndef FAST_PTHREAD_ONCE_H
#define FAST_PTHREAD_ONCE_H
#include <signal.h>
#include <stdint.h>
typedef sig_atomic_t fast_pthread_once_t;
#define FAST_PTHREAD_ONCE_INIT SIG_ATOMIC_MAX
extern __thread fast_pthread_once_t _fast_pthread_once_per_thread_epoch;
#ifdef __cplusplus
extern "C" {
#endif
extern void fast_pthread_once( pthread_once_t *once, void (*func)(void) );
inline static void fast_pthread_once_inline(
    fast_pthread_once_t *once, void (*func)(void) )
{
    fast_pthread_once_t x = *once;        /* unprotected access */
    if ( x > _fast_pthread_once_per_thread_epoch ) {
        fast_pthread_once( once, func );
    }
}
#ifdef __cplusplus
}
#endif
#endif FAST_PTHREAD_ONCE_H
The source is written in C. The lines of the primary function are numbered for reference in the subsequent correctness argument.
#include "fast_pthread_once.h"
#include <pthread.h>
static pthread_mutex_t mu = PTHREAD_MUTEX_INITIALIZER;
  /* protects global_epoch and all fast_pthread_once_t writes */
static pthread_cond_t cv = PTHREAD_COND_INITIALIZER;
  /* signalled whenever a fast_pthread_once_t is finalized */
#define BEING_INITIALIZED (FAST_PTHREAD_ONCE_INIT - 1)
static fast_pthread_once_t global_epoch = 0; /* under mu */
__thread fast_pthread_once_t _fast_pthread_once_per_thread_epoch;
static void check( int x )
{
    if ( x == 0 )
        abort();
}
void fast_pthread_once( fast_pthread_once_t *once, void (*func)(void) )
{
/*01*/    fast_pthread_once_t x = *once;        /* unprotected access */
/*02*/    if ( x > _fast_pthread_once_per_thread_epoch ) {
/*03*/        check( pthread_mutex_lock(&mu) == 0 );
/*04*/        if ( *once == FAST_PTHREAD_ONCE_INIT ) {
/*05*/            *once = BEING_INITIALIZED;
/*06*/            check( pthread_mutex_unlock(&mu) == 0 );
/*07*/            (*func)();
/*08*/            check( pthread_mutex_lock(&mu) == 0 );
/*09*/            global_epoch++;
/*10*/            *once = global_epoch;
/*11*/            check( pthread_cond_broadcast(&cv) == 0 );
/*12*/        } else {
/*13*/            while ( *once == BEING_INITIALIZED ) {
/*14*/                check( pthread_cond_wait(&cv, &mu) == 0 );
/*15*/            }
/*16*/        }
/*17*/        _fast_pthread_once_per_thread_epoch = global_epoch;
/*18*/        check (pthread_mutex_unlock(&mu) == 0);
          }
}
The above code implements pthread_once() in a way that is both fast (a few simple instructions, no memory barriers) and portable. Careful use of memory atomicity and monotonicity are used in place of a memory barrier.
The fast-path for the inline function in the header has the following code as generated by gcc 3.2.2 on an x86:
mov %gs:(%esi),%eax # access to thread-local storage cmp %eax, pthread_once_t_location # accesses the pthread_once_t variable jg slow_path # decides whether to call the function 
This code is one more instruction than a test that is not thread safe. This code touches two cache lines, while an unsafe version would touch only one. However, the additional cache line is thread-specific, and not especially likely to cause misses. As one might expect, repeated calls to the inline version achieve around 1 billion fast-path calls per second on a 2.8GHz processor with a hot cache.
In the sketch proof that follows, assumptions and deductions are given names in square brackets. This sketch will no doubt appear to be overly pedantic to some and sloppy to others. It probably contains errors and misuse of terminology; I hope these faults are not fatal. Its purpose is to convince the reader that with sufficient effort a full proof could be constructed, and perhaps to form the basis for such a proof for anyone eager for such mental exercise.
The portability assumptions beyond those in a straightforward implementation are as follows:
Let us say that every fast_pthread_once variable *once is "finalized" when the associated initialization function "func" returns (line 7 in the code above). We want to argue [correctness] (the combination of [safety] and [liveness]) of the code above, using the assumptions above, plus a few others given below. For [safety], we require [safety_1] that each variable *once is finalized at most once. We also require [safety_2] that any thread that calls fast_pthread_once on *once does not return until *once has been finalized at least once, and all the modifications made by func are visible to that thread. For [liveness], we require that the code does not deadlock or loop forever.
For the code, all variable accesses are consistent, except the load of *once on line 1 because:
[release_consistency] We assume that we have mutexes that provide at least release consistency. We use pthread_mutex_t in the example.
[fairness] We assume that runnable threads eventually run.
[atomicity2] We assume that fast_pthread_once variables are of the type T, from [atomicity].
[initialization] We assume that fast_pthread_once_t variables are originally set to FAST_PTHREAD_ONCE_INIT.
[global_epoch_value_range] global_epoch counts the number of *once variables that have been finalized; it is incremented (line 9) each time func is called (line 7) and never modified otherwise. From [*once_bound], we know the value of global_epoch is in {0, ..., FAST_PTHREAD_ONCE_INIT-2}, because global_epoch is initially 0, and FAST_PTHREAD_ONCE_INIT is the maximum value for the type of *once.
[*once_value_sequence] Every *once variable is initialized to FAST_PTHREAD_ONCE_INIT [initialization]. There are only two places where *once is changed:
The sets {FAST_PTHREAD_ONCE_INIT}, {BEING_INITIALIZED}, and {0, ..., FAST_PTHREAD_ONCE_INIT-2} have no intersection because BEING_INITIALIZED==FAST_PTHREAD_ONCE_INIT-1, and FAST_PTHREAD_ONCE_INIT is the maximum value for the type of *once. The first assignment (line 5) to BEING_INITIALIZED is predicated on *once being FAST_PTHREAD_ONCE_INIT, and the update is atomic due to the use of a mutex. Therefore, a *once may make the transition from FAST_PTHREAD_ONCE_INIT to BEING_INITIALIZED at most once. The second assignment (line 10) occurs only if the executing thread performed the first assignment; therefore the second assignment occurs at most once. So any given *once location takes on a sequence of values that is a prefix of the sequence:
FAST_PTHREAD_ONCE_INIT, BEING_INITIALIZED, E
where E < FAST_PTHREAD_ONCE_INIT, E < BEING_INITIALIZED, and E is in {0, ..., FAST_PTHREAD_ONCE_INIT-2}.
[safety_1] The function func is called at most once for each *once variable, because it is called only if the executing thread performed the assignment at line 5, which is performed at most once, from [*once_value_sequence]. This shows [safety_1].
[slow_path_safety_2] Any thread W that reaches line 10 has called func; therefore *once is finalized and the modifications from func are visible to W. If W reaches line 18, W either passed through line 10, or passed through line 13 and found *once != BEING_INITIALIZED. We know from [monotonicity] and [*once_value_sequence] that *once cannot be FAST_PTHREAD_ONCE_INIT (line 4) or BEING_INITIALIZED (line 13), so it must be some value E, a value in {0, ..., FAST_PTHREAD_ONCE_INIT-2}. The location *once takes on such a value only after being finalized. Moreover, accesses by a thread after line 13 seeing E are ordered after the assignment of E to *once and the modifications made by the call to func. This is from [release_consistency] and the facts that accesses at line 13 occur with the mutex held that the thread that finalized *once released the same mutex after doing so. Thus, if W reaches line 18, it does so after *once has been finalized, and with all modifications made by func visible to it. This shows [safety_2] for the slow path, which we call [slow_path_safety_2].
[fast_path_safety_2] We now need to show that any thread X that takes the fast path (just lines 1 and 2, with a false predicate at line 2), does so only if *once is finalized and func's modifications are visible to X. The value read from *once at line 1 is not read under mu, so the value read may be inconsistent. However, it is known to be one of the values
FAST_PTHREAD_ONCE_INIT, BEING_INITIALIZED, E
(where E is in {0, ..., FAST_PTHREAD_ONCE_INIT-2} and E < FAST_PTHREAD_ONCE_INIT and E < BEING_INITIALIZED) because of [*once_bound], [atomicity2], and [*once_value_sequence]. Further, we know that X's _fast_pthread_once_per_thread_epoch is in {0, ..., FAST_PTHREAD_ONCE_INIT-2} because it is assigned only from global_epoch (line 17) and we have [global_epoch_value_range]. Therefore, X sees the line 2 predicate as false only if both:
From the first of these plus assumptions [*once_bound], [monotonicity], and [*once_value_sequence], we know that for X to take the fast path, some thread Y executed line 10, and therefore Y finalized *once. From the second of these, we know that X must have acquired mu (mu is held at line 17) after Y finalized *once and subsequently released mu. So from [release_consistency], all the values modified by func are visible to X. This gives us [fast_path_safety_2].
[safety] Combining [safety_1], [slow_path_safety_2], and [fast_path_safety_2], we have [safety].
[mu_deadlock_freedom] From the code, while mu is held, no other mutexes are acquired nor any other blocking call made, so mu cannot participate in a deadlock.
[liveness] Liveness comes from these observations
[correctness] We have [safety] and [liveness].
#include <stdio.h>
#include <assert.h>
#include "fast_pthread_once.h"
fast_pthread_once_t once = FAST_PTHREAD_ONCE_INIT;
void run_once(void)
{
    static int run_count = 0;
    assert( run_count == 0 );
    run_count++;
}
int main( int argc, char *argv[] )
{
    int use_inline = 0;
    int n = 1000 * 1000 * 1000;
    int i;
    for ( i = 1; i < argc; i++ ) {
        const char *arg = argv[i];
        if ( strcmp(arg, "-i") == 0 ) {
            use_inline = 1;
        } else if ( strspn(arg, "0123456789") == strlen(arg) ) {
            n = atoi(arg);
        } else {
            fprintf( stderr, "usage: %s [-i] [number]\n", argv[0] );
            return 2;
        }
    }
    if ( use_inline ) {
        for ( i = 0; i != n; i++ ) {
            fast_pthread_once_inline(&once, run_once);
        }
    } else {
        for ( i = 0; i != n; i++ ) {
            fast_pthread_once(&once, run_once);
        }
    }
    return 0;
}
There are two primary sources of inefficiency in code. First, there is a single mutex for maintaining the global epoch. Second, all waiting threads are woken for each completed initialization. We do not believe these inefficiencies will matter in practice, because they become significant only for many threads over many long-running initializations. (Recall that the mutex is not held during the actual initialization.) However, for pathological programs the following improvements should suffice.
*once location into a mutex
and update global_epoch atomically.
pthread_once_t address.
Unlike having multiple mutexes, this would not affect the fast path,
and could reduce the effect of premature waking by an arbitrary factor.