Document: WG14 N1196
Author: Lawrence Crowl
Date: 2006/10/23

C++ Threads

Lawrence Crowl

Google

Introduction

Standard support for multi-threading is a pressing need.

Effective internet programming often requires concurrency.
Exploiting multi-core processors requires parallelism.

The C++ approach is to standardize the current environment.

Language threads correspond to operating-system threads.
- heavyweight, preemptive, and independent
Threads communicate through shared memory.
Losely based on Posix Threads and Windows Threads.
Not a replacement for existing standards,
- e.g. OpenMP, MPI, automatic parallelism.

Memory Model

Presuming that all writes are instantly available to all threads is not viable.

It implies instant communcation.
It does not match physics.
It does not match current hardware.
It inhibits most serial optimizations.

The standards adopts a message memory model.

Writes to memory are explicitly communicated between threads.
One thread does a release of its writes.
Another thread does a corrresponding acquire of those writes.

Sequencing has been redefined.

Sequence points are gone.
Relations are sequenced-before and and indeterminately-sequenced.
A write to and read from a location that
- are not sequence-before and not indeterminately-sequenced
- results in undefined behavior.

Sequencing has been extended to concurrency.

Relations are synchonizes-with relation and happens-before,
- based on acquire and release.
A race condition is
- a non-atomic write to a location in one thread and
- a non-atomic read from that location in another thread that
- have no happens-before relationship
The existence of a race condition makes the program undefined.

But what is a location?

A location is a non-bitfield primitive data object.
Adjacent bitfields together constitute a single location.
- Enables unsynchronized read-modify-write.

Optimizers are not unaffected.

Some speculative writes are no longer legal.
Some speculative reads are no longer legal.

Loops without synchronization may be assumed to terminate.

It is nearly always true.
It enables significant compiler optimizations.

Atomic Types and Operations

All threads observe the same sequence of values for an atomic type.

Atomic operations provide acquire, release, both, or neither.

Atomic without acquire or release has limited, but important, use cases.

Atomic types are structs, but could be primitive types.

They are statically initializable.
The operations are type-generic macros in C.
The operations overloaded functions in C++.
The operations have operators defined in C++.
The default assignment operator is wrong, but
- it cannot be disabled in C and
- disabling it in C++ 98 breaks compatibility.
C++ has a template for making any type atomic.

The types are comprehensive over the important primitive types.

atomic_flag: test_set, clear
atomic_bool: load, store, swap,
atomic_integers: load, store, swap, compare_swap, fetch_{add,sub,and,ior,xor}
atomic_void_pointer: load, store, swap, compare_swap, fetch_{add,sub}

Atomics may be compiled by both languages.

atomic_flag v1 = ATOMIC_FLAG_INIT;
atomic_long v2 = { 1 };
atomic_void_pointer v3 = { 0 };
void func()
{
    if ( atomic_flag_test_set( & v1 ) )
        atomic_flag_clear( & v1 );
    long t = atomic_load_acquire( & v2 );
    atomic_compare_swap( & v2, t, t|1 );
    atomic_fetch_ior_ordered( & v2, 1 );
    atomic_fetch_add_ordered( & v3, 1 );
#ifdef __cplusplus
    long l1 = v2; v2 = 3; ++v2; v2 &= 7; v3 += 4;
#endif
}

Atomic operations must be lock-free to be used in signals.

A macro will tell you if a type is lock-free.

Atomic operations must be address-free to be used between processes.

If an operation is lock-free, it must also be address-free.

Sequential consistency is still not settled.

x and y are atomic and initially 0
thread 1: atomic_store( &x, 1 )
thread 2: atomic_store( &y, 1 )
thread 3: if ( atomic_load( &x ) == 1 && atomic_load( &y ) == 1 )
thread 4: if ( atomic_load( &y ) == 1 && atomic_load( &x ) == 1 )

Are both conditions exclusive?

That is, is there a total store order?
Some hardware designers say no.
Some hardware designers say yes.
Experts doubt whether mortals can program effectively without total store order.

Thread-Local Storage

At least 5 vendors already implement the proposed facility.

Some have a slightly different syntax.

Define a new thread storage duration.

New __thread storage class.

Storage is unique to each thread.

Addresses of thread variables are not constant.

Thread storage are accessible to other threads.

Dynamic Initialization and Destruction

Initialization and destruction of static-duration variables is tricky.

Without implicit syncronization, there is the potential for data races.
With implicit synchronization, there is the potential for deadlock.

This problem does not exist in C.

Thread Semantic Model

Initiate a thread with a fork on a function call.

Join waits for the function to return.

Is there a test for ready for join?
Is there a join with a timeout?
Does join return a function value?
In C++, what happens to threads that finish via an exeption?

Mutexes provide mutual exclusion.

The standard will have at least a simple mutex.
The standard may have read-write mutexes.
The standard may have reentrant mutexes.

Condition variables enable the monitor paradigm.

Threads may wait on a condition variable,
- giving up their hold on the mutex.
Is there a timed wait?
Threads may notify a condition variable;
- notified threads reaquire the mutex and
- must reevaluate any condition.

Thread termination is voluntary.

Return from outermost function.
Likely to have some form of cooperative termination.
Possibly have some form of synchronous cancellation.
Asynchronous cancellation has strong opposition.

Thread scheduling is limited.

Some form of yield is likely to be all.

Current Thread Implementation Approach

The thread model is based on a full C++ library implementation.

templates, functors, bind, and destructors


std::thread::handle my_handle =
	std::thread::create( std::bind( my_func, 1, "two" ) );
other_work();
thread::join( my_handle );

Locks hold a mutex within a given scope.

A variable represents the mutex acquire/release pair.
The release occurs in the destructor for the variable.

class buffer
{
    int head, tail, store[10];
    std::thread::mutex_timed mutex;
    std::thread::condition not_full, not_empty;
public:
    buffer() : head( 0 ) , tail( 0 ) { }

    void insert( int arg )
    {
        std::timeout wake( 1, 0 );
        lock scoped( mutex );
        while ( (head+1)%10 == tail )
            if ( not_full.timed_wait( wake ) )
                throw "buffer full too long";
        store[head] = arg; head = (head+1)%10;
        not_empty.notify();
        }
    }
};

Rejected Thread Implementation Approach

The C++ committee rejected a syntax-based approach.

It would have been C compatible.

The approach provided new operators for fork and join.

int function( int argument )
{
    int join pending = fork work1( argument );
    // work1 continues concurrently
    int value = work2( argument );
    return value + join pending;
}

The rejected approach extended the set of control statements to manage synchronization.

struct buffer
{
    int head, tail, store[10];
    mutex_timed mutex;
    condition not_full, not_empty;
};

void buffer_insert( struct buffer *ptr; int arg )
{
    lock( ptr->mutex )
    {
        timeout wake = { 1, 0 };
        wait( ptr->not_full;
               (ptr->head+1)%10 != ptr->tail;
               wake )
        {
            ptr->store[ptr->head] = arg;
            ptr->head = (head+1)%10;
            notify( ptr->not_empty; 1 );
        }
        else
            failure( "buffer full too long" );
    }
    else
        failure( "too much buffer contention" );
};

Higher-Level Facilities

Higher-level facilities may be built on the above primitives.

e.g. thread pools, thread groups, parallel iterators, etc.

The committee has concerns that these facilities are not adequately field tests.

Probably will not be in the next C++ standard.
Probably will be in the next C++ library technical report.