| Document Number: | N2907=09-0097 | 
| Date: | 2009-06-18 | 
| Author: | Anthony
      Williams Just Software Solutions Ltd | 
This paper discusses a suggestion I made on the LWG reflector
  and cpp-thread mailing list to address the issues raised in N2880
  surrounding the lifetime of thread_local variables.
The basic idea of this proposal is that the lifetime
  of thread_local variables is tied to the lifetime of an
  instance of the new class thread_local_context. Each
  thread has an implicit instance of such a class constructed prior to
  the invocation of the thread function, and destroyed after
  completion of the thread function, but additional instances can be
  created in order to deliberately limit the lifetime
  of thread_local variables: when
  a thread_local_context object is destroyed, all
  the thread_local variables tied to it are also
  destroyed.
This enables us to address several of the concerns of
  N2880. Firstly, if we use a mechanism other than thread::join
  to wait for a thread to complete its work — such as waiting for a
  unique_future to be ready — then N2880 correctly
  highlights that under the current working paper the destructors
  of thread_local variables will still be running after
  the waiting thread has resumed. By judicious use
  of a thread_local_context instance and block scoping,
  we can ensure that the thread_local variables are
  destroyed before the future value is set. e.g.
int find_the_answer();
void thread_func(std::promise<int> * p)
{
    int local_result;
    {
        thread_local_context context; // create a new context for thread_locals
        local_result=find_the_answer();
    } // destroy thread_local variables along with the context object
    p->set_value(local_result);
}
int main()
{
    std::promise<int> p;
    std::thread t(thread_func,&p);
    t.detach(); // we're going to wait on the future
    std::cout<<p.get_future().get()<<std::endl;
}
When the call to get() returns, we know that not only
  is the future value ready, but the thread_local
  variables on the other thread have also been destroyed.
A second concern of N2880 was the potential for accumulating vast
  amounts of thread_local variables when reusing threads
  for multiple independent tasks, such as when implementing a thread
  pool. Under such circumstances, the thread pool implementation can
  wrap each task inside a scope containing a
  thread_local_context variable to ensure that when a
  task is completed its thread_local variables are
  destroyed in a timely fashion. e.g.
std::mutex task_mutex;
std::queue<std::function<void()>> tasks;
std::condition_variable task_cond;
bool done=false;
void worker_thread()
{
    std::unique_lock<std::mutex> lk(task_mutex);
    while(!done)
    {
        task_cond.wait(lk,[]{return !tasks.empty();});
        std::function<void()> task=tasks.front();
        tasks.pop_front();
        lk.unlock();
        {
            thread_local_context context;
            task();
        }
        lk.lock();
    }
}
With this scheme, the thread_local variables are
  destroyed between each task invocation when
  the thread_local_context object is destroyed, so if the
  sets of variables used by the tasks do not overlap then the problem
  of increasing memory usage is avoided.
Obviously, such a class would have to be tightly integrated with
  the mechanism for thread_local variables used by a
  compiler, so that they can be destroyed at the appropriate points,
  and constructed again if necessary. This is a key point — for
  the second scenario to work, then if
  a thread_local_context is destroyed and a fresh one
  constructed then any thread_local variables used during
  the lifetime of a context object must be created afresh, even if
  they were already created and destroyed during the lifetime of a
  prior context object on the same thread.
This does mean that implementations are pretty much restricted to
  initializing thread_local variables on first use, with
  a mechanism that allows the destructor
  of thread_local_context objects to reset that "first
  use" flag. If the thread_local_context is implemented
  with compiler intrinsics then the compiler may still be able to
  find optimization opportunities that allow batching of
  initializations or less-frequent checking of the "first use"
  flag.
thread_local_context object lifetimesThere is are interesting issues surrounding the behaviour of code
  with nested thread_local_context objects. Is such
  nesting allowed at all? What happens to thread_local
  variables that have already been assigned variables when
  a thread_local_context object is constructed? What
  about pointers to such variables?
I believe there are several possible answers to these questions, and I will address each in turn.
Certainly it could be argued that things are simpler if nesting is
  disallowed, and the use cases primarily point
  to thread_local_context being used high up in the call
  chain either directly in the thread function or not many levels
  down. However, I think this is an unnecessary restriction. What I do
  believe is important however is that lifetimes are properly nested,
  and a couple of simple rules should be enforced:
thread_local_context object should
    be done on the same thread as construction, andthread_local_context objects must be
    in the order of construction.If these rules are not obeyed then std::terminate
  should be called in the destructor of
  the thread_local_context object being executed when the
  violation is discovered.
thread_local variables with values
  assigned prior to construction of
  a thread_local_context?The importance of this question can be neatly demonstrated by the following example. Note that this example does not use a nested context, but the same issues apply, and the answer should be the same in examples that do use nested contexts (if we permit them).
static thread_local int i=0;
int main()
{
    i=42;
    {
        thread_local_context context;
        std::cout<<i<<",";
        i=123;
    }
    std::cout<<i<<std::endl;
}
What does this program print?
I can see use cases for both option 1 (42,123) and option 2
  (0,42). Option 3 is only there as a straw man — the whole
  point of the context objects is that thread_local
  objects created within the lifetime of the context object are then
  destroyed when the context object is destroyed.
Though potentially tempting, I think that undefined behaviour or
  termination is undesirable as it would be hard to identify the
  problem when looking at the source code, and it would be easy to
  trigger such behaviour by calling a function that
  used thread_local prior to the construction of the
  context.
So, which of options 1 and 2 do we go for? I favour option 2: the
  construction of the context object creates a "clean slate"
  for thread_local variables.
The downside of doing so is that any library that
  uses thread_local data structures as a cache for
  optimization purposes (such as an allocator with thread-local heaps)
  will have to recreate those structures within each context, even
  though it might be desirable to preserve such structures across
  contexts. For example, with the worker_thread in the
  code above it might be desirable to preserve per-thread heaps across
  task invocations to avoid repeatedly constructing/destructing the
  heap. 
However, I believe that this downside is outweighed by the clarity
  of the code: with option 2, within a
  new thread_local_context you know that you have a
  "clean slate", and that no thread_local variables have
  values left from another scope. With option 1, then our worker
  thread example would suddenly start "leaking" values from one task
  to another if that variable happened to be used in the code outside
  the context. With option 2 this is not possible, as each task gets a
  new copy of all the variables.
thread_local variables?Let's look at our example again, but this time we'll also store the
  address of i in a normal local variable p,
  and dereference this pointer inside the context.
static thread_local int i=0;
int main()
{
    i=42;
    int* p=&i;
    {
        thread_local_context context;
        std::cout<<i<<",";
        std::cout<<*p<<",";
        *p=99;
        i=123;
    }
    std::cout<<i<<std::endl;
}
What does this example print now?
I believe that options 1 and 2 here are the behaviours that best
  correspond to options 1 and 2 for the lifetime issues: if we
  preserve values from the parent context (option 1)
  then *p and i refer to the same
  variable. On the other hand, if we go for option 2 (the "clean
  slate" option), then *p refers to the variable from the
  outer context, whereas in the nested context i refers
  to the new variable (which thus has a different address.)
I think the third and fourth alternatives are understandable from
  an implementation perspective if we go for the "clean slate" option,
  but not desirable. The third alternative corresponds to an
  implementation that magically saves the values of
  the thread_local variables when the new context is
  initialized, and reuses the same addresses to refer to the value of
  that thread_local variable in the current context. For
  example, this could be done on a segmented architecture
  where thread_local variables live in a special segment,
  and that segment is remapped for the new context, and the mapping
  restored when the context is destroyed. However, I think this is
  undesirable behaviour — we allow pointers
  to thread_local variables to be passed between threads,
  and I think this is directly analagous: we should also allow
  pointers to thread_local variables to be passed between
  contexts in a single thread. The fourth option (Undefined
  behaviour) is just a "give implementors freedom" option, but I
  think it is undesirable for the reasons just given, and because I
  think undefined behaviour should not be introduced without very
  good cause.
std::packaged_task and the
  proposed std::async functionIf this proposal is adopted, then it could be used as part of an
  implementation of std::async (as proposed in N2889 and
  N2901) to ensure that the associated future did not become ready
  before the thread-local variables for the asynchronous task had been
  destroyed.
This proposal could also be integrated
  with std::packaged_task to ensure that the contained
  task was run in its own context, and that the context was destroyed
  (and the future result value stored) before the future became
  ready. This would allow end users to write a simple function for
  spawning a task with a return value on a new thread without having
  to worry about the issue of destruction of thread-local
  variables. However, it could potentially yield surprising behaviour
  if the task was invoked directly on an existing thread, particularly
  if the "clean slate" option was chosen.
I have no proposed wording at this time. If the committee agrees to proceed with this, then I can work to provide wording.
Thanks to Alberto Ganesh Barbati, Peter Dimov, Lawrence Crowl, Beman Dawes and others who have commented on this proposal on the mailing lists and via personal email.