Document Number:	N3989
Date:	2014-05-23
Revises:	N3960
Editor:	Jared Hoberock NVIDIA Corporation jhoberock@nvidia.com

Execution policies

2.1

In general

[parallel.execpol.general]

This subclause describes classes that represent execution policies. An execution policy is an object that expresses the requirements on the ordering of functions invoked as a consequence of the invocation of a standard algorithm. Execution policies afford standard algorithms the discretion to execute in parallel.

[ Example:

std::vector<int> v = ...

// standard sequential sort
std::sort(vec.begin(), vec.end());
std::sort(std::begin(vec), std::end(vec));

using namespace std::experimental::parallel;

// explicitly sequential sort
sort(seq, v.begin(), v.end());
sort(seq, std::begin(v), std::end(v));

// permitting parallel execution
sort(par, v.begin(), v.end());
sort(par, std::begin(v), std::end(v));

// permitting vectorization as well
sort(vec, v.begin(), v.end());
sort(vec, std::begin(v), std::end(v));

// sort with dynamically-selected execution
size_t threshold = ...
execution_policy exec = seq;
if(v.size() > threshold)
{
  exec = par;
}

sort(exec, v.begin(), v.end());
sort(exec, std::begin(v), std::end(v));

— end example ]

[ Note: Because different parallel architectures may require idiosyncratic parameters for efficient execution, implementations of the Standard Library should provide additional execution policies to those described in this Technical Specification as extensions. — end note ]

2.2

Header `<experimental/execution_policy>` synopsis

[parallel.execpol.synop]

namespace std {
namespace experimental {
namespace parallel {
  // 2.3, Execution policy type trait
  template<class T> struct is_execution_policy;

  // 2.4, Sequential execution policy
  class sequential_execution_policy;

  // 2.5, Parallel execution policy
  class parallel_execution_policy;

  // 2.6, Vector execution policy
  class vector_execution_policy;

  // 2.7, Dynamic execution policy
  class execution_policy;
}
}
}

2.3

Execution policy type trait

[parallel.execpol.type]

namespace std {
namespace experimental {
namespace parallel {
  template<class T> struct is_execution_policy
    : integral_constant<bool, see below> { };
}
}
}

is_execution_policy can be used to detect parallel execution policies for the purpose of excluding function signatures from otherwise ambiguous overload resolution participation.

If T is the type of a standard or implementation-defined execution policy, is_execution_policy<T> shall be publicly derived from integral_constant<bool,true>, otherwise from integral_constant<bool,false>.

The behavior of a program that adds specializations for is_execution_policy is undefined.

2.4

Sequential execution policy

[parallel.execpol.seq]

namespace std {
namespace experimental {
namespace parallel {
  class sequential_execution_policy{};
}
}
}

The class sequential_execution_policy is an execution policy type used as a unique type to disambiguate parallel algorithm overloading and require that a parallel algorithm's execution may not be parallelized.

2.5

Parallel execution policy

[parallel.execpol.par]

namespace std {
namespace experimental {
namespace parallel {

 class parallel_execution_policy{};
}
}
}

The class parallel_execution_policy is an execution policy type used as a unique type to disambiguate parallel algorithm overloading and indicate that a parallel algorithm's execution may be parallelized.

2.6

Vector execution policy

[parallel.execpol.vec]

namespace std {
namespace experimental {
namespace parallel {

 class vector_execution_policy{};
}
}
}

The class vector_execution_policy is an execution policy type used as a unique type to disambiguate parallel algorithm overloading and indicate that a parallel algorithm's execution may be vectorized.

2.7

Dynamic execution policy

[parallel.execpol.dynamic]

namespace std {
namespace experimental {
namespace parallel {

  class execution_policy
  {
    public:
      // 2.7.1, execution_policy construct/assign
      template<class T> execution_policy(const T& exec);
      template<class T> execution_policy& operator=(const T& exec);

      // 2.7.2, execution_policy object access
      template<class T> T* get() noexcept;
      template<class T> const T* get() const noexcept;
  };
}
}
}

The class execution_policy is a dynamic container for execution policy objects. execution_policy allows dynamic control over standard algorithm execution.

[ Example:

std::vector<float> sort_me = ...
        
using namespace std::experimental::parallel;
std::execution_policy exec = std::seq;

if(sort_me.size() > threshold)
{
  exec = std::par;
}
 
std::sort(exec, sort_me.begin(), sort_me.end());
std::sort(exec, std::begin(sort_me), std::end(sort_me));

— end example ]

Objects of type execution_policy shall be constructible and assignable from objects of type T for which is_execution_policy<T>::value is true.

2.7.1

`execution_policy` construct/assign

[parallel.execpol.con]

template<class T> execution_policy(const T& exec);

Effects:: Constructs an execution_policy object with a copy of exec's state.
Requires:: is_execution_policy<T>::value is true.
Remarks:: This constructor shall not participate in overload resolution unless is_execution_policy<T>::value is true.

template<class T> execution_policy& operator=(const T& exec);

Effects:: Assigns a copy of exec's state to *this.
Returns:: *this.
Requires:: is_execution_policy<T>::value is true.
Remarks:: This operator shall not partipate in overload resolution unless is_execution_policy<T>::value is true.

2.7.2

`execution_policy` object access

[parallel.execpol.access]


          const type_info& type() const noexcept;

Returns:: typeid(T), such that T is the type of the execution policy object contained by *this.


          template<class T> T* get() noexcept;
          template<class T> const T* get() noexcept;

Returns:: If target_type() == typeid(T), a pointer to the stored execution policy object; otherwise a null pointer.
Requires:: is_execution_policy<T>::value is true.
Remarks:: This function shall not participate in overload resolution unless is_execution_policy<T> is true.

2.8

Execution policy objects

[parallel.execpol.objects]

namespace std {
namespace experimental {
namespace parallel {
  constexpr sequential_execution_policy seq = sequential_execution_policy();
  constexpr parallel_execution_policy   par = parallel_execution_policy();
  constexpr vector_execution_policy     vec = vector_execution_policy();
}
}
}

The header <execution_policy> declares a global object associated with each type of execution policy defined by this Technical Specification.

Parallel algorithms

[parallel.alg]

4.1

In general

[parallel.alg.general]

This clause describes components that C++ programs may use to perform operations on containers and other sequences in parallel.

4.1.1

Effect of execution policies on algorithm execution

[parallel.alg.general.exec]

Parallel algorithms have template parameters named ExecutionPolicy which describe the manner in which the execution of these algorithms may be parallelized and the manner in which they apply user-provided function objects.

The applications of function objects in parallel algorithms invoked with an execution policy object of type sequential_execution_policy execute in sequential order in the calling thread.

The applications of function objects in parallel algorithms invoked with an execution policy object of type parallel_execution_policy are permitted to execute in an unordered fashion in unspecified threads, and indeterminately sequenced within each thread. [ Note: It is the caller's responsibility to ensure correctness, for example that the invocation does not introduce data races or deadlocks. — end note ]

[ Example:

using namespace std::experimental::parallel;
int a[] = {0,1};
std::vector<int> v;
for_each(par, std::begin(a), std::end(a), [&](int i) {
  v.push_back(i*2+1);
});
foo bar

The program above has a data race because of the unsynchronized access to the container v. — end example ]

[ Example:

using namespace std::experimental::parallel;
std::atomic x = 0;
int a[] = {1,2};
for_each(par, std::begin(a), std::end(a), [](int n) {
  x.fetch_add(1, std::memory_order_relaxed);
  // spin wait for another iteration to change the value of x
  while (x.load(std::memory_order_relaxed) == 1) { }
});

The above example depends on the order of execution of the iterations, and is therefore undefined (may deadlock). — end example ]

[ Example:

using namespace std::experimental::parallel;
int x;
std::mutex m;
int a[] = {1,2};
for_each(par, std::begin(a), std::end(a), [&](int) {
  m.lock();
  ++x;
  m.unlock();
});

The above example synchronizes access to object x ensuring that it is incremented correctly. — end example ]

The applications of function objects in parallel algorithms invoked with an execution policy of type vector_execution_policy are permitted to execute in an unordered fashion in unspecified threads, and unsequenced within each thread. [ Note: As a consequence, function objects governed by the vector_execution_policy policy must not synchronize with each other. Specifically, they must not acquire locks. — end note ]

[ Example:

using namespace std::experimental::parallel;
int x;
std::mutex m;
int a[] = {1,2};
for_each(vec, std::begin(a), std::end(a), [&](int) {
  m.lock();
  ++x;
  m.unlock();
});

The above program is invalid because the applications of the function object are not guaranteed to run on different threads. — end example ]

[ Note: The application of the function object may result in two consecutive calls to m.lock on the same thread, which may deadlock. — end note ]

[ Note: The semantics of the parallel_execution_policy or the vector_execution_policy invocation allow the implementation to fall back to sequential execution if the system cannot parallelize an algorithm invocation due to lack of resources. — end note ]

If they exist, a parallel algorithm invoked with an execution policy object of type parallel_execution_policy or vector_execution_policy may apply iterator member functions of a stronger category than its specification requires. In this case, the application of these member functions are subject to provisions 3. and 4. above, respectively.

[ Note: For example, an algorithm whose specification requires InputIterator but receives a concrete iterator of the category RandomAccessIterator may use operator[]. In this case, it is the algorithm caller's responsibility to ensure operator[] is race-free. — end note ]

Algorithms invoked with an execution policy object of type execution_policy execute internally as if invoked with instances of type sequential_execution_policy, parallel_execution_policy, or an implementation-defined execution policy type depending on the dynamic value of the execution_policy object. the contained execution policy object.

The semantics of parallel algorithms invoked with an execution policy object of implementation-defined type are unspecified.

4.1.2

`ExecutionPolicy` algorithm overloads

[parallel.alg.overloads]

Parallel algorithms coexist alongside their sequential counterparts as overloads distinguished by a formal template parameter named ExecutionPolicy. This ~~template parameter corresponds to the parallel algorithm's first function parameter, whose type is ExecutionPolicy~~ is the first template parameter and corresponds to the parallel algorithm's first function parameter, whose type is ExecutionPolicy&&.

Unless otherwise specified, the semantics of ExecutionPolicy algorithm overloads are identical to their overloads without.

Parallel algorithms ~~have the requirement is_execution_policy<ExecutionPolicy>::value is true~~ shall not participate in overload resolution unless is_execution_policy<ExecutionPolicy>::value is true.

The algorithms listed in Table 1 shall have ExecutionPolicy overloads.

Table 1 —
`adjacent_difference`	`adjacent_find`	`all_of`	`any_of`
`copy`	`copy_if`	`copy_n`	`count`
`count_if`	`equal`	`exclusive_scan`	`fill`
`fill_n`	`find`	`find_end`	`find_first_of`
`find_if`	`find_if_not`	`for_each`	`for_each_n`
`generate`	`generate_n`	`includes`	`inclusive_scan`
`inner_product`	`inplace_merge`	`is_heap`	`is_heap_until`
`is_partitioned`	`is_sorted`	`is_sorted_until`	`lexicographical_compare`
`max_element`	`merge`	`min_element`	`minmax_element`
`mismatch`	`move`	`none_of`	`nth_element`
`partial_sort`	`partial_sort_copy`	`partition`	`partition_copy`
`reduce`	`remove`	`remove_copy`	`remove_copy_if`
`remove_if`	`replace`	`replace_copy`	`replace_copy_if`
`replace_if`	`reverse`	`reverse_copy`	`rotate`
`rotate_copy`	`search`	`search_n`	`set_difference`
`set_intersection`	`set_symmetric_difference`	`set_union`	`sort`
`stable_partition`	`stable_sort`	`swap_ranges`	`transform`
`uninitialized_copy`	`uninitialized_copy_n`	`uninitialized_fill`	`uninitialized_fill_n`
`unique`	`unique_copy`

4.2

Definitions

[parallel.alg.defns]

Define GENERALIZED_SUM(op, a1, ..., aN) as follows:

a1 when N is 1
op(GENERALIZED_SUM(op, b1, ..., bM), GENERALIZED_SUM(op, bM, ..., bN)) where
- b1, ..., bN may be any permutation of a1, ..., aN and
- 0 < M < N.

Define GENERALIZED_NONCOMMUTATIVE_SUM(op, a1, ..., aN) as follows:

a1 when N is 1
op(GENERALIZED_NONCOMMUTATIVE_SUM(op, a1, ..., aM), GENERALIZED_NONCOMMUTATIVE_SUM(op, aM, ..., aN) where 0 < M < N.

4.3

Novel algorithms

[parallel.alg.novel]

This subclause describes novel algorithms introduced by this Technical Specification.

4.3.1

Header `<experimental/algorithm>` synopsis

[parallel.alg.novel.algorithms.synop]

namespace std {
namespace experimental {
namespace parallel {
  template<class ExecutionPolicy,
           class InputIterator, class Function>
    void for_each(ExecutionPolicy&& exec,
                  InputIterator first, InputIterator last,
                  Function f);
  template<class InputIterator, class Size, class Function>
    InputIterator for_each_n(InputIterator first, Size n,
                             Function f);
}
}
}

4.3.2

For each

[parallel.alg.novel.foreach]


          template<class ExecutionPolicy,
                   class InputIterator, class Function>
            void for_each(ExecutionPolicy&& exec,
                          InputIterator first, InputIterator last,
                          Function f);

Effects:: Applies f to the result of dereferencing every iterator in the range [first,last). [ Note: If the type of first satisfies the requirements of a mutable iterator, f may apply nonconstant functions through the dereferenced iterator. — end note ]
Complexity:: Applies f exactly last - first times.
Remarks:: If f returns a result, the result is ignored.
Notes:: Unlike its sequential form, the parallel overload of for_each does not return a copy of its Function parameter, since parallelization may not permit efficient state accumulation. Unlike its sequential form, the parallel overload of for_each requires Function to meet the requirements of CopyConstructible, but not MoveConstructible.


          template<class InputIterator, class Size, class Function>
            InputIterator for_each_n(InputIterator first, Size n,
                                     Function f);

Requires:: Function shall meet the requirements of MoveConstructible [ Note: Function need not meet the requirements of CopyConstructible. — end note ]
Effects:: Applies f to the result of dereferencing every iterator in the range [first,first + n), starting from first and proceeding to first + n - 1. [ Note: If the type of first satisfies the requirements of a mutable iterator, f may apply nonconstant functions through the dereferenced iterator. — end note ]
Returns:: first + n for non-negative values of n and first for negative values.
Remarks:: If f returns a result, the result is ignored.


          template<class ExecutionPolicy,
                   class InputIterator, class Size, class Function>
                   InputIterator for_each_n(ExecutionPolicy && exec,
                                            InputIterator first, Size n,
                                            Function f);

Effects:: Applies f to the result of dereferencing every iterator in the range [first,first + n), starting from first and proceeding to first + n - 1. [ Note: If the type of first satisfies the requirements of a mutable iterator, f may apply nonconstant functions through the dereferenced iterator. — end note ]
Returns:: first + n for non-negative values of n and first for negative values.
Remarks:: If f returns a result, the result is ignored.
Notes:: Unlike its sequential form, the parallel overload of for_each_n requires Function to meet the requirements of CopyConstructible, but not MoveConstructible.

4.3.3

Header `<experimental/numeric>`

[parallel.alg.novel.numeric.synop]

namespace std {
namespace experimental {
namespace parallel {
  template<class InputIterator>
    typename iterator_traits<InputIterator>::value_type
      reduce(InputIterator first, InputIterator last);
  template<class InputIterator, class T>
    T reduce(InputIterator first, InputIterator last T init);
  template<class InputIterator, class T, class BinaryOperation>
    T reduce(InputIterator first, InputIterator last, T init,
             BinaryOperation binary_op);

  template<class InputIterator, class OutputIterator>
    OutputIterator
      exclusive_scan(InputIterator first, InputIterator last,
                     OutputIterator result);
  template<class InputIterator, class OutputIterator,
           class T>
    OutputIterator
      exclusive_scan(InputIterator first, InputIterator last,
                     OutputIterator result,
                     T init);
  template<class InputIterator, class OutputIterator,
           class T, class BinaryOperation>
    OutputIterator
      exclusive_scan(InputIterator first, InputIterator last,
                     OutputIterator result,
                     T init, BinaryOperation binary_op);

  template<class InputIterator, class OutputIterator>
    OutputIterator
      inclusive_scan(InputIterator first, InputIterator last,
                     OutputIterator result);
  template<class InputIterator, class OutputIterator,
           class BinaryOperation>
    OutputIterator
      inclusive_scan(InputIterator first, InputIterator last,
                     OutputIterator result,
                     BinaryOperation binary_op);
  template<class InputIterator, class OutputIterator,
           class T, class BinaryOperation>
    OutputIterator
      inclusive_scan(InputIterator first, InputIterator last,
                     OutputIterator result,
                     T init, BinaryOperation binary_op);
}
}
}

4.3.4

Reduce

[parallel.alg.novel.reduce]


          template<class InputIterator>
            typename iterator_traits<InputIterator>::value_type
              reduce(InputIterator first, InputIterator last);

Returns:: reduce(first, last, typename iterator_traits<InputIterator>::value_type{})
Requires:: typename iterator_traits<InputIterator>::value_type{} shall be a valid expression. The operator+ function associated with iterator_traits<InputIterator>::value_type shall not invalidate iterators or subranges, nor modify elements in the range [first,last).
Complexity:: O(last - first) applications of operator+.
Notes:: The primary difference between reduce and accumulate is that the behavior of reduce may be non-deterministic for non-associative or non-commutative operator+.


          template<class InputIterator, class T>
            T reduce(InputIterator first, InputIterator last, T init);

Returns:: reduce(first, last, init, plus<>())
Requires:: The operator+ function associated with T shall not invalidate iterators or subranges, nor modify elements in the range [first,last).
Complexity:: O(last - first) applications of operator+.
Notes:: The primary difference between reduce and accumulate is that the behavior of reduce may be non-deterministic for non-associative or non-commutative operator+.


          template<class InputIterator, class T, class BinaryOperation>
            T reduce(InputIterator first, InputIterator last, T init,
                     BinaryOperation binary_op);

Returns:: GENERALIZED_SUM(binary_op, init, *first, ..., *(first + last - first - 1)).
Requires:: binary_op shall not invalidate iterators or subranges, nor modify elements in the range [first,last).
Complexity:: O(last - first) applications of binary_op.
Notes:: The primary difference between reduce and accumulate is that the behavior of reduce may be non-deterministic for non-associative or non-commutative ~~operator+~~binary_op.

4.3.5

Exclusive scan

[parallel.alg.novel.exclusive.scan]


          template<class InputIterator, class OutputIterator,
                   class T>
            OutputIterator
              exclusive_scan(InputIterator first, InputIterator last,
                             OutputIterator result,
                             T init);

Returns:: exclusive_scan(first, last, result, init, plus<>())
Requires:: The operator+ function associated with iterator_traits<InputIterator>::value_type shall not invalidate iterators or subranges, nor modify elements in the ranges [first,last) or [result,result + (last - first)).
Complexity:: O(last - first) applications of operator+.
Notes:: The primary difference between exclusive_scan and inclusive_scan is that exclusive_scan excludes the ith input element from the ith sum. If the operator+ function is not mathematically associative, the behavior of exclusive_scan may be non-deterministic.


          template<class InputIterator, class OutputIterator,
                   class T, class BinaryOperation>
            OutputIterator
              exclusive_scan(InputIterator first, InputIterator last,
                             OutputIterator result,
                             T init, BinaryOperation binary_op);

Effects:: Assigns through each iterator i in [result,result + (last - first)) the value of GENERALIZED_NONCOMMUTATIVE_SUM(binary_op, init, *first, ..., (*first + i - result - 1)).
Returns:: The end of the resulting range beginning at result.
Requires:: binary_op shall not invalidate iterators or subranges, nor modify elements in the ranges [first,last) or [result,result + (last - first)).
Complexity:: O(last - first) applications of binary_op.
Notes:: The primary difference between exclusive_scan and inclusive_scan is that exclusive_scan excludes the ith input element from the ith sum. If binary_op is not mathematically associative, the behavior of exclusive_scan may be non-deterministic.

4.3.6

Inclusive scan

[parallel.alg.novel.inclusive.scan]


          template<class InputIterator, class OutputIterator>
            OutputIterator
              inclusive_scan(InputIterator first, InputIterator last,
                             OutputIterator result);

Returns:: inclusive_scan(first, last, result, plus<>())
Requires:: The operator+ function associated with iterator_traits<InputIterator>::value_type shall not invalidate iterators or subranges, nor modify elements in the ranges [first,last) or [result,result + (last - first)).
Complexity:: O(last - first) applications of operator+.
Notes:: The primary difference between exclusive_scan and inclusive_scan is that exclusive_scan excludes the ith input element from the ith sum. If the operator+ function is not mathematically associative, the behavior of inclusive_scan may be non-deterministic.


          template<class InputIterator, class OutputIterator,
                   class BinaryOperation>
            OutputIterator
              inclusive_scan(InputIterator first, InputIterator last,
                             OutputIterator result,
                             BinaryOperation binary_op);
          
          template<class InputIterator, class OutputIterator,
                   class T, class BinaryOperation>
            OutputIterator
              inclusive_scan(InputIterator first, InputIterator last,
                             OutputIterator result,
                             T init, BinaryOperation binary_op);

Effects:: Assigns through each iterator i in [result,result + (last - first)) the value of GENERALIZED_NONCOMMUTATIVE_SUM(binary_op, *first, ..., (*first + i - result)) or GENERALIZED_NONCOMMUTATIVE_SUM(binary_op, init, *first, ..., (*first + i - result)) if init is provided.
Returns:: The end of the resulting range beginning at result.
Requires:: binary_op shall not invalidate iterators or subranges, nor modify elements in the ranges [first,last) or [result,result + (last - first)).
Complexity:: O(last - first) applications of binary_op.
Notes:: The primary difference between exclusive_scan and inclusive_scan is that inclusive_scan includes the ith input element in the ith sum. If binary_op is not mathematically associative, the behavior of inclusive_scan may be non-deterministic.

Working Draft, Technical Specification for C++ Extensions for Parallelism

General

Scope

Normative references

Namespaces and headers

Terms and definitions

Execution policies

In general

Header `<experimental/execution_policy>` synopsis

Execution policy type trait

Sequential execution policy

Parallel execution policy

Vector execution policy

Dynamic execution policy

`execution_policy` construct/assign

`execution_policy` object access

Execution policy objects

Parallel exceptions

Exception reporting behavior

Header `<experimental/exception_list>` synopsis

Parallel algorithms

In general

Effect of execution policies on algorithm execution

`ExecutionPolicy` algorithm overloads

Definitions

Novel algorithms

Header `<experimental/algorithm>` synopsis

For each

Header `<experimental/numeric>`

Reduce

Exclusive scan

Inclusive scan