1. Overview
US 161 and US 184 identify issues relating to the ExecutionPolicy (e.g.
parallel) signatures of the four pre-C++17 sequence algorithms (accumulate, inner_product, partial_sum and adjacent_difference) found in <numeric>.
All of these issues revolve around the strength of the ordering requirements of
these algorithms and the new <numeric> algorithms with
weaker ordering guarantees that were introduced in the Parallelism TS v1 and
C++17 (reduce, transform_reduce, inclusive_scan, exclusive_scan, transform_inclusive_scan and transform_exclusive_scan).
This paper proposes two types of changes:
- 
     Addition/removal of <numeric>sequence algorithm signatures.
- 
     Changes to the order of parameters of <numeric>sequence algorithm signatures.
1.1. inner_product Cannot Be Parallelized
   Currently, accumulate, inner_product and partial_sum use
"accumulator-style" wording:
26.8.2 Accumulate [accumulate]:template <class InputIterator, class T> T accumulate(InputIterator first, InputIterator last, T init); template <class InputIterator, class T, class BinaryOperation> T accumulate(InputIterator first, InputIterator last, T init, BinaryOperation binary_op);1 Effects: Computes its result by initializing the accumulator
accwith the initial valueinitand then modifies it withacc = acc + *ioracc = binary_op(acc, *i)for every iteratoriin the range[first, last)in order.2 Requires:
Tshall meet the requirements ofCopyConstructible(Table 22) andCopyAssignable(Table 24) types. In the range[first, last],binary_opshall neither modify elements nor invalidate iterators or subranges.
26.8.5 Inner product [inner.product]:template <class InputIterator1, class InputIterator2, class T> T inner_product(InputIterator1 first1, InputIterator1 last1, InputIterator2 first2, T init); template <class InputIterator1, class InputIterator2, class T, class BinaryOperation1, class BinaryOperation2> T inner_product(InputIterator1 first1, InputIterator1 last1, InputIterator2 first2, T init, BinaryOperation1 binary_op1, BinaryOperation2 binary_op2);1 Effects: Computes its result by initializing the accumulator
accwith the initial valueinitand then modifying it withacc = acc + (*i1) * (*i2)oracc = binary_op1(acc, binary_op2(*i1, *i2))for every iteratori1in the range[first1, last1)and iteratori2in the range[first2, first2 + (last1 - first1))in order.2 Requires:
Tshall meet the requirements ofCopyConstructible(Table 22) andCopyAssignable(Table 24) types. In the ranges[first1, last1]and[first2, first2 + (last1 - first1)]binary_op1andbinary_op2shall neither modify elements nor invalidate iterators or subranges.
26.8.6 Partial sum [partial.sum]template <class InputIterator, class OutputIterator> OutputIterator partial_sum(InputIterator first, InputIterator last, OutputIterator result); template <class InputIterator, class OutputIterator, class BinaryOperation> OutputIterator partial_sum(InputIterator first, InputIterator last, OutputIterator result, BinaryOperation binary_op);1 Effects: For a non-empty range, the function creates an accumulator
accwhose type isInputIterator's value type, initializes it with*first, and assigns the result to*result. For every iteratoriin[first + 1, last)in order,accis then modified byacc = acc + *ioracc = binary_op(acc, *i)and the result is assigned to*(result + (i - first)).2 Returns:
result + (last - first).3 Complexity: Exactly
(last - first) - 1applications of the binary operation.4 Requires:
InputIterator's value type shall be constructible from the type of*first. The result of the expressionacc + *iorbinary_op(acc, *i)shall be implicitly convertible toInputIterator's value type.accshall be writable (24.2.1) to theresultoutput iterator. In the ranges[first, last]and[result, result + (last - first)]binary_opshall neither modify elements nor invalidate iterators or subranges.5 Remarks:
resultmay be equal tofirst.
This wording specifies that these algorithms perform the following operations:
- 
     Create an accumulator accwhich is initialized withinit.
- 
     For (the) iterator(s) in the range(s) in order, modify accby applying an update operation which takesaccand the dereferenced value(s) of the iterator(s) as arguments.
The ordering requirement is necessary to ensure that these algorithms are well-defined for non-associative and/or non-commutative operations, such as floating-point addition and multiplication (non-associative and commutative), and subtraction (non-associative and non-commutative).
Currently, the wording forces each iteration to depend on the prior iteration to ensure the correct ordering of accumulation. This introduces a loop-carried dependency that makes it impossible to parallelize.
To parallelize accumulate and inner_product we need to be able to
re-order applications of the operations, partition the workload into sub-tasks
and then combine the results of the sub-tasks together using the operator.
This, however, would give a non-deterministic result for non-associative or
non-commutative operations.
For partial_sum, re-ordering is not possible due to the nature of the
algorithm. However, partitioning is possible. Thus, you get a non-deterministic
result for non-associative operations but it is fine for non-commutative
operations.
To allow parallelization of accumulate and partial_sum without relaxing
the existing ordering, new algorithms with weaker ordering were introduced
instead of adding ExecutionPolicy signatures of the existing ones: reduce (a weaker accumulate) and inclusive_scan (a weaker partial_sum).
However, for inner_product, no new algorithm with weaker ordering was
introduced and an ExecutionPolicy signature was added instead. US 184
points out that this signature is not useful as it cannot be parallelized due
to the ordering constraints.
1.2. inner_product is a Form of transform_reduce
   Just as reduce is the parallelizable counterpart of accumulate (same basic
operation but without a specific ordering), inner_product has a natural
counter-part. Consider this possible implementation of inner_product:
template <class InputIt1, class InputIt2, class T, class ReductionOperation, class BinaryTransformOperation> T inner_product(InputIt1 first1, InputIt1 last1, InputIt2 first2, T init, ReductionOperation reduce_op, BinaryTransformOperation transform_op) { while (first1 != last1) { init = reduce_op(init, transform_op(*first1, *first2)); ++first1; ++first2; } return value; }
The application of transform_op to the sequence is a binary transform:
template <class InputIt1, class InputIt2, class OutputIt, class BinaryTransformOperation> OutputIt transform(InputIt1 first1, InputIt1 last1, InputIt2 first2, OutputIt o_first, BinaryTransformOperation transform_op) { while (first1 != last1) { *o_first++ = transform_op(*first1++, *first2++); } return o_first; }
And the application of reduce_op accumulates the transformed values:
template <class InputIt, class T, class ReductionOperation> T accumulate(InputIt first, InputIt last, T init, ReductionOperation reduce_op) { for (; first != last; ++first) { init = reduce_op(init, *first); } return init; }
    So, inner_product is an 
    accumulatesreduces
of a transformed sequence. It’s a transform_reduce!.
   
1.3. transform_reduce is Missing an Overload
   However, transform_reduce is missing an signature for a binary transform;
only signatures with unary transform operations are currently specified by the
C++17 CD. I call this missing signature binary-binary transform_reduce, since both the reduction and transformation
operation are binary operations.
This signature, which is equivalent to parallel inner_product with weaker
ordering, is very useful. Many of the typical examples of transform_reduce usage that I’ve seen use tricks to perform a binary-binary transform_reduce using the unary-binary transform_reduce that is in the C++17 CD.
The typical transform_reduce dot product example (similar to what is
found in the original transform_reduce proposal [N3960]) looks like this:
std::vector<std::tuple<double, double> > XY = // ... double dot_product = std::transform_reduce( std::par_unseq, // Input sequence. XY.begin(), XY.end(), // Unary transformation operation. [](std::tuple<double, double> const& xy) { // Array of struct means this is tricky to execute on vector hardware: // // memory layout: X[0] Y[0] X[1] Y[1] X[2] Y[2] X[3] Y[3] ... // // op #0: load a pack of Xs (they aren’t contiguous; the load will be // strided, may need to access multiple // cache lines and may be harder for the // hardware prefetcher to handle) // op #1: load a pack of Ys (same as above) // op #2: multiply the pack of Xs by vector of Ys return std::get<0>(xy) * std::get(xy); }, double(0.0), // Initial value for reduction. // Binary reduction operation. std::plus<double>() );
Note that this is array-of-structs NOT struct-of-arrays. The HPX and THRUST dot product examples use iterator tricks (zip iterators or counting iterators) to switch to a struct-of-arrays scheme. The zip iterator trick is shown below, using Boost:
std::vector<double> X = // ... std::vector<double> Y = // ... double dot_product = std::transform_reduce( std::par_unseq, // Input sequence. boost::make_zip_iterator(X.begin(), Y.begin()), boost::make_zip_iterator(X.end(), Y.end()), // Unary transformation operation. [](auto&& xy) // std::tuple<double, double>-esque { // Struct of arrays means this is easier to execute on vector hardware: // // memory layout: X[0] X[1] X[2] X[3] ... Y[0] Y[1] Y[2] Y[3] ... // // op #0: load a pack of Xs (elements are contiguous, load will access // only one cache line and the hardware // prefetcher will easily track the memory // stream) // op #1: load a pack of Ys (same as above) // op #2: multiply the pack of Xs by the pack of Ys return std::get<0>(xy) * std::get<1>(xy); }, double(0.0), // Initial value for reduction. // Binary reduction operation. std::plus<double>() );
More examples of this pattern can be found in HPX (zip iterator example and counting iterator example) and THRUST (zip iterator example).
1.4. transform_reduce Parameter Order is Odd
   The missing binary-binary transform_reduce would look like this:
template <class InputIt1, class InputIt2, class BinaryTransformOperation, class T, class ReductionOperation> // Always binary. T transform_reduce(InputIt1 first1, InputIt1 last1, InputIt2 first2, BinaryTransformOperation transform_op, T init, ReductionOperation reduce_op);
Note that the order of parameters, which is consistent with the other transform_reduce signatures, is inconsistent with inner_product, transform and accumulate:
template <class InputIt1, class UnaryTransformOperation, class T, class ReductionOperation> // Always binary. T transform_reduce(InputIt1 first1, InputIt1 last1, UnaryTransformOperation transform_op, T init, ReductionOperation reduce_op); template <class InputIt1, class InputIt2, class T> T inner_product(InputIt1 first1, InputIt1 last1, InputIt2 first2, T init); template <class InputIt1, class InputIt2, class T, class ReductionOperation, // Always binary. class BinaryTransformOperation> T inner_product(InputIt1 first1, InputIt1 last1, InputIt2 first2, T init, ReductionOperation reduce_op, BinaryTransformOperation transform_op); template <class InputIt1, class InputIt2, class OutputIt, class BinaryTransformOperation> OutputIt transform(InputIt1 first1, InputIt1 last1, InputIt2 first2, OutputIt o_first, BinaryTransformOperation transform_op); template <class InputIt1, class T, class ReductionOperation> T accumulate(InputIt1 first1, InputIt1 last1, T init, ReductionOperation reduce_op);
The inconsistencies are:
- 
     In inner_product,accumulateandreduce(which is not shown but has the same interface asaccumulate), theinitparameter comes after the iterator parameters and before the operation parameters. Intransform_reduce, it is in between the transformation operation parameter and the reduction operation parameter.
- 
     In inner_product, the reduction operation parameter comes before the transformation operation parameter. Intransform_reduce, the reduction operation parameter comes before the transformation operation parameter.
1.5. transform_inclusive_scan and transform_exclusive_scan Parameter Order is Odd
   Like transform_reduce, the new transform_inclusive_scan and transform_exclusive_scan algorithms have parameter orders that are
inconsistent with existing algorithms that they are closely related to, like accumulate, partial_sum and inner_product. The specific issues:
- 
     In exclusive_scan, theinitparameter comes after the iterator parameters and before the operation parameters. Intransform_exclusive_scan, theinitparameter is after the transformation operation parameter and the reduction operation parameter.
- 
     In inclusive_scan, the reduction operation parameter comes before the transformation operation parameter. Intransform_inclusive_scanandtransform_exclusive_scan, the reduction operation parameter comes before the transformation operation parameter.
2. Proposed Resolution
The following changes would resolve the issues with the <numeric> raised by US 161 and US 184:
- 
     Remove the ExecutionPolicysignature forinner_product.
- 
     Add binary-binary transform_reducesignatures (four signatures: with and withoutExecutionPolicyparameters, and with and without defaulted operations).
- 
     Change the parameter order for all transform_reducesignatures so that:- 
       The initial value parameter comes after the iterator parameters and before the operation parameters. 
- 
       The reduction operation parameter comes before the transformation operation parameters. 
 
- 
       
- 
     Change the parameter order for transform_*_scansignatures so that:- 
       For transform_exclusive_scan, the initial value parameter comes after the iterator parameters and before the operation parameters.
- 
       For transform_inclusive_scanandtransform_exclusive_scan, the reduction operation parameter comes before the transformation operation parameters.
 
- 
       
With these changes, a parallel dot product over struct-of-arrays data can be written as:
std::vector<double> X = // ... std::vector<double> Y = // ... double dot_product = std::transform_reduce(std::par_unseq, x.begin(), x.end(), y.begin(), double(0.0));
A parallel word count could be written with binary-binary transform_reduce as:
bool is_word_beginning(char left, char right) { return std::isspace(left) && !std::isspace(right); } std::size_t word_count(std::string_view s) { if (s.empty()) return 0; // Count the number of characters that start a new word. return std::transform_reduce( std::par_unseq, // "Left" input sequence: s[0], s[1], ..., s[s.size() - 2] s.begin(), s.end() - 1, // "Right" input sequence: s[1], s[2], ..., s[s.size() - 1] s.begin() + 1, // Initial value for reduction: if the first character // is not a space, then it’s the beginning of a word. std::size_t(!std::isspace(s.front()) ? 1 : 0), // Binary transformation operation. std::plus<std::size_t>(), // Binary transformation operation: Return 1 when we hit // a new word. is_word_beginning ); }
3. FAQ
Q: Why does this need to be fixed now?
- 
     inner_productcurrently has ordering guarantees which prevent parallelization. If we fix them in the future, it will have to be a breaking change.
- 
     The order of parameters in some of the new algorithms from the Parallelism TS v1 is inconsistent. If we fix them in the future, it will have to be a breaking change. 
Q: The initial value parameter comes before the operation parameters in exclusive_scan, but after the operation parameters in inclusive_scan. Isn’t this also inconsistent?
Initially, I thought this was inconsistent. However, I’ve since learned from
LWG that this was done intentionally. The goal is to keep parameters fixed
relative to the other signatures for that algorithm. For example, since inclusive_scan does not require an initial value parameter, it would be odd
to change from:
OutputIt inclusive_scan(InputIt, InputIt, OutputIt); OutputIt inclusive_scan(InputIt, InputIt, OutputIt, BinaryOperation); OutputIt inclusive_scan(InputIt, InputIt, OutputIt, BinaryOperation, T init);
to this:
OutputIt inclusive_scan(InputIt, InputIt, OutputIt); OutputIt inclusive_scan(InputIt, InputIt, OutputIt, BinaryOperation); OutputIt inclusive_scan(InputIt, InputIt, OutputIt, T init, BinaryOperation);
because it makes the order of BinaryOperation inconsistent.
Q: Why does this paper not address the parallel signature of adjacent_difference?
adjacent_difference uses accumulator style wording and lives in
<numeric>, but it is not truly an accumulation, and has none of the
associativity and commutativity concerns of the other <numeric>
algorithms. SG1 preferred to follow the approach taken with for_each and
relax the ordering requirement for the parallel signature of adjacent_difference instead of adding a new algorithm with weaker guarantees. [P0574r1] contains the necessary changes for adjacent_difference.
4. Wording
The following changes are relative to the latest working paper as of 03-02-2017 ([N4640]) with the changes from [P0467r2] applied.
The � character is used to denote a placeholder section number which the editor shall choose.
4.1. Wording for transform_reduce/inner_product
   Apply the following changes to the declarations of transform_reduce in the <numeric> header synopsis in 26.8.1 [numerics.ops.overview]:
template <class InputIterator1, class InputIterator2, class T> T transform_reduce(InputIterator1 first1, InputIterator1 last1, InputIterator2 first2, T init); template <class InputIterator1, class InputIterator2, class T, class BinaryOperation1, class BinaryOperation2> T transform_reduce(InputIterator1 first1, InputIterator1 last1, InputIterator2 first2, T init, BinaryOperation1 binary_op1, BinaryOperation2 binary_op2); template <class InputIterator,class UnaryOperation,class T, class BinaryOperation, class UnaryOperation> T transform_reduce(InputIterator first, InputIterator last,UnaryOperation unary_op,T init, BinaryOperation binary_op, UnaryOperation unary_op); template <class ExecutionPolicy, class ForwardIterator1, class ForwardIterator2, class T> T transform_reduce(ExecutionPolicy&& exec, // see 25.2.5 ForwardIterator1 first1, ForwardIterator1 last1, ForwardIterator2 first2, T init); template <class ExecutionPolicy, class ForwardIterator1, class ForwardIterator2, class T, class BinaryOperation1, class BinaryOperation2> T transform_reduce(ExecutionPolicy&& exec, // see 25.2.5 ForwardIterator1 first1, ForwardIterator1 last1, ForwardIterator2 first2, T init, BinaryOperation1 binary_op1, BinaryOperation2 binary_op2); template <class ExecutionPolicy, class ForwardIterator,class UnaryOperation,class T, class BinaryOperation, class UnaryOperation> T transform_reduce(ExecutionPolicy&& exec, // see 25.2.5 ForwardIterator first, ForwardIterator last,UnaryOperation unary_op,T init, BinaryOperation binary_op, UnaryOperation unary_op); template <class InputIterator1, class InputIterator2, class T> T inner_product(InputIterator1 first1, InputIterator1 last1, InputIterator2 first2, T init); template <class InputIterator1, class InputIterator2, class T, class BinaryOperation1, class BinaryOperation2> T inner_product(InputIterator1 first1, InputIterator1 last1, InputIterator2 first2, T init, BinaryOperation1 binary_op1, BinaryOperation2 binary_op2);template <class ExecutionPolicy, class ForwardIterator1, class ForwardIterator2,class T>T inner_product(ExecutionPolicy&& exec, // see 25.2.5ForwardIterator1 first1, ForwardIterator1 last1,ForwardIterator2 first2, T init);template <class ExecutionPolicy, class ForwardIterator1, class ForwardIterator2,class T, class BinaryOperation1, class BinaryOperation2>T inner_product(ExecutionPolicy&& exec, // see 25.2.5ForwardIterator1 first1, ForwardIterator1 last1,ForwardIterator2 first2, T init,BinaryOperation1 binary_op1,BinaryOperation2 binary_op2);
Apply the following changes to the definitions of transform_reduce in 26.8.4 [transform.reduce]:
26.8.4 Transform reduce [transform.reduce]:template <class InputIterator1, class InputIterator2, class T> T transform_reduce(InputIterator1 first1, InputIterator1 last1, InputIterator2 first2, T init); template <class ExecutionPolicy, class ForwardIterator1, class ForwardIterator2, class T> T transform_reduce(ExecutionPolicy&& exec, ForwardIterator1 first1, ForwardIterator1 last1, ForwardIterator2 first2, T init);� Effects: Equivalent to:return transform_reduce(first1, last1, first2, init, plus<>(), multiplies<>());template <class InputIterator1, class InputIterator2, class T, class BinaryOperation1, class BinaryOperation2> T transform_reduce(InputIterator1 first1, InputIterator1 last1, InputIterator2 first2, T init, BinaryOperation1 binary_op1, BinaryOperation2 binary_op2); template <class ExecutionPolicy, class ForwardIterator1, class ForwardIterator2, class T, class BinaryOperation1, class BinaryOperation2> T transform_reduce(ExecutionPolicy&& exec, ForwardIterator1 first1, ForwardIterator1 last1, ForwardIterator2 first2, T init, BinaryOperation1 binary_op1, BinaryOperation2 binary_op2);� Returns:GENERALIZED_SUM(binary_op1, init, binary_op2(*i, *(first2 + (i - first1))), ...)for every iteratoriin[first1, last1).� Requires: Neither
binary_op1norbinary_op2shall invalidate subranges, or modify elements in the ranges[first1, last1)and[first2, first2 + (last1 - first1)).� Complexity:
O(last1 - first1)applications each ofbinary_op1andbinary_op2.template <class InputIterator,class UnaryOperation,class T, class BinaryOperation, class UnaryOperation> T transform_reduce(InputIterator first, InputIterator last,UnaryOperation unary_op,T init, BinaryOperation binary_op, UnaryOperation unary_op); template <class ExecutionPolicy, class ForwardIterator,class UnaryOperation,class T, class BinaryOperation, class UnaryOperation> T transform_reduce(ExecutionPolicy&& exec, ForwardIterator first, ForwardIterator last,UnaryOperation unary_op,T init, BinaryOperation binary_op, UnaryOperation unary_op);1 Returns:
GENERALIZED_SUM(binary_op, init, unary_op(*i), ...)for every iteratoriin[first, last).2 Requires: Neither
unary_opnorbinary_opshall invalidate subranges, or modify elements in the range[first, last).3 Complexity:
O(last - first)applications each ofunary_opandbinary_op.4 Notes:
transform_reducedoes not applyunary_optoinit.
Direction to the Editor: Reorder the declarations and definitions of inner_product to come before the declarations of transform_reduce() in the <numeric> header synopsis and class definitions in 26.8 to
maintain consistency with accumulate/reduce and partial_sum/inclusive_scan.
4.2. Wording for transform_exclusive_scan
   Apply the following changes to the declarations of transform_exclusive_scan in the <numeric> header synopsis in 26.8.1
[numerics.ops.overview]:
template <class InputIterator, class OutputIterator,class UnaryOperation,class T, class BinaryOperation, class UnaryOperation> OutputIterator transform_exclusive_scan(InputIterator first, InputIterator last, OutputIterator result,UnaryOperation unary_op,T init, BinaryOperation binary_op, UnaryOperation unary_op); template <class ExecutionPolicy, class ForwardIterator1, class ForwardIterator2,class UnaryOperation,class T, class BinaryOperation, class UnaryOperation> ForwardIterator2 transform_exclusive_scan(ExecutionPolicy&& exec, // see 25.2.5 ForwardIterator1 first, ForwardIterator1 last, ForwardIterator2 result,UnaryOperation unary_op,T init, BinaryOperation binary_op, UnaryOperation unary_op);
Apply the following changes to the definitions of transform_exclusive_scan in 26.8.9 [transform.exclusive.scan]:
template <class InputIterator, class OutputIterator,class UnaryOperation,class T, class BinaryOperation, class UnaryOperation> OutputIterator transform_exclusive_scan(InputIterator first, InputIterator last, OutputIterator result,UnaryOperation unary_op,T init, BinaryOperation binary_op, UnaryOperation unary_op); template <class ExecutionPolicy, class ForwardIterator1, class ForwardIterator2,class UnaryOperation,class T, class BinaryOperation, class UnaryOperation> ForwardIterator2 transform_exclusive_scan(ExecutionPolicy&& exec, ForwardIterator1 first, ForwardIterator1 last, ForwardIterator2 result,UnaryOperation unary_op,T init, BinaryOperation binary_op, UnaryOperation unary_op);
4.3. Wording for transform_inclusive_scan
   Apply the following changes to the declarations of transform_inclusive_scan in the <numeric> header synopsis in 26.8.1
[numerics.ops.overview]:
template <class InputIterator, class OutputIterator,class UnaryOperation,class BinaryOperation, class UnaryOperation> OutputIterator transform_inclusive_scan(InputIterator first, InputIterator last, OutputIterator result,UnaryOperation unary_op,BinaryOperation binary_op, UnaryOperation unary_op); template <class InputIterator, class OutputIterator,class UnaryOperation,class BinaryOperation, class UnaryOperation, class T> OutputIterator transform_inclusive_scan(InputIterator first, InputIterator last, OutputIterator result,UnaryOperation unary_op,BinaryOperation binary_op, UnaryOperation unary_op, T init); template <class ExecutionPolicy, class ForwardIterator1, class ForwardIterator2,class UnaryOperation,class BinaryOperation, class UnaryOperation> ForwardIterator2 transform_inclusive_scan(ExecutionPolicy&& exec, // see 25.2.5 ForwardIterator1 first, ForwardIterator1 last, ForwardIterator2 result,UnaryOperation unary_op,BinaryOperation binary_op, UnaryOperation unary_op); template <class ExecutionPolicy, class ForwardIterator1, class ForwardIterator2,class UnaryOperation,class BinaryOperation, class UnaryOperation, class T> ForwardIterator2 transform_inclusive_scan(ExecutionPolicy&& exec, // see 25.2.5 ForwardIterator1 first, ForwardIterator1 last, ForwardIterator2 result,UnaryOperation unary_op,BinaryOperation binary_op, UnaryOperation unary_op, T init);
Apply the following changes to the definitions of transform_inclusive_scan in 26.8.10 [transform.inclusive.scan]:
template <class InputIterator, class OutputIterator,class UnaryOperation,class BinaryOperation, class UnaryOperation> OutputIterator transform_inclusive_scan(InputIterator first, InputIterator last, OutputIterator result,UnaryOperation unary_op,BinaryOperation binary_op, UnaryOperation unary_op); template <class InputIterator, class OutputIterator,class UnaryOperation,class BinaryOperation, class UnaryOperation, class T> OutputIterator transform_inclusive_scan(InputIterator first, InputIterator last, OutputIterator result,UnaryOperation unary_op,BinaryOperation binary_op, UnaryOperation unary_op, T init); template <class ExecutionPolicy, class ForwardIterator1, class ForwardIterator2,class UnaryOperation,class BinaryOperation, class UnaryOperation> ForwardIterator2 transform_inclusive_scan(ExecutionPolicy&& exec, ForwardIterator1 first, ForwardIterator1 last, ForwardIterator2 result,UnaryOperation unary_op,BinaryOperation binary_op, UnaryOperation unary_op); template <class ExecutionPolicy, class ForwardIterator1, class ForwardIterator2,class UnaryOperation,class BinaryOperation, class UnaryOperation, class T> ForwardIterator2 transform_inclusive_scan(ExecutionPolicy&& exec, ForwardIterator1 first, ForwardIterator1 last, ForwardIterator2 result,UnaryOperation unary_op,BinaryOperation binary_op, UnaryOperation unary_op, T init);
5. Acknowledgement
Thanks to:
- 
     Agustín Bergé for helping identify this issue. 
- 
     Alisdair Meredith for assisting with wording and tackling the related problem with adjacent_difference.
- 
     Hartmut Kaiser, JF Bastien, Michael Garland and Jared Hoberock for providing feedback.