Doc. No.: | N 1665 |
---|---|
Date: | 2012-11-01 |
Reply to: | Clark Nelson |
This specification encompasses C++ as well as C. As a result, many things are said that are not relevant to C, and for simplicity, the terminology and conventions of C++ are used where they differ from those of C.
Any use of pragma syntax in this document may be interpreted as a placeholder for a syntax more suitable for standardization.
Cilk is a trademark of Intel Corporation in the U.S. and/or other countries.
More information about Intel® Cilk™ Plus can be found at cilk.com
This document is part of the Intel® Cilk™ Plus Language Specification. The language specification comprises a set of technical specifications describing the language and the run-time support for the language. Together, these documents provide the detail needed to implement a compliant compiler. At this time the language specification contains these parts:
This document defines the Intel® Cilk™ Plus extension to C and C++. The language extension is supported by a run time user mode work stealing task scheduler which is not directly exposed to the application programmer. However, some of the semantics of the language and some of the guarantees provided require specific behavior of the task scheduler. The programmer visible parts of the language include the following constructs:
_Cilk_spawn
, _Cilk_sync
and _Cilk_for
)
to express taskingAn implementation of the language may take advantage of all parallelism resources
available in the hardware. On a typical CPU, these include at least multiple cores
and vector units. Some of the language constructs, e.g. _Cilk_spawn
,
utilize only core parallelism; some, e.g. SIMD loops, utilize only vector parallelism,
and some, e.g. elemental functions, utilize both. The defined behavior of every
deterministic Cilk program is the same as the behavior of a similar C or C++ program
known as the serialization.
While execution of a C or C++ program may be
considered as a linear sequence of statements, execution of a tasking program is
in general a directed acyclic graph. Parallel control flow may yield a new kind
of undefined behavior, a data race,
whereby parts of the program that may
execute in parallel access the same memory location in an indeterminate order, with
at least one of the accesses being a write access. In addition, if an exception
is thrown, code code may still be executed that would not have been executed in
a serial execution.
Cilk Plus adds the following new keywords:
_Cilk_sync
_Cilk_spawn
_Cilk_for
A program that uses these keywords other than as defined in the grammar extension below is ill-formed.
The header <cilk/cilk.h>
defines the following aliases for the
Cilk keywords:
#define cilk_spawn _Cilk_spawn #define cilk_sync _Cilk_sync #define cilk_for _Cilk_for
The three keywords are used in the following new productions:
_Cilk_sync ;
The call production of the grammar is modified to permit the keyword _Cilk_spawn
before the expression denoting the function to be called:
_Cilk_spawn
opt postfix-expression (
expression-listopt )
Consecutive _Cilk_spawn
tokens are not permitted. The postfix-expression
following _Cilk_spawn
is called a spawned function. The
spawned function may be a normal function call, a member-function call, or the function-call
(parentheses) operator of a function object (functor) or a call to a lambda expression.
Overloaded operators other than the parentheses operator may be spawned only by
using the function-call notation (e.g., operator+(arg1,arg2)
). There
shall be no more than one _Cilk_spawn
within a full expression. A function
that contains a spawn statement is called a spawning function.
A program is ill formed if the _Cilk_spawn
form of this expression appears
other than in one of the following contexts:
(A _Cilk_spawn
expression may be permitted in more contexts in the future.)
A statement with a _Cilk_spawn
on the right hand side of an assignment
or declaration is called an assignment spawn
or initializer spawn, respectively and the object assigned or initialized
by the spawn is called the receiver.
The iteration-statement is extended by adding another form of for
loop:
# pragma cilk grainsize =
expression new-line_Cilk_for (
for-init-decl
;
condition ;
expression )
statement_Cilk_for (
assignment-expression
;
condition ;
expression )
statementA strand is a serially-executed sequence of instructions that does not contain a spawn point or sync point (as defined below). At a spawn point, one strand (the initial strand) ends and two strands (the new strands) begin. The initial strand runs in series with each of the new strands but the new strands may run in parallel with each other. At a sync point, one or more strands (the initial strands) end and one strand (the new strand) begins. The initial strands may run in parallel with one another but each of the initial strands runs in series with the new strand. A single strand can be subdivided into a sequence of shorter strands in any manner that is convenient for modeling the computation. A maximal strand is one that cannot be included in a longer strand.
The strands in an execution of a program form a directed acyclic graph (DAG) in which spawn points and sync points comprise the vertices and the strands comprise the directed edges, with time defining the direction of each edge. (In an alternative DAG representation, sometimes seen in the literature, the strands comprise the vertices and the dependencies between the strands comprise the edges.)
The behavior of a deterministic Intel® Cilk™ Plus program is defined in terms of its serialization, as defined in this section. If the serialization has undefined behavior, the Intel® Cilk™ Plus program also has undefined behavior.
The strands in an execution of a program are ordered according to the order of execution
of the equivalent code in the program's serialization. Given two strands, the earlier
strand is defined as the strand that would execute first in the serial execution
of the same program with the same inputs, even though the two strands may execute
in either order or concurrently in the actual parallel execution. Similarly, the
terms earliest,
latest,
and later
are used to designate strands
according to their serial ordering. The terms left,
leftmost,
right,
and rightmost
are equivalent to earlier,
earliest,
later,
and latest,
respectively.
The serialization of a pure C or C++ program is itself.
If a C or C++ program has defined behavior and does not use the tasking keywords or library functions, it is a Cilk Plus program with the same defined behavior.
The serializations of _Cilk_spawn
and _Cilk_sync
are empty.
If a Cilk Plus program has defined deterministic behavior, then that behavior is
the same as the behavior of the C or C++ program derived from the original by removing
all instances of the keywords _Cilk_spawn
, and _Cilk_sync
.
The serialization of _Cilk_for
is
for
.
If a Cilk Plus program has defined deterministic behavior, then that behavior is
the same as the behavior of the C or C++ program derived from the original by replacing
each instance of the _Cilk_for
keyword with for
.
A spawning block is a region of the program subject to special rules. Spawning blocks
may be nested. The body of a nested spawning block is not part of the outer spawning
block. Spawning blocks never partially overlap. The body of a spawning function
is a spawning block. A _Cilk_for
statement is a spawning block and
the body of the _Cilk_for
loop is a (nested) spawning block.
Every spawning block includes an implicit _Cilk_sync
executed on exit
from the block, including abnormal exit due to an exception. Destructors for automatic
objects with scope ending at the end of the spawning block are invoked before the
implicit _Cilk_sync
. The receiver is assigned or initialized to the
return value before executing the implicit _Cilk_sync
at the end of
a function. An implicit or explicit _Cilk_sync
within a nested spawning
block will synchronize with _Cilk_spawn
statements only within that
spawning block, and not with _Cilk_spawn
statements in the surrounding
spawning block.
_Cilk_for
LoopsTo simplify the grammar, some restrictions on _Cilk_for
loops are stated
here in text form. Where a constraint on an expression is expressed grammatically,
parentheses around a required expression or sub-expression are allowed.
The three items inside parentheses in the grammar, separated by semicolons, are the initialization, condition, and increment. (Cilk Plus imposes no constraints on the form of the initialization, beyond those of the base language.) We also need to define range-based Cilk-for loops.
The condition shall have one of the following forms:
<
expression>
expression<=
expression>=
expression!=
expressionExactly one of the operands of the comparison operator shall be just the name of the loop's control variable. The operand that is not the control variable is called the limit.
The loop increment shall have the following form:
,
single-increment++
identifier++
--
identifier--
+=
expression=
identifier +
expression=
expression +
identifier-=
expression=
identifier -
expressionEach comma in the grammar of increment shall represent a use of a built-in comma operator. Where identifier occurs more than once in the grammar of a single-increment, the same variable shall be named as each occurrence. The variable modified by the top-level operation of a single-increment is an induction variable. Each induction variable shall have integral, pointer or class type, shall have automatic storage duration, and shall be declared either in the initialization clause of the loop or local to the smallest spawning block containing the loop. Any expression operand of a single-increment shall have integral or enumeration type.
In exactly one single-increment, the identifier shall name the loop's control variable. A loop shall have exactly one control variable. (The control variable of a loop is determined by considering the loop's condition and increment together.)
The table indicates the stride corresponding to the syntactic form.
Syntax | Stride |
---|---|
++ identifier |
+1 |
identifier++ |
|
-- identifier |
-1 |
identifier-- |
|
identifier += expression |
expression |
identifier = identifier + expression
|
|
identifier = expression + identifier
|
|
identifier -= expression |
-( expression) |
identifier = identifier - expression
|
The notion of stride exists for exposition only and does not need to be computed.
In particular, for the case of identifier -=
expression,
a program may be well formed even if expression is unsigned.
A program that contains a return
, break
, or goto
statement that would transfer control into or out of a _Cilk_for
loop
is ill-formed. Moved from above.
Within each iteration of the loop body, the name of each induction variable refers to a new object, as if the name were declared as an object within the body of the loop, with automatic storage duration and with the const-qualified version of the type of the original object. If an induction variable is declared before the loop initialization, then the final value of the induction variable is the same as for the serialization of the program. Moved from above. This actually describes semantics, not constraints.
The type of an induction variable shall be copy constructible. (For the purpose of
specification, all C types are considered copy constructible.) For a single-increment
that modifies variable V, if it uses operator +
, ++
or +=
, the expression:
V+=
(
difference_type)
(incr)
shall be well-formed; if it uses operator -
, --
or
-=
, the expression
V-=
(
difference_type)
(incr)
shall be well-formed. The loop is a use of the required operator function.
The initialization, condition, and increment parts of a _Cilk_for
shall
meet all of the semantic requirements of the corresponding serial for
statement. In addition, depending on the syntactic form of the condition, a _Cilk_for
adds the following requirements on the types of the control variable (var),
limit, and stride, and the loop count is computed as follows,
evaluated in infinite integer precision. (In the following table, first
is the value of var immediately after initialization, if any.)
Condition syntax | Requirements | Loop count |
---|---|---|
var |
(limit) - (first) shall be well-formed and shall
yield an integral difference_type;stride shall be > 0 |
(( limit ) |
var |
(first) - (limit) shall be well-formed and shall
yield an integral difference_type;stride shall be < 0 |
(( first ) |
var |
(limit) - (first) shall be well-formed and shall
yield an integral difference_type;stride shall be > 0 |
(( limit ) |
var |
(first) - (limit) shall be well-formed and shall
yield an integral difference_type;stride shall be < 0 |
(( first ) |
var |
(limit) - (first) and (first)
- (limit) shall be well-formed and yield the same integral difference_type;stride shall be != 0 |
if stride is positive then ((limit) |
If the stride does not meet the requirements in the table above, the behavior is undefined. If this condition can be determined statically, the compiler is encouraged (but not required) to issue a warning. (Note that the incorrect loop might occur in an unexecuted branch, e.g., of a function template, and thus should not cause a compilation failure in all cases.)
If an induction variable is modified other than as a side effect of evaluating the loop increment expression, the behavior of the program is undefined.
If X and Y are values of an induction variable that occur in consecutive evaluations of the loop condition in the serialization, then
((limit)-
X)-
((limit)-
Y)
evaluated in infinite integer precision, shall equal the stride associated with the induction variable. If the condition expression is true on entry to the loop, then the loop count shall be non-negative.
Programmer note: Unsigned wraparound is not allowed.
The increment and limit expressions may be evaluated fewer times than in the serialization. If different evaluations of the same expression yield different values, the behavior of the program is undefined.
The copy constructor for the control variable may be executed more times than in the serialization.
If evaluation of the increment or limit expression, or a required operator+=
or operator-=
throws an exception, the behavior of the program is undefined.
If the loop body throws an exception that is not caught within the same iteration
of the loop, it is unspecified which other loop iterations execute. If multiple
loop iterations throw exceptions that are not caught in the loop body, the _Cilk_for
statement throws the exception that would have occurred first in the serialization
of the program.
A _Cilk_for
iteration-statement may optionally be preceded
by a grainsize-pragma. The grainsize pragma shall immediately precede
a _Cilk_for
loop and may not appear anywhere else in a program, except
that other pragmas that appertain to the _Cilk_for
loop may appear
between the grainsize-pragma and the _Cilk_for
loop. The
expression in the grainsize pragma shall evaluate to a type convertible to signed
long
. The presence of the pragma provides a hint to the runtime specifying
the number of serial iterations desired in each chunk of the parallel loop. The
grainsize expression is evaluated at runtime. If there is no grainsize pragma, or
if the grainsize evaluates to 0, then the runtime will pick a grainsize using its
own internal heuristics. If the grainsize evaluates to a negative value, the behavior
is unspecified. (The meaning of negative grainsizes is reserved for future extensions.)
The grainsize pragma applies only to the _Cilk_for
statement that immediately
follows it — the grain sizes for other _Cilk_for
statements
are not affected.
The _Cilk_spawn
keyword suggests to the implementation that an executed
statement or part of a statement may be run in parallel with following statements.
A consequence of this parallelism is that the program may exhibit undefined behavior
not present in the serialization. Execution of a _Cilk_spawn
keyword
is called a spawn. Execution of a _Cilk_sync
statement is
called a sync. A statement that contains a spawn is called a spawning
statement.
The following sync of a _Cilk_spawn
refers to the next _Cilk_sync
executed (dynamically, not lexically)
in the same spawning block. Which spawn the sync follows is implied from context.
The following sync may be the implicit _Cilk_sync
at the end of a spawning
block.
A spawn point is a C sequence point at which a control flow fork is considered to have taken place. Any operations within the spawning expression that are not required by the C/C++ standards to be sequenced after the spawn point shall be executed before the spawn point. The strand that begins at the statement immediately following the spawning statement (in execution order) is called the continuation of the spawn. The sequence of operations within the spawning statement that are sequenced after the spawn point comprise the child of the spawn. The scheduler may execute the child and the continuation in parallel. Informally, the parent is the spawning block containing the initial strand, the spawning statements, and their continuations but excluding the children of all of the spawns. The children of the spawns within a single spawning block are siblings of one another.
The spawn points associated with different spawning statements are as follows:
_Cilk_for
loop is a spawning statement with spawn point
at the end of the loop condition test._Cilk_spawn
has a spawn
point at the sequence point at the call to the spawned function. Any unnamed temporary
variables created prior to the spawn point are not destroyed until after the spawn
point (i.e., the destructors are invoked in the child).For example, in the following two statements:
x[g()] = _Cilk_spawn f(a + b); a++;
The call to function f
is the spawn point and the statement a++;
is the continuation. The expression a + b
and the initialization of
the temporary variable holding that value, and the evaluation of x[g()]
take place before the spawn point. The execution of f
, the assignment
to x[g()]
, and the destruction of the temporary variable holding
a + b
take place in the child.
If a statement is followed by an implicit sync, that sync is the spawn continuation.
Programmer note: The sequencing may be more clear if
x = _Cilk_spawn f(a + b);
is considered to mean
{ // Evaluate arguments and receiver address before spawn point T tmp = a + b; // T is the type of a + b U &r = x[g()]; // U is the type of x[0] _Cilk_spawn { r = f(tmp); tmp.~T(); } }
A setjmp
/longjmp
call pair within the same spawning block
has undefined behavior if a spawn or sync is executed between the setjmp
and the longjmp
. A setjmp
/longjmp
call pair
that crosses a spawning block boundary has undefined behavior. A goto
statement is not permitted to enter or exit a spawning block.
A sync statement indicates that all children of the current spawning block must finish
executing before execution may continue within the spawning block. The new strand
coming out of the _Cilk_sync
is not running in parallel with any child
strands, but may still be running in parallel with parent and sibling strands (other
children of the calling function).
There is an implicit sync at the end of every spawning block. If a spawning statement appears within a try block, a sync is implicitly executed at the end of that try block, as if the body of the try were a spawning block. If a spawning block has no children at the time of a sync, then the sync has no observable effect. (The compiler may elide an explicit or implicit sync if it can statically determine that the sync will have no observable effect.)
Programmer note: Because implicit syncs follow destructors, writing
_Cilk_sync
at the end of a function may produce a different effect
than the implicit sync. In particular, if an assignment spawn or initializer spawn
is used to modify a local variable, the function will generally need an explicit
_Cilk_sync
to avoid a race between assignment to the local variable
by the spawned function and destruction of the local variable by the parent function.
There is an implicit _Cilk_sync
before a try-block, and before
a throw
, after the exception object has been constructed.
If a spawned function terminates with an exception, the exception propagates from the point of the corresponding sync.
When several exceptions are pending and not yet caught, later exception objects (in the serial execution order of the program) are destructed in an unspecified order before the earliest exception is caught.
Cilk defines a category of objects called hyperobjects
. Hyperobjects allow
thread-safe access to shared objects by giving each parallel strand a separate instance
of the object.
Parallel code uses a hyperobject by performing a hyperobject lookup operation. The hyperobject lookup returns a reference to an object, called a view, that is guaranteed not to be shared with any other active strands in the program. The sequencing of a hyperobject lookup within an expression is not specified. The runtime system creates a view when needed, using callback functions provided by the hyperobject type. When strands synchronize, the hyperobject views are merged into a single view, using another callback function provided by the hyperobject type.
The view of a hyperobject visible to a program may change at any spawn or sync (including
the implicit spawns and syncs within a _Cilk_for
loop). The identity
(address) of the view does not change within a single strand. The view of a given
hyperobject visible within a given strand is said to be associated with
that view. A hyperobject has the same view before the first spawn within a spawning
block as after a sync within the same spawning block, even though the thread ID
may not be the same (i.e., hyperobject views are not tied to threads). A hyperobject
has the same view upon entering and leaving a _Cilk_for
loop and within
the first iteration (at least) of the _Cilk_for
loop. A special view
is associated with a hyperobject when the hyperobject is initially created. This
special view is called the leftmost view or earliest view
because it is always visible to the leftmost (earliest) descendent in the depth-first,
left-to-right traversal of the program's spawn tree. The leftmost view is given
an initial value when the hyperobject is created.
Programmer note: If two expressions compute the same address for a view, then they have not been scheduled in parallel. This property yields one of the simplest ways by which a program can observe the runtime behavior of the scheduler.
Implementation note: An implementation can optimize hyperobject lookups by performing them only when a view has (or might have) changed. This optimization can be facilitated by attaching implementation-specific attributes to the hyperobject creation, lookup, and/or destruction operations.
The vast majority of hyperobjects belong to a category known as reducers.
Each reducer type provides a reduce
callback operation that merges
two views in a manner specific to the reducer. For a pair of views V1
and V2, the result of calling reduce(
V1,
V2)
is notated as V1⊗V2.
Each reducer also provides an identity
callback operation that initializes
a new view.
The reduce
callback for a classical
reducer implements an operation
⊗ such that (a⊗b)⊗c==a⊗(b⊗c)
(i.e., ⊗ is associative). The view-initialization callback for such a reducer
sets the view to an identity value I such that I⊗v==v
and v⊗I==v for any value v of value_type.
Given an associative ⊗ and an identity I, the triplet (value_type,
⊗, I) describes a mathematical monoid. For example,
(int
, +
, 0
) is a monoid, as is (list
,
concatenate
, empty). If each individual view, R,
of a classical reducer is modified using only expressions that are equivalent to
R←R⊗v (where v is of
value_type), then the reducer computes the same value in the parallel
program as would be computed in the serialization of the program. (In actuality,
the ⊗
in the expression R←R⊗v
can represent a set of mutually-associative operations. For example, +=
and -=
are mutually associative.) For example, a spawned function or
_Cilk_for
body can append items onto the view of a list reducer with
monoid (list
, concatenate
, empty). At the end
of the parallel section of code, the reducer's view contains the same list items
in the same order as would be generated in a serial execution of the same code.
Given a set of strands entering a sync, S1,S2,S3,…Sn,
associated with views V1,V2,V3,…Vn,
respectively such that Si is earlier in the serial ordering
than Si+1, a single view, W, emerges from the sync
with value W←V1⊗V2⊗V3⊗…⊗Vn,
such that the left-to-right order is maintained but the grouping (associativity)
of the operations is unspecified. The timing of this reduction
is unspecified
— in particular, subsequences typically will be computed asynchronously as
child tasks complete. Every view except the one emerging from the sync is destroyed
after the merge. If any of the strands does not have an associated view, then the
invocation of the reduce
callback function can be elided (i.e., the
missing view is treated as an identity).
A strand is never associated with more than one view for a given reducer, but multiple strands can be associated with the same view if those strands are not scheduled in parallel (at run time). Specifically, for a given reducer, the association of a strand to a view of the reducer obeys the following rules:
identity
.
The implementation may create the new view at any time up until the first hyperobject
lookup following the spawn. If the continuation strand does not perform a hyperobject
lookup, then the implementation is not required to create a view for that strand.Even before the final reduction, the leftmost view of a reducer will contain the same value as in the serial execution. Other views, however, will contain partial values that are different from the serial execution.
If ⊗ is not associative or if identity
does not yield a true
identity value then the result of a set of reductions will be non-deterministic
(i.e., it will vary based on runtime scheduling). Such non-classical
reducers
are nevertheless occasionally useful. Note that, for a classical reducer, the ⊗
operator needs to be associative, but does not need to be commutative.
Note: The syntax described here is the syntax used in the Intel products. Intel is considering a different syntax for future, either in addition to or instead of the syntax described below.
At present, reducers are the only kind of hyperobject supported. In C++, every reducer
hyperobject has a hyperobject type, which is an instantiation of the cilk::reducer
class template. The cilk::reducer
class template has a single template
type parameter, Monoid
, which shall be a class type.
To define a reducer, a program defines a monoid class with public members representing the monoid, (T, ⊗, identity) as follows:
value_type |
typedef for T |
|
evaluate
|
identity(value_type* p) |
construct identity object at * p |
destroy(value_type* p) |
call the destructor on the object * p |
allocate(size_t size) |
return a pointer to size bytes of raw memory |
deallocate(value_type* p) |
deallocate the raw memory at * p |
If any of the above functions do not modify the state of the monoid (most monoids
carry no state), then those functions may be declared static
or const
.
The monoid type may derive from an instantiation of cilk::monoid_base<
T>
,
which defines value_type
and provides default implementations for
identity
, destroy
, allocate
, and deallocate
.
The derived class needs to define reduce
and override only those functions
for which the default is incorrect.
For a given monoid, M, the type cilk::reducer<
M>
defines a hyperobject type. The cilk::reducer
class template provides
constructors, a destructor, and (const
and non-const
versions
of) value_type& operator()
and value_type& view()
,
both of which return a reference to the current view.
A hyperobject is created by defining an instance of cilk::reducer<
M>
:
cilk::reducer<M> hv(args);
Where args is a list of M::value_type
constructor
arguments used to initialize the leftmost view of hv
. A hyperobject
lookup is performed by invoking the member function, view
or member
operator()
on the hyperobject, as in the following examples:
hv.view().append(elem); hv().append(elem);
In these examples, append
is an operation to be applied to the current
view of hv
, and is presumably consistent with the associative operation
defined in the monoid, M.
Modifying a hyperobject view in a way that is not consistent with the associative operation in the monoid can lead to subtle bugs. For example, addition is not associative with multiplication, so performing a multiplication on the view of a summing reducer will almost certainly produce incorrect results. To prevent this kind of error, it is common to wrap reducers in proxy classes that expose only the valid associative operations. All of the reducers included in the standard reducer library have such wrappers.
An object of type M::value_type
is constructed by the
reducer
constructor. This object is called the initial view or leftmost view
of the hyperobject. When a hyperobject goes out of scope, the destructor is called
on the leftmost view. It is unspecified whether M::allocate
and M::deallocate
are called to allocate and deallocate
the leftmost view (they are not called in the current Intel implementation).
The implementation may create a view at any spawn that has been scheduled in parallel,
or may lazily defer creation until the first access within a strand. The implementation
creates a view by calling M::allocate
followed by M::identity
.
(This is in addition to the initial view created by construction of the hyperobject.)
The calls to M::allocate
and M::identity
are part of the strand for the purpose of establishing the absence of a data race.
At any sync or at the end of any spawned (child) function, the runtime may merge
two views by calling M::reduce(
left,
right)
, where right is the earliest remaining
view that is later than left. The M::reduce
function
is expected to store the merged result in the left view. After the merge,
the runtime destroys the right view by calling M::destroy
followed by M::deallocate
. Every view except the leftmost
view is passed exactly once as the second argument to reduce
. The calls
to M::reduce
, M::destroy
and M::deallocate
happen after completion of both of the strands that formerly owned the left and
right views.
If a monoid member function executes a hyperobject lookup (directly or through a function call), the behavior of the program is undefined.
For purposes of establishing the absence of a data race, a hyperobject view is considered a distinct object in each parallel strand. A hyperobject lookup is considered a read of the hyperobject.
Note: The syntax described here is the syntax used in the Intel products. Intel is considering a different syntax for future, either in addition to or instead of the syntax described below.
The C mechanism for defining and using hyperobjects depends on a small number of
typedefs and preprocessor macros provided in the Cilk library. C does not have the
template capabilities of C++ and thus has a less abstract hyperobject syntax. Unlike
C++, each C hyperobject variable is unique — there is no named type that
unites similar hyperobjects. There is, however, an implicit hyperobject type
defined by the operations that comprise the hyperobjects' monoid. The provided macros
facilitate creating reducer variables, which are the only type of hyperobject currently
supported. The terms reducer
and hyperobject
are used interchangeably
in this section.
To define a C reducer, the program defines three functions representing operations on a monoid (T, ⊗, identity):
void T_reduce(void* r, void* left, void* right); void T_identity(void* r, void* view); void T_destroy(void* r, void* view);
The names of these functions are for illustration purposes only and must be chosen, as usual, to avoid conflicts with other identifiers. The purposes of these functions are as follows:
T_reduce | Evaluate
|
T_identity | Initialize a T value to identity |
T_destroy | Clean up (destroy) a T value |
The r argument to each of these functions is a pointer to the actual reducer
variable and is usually ignored. Since most C types do not require cleanup on destruction,
the T_destroy function often does nothing. As a convenience, the Cilk
library makes this common implementation available as a library function, __cilkrts_hyperobject_noop_destroy
.
A reducer, hv
, is defined and given an initial value, init,
using the CILK_C_DECLARE_REDUCER
and CILK_C_INIT_REDUCER
macros as follows:
CILK_C_DECLARE_REDUCER(T) hv = CILK_C_INIT_REDUCER(T_identity, T_reduce, T_destroy, init);
The init expression is used to initialize the leftmost reducer view. The
CILK_C_DECLARE_REDUCER
macro defines a struct
and can
be used in a typedef
or extern
declaration as well:
extern CILK_C_DECLARE_REDUCER(T) hv;
The CILK_C_INIT_REDUCER
macro expands to a static initializer for a
hyperobject of any type. After initialization, the leftmost view of the reducer
is available as hv.value
.
If a reducer is local to a function, it shall be registered before first use using
the CILK_C_REGISTER_REDUCER
macro and unregistered after its last use
using the CILK_C_UNREGISTER_REDUCER
macro:
CILK_C_REGISTER_REDUCER(hv); /* use hv here */ CILK_C_UNREGISTER_REDUCER(hv);
For the purpose of registration and unregistration, first use and
last use are defined with respect to the serialization of the program. The
reducer view immediately before unregistration shall be the same (have the same
address) as the reducer view immediately after registration. In practice, this means
that any spawns after the registration have been synced before the unregistration
and that no spawns before the registration have been synced before the unregistration.
Registration and unregistration are optional for reducers declared in global scope.
The value
member of the reducer continues to be available after unregistration,
but a hyperobject lookup on an unregistered reducer results in undefined behavior
unless the reducer is registered again.
A hyperobject lookup is performed using the REDUCER_VIEW
macro:
REDUCER_VIEW(hv) += expr;
As in the case of a C++ reducer, modifying a reducer other than through the correct associative operations can cause bugs. Unfortunately, C does not have sufficient abstraction mechanisms to prevent this kind of error. Nevertheless, the Cilk library provides wrapper macros to simplify the declaration and initialization, though not the safety, of library-provided reducers in C. For example, you can define and initialize a summing reducer this way:
CILK_C_DECLARE_REDUCER(long) hv = REDUCER_OPADD_INIT(long, 0);
A C reducer can be declared, defined, and accessed within C++ code, but a C++ reducer cannot be used within C code.
The macro CILK_C_DECLARE_REDUCER(T)
defines a struct
with a data member of type T, named value
. The macro CILK_C_INIT_REDUCER(I,R,D,V)
expands to a braced-init-list appropriate for initializing a variable,
hv, of structure type declared with CILK_C_DECLARE_REDUCER(T)
such that hv, can be recognized by the runtime system as a C reducer
with value type T, identity function I, reduction function
R, destroy function D, and initial value V.
Invoking CILK_C_REGISTER_REDUCER(hv)
makes a call into the
runtime system that registers hv.value
as the initial, or
leftmost, view of the C hyperobject hv. The macro CILK_C_UNREGISTER_REDUCER(hv)
makes a call into the runtime system that removes hyperobject hv from
the runtime system's internal map. Attempting to access hv after it has
been unregistered will result in undefined behavior. If a hyperobject is never registered,
the leftmost view will be associated with the program strand before the very first
spawn in the program and will follow the leftmost branch of the execution DAG. This
association is typically useful only for hyperobjects in global scope.
The implementation may create a view at any spawn that has been scheduled in parallel,
or may lazily defer creation until the first access within a strand. The implementation
creates a view by allocating it with malloc
, then calling the identity
function specified in the reducer initialization. (This is in addition to the initial
view created by construction of the reducer.) The call to the identity function
is part of the strand for the purpose of establishing the absence of a data race.
At any sync or at the end of any spawned (child) function, the runtime may merge
two views by calling the reduction function (specified in the reducer initialization)
on the values left and right, where right is the
earliest remaining view that is later than left. The reduction function
is expected to store the merged result in the left view. After the merge,
the runtime destroys the right view by calling the destroy function for
the hyperobject, then deallocates it using free
. Every view except
the leftmost view is passed exactly once as the second argument the reduction function.
The calls to reduction and destroy functions happen after completion of both of
the strands that formerly owned the left and right views.
If a monoid function executes a hyperobject lookup, the behavior of the program is undefined.
For purposes of establishing the absence of a data race, a hyperobject view is considered a distinct object in each parallel strand. A hyperobject lookup is considered a read of the hyperobject.
This section provides a specification for the array notation portion of the Intel® Cilk™ Plus language extension. Array notation is intended to allow users to directly express high level parallel vector array operations in their programs. This assists the compiler in performing vectorization and auto-parallelization. From the users' point of view, they will see more predictable vectorization, improved performance and better hardware resource utilization. Array notation is an extension of the standard C/C++ languages, including features that are designed for easy expression of array operations and simplified parallel function invocation.
The section expression selects multiple array elements for a data-parallel operation.
The syntax of a section expression is as follows:
[
section-triplet ]
:
expression :
expression:
expression:
Each of the expressions in a section triplet shall have integer type. The postfix
expression in a section expression shall have type pointer to complete object
type
; the type of the section expression is type
(i.e. the same type as the corresponding simple
subscript expression; there
is no section type).
The sequence of expressions delimited by the brackets in a section expression is termed a section triplet (even when there are fewer than three expressions). The expressions in a triplet are interpreted, respectively as: begin, length, and stride. Each section triplet represents a sequence of subscript values, starting at begin, with length elements, where each element increases by stride:
begin, begin + stride, begin + stride * 2, …, begin + stride * (length - 1)
When no stride expression is present, the value of stride is 1. When the triplet contains no expressions (i.e. consists entirely of a single colon), the value of begin is 0, and the value of length is the number of elements in the array being subscripted. If this shorthand is used, the type of the array being subscripted shall have a declared size (which can be non-constant in the case of a VLA).
For example, A[0:3:2]
refers to elements 0, 2, and 4 of the array
A
. A[:]
refers to the entire array A
, assuming
A
is a one-dimensional array with known upper bound.
The expressions in a triplet are converted to ptrdiff_t
.
If stride is negative, then begin identifies the uppermost index. If length is less than or equal to zero, the sequence of subscript values is empty.
Every expression has a rank, determined as follows.
Determination of common rankis commutative and associative; the common rank of more than two expressions can be determined by arbitrarily pairing expressions and intermediate results.)
[
section-triplet ]
) is one greater than the rank of its postfix
expression operand. The rank of each expression in a section triplet shall be zero.[
expression ]
) is the sum of the ranks of its operand expressions.
The rank of the subscript operand shall not be greater than one.In an assignment expression, if the right operand has nonzero rank, the left operand shall have the same rank as the right operand.
Examples:
Expression | Rank |
---|---|
A[3:4][0:10] |
2 |
A[3][0:10] |
1 |
A[3:4][0] |
1 |
A[:][:] |
2 |
A[3][0] |
0 |
An array section is an lvalue postfix expression with rank greater than zero.
For example, A[0:3][0:4]
refers to 12 elements in the two-dimensional
array A, starting at row 0, column 0, and ending at row 2, column 3.
Examples of section expressions:
int *p; int A[n][m]; p[:] = ... // not valid. p has no declared size. A[:][:] = ... // The entire 2D array A. p[1:5] = ... // p[1],p[2], ... p[5].
Every section triplet of an array section has a relative rank, defined as its ordinal number among the other triplets, from left to right, starting at 0.
Examples:
Expression | Relative rank of 0:10 |
---|---|
A[:][0:10][:] |
1 |
A[0:10][:][:] |
0 |
A[:][:][0:10] |
2 |
A[i[:]][j[0:10]][k[:]] |
1 |
The shape of an array section is defined as a vector: (length0,length1,...,lengthn-1), where n is the rank of the array section, and lengthi is the length of the triplet with relative rank i. The size of an array section is the product of the length values of its shape.
A full expression shall have rank zero, unless it appears in an expression statement
or as the controlling expression of an if
statement.
In general, a statement containing a section expression is executed once for each element of the section; some operations of these executions may be performed in parallel. However, in such a statement, a sub-expression with rank zero is evaluated only once. For example:
a[:] = f1(b[:]) + f2(c) + d;
f1
is called for each element of b
. f2(c)
is evaluated once; that value is used to compute the new value of each element of
a
. If f2
changes the value of d
, the value
used for d
in this example is unspecified. If f1
changes
the value of d
, the behavior is likely to be undefined.
When two array sections are required to have common rank, if they do not have the same shape, the behavior is undefined.
When two expressions are required to have common rank, and one of them has zero rank, the expression with zero rank is evaluated only once; the resulting value is used as many times as necessary to fully execute the containing statement.
An array section assignment is a parallel operation that modifies every element of the array section on the left-hand side.
When the left operand of assignment has non-zero rank, the assignment of each element is unsequenced with respect to the assignment of every other element, and the value computation for each element is unsequenced with respect to the value computation for every other element.
For example, if any section on the right-hand side of an assignment overlaps with the left-hand side, and the overlap is not complete, the behavior is undefined.
a[0:10] = a[10:10]; // no overlap; well-defined a[0:10:2] = a[1:10:2]; // no overlap; well-defined a[0:10] = a[0:10] + 1; // complete overlap; well-defined a[0:10] = a[1:10]; // incomplete overlap; undefined
// Copy elements 10->19 in A to elements 0->9 in B. B[0:10] = A[10:10];
// Undefined behavior. Triplets 0:10 and 0:100 are not the same size. B[0:10] = A[0:100];
// Transpose row 0, columns 0-9 of A, into column 0, rows 0-9 of B. B[0:10][0] = A[0][0:10];
// Copy the specified array section in the 2nd and 3rd dimensions of A into // the 1st and 4th dimensions of B. B[0:10][0][0][0:5] = A[3][0:10][0:5][5]
// Undefined behavior. The corresponding triplets with the same relative rank // (0:9, 0:5) and (0:5, 0:9) do not have the same number of elements. B[0:9][0:5] = A[0:5][0:9];
// OK since the triplets on both sides have the same number of elements. // The 5 elements in A are scattered to the even-index elements in B // (0,2,4...8). The values of the odd-index elements (1,3,5...7) are not // changed. B[0:5:2] = A[0:5];
When applied to an array section or sections, the C/C++ arithmetic operators are applied to each element of the array section(s). Note that multiplication is performed element-wise and does not correspond to traditional vector/matrix multiplication.
// Set all elements of A to 1.0. A[:] = 1.0;
// Set every element of 3x3 shape in A to the value of B[0]. A[0:3][0:3] = B[0];
// Error: the number of dimensions (rank) must match, // or be equal to 0. A[0:2][0:2] = B[0:2];
// Element-wise addition of all elements in A and B, resulting in C. C[:] = A[:] + B[:];
// Element-wise multiplication of all elements in A and B, // resulting in C. C[:] = A[:] * B[:];
// Add elements 10->19 from A with elements 0->9 from B and place in // elements 20->29 in C. C[20:10] = A[10:10] + B[0:10];
// Element-wise addition of the first 10 elements in column 2 of A and // column 3 of B, resulting in column 0 of C. C[0:10][0] = A[0:10][2] + B[0:10][3];
// Matrix addition of the 2x2 matrices in A and B starting at A[3][3] // and B[5][5], placed in C starting at C[0][0]. C[0:2][0:2] = A[3:2][3:2] + B[5:2][5:2];
// Add the array section along the 1st and 2nd dimensions of B to the // elements in the array section along the 2nd and 3rd dimensions of A, // placing them in an array section in C. C[0:9][0][0:9] = A[0][0:9][0:9] + B[0:9][0:9][4];
// Element-wise addition of first 10 elements in A and B, // resulting in A. A[0:10] = A[0:10] + B[0:10];
// Element-wise negation of first 10 elements in A, resulting in A. A[0:10] = -A[0:10];
// Multiply every element in A by 2. A[:] *= 2;
// Add one to each element in A. A[:]++;
// Element-wise equality test of B and C, resulting in an array of // Boolean values, which are placed in A. A[:] = B[:] == C[:];
If a function is called with an array section argument, the function is mapped
,
or called with successive elements in the array. For example:
type fn(type arg1, type2 arg2); type in[N], out[N]; type2 in2[N]; out[x:y:z] = fn(in[x:y:z], in2[x:y:z]);
The function fn
is mapped over array sections of arrays in
and in2
. The function fn
will be called with arguments
(in[x], in2[x])
, with arguments (in[x+z], in2[x+z])
, etc.
The results of the function calls are collected into a section of array out
.
The executions of function calls mapped from a given statement are unsequenced with respect to one another.
The reduction operations accumulate a result over all the values in an array section. There are specialized reductions for specific operations, and two general reductions which perform a user-specified operation. A reduction operation resembles a generic function, which can take a variety of argument types; in some cases, the return type of the operation matches the type of the section argument.
The section argument of a reduction operation shall have rank greater than zero. The rank of a reduction operation is zero.
Each of the specialized reduction operations resembles a function that takes a single section argument, whose (element) type shall be an arithmetic type. The specialized reduction operations are listed here:
Reduction operation | Result value | Result type | Result value when the argument is an empty section |
---|---|---|---|
__sec_reduce_add |
The sum of the argument values | The argument type | Zero |
__sec_reduce_mul |
The product of the argument values | The argument type | One |
__sec_reduce_max |
The maximum argument value | The argument type | The minimum value representable in the result type |
__sec_reduce_min |
The minimum argument value | The argument type | The maximum value representable in the result type |
__sec_reduce_max_ind |
The index of the maximum argument value | intptr_t |
Unspecified |
__sec_reduce_min_ind |
The index of the minimum argument value | intptr_t |
Unspecified |
__sec_reduce_all_zero |
One if all argument values are zero; zero otherwise | int |
One |
__sec_reduce_all_nonzero |
One if all argument values are nonzero; zero otherwise | int |
One |
__sec_reduce_any_zero |
One if any argument value is zero; zero otherwise | int |
Zero |
__sec_reduce_any_nonzero |
One if any argument value is nonzero; zero otherwise | int |
Zero |
For __sec_reduce_max_ind
and __sec_reduce_min_ind
, the
rank of the argument shall be exactly one.
The general reduction operations are described by these pseudo-declarations, where T represents a type:
T __sec_reduce(T initial, T section, T (*operation)(T, T));
void __sec_reduce_mutating(T &result, T section, T2 (*operation)(T *, T));
The second argument is the section argument; the result type of the reduction is the element type of the section. The result type shall be copy constructible.
The first argument is the initial value for the operation. Its type shall be the
result type of the reduction. For __sec_reduce_mutating
, it is also
the object into which the result is computed, and it shall be an lvalue.
The third argument identifies the abstract operation for which to perform the reduction.
It shall designate a function that takes two arguments having the same type as the
result of the reduction. For __sec_reduce
, it shall return its result
as a value of the result type of the reduction. For __sec_reduce_mutating
,
its first argument will refer to an object of the result type of the reduction,
into which it shall compute its result.
In C, the operation argument shall have pointer-to-function type. For __sec_reduce
,
its first argument will also be a value of the result type of the reduction; for
__sec_reduce_mutating
, its first argument will be a pointer to an object
of the result type of the reduction.
In C++, if the result type of the reduction is a class type, then names in the operation
argument are looked up as if used in a member function of that class. If the operation
argument is an id-expression referring to a set of overloaded functions,
overload resolution is performed as for a binary operator; i.e. overload resolution
is used to determine whether to invoke the operation as a member function (e.g.
(a.f(b))
or as a non-member function (e.g. (f(a, b))
.
Otherwise, the operation argument shall be a callable object; if it has pointer-to-member
type, then the operation is invoked as a member function call, otherwise it is invoked
with two arguments.
Invocations of the function designated by the operation argument are unsequenced with respect to one another. It is unspecified how the elements of the section, the initial/result object, and any introduced temporary objects, are paired by calls to the operation function, except that if the rank of the section is one and the operation function is associative, then the result is the same as for left-to-right reduction, where the initial value is taken as leftmost, and the element with index begin + stride * (length - 1) is taken as rightmost.
type fn(type in1, type in2); type in[N], out; out = __sec_reduce(initial, in[x:y:z], fn);
The reduction will be computed analogously to:
tmp = initial; for each element X of in[x:y:z] tmp = fn(tmp, X);
The result of the reduction will be the final value of tmp.
For example, the two reduction operations given here compute the same result:
double add(double in1, double in2) { return in1+in2; } out = __sec_reduce(0, in[x:y:z], add); // accumulate using add() out = __sec_reduce_add(in[x:y:z]); // accumulate using built-in +
The compiler may produce more optimized code when the specialized reduction operations are used.
In writing code that uses array sections, it is sometimes useful to explicitly reference the indices of the individual elements in a section. For example, the user may wish to fill an array with a function of the element index, rather than a single value.
Conceptually, an array section operation can be thought of as expanding into a loop
with an implicit index variable for each relative rank of the section. For each
relative rank, the value of the implicit index variable ranges between zero and
one less than the length of the triplet with that relative rank. The __sec_implicit_index
operation returns the value of the implicit index variable for a specified relative
rank. It behaves as a function with the following declaration:
intptr_t __sec_implicit_index(int relative_rank);
The argument shall be an integer constant expression. For purposes of rank checking, the rank of an implicit index operation is zero, although it is reevaluated for each element, like an expression of rank one.
int A[10], B[10][10]; // A[0] = 0, A[1] = 1, A[2] = 2,... A[:] = __sec_implicit_index(0);
// B[i][j] = i+j; B[:][:] = __sec_implicit_index(0) + __sec_implicit_index(1);
// The length of each dimension is 2. The value of __sec_implicit_index // is either 0 or 1, regardless of the subscripts of the affected elements. A[i:2:s][j:2:t] = __sec_implicit_index(0) ^ __sec_implicit_index(1);
if
statementsIf the rank of the controlling expression of an if
clause is nonzero,
the rank of every full expression in every substatement of that if
statement (including substatements of those substatements, recursively) shall equal
the rank of the controlling expression of the if
clause. If the shape
of a full expression of a substatement does not match the shape of the controlling
expression, the behavior is undefined.
When the controlling expression of an if
statement has nonzero rank,
the entire if
statement, including its dependent statements, is executed
for each element of the shape of the controlling expression.
The array notation enables a vector kernel
style of programming, where vector
code is encapsulated within a function, with a parameterized vector length. In this
case, concurrency happens within a single function invocation, unlike when mapping
function calls over an array section, where concurrency happens between function
invocations.
The following example illustrates how to combine array section vectorization inside a function body with threading for parallel function calls. The vector length m is 256 in this example.
void saxpy_vec(int m, float a, float x[restrict m], float y[m]) { y[:] += a * x[:]; } int main(void) { float a[2048], b[2048]; _Cilk_for (int i = 0; i < 2048; i += 256) saxpy_vec(256, 2.0, (a + i), (b + i)); }
By writing the function explicitly with array arguments, the programmer can write code with easily customizable vector lengths and runtime model choices.
Please note that functions cannot return array section values. It may be helpful to return a pointer to the array as in standard C/C++ and take sections of the return value in the callee.
This section describes the elemental functions
portion of the Cilk Plus language.
Elemental functions are a data parallel programming construct. The use of elemental
functions consists of the following three steps:
vector
attribute with optional clauses to be
described below, so that the compiler generates a vector variant of the
function, which operates on a vector of elements instead of a single one.The function is invoked with vectors of arguments iteratively until the whole array of arguments is processed. Each such invocation is of a single vector variant of the function.
The semantics of an elemental function are that the execution order among its invocations is unsequenced.
The execution of an elemental function depends on the language construct used at the invocation site, as follows.
for
loop, then the compiler
may invoke a vector variant instead of the original function, and iterate in the
loop fewer times. While in this context the replacement is always correct, whether
the replacement will actually be done is implementation dependent and is subject
to performance heuristics.#pragma simd
to the C/C++ for loop will ensure that a vector
variant of the function is called. Instances of the function will be invoked iteratively
in a single execution strand until all elements of the array arguments have been
processed._Cilk_for
loop, then a vector variant
is used, and the execution of the _Cilk_for
loop is as described in
the _Cilk_for
section. This may result in multiple invocations of the
function executing concurrently.Implementation note: The compiler will generate three translations of
the source code into machine code. One is for the compliant version, which operates
on a single element per invocation. The second the short vector variant, that operates
on multiple elements at a time. The number of elements to be used in by the short
vector variant of the function is determined by the types of arguments and return
value of the function, and the width of the vector registers in the target CPU.
The programmer can use a vectorlength
clause (described below) to override
the compiler's choice. The third version is designed to receive an additional argument
which is a mask argument. This argument is implicit and the application programmer
does not explicitly provide it. The mask argument is required for the cases in which
the function is called from a loop under a conditional statement. The application
programmer can use a mask
clause to suppress the generation of either
the second or the 3rd version.
vector
attributeThe vector
attribute is a general-purpose attribute, applicable to functions.
For compatibility with the Microsoft compiler, it can appear in a __declspec
;
for compatibility with the GNU compiler, it can appear in an __attribute__
;
in C++0X, it can appear in an attribute-specifier.
vector
vector (
elemental-clausesopt )
,
elemental-clauseFor each vector
attribute, one vector variant of the vectorized function
is created. The vector
attribute and its associated clauses are considered
part of the function signature. If the function declaration is inconsistent with
the function prototype with respect to these clauses then the behavior is undefined.
An elemental function shall not have an exception specification.
processor
clauseprocessor (
processor-name )
pentium_4
pentium_4_sse3
core2_duo_sse3
core_2_duo_sse_4_1
core_i7_sse4_2
Directs the compiler to create a vector version of the function for the given target processor. The default processor is taken from the implicit or explicit processor- or architecture- specific flag in the compiler command line. The target processor defines both the vector ISA that the compiler is allowed to use, and the width of the vectors.
vectorlength
clausesvectorlengthfor (
type-name )
vectorlength (
constant-expression-list )
,
constant-expressionA constant-expression in a vectorlength
clause shall be a
valid integer constant expression. In the context of a vector
attribute,
the constant-expression-list in a vectorlength
clause shall
have only one constant-expression.
Every vector variant of an elemental function has a vector length (VL). If the
vectorlength
clause is used, the VL is the value of the argument of that
clause. Otherwise, the function's characteristic type is found. If the
vectorlengthfor
clause is used, the characteristic type is the type
specified as its argument; otherwise, if the return type is not void
,
then the characteristic type is the return type; otherwise, if the function has
any non-uniform, non-linear parameters, then the characteristic type is the type
of the first such parameter; otherwise, the characteristic type is int
.
The VL is then determined from the characteristic type and the processor's vector
register. For example:
int
, the VL is 4.int
. (Integer
vector operations in Intel® AVX are performed on XMM.)float
.double
.uniform
clauseuniform (
uniform-param-list )
,
parameter-nameThe identifier in a parameter-name shall match the name of a parameter of the function to which the containing clause applies; the parameter shall have integral or pointer type.
If the value of an argument corresponding to a uniform
parameter differs
between invocations from a single execution of a loop, the behavior is undefined.
(The values of these parameters can be broadcasted to all iterations as a performance
optimization.)
linear
clauselinear (
linear-param-list )
,
elemental-linear-param:
elemental-linear-stepA constant-expression in an elemental-linear-step shall be
an integer constant expression. A parameter referenced as an elemental-linear-step
shall be the subject of a uniform
clause.
No parameter of an elemental function shall be the subject of more than one
uniform
or linear
clause.
For a linear
parameter, if the corresponding argument values in consecutive
iterations (in the serial version of the program) do not differ by the value of
the elemental-linear-step, the behavior is undefined.
mask
clausesmask
nomask
By default, for every vector variant declared, two implementations are provided:
one especially suitable for conditional invocation (i.e. masked), and another especially
suitable for unconditional invocation (i.e. unmasked). If all invocations are conditional,
generation of the unmasked implementation can be suppressed using the mask
clause. Similarly, if all invocations are unconditional, generation of the masked
implementation can be suppressed using the nomask
clause. (Using both
clauses together has no effect: both implementations are provided. This is also
the default.)
This section describes the SIMD loop portion of the Cilk Plus language. The SIMD pragma can be applied to a loop, to indicate that the iterations of the loop can be divided into contiguous chunks of a specific length, and that within each chunk, multiple iterations can be executed concurrently.
# pragma simd
simd-clausesopt new-line iteration-statementsimd-openmp-data-clause: | |
data-privatization-clause | [2.9.3.3 private clause] |
data-privatization-in-clause | [2.9.3.4 firstprivate clause] |
data-privatization-out-clause | [2.9.3.5 lastprivate clause] |
data-reduction-clause | [2.9.3.6 reduction clause] |
The iteration-statement following #pragma simd
shall be a
for
loop. The loop's control clause and body are subject to the same
restrictions as in a _Cilk_for
loop. The loop control variable shall
have integer or pointer type.
The syntax and semantics of the various simd-openmp-data-clauses are detailed in the OpenMP specification. (http://www.openmp.org/mp-documents/spec30.pdf, Section 2.9.3).
vectorlength
clausesIf the vectorlengthfor
clause is used, the length of the chunk (vector
length, or VL) is computed as for an elemental function, using the vectorlengthfor
argument type as the characteristic type. If not, a VL is selected to try to generate
the most efficient vector code. If the vectorlength
clause is used,
the VL is selected from among the values of its arguments.
linear
clauselinear (
simd-linear-variable-list )
,
simd-linear-variable:
simd-linear-stepThe id-expression in a simd-linear-variable shall designate
a variable with scalar type. The conditional-expression in a simd-linear-step
shall either satisfy the requirements of an integer constant expression, or be a
reference to a variable with integer type. In any SIMD loop, no variable shall be
the subject of more than one linear
clause, and no variable shall be
the subject of a linear
clause and also an OpenMP data clause.
If a simd-linear-variable has no associated simd-linear-step, the constant value 1 is used for the simd-linear-step. If the simd-linear-step refers to a variable, and the variable is modified during the execution of the loop, or if the value of the object designated by the id-expression in a simd-linear-variable does not increase by the value of the expression in the associated simd-linear-step in each iteration of the loop, the behavior is undefined.
The following language constructs shall not appear within the body of an elemental function, nor in a SIMD loop:
goto
statement_Cilk_spawn
expression_Cilk_for
statementtry
statementsetjmp
If execution of a function called from an elemental function or SIMD loop terminates
with an exception or a call to longjmp
, the behavior is undefined.
Intel is considering removal of some of these restrictions in the future.
Note: Because invocations of an elemental function, and iterations of a SIMD loop, are implicitly allowed to be concurrent, modifying any non-atomic non-local object carries the potential for a data race, and therefore undefined behavior.