Doc. No.: |
WG14/N1315 |
Date: |
2008-7-30 |
Reply to: |
Hans-J. Boehm |
Phone: |
+1-650-857-3406 |
Email: |
Hans.Boehm@hp.com |
N1315: Rationale for the C++ working paper definition of "memory location".
This paper outlines the reasoning behind the definition of "memory location"
used in the C++ concurrency memory model
(see C++ papers
N2429 and
N2480)
in the current C++ working paper.
It is in part a reaction to concern expressed
by members of the C committee (including Rich Peterson and Clark Nelson),
and in a liaison report,
that this definition may be too restrictive on the implementation.
This is intended to reflect the arguments for the current C++ solution;
it is not intended to represent HP's position on the C memory model.
This follows substantial email discussion with some C committee members
on the issue. Although that discussion did no so far result in agreement,
it has certainly helped to clarify the issues, and hence contributed to
this paper.
Background: The significance of "memory locations"
In the C++ working paper concurrency memory model, a memory location
is essentially the smallest unit of memory that we can guarantee to
be updated independently of surrounding memory. Objects that share
the same memory location may not be independently updatable; updating
one may be implemented by reading a larger unit of memory including the
other one, replacing a piece of the larger unit, typically in a register,
and then storing the larger unit back. This implies that two objects
sharing a memory location may not be simultaneously updated by two threads.
For example, consider the declaration
struct {char a; char b; }
If a and b share the same memory location, and they
are each updated by loading and then storing the entire struct,
then a simultaneous update of a and b might be performed
by
- Copying the entire struct into a temporary in thread 1.
- Updating the a field in the temporary.
- Copying the struct into a temporary in thread 2, updating b
in the temporary, and writing the temporary back, still in thread 2.
- Copying thread 1's temporary back.
This effectively loses thread 2's update of b.
This becomes an issue if, for example, the two fields a
and b are protected by different locks, and hence may be
accessed concurrently.
With the usual definition of data race,
if a and b share a memory location,
then we may introduce a race even if, say, b is protected by a lock,
and a is never written.
This is also potentially an issue for
adjacent array elements, either under similar conditions, or because
work on disjoint sections of the array are performed in parallel
by different threads.
From a pure programming convenience perspective, it would be optimal if
each scalar object and bit-field were its own memory location. Unfortunately,
in the case of bit-fields, this is generally considered impractical, since
little existing hardware supports independent updates of individual bits
without the use of atomic read-modify-write operations. And the latter
are generally too expensive to be practical for this application.
As a result, the memory model in the C++ working paper defines each
scalar object to be a separate memory location, but considers an
entire contiguous sequence of (non-zero length) bit-fields to be
a single location, i.e. it allows simultaneous updates of bit-fields
to interfere with each other.
We believe that there are only two ways in which major commercial
compilers for general purpose processors currently deviate from this:
- Updates of bit-fields commonly overwrite adjacent small non-bit-fields.
On all modern mainstream machines this is avoidable at a cost of generating
several smaller stores instead of a single load and store. This theoretically
may incur some slowdown. But experiments by Sandya Mannarswamy at HP on
SPECint for PA-RISC and Itanium found differences in run-time produced
by these changes in bit-field compilation to be significantly under 1%,
with varying signs, and all well within
measurement noise. (This is unsurprising,
since most programmers generally have a favorable alignment in mind when
they declare structs with bit-fields, and the added overhead is minor in any
case. We know of no reason to expect different results on other
architectures.)
- Compilers for the Alpha processor tend to avoid byte stores by default
for the sake of compatibility with the earliest generations
of Alpha processors, which did not support them. Alpha processors since
the 21164A (introduced in late 1995) have supported them, and
all Alpha machines have since been discontinued by HP. Thus they do not
appear to be of concern for a 2010 (?) standard.
It is conceivable that there exist other multiprocessors for which
the C++ restrictions are a serious issue, but we are not aware of them.
The C++ memory model is
not difficult to implement on X86, ARM, PowerPC, MIPS, IBM
Mainframes, SPARC, Motorola 68K, and various others.
It imposes strictly weaker hardware requirements than the Java
memory model, and is therefore implementable on any hardware that
supports Java.
A Note on uniprocessors
The C++ memory model is implementable at
no cost in any environment that does not really support multiple
concurrent threads. So long as atomic variables, and variables of
type sig_atomic_t are sufficiently aligned (and I don't
believe this is a new restriction for sig_atomic_t), such
environments do not allow the user to observe whether stores of small
objects rewrite adjacent bytes. Thus the concurrency memory model
imposes no added constraints.
At more cost, this can be extended to uniprocessors
running multithreaded code, even in those rare cases in which
the processor does not directly support
byte stores. This approach uses restartable atomic sequences
(
"Fast mutual exclusion for uniprocessors", Bershad, Redell, and Ellis,
ASPLOS 92). It requires that byte stores be implemented
by the usual load-replace-store sequence, but the compiler must generate
enough additional information, and must obey some additional code generation
constraints, so that the OS can restart the sequence on a context
switch. (The detailed implementation involves a time-space tradeoff.
The easiest solution is to compile all byte stores to a function call,
so that there is only one such sequence. This clearly involves some
run-time cost, but typically far less than generating an atomic
read-modify-write sequence. The time cost can be converted to a
space cost by duplicating the sequence, and generating a table
describing the locations of the sequences.)
In cases in which the thread-switching code is willing to trust client code,
simpler solutions are also possible. The client code can just set a
flag during the execution of "atomic sequences" like byte stores, which
should not be interrupted by a context switch.
The argument for a "coarser" definition of "memory location"
My understanding is that many members of the C committee feel uncomfortable
with the above solution, particularly in the context of C,
since it may slow down existing code when it
is recompiled. This is particularly true when we consider
compilers that fully exploit the freedom given by the current Posix
specification, which appears to leave the definition of "memory location"
completely up to the compiler.
A particularly egregious case (pointed out by David Keaton)
may occur on hardware that supports vector
instructions, but requires vector stores to unconditionally store
every element of the vector. Thus the loop
for (i = 1; i < n; i++) { if (a[i] != 0.0) a[i] = 1.0/a[i]; }
could be vectorized under existing rules, but not easily under the
proposed rules. Under existing rules, the above loop could arguably
be transformed to
for (i = 1; i < n; i++) { a[i] == (a[i] != 0.0? 1.0/a[i] : a[i]); }
which could be directly vectorized. This transformation is
disallowed under C++ working paper rules, since it introduces stores to
the elements a[i] that were originally zero.
We all believe that the resulting
performance difference could be substantial (though experiments might
be useful). Thus it is at least conceivable that simply recompiling an
existing C program, on a system that previously implemented the above
transformation, would result in a substantial slowdown, requiring
existing code to be rewritten.
Why I still believe the current C++ definition is correct
It is possible to "coarsen" the C++ definition in relatively minimal
ways, notably by allowing bit-field updates to overwrite adjacent fields
whose type is smaller than the bit-field type. This would bring the
definition into line with existing practice on most platforms.
It would deviate from Java and the current C++ working paper definition
only for structs involving bit-fields, which do not
exist in Java.
But, on balance, this still appears to be an undesirable trade-off:
- As we pointed out above, all evidence suggests that the performance
benefit is negligible. This would not allow something like the
vectorization transformation above. In very rare cases, it would allow
fewer stores and bit masking operations to be generated, at the cost of
adding a load.
- It adds a source of completely unexpected, and nearly untestable, errors.
It is relatively easy to explain that individual bit-fields cannot be
independently updated, and thus must be separated by e.g. a zero-length
bit-field to enable concurrent access. I doubt that many programmers
expect this behavior for, say, a short field followed by
a bit-field.
-
It is critically important for the industry that we make it easier to
write parallel and specifically multi-threaded programs, so that a larger
variety of application see continued performance improvement from multicore
processors, where they previously saw performance improvements from added
clock speed. I believe it is clearly worth the modest amount of
required compiler engineering effort to avoid unnecessary pitfalls like this.
A much more drastic alternative would be to revert to something like
the classic Posix rules, which (arguably) allow the above vectorization
example. I believe that, in spite of the much more substantial performance
advantages, this would be a disastrous mistake, and would make C++
essentially unusable for multithreaded programming:
- Although detailed information is hard to come by, as far as I can tell,
no existing mainstream compilers implement transformations such as these,
except possibly when the user requests unsafe optimizations. I have heard
discussion in the past of compiler merging of non-adjacent field updates.
As far as I can tell, this is no longer performed by current production
compilers. Based on comments from several compiler teams, optimizations
along these lines have tended to be removed in response to complaints
from programmers writing multithreaded code, often particularly
from kernel developers, who have so far probably been the most
aggressive authors of such code. I expect that
as more application programmers need to aggressively parallelize code
to take advantage of multicore processors, they will generate more of these
complaints.
Thus the effect of a carefully worded "coarse" definition of memory location
in the language standard is likely to be that either it will be ignored,
or it will cause mainstream compilers to implement optimizations that have
so far largely been considered unsafe. The latter is likely to cause far
more serious code breakage than performance regressions from failed
vectorization.
- Such a "coarse" memory location definition, especially for arrays,
breaks even the simplest programs that try to use threading for
parallelization. Consider a program that increments each element in
an array by forking N threads, each of which increments
the elements of a
chunk containing roughly one Nth of the array elements.
Such a coarse definition would allow data races
at the chunk boundaries, and a much more complex algorithm would be needed.
Given the inherent difficult of writing such programs, I don't believe
we can tolerate such added complexities. I would be unwilling to program
in such a language.
- It is unclear how much of the pressure to perform optimizations
such as the above vectorization example is coming from platforms that
would be seriously affected, i.e. that can support threads and
cache-coherent shared memory across multiple, probably homogeneous,
processors.
- More philosophically, the parallel semantics of a piece of program
text are fundamentally different from the sequential ones. If we do want
to support concurrency, we need to respect the parallel semantics, which
unavoidably restricts some compiler transformations. The two versions
of the for-loop in the vectorization example have identical
sequential semantics, but different parallel semantics. If the programmer
wants vectorization, and the second version has acceptable semantics in
the presence of threads, (s)he needs to write the second version, which
is not seriously more complicated than the first.
- These rules would be unexpectedly different from those learned by
beginning programmers, who are likely to have been exposed to the Java
rules. Problems introduced by failing to note the difference are likely to
be very intermittent, and may be missed during testing.
- If the presence of data races in parallel
programs depends on the sizes of integral types, it becomes much more
difficult to write parallel algorithms that are templatized with respect to
the type. This would greatly hinder the development of parallel libraries
for C++.
What about attributes for shared data?
One "compromise" that came up during earlier discussions is to allow
potentially shared data to be annotated with some kind of attribute or
type qualifier, so that it can be treated specially, and a less coarse
definition of "memory location" can be applied. After some thought, I
no longer believe this is at all feasible, except possibly as an
attribute that explicitly allows coarser accesses. But even that
would have limited effectiveness.
Threads like those defined by pthreads are fundamentally based on the
assumption that that there is no difference between variables accessed
only by a single thread, and those protected by proper mutual exclusion.
This allows abstract data types (e.g. a C++ vector or map) to be implemented
in a way that can be used both for thread private and shared, but
lock-protected data. A consequence of this is that shared data, e.g.
the array in a C++ vector, is often allocated and accessed by code that
inherently cannot know whether the data is shared. Hence there is no
way to consistently supply correct attributes; the best we might be able to
do is to declare certain data to not be shared in those instances, in
which we happen to know this at the point of declaration.