N2293: Alignment requirements for memory management functions

Summary

The alignment requirements of malloc, calloc, and realloc are somewhat confusingly phrased, in a way that affects small allocations (sizes less than _Alignof(max_align_t)). Some readers (and implementations) interpret them to demand _Alignof(max_align_t)-alignment even for allocation sizes that could not hold an object with that alignment. We call this the "strong-alignment" reading. Other readers (and implementations) interpret them as requiring the returned memory to be aligned only enough to accommodate those types that could inhabit the returned memory. In particular, since sizeof(T) >= _Alignof(T) for all portably defined types T, allocations with sizes smaller than _Alignof(max_align_t) need only be aligned to the largest power of two less than or equal to the requested size. We call this the "weak-alignment" reading.

Many implementations only provide weak-alignment guarantees, and some cases these guarantees are important in conserving both CPU and memory. Therefore, we propose clarifying the wording such that the weak-alignment implementation is unambiguously allowed. Strong-alignment implementations remain correct, and may document the guarantee as an extension.

The current wording

C17, 7.22.3 states: "The pointer returned if the allocation succeeds is suitably aligned so that it may be assigned to a pointer to any type of object with a fundamental alignment requirement and then used to access such an object or an array of such objects in the space allocated".

The two clauses of interest here are "it may be assigned to a pointer to any type of object with a fundamental alignment requirement" and "and then used to access such an object or an array of such objects in the space allocated". Strong-alignment readers tend to see both clauses as independent, whereas weak-alignment readers tend to see the second as an adverbial clause modifying the first. Neither reading is completely satisfactory. The strong-alignment view implies an independent clause with a tense mismatch or a missing verb; the weak-alignment view uses "and then" when "then" would suffice, which could imply a sequencing at odds with the interpretation.

In the author's view, attempting to declare one of these two parsings correct is a waste of time. Many different implementers, fluent in English and acting in good faith, have parsed the sentence both ways. The question is how to reconcile the divergence.

Implementations and implementer opinions

The author has experimented with and inspected the source code for as many different implementations as possible. These are his best guesses based on consultations with maintainers and the observed behavior of implementations (if possible), or source-code inspection alone (if not).

Strong-alignment environments:

Glibc
OS X libc
Musl libc
OpenBSD libc

Weak-alignment environments:

Windows CRT
QNX libc (whether or not this manifests depends on the specific compiler and target CPU).
Jemalloc-based libcs
- FreeBSD
- NetBSD
- Bionic (Android's libc)
uClibc
A number of server-oriented allocators intended to work alongside a platform libc
- Intel TBB malloc
- Tcmalloc
- Freestanding jemalloc

I also reached out to those maintainers with whom I could find a mutual professional connection:

The Windows CRT maintainers are weak-alignment readers, believe that the strong-alignment interpretation rules out efficient implementations, and would be unable to comply with it (instead never conforming).
Jemalloc is a could not become a strong-alignment implementation without nontrivial regressions (Note: the author is a jemalloc maintainer).
The same is true of tcmalloc.
Glibc's lack of a corresponding corporate hierarchy makes it somewhat ambiguous to define a malloc maintainer. Jonathan Wakely (of Red Hat) replies: "I take the strong-alignment reading. I'm not against changing the standard in the other direction, I just think it currently requires strong-alignment (although I agree it's not entirely clear and it's not unreasonable to read it the other way)".

(There's no filtering of responses here; every group of maintainers I was able to contact is included, except for one that was unable to obtain legal permission to comment publicly).

This is not the first time this vagueness has been addressed; the response to DR75 endorses the strong-alignment interpretation. However, that response never become normative (via inclusion in a technical corrigendum or international standard). Moreover, it was written before the introduction of the notion of fundamental alignments, and thereby implies that all returned pointers must be suitably aligned for all types; including vector types, types with implementation-specific alignment specifiers, etc. Most compilers with which the author is familiar never provided such a strong guarantee in between DR75 and C11.

Proposed wording

We propose rewording so that the weak-alignment interpretation is specified, and replacing:

"The pointer returned if the allocation succeeds is suitably aligned so that it may be assigned to a pointer to any type of object with a fundamental alignment requirement and then used to access such an object or an array of such objects in the space allocated"

with

"The pointer returned if the allocation succeeds is suitably aligned so that it may be assigned to a pointer to any type of object with a fundamental alignment requirement and size less than or equal to the size requested. It may then be used to access such an object or an array of such objects in the space allocated"

This makes all the implementations listed in the previous section conforming.

In subsequent sections, we'll argue that this change does not induce code breakage, and that changing the weak-alignment implementations to be strong-alignment ones would cause significant performance regressions.

Code breakage

A strong-alignment reader might argue that the proposed rewording breaks existing code; alignment assumptions that they read as holding in all conforming implementations now only hold given an implementation-level extension. However, even granting their parsing of the wording, this guarantee was only ever true in theory, not in practice. The number of weak-alignment implementations meant that the code in question was never meaningfully portable except to those implementations that opted-in in the first place.

Nevertheless, in an attempt to quantify the amount of such non-portable code, I examined the FreeBSD ports project. Many (over 30,000, though not all of them written in C) third-party programs are available for installation on FreeBSD systems. Because many of these programs are developed primarily on other operating systems, the ports project includes the patches required to allow projects to work on FreeBSD. If a program uses the strong-alignment assumption, we would expect to see a patch involving malloc, realloc, calloc, or max_align_t in the ports project, tweaking the size argument (since FreeBSD's libc malloc is weak-alignment). Manual inspection of all instances of these strings in these patches indicated that this was not the case.

However, it may still be useful to consider a few specific scenarios:

The most obvious situation in which code written assuming strong-alignment could misbehave when run on a weak-alignment implementation is strategies that use the low pointer bits of heap objects to encode metadata. To try to measure the number of such instances, I used the Debian code search functionality, which indexes all the open-source projects included in the Debian archive (over 18,000 packages, and 140 GB of source code). I searched through all occurrences of max_align_t trying to find instances that might rely on alignment in this way. In case this assumption arrived through other means (such as using _Alignof(some_other_primitive_type)), I also searched for related phrases, like "tag bits", "pointer tag", etc. In all cases (all language runtimes or mutex implementations), the required alignment was otherwise implied by the contents of the pointed to object (either because of an explicit _Alignas, the presence of a sufficiently-aligned data member, or merely the size of the pointed-to object).
Another possibility is casting a pointer to a small allocation to a pointer to a type with large alignment. If the pointer is less aligned than required for the larger type, this violates language rules. Trying to find instances of this pattern via regex searching is infeasible (it requires understanding of language semantics). However, note that any such code is incorrect if applied to (e.g.) stack data, and that (regardless of origin) using such a pointer without casting it back is a strict-aliasing violation. Moreover, the only common use case of this pattern (in the author's experience), where an S* is cast to a T* (e.g. in a pointer-to-first-element cast) and then back, is still correct in a weak-alignment implementation. Additionally, Facebook has a large codebase containing both C and C++ (and both proprietary and open-source code), which it tests in build modes containing dynamic alignment checks. No breakages due to weak malloc alignment have been reported to the malloc team.
A third possibility is code that uses implementation-specific extensions to create types that violate alignment/size relationships (e.g. following typedef __attribute__((aligned(16))) char aligned_char_t, _Alignof(aligned_char_t) is 16 while sizeof(aligned_char_t) is 1 on GCC). These types already violate language rules (for example, in array indexing), and, being non-standard, don't need to have standards-defined semantics. I also searched for variants of this pattern in Debian packages. Unfortunately, the number of results (over 9,000) was too large to examine every occurrence manually; however, in every case I did examine (a few hundred), no breakage would occur (most instances are static data, intended to loosen rather than strengthen alignment, or applied only to large objects, a la typedef float vec4[4] __attribute__ ((__aligned__(16)));).
Code allocating an array of dynamic size containing highly-aligned objects may try to allocate via ptr = malloc(num_objects * object_size), with num_objects set to 0 (i.e. equivalent to malloc(0)). It might subsequently rely on the alignment guarantee it receives. Instances of this pattern are difficult to quantify, and it's enough of an edge case that the author was not motivated to try. If such code is deemed valuable, then specifying that the pointer returned from malloc(0), if non-null, should be _Alignof(max_align_t)-aligned is straightforward to require.

In summary: code that requires strong-alignment assumptions for correctness is rare enough that the above search was unable to find it. Any such code that does exist relies on platform-dependent behavior in practice if not in theory, and therefore is not any more or less portable with the proposed rewording than in the status quo.

Performance considerations

Another way of dealing with the split would be to clarify in the strong-alignment direction, and demand that the weak-alignment implementations change their behavior. However, doing so would impose nontrivial costs on those implementations. The weak-alignment interpretation has a number of performance advantages over the strong-alignment one.

The minimum alignment of an object is a floor on its size. Requiring those objects to be larger increases memory consumption, as well as cache and TLB misses (on architectures with caches and TLBs). On implementations that use boundary tags, the effective minimum allocation size is doubled, to 2 * _Alignof(max_align_t).
Object size can be a useful heuristic for lifetime, so that, in implementations using slab/magazine allocation, the chance of being able to reuse a region of memory for a different size of allocation increases.
Many concurrency-oriented allocators shard mutexes by size class. The smallest size classes tend to have the hottest mutexes; forcing all allocations of sizes less than or equal to _Alignof(max_align_t) to use the same mutex increases contention on it.

Strong-alignment implementations are also valid weak-alignment implementations, and therefore an implementation that obtains performance improvements from a strong-alignment guarantee can of course continue to provide them.

Note that most compiler-level optimizations don't benefit from assuming strong-alignment. The optimizer cannot in general assume that a pointer it sees came from malloc unless it sees it returned from malloc or passed to free (and, in the former case, it's free to expand the size of the allocation if desired). We are concerned only with small allocations here, so having a small alignment cost multiplied by iteration through a loop is not a concern. Moreover, situations in which an operation on a N/2-byte object can be sped up when it is known that the object is N-byte aligned are relatively uncommon.

To attempt to quantify the possible regression, I measured the performance changes on an allocator that can be configured to provide either the strong or the weak alignment guarantees (jemalloc), on two large (heaps in the 10s of GBs) proprietary server processes. Each showed a 1.1% regression in resident set size, and 0.7% and 0.8% regressions in CPU consumption.

To get numbers based only on publicly available data, I measured GCC 4.8.5's runtime and memory consumption when compiling GCC 4.0.0's sources concatenated into a single file (the former version being chosen because it is the default compiler provided by my operating system, and the latter because of its easy availability at https://people.csail.mit.edu/smcc/projects/single-file-programs/), and measured runtime and memory consumption. These numbers are for three runs, after an initial “burn-in” run for both versions:

Runtime (strong-alignment): 156.30 / 151.18 / 154.04 seconds
RSS (strong-alignment): 1.133 / 1.147 / 1.147 GB
Runtime (weak-alignment): 149.93 / 147.44 / 147.67 seconds
RSS (weak-alignment): 1.114 / 1.142 / 1.125 GB

Comparing best-to-best run in each category gives a 2.5% runtime increase and 1.7% RSS increase from enforcing strong alignment. Median-to-Median gives a 4.3% runtime increase and 2.0% RSS increase. Worst-to-worst gives a 4.2% runtime increase and 0.04% RSS increase.

Conclusion

Regardless of whether or not strong-alignment is required in the Platonic reading of C as standardized, it is not provided in C as implemented. Clarifying the wording in the strong-alignment direction imposes performance regressions on existing implementations. Leaving the wording unchanged confuses users and implementers. Clarifying the wording in the weak-alignment direction avoids these costs, and does not change the portability of programs in practice.