2021-03-24
Several heap allocators1 (hereafter, "allocators") expose as extensions variants of free
that accept, in addition to the address of the allocation, an additional argument "reminding" the allocator of the size of that allocation. These extensions can reduce deallocation cost by 30%, allow extra security-hardening functionality, and currently ship in several implementations. We propose standardizing these functions, so that these improvements are available portably.
Would deallocate the storage pointed to by ptr
, informing the allocator that the storage was allocated with a call to one of malloc
, realloc
, or calloc
with the associated size (or, in the case of calloc
, with two arguments having size as their product).
Would behave similarly, but could be used for storage allocated by aligned_alloc
.
Under the API we propose, the following code snippet:
void *buf = malloc(size);
use(buf, size);
free(buf);
void *aligned_buf = aligned_alloc(alignment, size);
use_aligned(buf, size, alignment);
free(buf);
Could be written equivalently (but perhaps more performantly, or with more safety checking) as:
void *buf = malloc(size);
use(buf, size);
free_sized(buf, size);
void *aligned_buf = aligned_alloc(size, alignment);
use_aligned(buf, size, alignment);
free_aligned_sized(buf, alignment, size);
Implementing both proposed functions by falling back to free
is valid, so the worst-case implementation overhead is minimal.
Many modern allocators divide memory into discrete sizes, keeping their metadata out-of-band. It is also common to cache recently-freed allocations per-thread, or per-core. When deallocating, this type of allocator must determine the true size of the object to put it into the correct per-thread or per-core cache. Looking up the size given just a pointer requires walking a hashtable or radix tree, which requires several memory loads and (often) cache misses. The combination of caching and out-of-band metadata motivate the inclusion of the sized-deallocation extensions.
Many times, when an application is deallocating storage, it already "knows" the size of that storage in some application-specific way This information can be passed to the allocator such that it can modify the appropriate free list without the typical metadata accesses to look up the size of the allocation. This lookup can be the largest single cost of deallocation; in some popular allocators taking up 30% of cycles along the deallocation pathways23 .
Moreover, sized-deallocation can be an aid in testing and security-hardening. If storage is deallocated with an incorrect size, either heap corruption has occurred, or the application has incorrectly tracked the sizes of its allocations, risking a heap overrun vulnerability. In either case, the process can be aborted (or the error otherwise managed). When the equivalent C++ functionality was deployed in both Google and Facebook’s codebases, it exposed a significant number of correctness bugs caused by incorrect type-tracking in applications. With a standardized name, tools like Address Sanitizer can flag these sorts of errors, as they do in C++ code.
Allocators which do not benefit from the additional size information, can easily provide trivial implementations of free_sized
and free_aligned_sized
which simply ignore it:
void free_sized(void *ptr, size_t /* unused */ size) {
free(ptr);
}
void free_aligned_sized(void *ptr, size_t /* unused */ alignment, size_t /* unused */ size) {
free(ptr);
}
Because sized-deallocation is non-standard, portable code has to resort to #ifdef
s and linker hacks to use such functionality only conditionally. For example, the deallocation wrappers in BoringSSL requires a declaration for a weak version of one such extension, sdallocx
, which is checked against NULL
at runtime4:
The true cost is likely also paid in performance; most users simply don't bother using non-portable extensions, and get slower or less secure programs as a result.
C is often a shared substrate on which other languages build; Rust and most C++ implementations delegate their language-specific allocation interfaces to the C ones (e.g. C++’s operator new
/ operator delete
are typically implemented as calls to malloc
and free
). Scripting languages often delegate to the underlying heap allocator. Even though these languages support sized-deallocation, they can't inform the underlying allocator of sizes portably. To maximally utilize underlying functionality, each language implementation needs to have custom support for each libc the implementation might be used alongside (one per available extension). Standardizing names for these solves this M:N problem.
Two allocators (TCMalloc and jemalloc) expose this functionality under the name sdallocx
. The latter is the system allocator for FreeBSD, where sdallocx
is exposed as an extension.
GrapheneOS (a security-focused mobile OS compatible with Android apps and devices) uses a hardened allocator that exposes free_sized
, for the heap consistency checking purposes indicated above.
Mimalloc (a malloc replacement from Microsoft research, embedded in various language runtimes) exposes these functions as mi_free_size
and mi_free_size_aligned
.
libumem ships with Solaris and illumos, and exposes sized deallocation via umem_free
5.
C++ exposes equivalent functionality via additional overloads of operator delete
as of C++146.
The Rust allocator interface is implicitly sized (i.e. all deallocations are sized-deallocations).
We reached out to the allocator authors we were able to; we’ve summarized discussions here.
TCMalloc and jemalloc use this functionality on their fast paths, and regard it as a vital performance improvement (the authors list includes maintainers from both projects). Both do security hardening using size hints, in varying degrees depending on configuration (e.g. "always validate"/"never validate"/"check down paths where we touch metadata anyways, like when flushing thread-local caches").
The GrapheneOS hardened malloc implementer already implements free_sized
under the described semantics (using it for the security checking purposes), and supports standardization. He notes the C++ benefits as well.
The OpenBSD implementer sees the ability to specify the size to free
as a good thing, though he is interested more in the security benefits than in the performance ones.
The Windows C runtime delegates to the underlying kernel library HeapAlloc
/HeapFree
APIs, and the maintainers like the idea of being able to turn on size checking as a testing mechanism in debug builds. The current HeapAlloc
implementation wouldn’t see a performance benefit from avoiding the metadata lookups, and the maintainers are unsure they would trust the user-provided size without checking. They would rather the free_sized
implementation call some new underlying HeapAlloc
functionality than just delegate to free
. This would let them catch any user bugs now, and avoid uncovering latent bugs if at some point in the future they decide to use the size hints for a performance improvement. They’re willing to add such support functions to the HeapAlloc
API.
The current glibc implementation wouldn't see a performance benefit from sized deallocation, but might after some future enhancements the maintainers are considering. They like the security hardening possibilities, but wouldn’t risk trusting the user-specified size without validating it.
The musl maintainer is generally skeptical of API extensions, and leans that way here as well. He notes that the performance and security advantages can be in direct tension -- if an allocator trusts user sizes uncritically, then they also present a whole new attack vector. He sees some value in the increased hardening possibilities.
To be added to the memory management functions section (7.22.3, relative to n2596)
free_sized
function1.
2. If ptr
is the result obtained from a call to malloc(size)
, realloc(old_ptr, size)
, or calloc(nmemb, memb_size)
, where nmemb * memb_size
is equal to size
, this function behaves equivalently to free(ptr)
. Otherwise, the result is undefined.
3. NOTE: A conforming implementation may simply ignore size
and call free
.
4. NOTE: The result of an aligned_alloc
call may not be passed to free_sized
.
5. Implementations may provide extensions to query the usable size of an allocation, or to determine the usable size of the allocation that would result if a request for some other size were to succeed. Such implementations should allow passing the resulting usable size as the size
parameter, and provide functionality equivalent to free
in such cases.
6. The free_sized
function returns no value.
free_aligned_sized
function1.
2. If ptr
is the result obtained from a call to aligned_alloc(alignment, size)
, this function behaves equivalently to free(ptr)
. Otherwise, the result is undefined.
3. NOTE: A conforming implementation may simply ignore size
and alignment
and call free
.
4. NOTE: The result of a malloc
, calloc
, or realloc
call may not be passed to free_aligned_sized
.
5. Implementations may provide extensions to query the usable size of an allocation, or to determine the usable size of the allocation that would result if a request for some other size were to succeed. Such implementations should allow passing the resulting usable size as the size
parameter, and provide functionality equivalent to free
in such cases.
6. The free_aligned_sized
function returns no value.
It’s hard to prove that any given name added to the standard library won’t collide with existing code. There are two Google hits for free_sized
are in allocator implementations, and match the proposed prototype. One is from a seemingly unmaintained user library that declares it as a static function; it would have to change if recompiled against this declaration of free_sized
, but there would not be any ABI breakage if the library were simply relinked. free_aligned_sized
appears to be a globally unique name, at least as far as Google is concerned.
sized_free
and aligned_sized_free
are other reasonable choices, with the former having somewhat more Google-able name conflicts.
These semantics disallow passing an allocation obtained from alloc_aligned
to free_sized
(i.e. omitting the alignment
parameter during deallocation), or vice versa. So, if a program keeps track of the size of allocation requested, but not whether or not it was allocated with user-specified alignment, it has to use free
to deallocate the storage rather than free_sized
. This may seem unintuitive; informing the allocator of a true fact is disallowed.
This stems from an implementation constraint (one shared by all existing implementations the authors are aware of); many allocators implement aligned allocation by increasing the effective size of the returned storage. Passing the smaller size back to free_sized
would cause an incorrect size class computation, rendering the performance and security benefits unobtainable.
An alternate approach would be to only include free_aligned_sized
, and require allocations obtained from malloc
, realloc
, or calloc
to pass 0
for alignment
. This shaves down the specification by a function, but at the cost of making the remaining function somewhat more complex, and using a dynamic property (the value of alignment
) to pass information available statically (the source of the allocation). It also forces languages built on C to go through a trampoline even for the common case of allocations with default alignment, which imposes modest performance costs on them.
This wording has free_aligned_sized
parallel alloc_aligned
, and its parameter order is (ptr, alignment, size)
. C++ and Rust instead make their argument ordering (ptr, size, alignment)
. By matching aligned_alloc
, we disallow those languages from implementing their deallocation redirects purely in the linker (i.e. they need a host-language trampoline function that swaps their second and third argument before calling free_aligned_sized
).
An alternative to consider would be renaming to free_sized_aligned
, and swapping the second and third arguments. Practically, a small enough fraction of deallocations will use this function that it’s unclear that this overhead matters.
Existing systems often deploy custom allocators by using ELF symbol interposition. This means that existing system administrators may be preloading custom allocators to provide developer tooling functionality or improved performance with certain applications given a particular workload (e.g. Using LD_PRELOAD
to load jemalloc). Newly deployed applications that use the sized deallocation provided by the system C library will likely fail to operate correctly unless the interposers are also updated to match the system C library, or the C library takes care to handle this case specially. In deploying a new system C library that provides the new sized deallocation API the surrounding ecosystem should be coordinated to ensure the distribution is consistently providing interposers for the new API where required. The C library has avenues to address this issue on its own: at the time of the first call to the libc malloc
, the allocator can set a flag indicating that it’s the libc allocator that’s in use. The libc free_sized
and free_aligned_sized
implementations can check the flag and delegate to free
if some other allocator is providing the in-use definition of malloc.
Some reviewers made comments along the lines of the musl maintainer: that while performance-focused allocators might become more performant with sized allocation, it might be at the cost of security, since uncritically trusting user-provided sizes may present a new attack vector. Likewise, security-focused allocators might become more secure with sized deallocation, but without a performance improvement (since they need to validate sizes anyway).
While individual implementations might choose those portions of the design space, it's worth noting that both types of allocators can improve along both dimensions simultaneously; making the size information immediately available can improve instruction-level parallelism for redzone checking in security-focused allocators (by breaking address dependencies between cache misses), and can allow extra security checking in performance-focused ones.
calloc
overflow handlingThe current wording of calloc
does not directly address the possibility that the product of its arguments overflows a size_t
, but it nevertheless returns a non-NULL pointer (one whose size we cannot describe with a size_t
). In such cases, the size
argument passed to free_sized
would be smaller than the true size of the allocation. There are a few ways we could handle this.
calloc
to return allocations larger than SIZE_T_MAX
would not be able to assume that the passed-in size
parameter is not the result of an overflow (and, say, fall back to free
). Practically, no existing allocators we're aware of allow this sort of pathologically large allocation in the first place, so this is not a "real" implementation issue.calloc
is already required to return NULL
in such cases. The argument for this is that calloc
must return enough space for nmemb
objects of size size
. Such an array is itself an object whose size can be obtained via sizeof
, and must therefore fit in a size_t
. Since such an array is too large to exist, calloc
is not permitted to succeed.calloc
requirements to explicitly specify that it must return NULL
if computing nmemb * size
would overflow a size_t
. This already describes most (all?) existing implementations, and allowing objects to exist for which ordinary indexing operations overflow size_t
s presents security issues of the form calloc
is meant to solve in the first place.free_sized
requirements so that size
is required to be equal to the mathematical product of nmemb
and size
. This would make passing such an overflowed value to free_sized
undefined behavior.We recommend the first approach for the purposes of this paper, but would encourage consideration of the third as well.
C standard library implementations include dynamic storage allocation, but often make this functionality replaceable with alternate definitions of malloc
, free
, etc. supplied by a different vendor. We have tried to be consistent in our terminology, and use "implementation" to refer to a whole C implementation in its entirety, and "allocator" to refer to any particular implementation of aligned_alloc
, calloc
, free
, malloc
, realloc
specifically.↩
This is based on the savings in the deployment experience of the jemalloc and TCMalloc teams. This opportunity has been rediscovered many times by many different vendors; see e.g. https://www.samxi.org/papers/kanev_asplos2017.pdf or https://research.fb.com/wp-content/uploads/2020/03/Accelerometer-Understanding-Acceleration-Opportunities-for-Data-Center-Overheads-at-Hyperscale.pdf .↩
The true savings are probably even higher; avoiding these metadata lookups reduces the allocator’s cache footprint generally, so that other parts of the program benefit as well.↩
See https://github.com/google/boringssl/blob/94634a72b1cbc327dccfefd577436576992c644e/crypto/mem.c#L104 and the surrounding code.↩
https://docs.oracle.com/cd/E88353_01/html/E37843/umem-free-3malloc.html↩
http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2013/n3536.html↩