1. Changelog
1.1. Revision 5 - April 12th, 2022
-
Additional syntax changes based on feedback from Joseph Myers, Hubert Tong, and users.
-
Minor wording tweaks and typo clean up.
-
An implementation available in Godbolt (since last revision as well and noted below).
-
The paper’s source code has been refactored:
-
Separated WG21 paper from WG14 paper.
-
Core paper together (rationale, reasoning), included in both C and C++ papers since rationale is identical.
-
-
Changed
to match feedback from last standards meeting, nominally that an empty resource returns__has_embed
instead of2
(but both decay to a truthy value during preprocessor conditional inclusion expressions). Modified by the wording and the prose in § 4.4 __has_embed.1 -
As a reaction to this, the
embed parameter is an optional part of the proposal, as explained in § 4.2.1.3 Empty Signifier. This did affect a user in an impactful manner but the new functionality is fine, but has some downsides w.r.t. "repeating yourself".is_empty
-
-
The wording for the limit parameter (§ 7.5 Add 3 new sub clauses as §6.10.✨.1 through §6.10.✨.3, under §6.10.✨ Binary resource inclusion) adjusted to perform macro expansion, at least once. Exact wording may need help.
1.2. Revision 4 - February 7th, 2022
-
Clean up syntax.
-
Reimplement and deploy extension in Clang to ensure an implementation of named parameters work.
-
Change wording to encapsulate the new fixes.
-
Removed C++ wording to focus on C wording for this document.
1.3. Revision 3 - May 15th, 2021 (WG14)
-
Added post C meeting fixes to prepare for hopeful success next meeting.
-
Added 2 more examples to C and C++ wording.
-
Vastly improved wording and reduced ambiguities in syntax and semantics.
-
Fixed various wording issues.
1.4. Revision 2 - October 25th, 2020
-
Added post C++ meeting notes and discussion.
-
Removed type or bit specifications from the
directive.#embed -
Moved "Type Flexibility" section and related notes to the Appendix as they are now unpursued.
1.5. Revision 1 - April 10th, 2020
-
Added post C meeting notes and discussion.
-
Added discussion of potential endianness.
-
Improved wording section at the end to be more detailed in handling preprocessor (which does not understand types).
1.6. Revision 0 - January 5th, 2020
-
Initial release! 🎉
2. Polls & Votes
The votes for the C Committee are as follows:
-
Y: Ye
-
N: Nay
-
A: Abstain
2.1. January/February 2022 C Meeting
"Does WG14 want the embed parameter specification as shown in N2898?"
Y | N | A |
---|---|---|
12 | 2 | 8 |
From the January/February 2022 Meeting Minutes, Summary of Decisions:
WG14 wants the embed parameter specification as shown in N2898.
We interpret this as consensus. We keep the parameters but make the one that folks were questioning (
) optional in response to the feedback during and after the meeting.
2.2. December 2020 Virtual C Meeting
"Do we want to allow #embed to appear in any context that is different from an initialization of a character array?"
Y | N | A |
---|---|---|
5 | 8 | 6 |
"Leaning in the direction of no but not clear." The paper author after consideration chose to keep this as-is right now. Discussion of the feature meant that trying to ban this from different contexts meant that a naïve, separated-preprocessor implementation would be banned and it would require special compiler magic to diagnose. Others pointed out that just trying to leave it "unspecified whether it works outside of the initialization of an array or not" is very dangerous to portability. The author agrees with this assessment and therefore will leave it as-is. The goal of this feature is to enable implementers to use the magic if they so choose, as an implementation detail and a Quality of Implementation selling point. Vendors who provide a simple expansion may not see improvements to throughput and speed of translation but that is their choice as an implementer. Therefore, we cannot do anything which would require them or any preprocessor implementation to traffic in magic directives unless they want to.
2.3. April 2020 Virtual C Meeting
"We want to have a proper preprocessor
over a
-based directive."
This had UNANIMOUS CONSENT to pursue a proper preprocessor directive and NOT use the
syntax. It is noted that the author deems this to be the best decision!
The following poll was later superceded in the C and C++ Committees.
"We want to specify embed as using
rather than
." (2-way poll.)
Y | N | A |
---|---|---|
10 | 2 | 3 |
-
Y: 10 bits-per-element (Ye)
-
N: 2 type-based (Nay)
-
A: 4 Abstain (Abstain)
This poll will be a bit harder to accommodate properly. Using a
that produces a numeric constant means that the max-length specifier is now ambiguous. The syntax of the directive may need to change to accommodate further exploration.
3. Introduction
For well over 40 years, people have been trying to plant data into executables for varying reasons. Whether it is to provide a base image with which to flash hardware in a hard reset, icons that get packaged with an application, or scripts that are intrinsically tied to the program at compilation time, there has always been a strong need to couple and ship binary data with an application.
Neither C nor C++ makes this easy for users to do, resulting in many individuals reaching for utilities such as
, writing python scripts, or engaging in highly platform-specific linker calls to set up
variables pointing at their data. Each of these approaches come with benefits and drawbacks. For example, while working with the linker directly allows injection of vary large amounts of data (5 MB and upwards), it does not allow accessing that data at any other point except runtime. Conversely, doing all of these things portably across systems and additionally maintaining the dependencies of all these resources and files in build systems both like and unlike
is a tedious task.
Thusly, we propose a new preprocessor directive whose sole purpose is to be
, but for binary data:
.
3.1. Motivation
The reason this needs a new language feature is simple: current source-level encodings of "producing binary" to the compiler are incredibly inefficient both ergonomically and mechanically. Creating a brace-delimited list of numerics in C comes with baggage in the form of how numbers and lists are formatted. C’s preprocessor and the forcing of tokenization also forces an unavoidable cost to lexer and parser handling of values.
Therefore, using arrays with specific initialized values of any significant size becomes borderline impossible. One would think this old problem would be work-around-able in a succinct manner. Given how old this desire is (that comp.std.c thread is not even the oldest recorded feature request), proper solutions would have arisen. Unfortunately, that could not be farther from the truth. Even the compilers themselves suffer build time and memory usage degradation, as contributors to the LLVM compiler ran the gamut of the biggest problems that motivate this proposal in a matter of a week or two earlier this very year. Luke is not alone in his frustrations: developers all over suffer from the inability to include binary in their program quickly and perform exceptional gymnastics to get around the compiler’s inability to handle these cases.
C developer progress is impeded regarding the inability to handle this use case, and it leaves both old and new programmers wanting.
3.2. But How Expensive Is This?
Many different options as opposed to this proposal were seriously evaluated. Implementations were attempted in at least 2 production-use compilers, and more in private. To give an idea of usage and size, here are results for various compilers on a machine with the following specification:
-
Intel Core i7 @ 2.60 GHz
-
24.0 GB RAM
-
Debian Sid or Windows 10
-
Method: Execute command hundreds of times, stare extremely hard at
/Task Managerhtop
While
and
work well for getting accurate timing information and can be run several times in a loop to produce a good average value, tracking memory consumption without intrusive efforts was much harder and thusly relied on OS reporting with fixed-interval probes. Memory usage is therefore approximate and may not represent the actual maximum of consumed memory. All of these are using the latest compiler built from source if available, or the latest technology preview if available. Optimizations at
(GCC & Clang style)/
(MSVC style) or equivalent were employed to generate the final executable.
3.2.1. Speed
Strategy | 40 kilobytes | 400 kilobytes | 4 megabytes | 40 megabytes |
---|---|---|---|---|
GCC
| 0.236 s | 0.231 s | 0.300 s | 1.069 s |
-generated GCC
| 0.406 s | 2.135 s | 23.567 s | 225.290 s |
-generated Clang
| 0.366 s | 1.063 s | 8.309 s | 83.250 s |
-generated MSVC
| 0.552 s | 3.806 s | 52.397 s | Out of Memory |
3.2.2. Memory Size
Strategy | 40 kilobytes | 400 kilobytes | 4 megabytes | 40 megabytes |
---|---|---|---|---|
GCC
| 17.26 MB | 17.96 MB | 53.42 MB | 341.72 MB |
-generated GCC
| 24.85 MB | 134.34 MB | 1,347.00 MB | 12,622.00 MB |
-generated Clang
| 41.83 MB | 103.76 MB | 718.00 MB | 7,116.00 MB |
-generated MSVC
| ~48.60 MB | ~477.30 MB | ~5,280.00 MB | Out of Memory |
3.2.3. Analysis
The numbers here are not reassuring that compiler developers can reduce the memory and compilation time burdens with regard to large initializer lists. Furthermore, privately owned compilers and other static analysis tools perform almost exponentially worse here, taking vastly more memory and thrashing CPUs to 100% for several minutes (to sometimes several hours if e.g. the Swap is engaged due to lack of main memory). Every compiler must always consume a certain amount of memory in a relationship directly linear to the number of tokens produced. After that, it is largely implementation-dependent what happens to the data.
The GNU Compiler Collection (GCC) uses a tree representation and has many places where it spawns extra "garbage", as its called in the various bug reports and work items from implementers. There has been a 16+ year effort on the part of GCC to reduce its memory usage and speed up initializers (C Bug Report and C++ Bug Report). Significant improvements have been made and there is plenty of room for GCC to improve here with respect to compiler and memory size. Somewhat unfortunately, one of the current changes in flight for GCC is the removal of all location information beyond the 256th initializer of large arrays in order to save on space. This technique is not viable for static analysis compilers that promise to recreate source code exactly as was written, and therefore discarding location or token information for large initializers is not a viable cross-implementation strategy.
LLVM’s Clang, on the other hand, is much more optimized. They maintain a much better scaling and ratio but still suffer the pain of their token overhead and Abstract Syntax Tree representation, though to a much lesser degree than GCC. A bug report was filed but talk from two prominent LLVM/Clang developers made it clear that optimizing things any further would require an extremely large refactor of parser internals with a lot of added functionality, with potentially dubious gains. As part of this proposal, the implementation provided does attempt to do some of these optimizations, and follows some of the work done in this post to try and prove memory and file size savings. (The savings in trying to optimize parsing large array literals were "around 10%", compared to the order-of-magnitude gains from
and similar techniques).
Microsoft Visual C (MSVC) scales the worst of all the compilers, even when given the benefit of being on its native operating system. Both Clang and GCC outperform MSVC on Windows 10 or WINE as of the time of writing.
Linker tricks on all platforms perform better with time (though slower than
implementation), but force the data to be optimizer-opaque (even on the most aggressive "Link Time Optimization" or "Whole Program Optimization" modes compilers had). Linker tricks are also exceptionally non-portable: whether it is the
assembly command supported by certain compilers, specific invocations of
/
or others, non-portability plagues their usefulness in writing Cross-Platform C (see Appendix for listing of techniques). This makes C decidedly unlike the "portable assembler" advertised by its proponents (and my Professors and co-workers).
4. Design
There are two design goals at play here, sculpted to specifically cover industry standard practices with build systems and C programs.
The first is to enable developers to get binary content quickly and easily into their applications. This can be icons/images, scripts, tiny sound effects, hardcoded firmware binaries, and more. In order to support this use case, this feature was designed for simplicity and builds upon widespread existing practice.
The second is extensibility. We recognize that talking to arbitrary places on either the file system, network, or similar has different requirements. After feedback from an implementer about syntax for extensions, we reached out to various users of the beta builds or custom builds using
-like things. It turns out many of them have needs that, since they are the ones building and in some cases patching over/maintaining their compiler, have needs for extensible attributes that can be passed to
directives. Therefore, we structured the syntax in a way that is favorable to "simple" scanning tools but powerful enough to handle arbitrary directives and future extension points.
4.1. Goal: Simplicity and Familiarity
Providing a directive that mirrors
makes it natural and easy to understand and use this new directive. It accepts both chevron-delimited (
) and quote-delimited (
) strings like
does. This matches the way people have been generating files to
in their programs, libraries and applications: matching the semantics here preserves the same mental model. This makes it easy to teach and use, since it follows the same principles:
/* default is unsigned char */ const unsigned char icon_display_data [] = { #embed "art.png" }; /* specify any type which can be initialized form integer constant expressions will do */ const char reset_blob [] = { #embed "data.bin" };
Because of its design, it also lends itself to being usable in a wide variety of contexts and with a wide variety of vendor extensions. For example:
/* attributes work just as well */ const signed char aligned_data_str [] __attribute__ (( aligned ( 8 ))) = { #embed "attributes.xml" };
The above code obeys the alignment requirements for an implementation that understands GCC directives, without needing to add special support in the
directive for it: it is just another array initializer, like everything else.
4.1.1. Existing Practice - Search Paths
It follows the same implementation experience guidelines as
by leaving the search paths implementation defined, with the understanding that implementations are not monsters and will generally provide
/
and other related flags as their users require for their systems. This gives implementers the space they need to serve the needs of their constituency.
4.1.2. Existing Practice - Discoverable and Distributable
Build systems today understand the make dependency format, typically through use of the compiler flags
and friends. This sees widespread support, from CMake, Meson and Bazel to ninja and make. Even VC++ has a version of this flag --
-- that gets parsed by build systems.
This preprocessor directive fits perfectly into existing build architecture by being discoverable in the same way with the same tooling formats. It also blends perfectly with existing distributed build systems which preprocess their files with
before sending it up to the build farm, as
and
do.
4.2. Syntax
The syntax for this feature is for an extensible preprocessor directive. The general form is:
where
refers to the syntax of
/
/
/
that is already part of the grammar. The syntax takes after many existing extensions in many preprocessor implementations and specifications, including OpenMP, Clang
s, Microsoft
s, and more. The named parameters was a recommendation by an implementer
This syntax keeps the header-name, enclosed in angle brackets or quotation marks, first to allow a "simple" preprocessing tool to quickly scan for all the necessary dependency names without having to parse any of the names or parameters that come after. Both standard names and vendor/implementation-specific names can also be accommodated in the list of naked attributes, allowing for specific vendor extensions in a consistent manner while the standard can take the normal
names.
4.2.1. Parameters
One of the things that’s critical about
is that, because it works with binary resources, those resources have characteristics very much different from source and header files present in a typical filesystem. There may be need for authentication (possibly networked), permission, access, additional processing (new-line normalization), and more that can be somewhat similarly specified through the implementation-defined parameters already available through the C and C++ Standards' "
" function.
However, adding a "mode" string similar to
, while extensible, is archaic and hard to check. Therefore, the syntax allows for multiple "named expressions", encapsulated in parentheses, and marked with
as a form of "namespacing" identifiers similar to
attribute-style syntax. However, parameters do not have the balanced square bracket
delimiters, and just use the
form with an optional parentheses-enclosed list of arguments.
Some example attributes including interpreting the binary data as "text" rather than a bitstream with
, providing authenticated access with
,
to change the element of each entry produced, and more. These are all things vendors have indicated they might support for their use cases.
4.2.1.1. Limit Parameter
The earliest adopters and testers of the implementation reported problems when trying to access POSIX-style
devices and pseudo-files that do not have a logical limitation. These "infinity files" served as the motivation for introducing the "limit" parameter; there are a number of resources which are logically infinite and thusly having a compiler read all of the data would result an Out of Memory error, much like with
if someone did
.
The limit parameter is specified after the resource name in
, like so:
const int please_dont_oom_kill_me [] = { #embed "/dev/urandom" limit(512) };
This prevents locking compilers in an infinite loop of reading from potentially limitless resources. Note the parameter is a hard upper bound, and not an exact requirement. A resource may expand to a 16-element list rather than a 512-element list, and that is entirely expected behavior. The limit is the number of elements allowed up to the maximum for this type.
This does not provide a form of "timeout" for e.g. resources stored on a Network File System or an inactivity limit or similar. Implementations that utilize support for more robust handling of resource location schemes like Uniform Resource Identifiers (URIs) that may interface with resources that take extensive amounts of time to locate should provide implementation-defined extensions for timeout or inactivity checks.
4.2.1.2. Non-Empty Prefix and Suffix
Something pointed out by others using this preprocessor directive is a problem similar to
: when placing this parameter with other tokens before or after the
directive, it sometimes made it hard to properly anticipate whether a file was empty or not.
The
proposal includes a prefix and suffix entry that applies if and only if the resource is non-empty:
const unsigned char null_terminated_file_data [] = { #embed "might_be_empty.txt" \ prefix(0xEF, 0xBB, 0xBF, ) /* UTF-8 BOM */ \ suffix(,) 0 // always null-terminated };
and
only work if the
resource is not empty. If a user wants a prefix or suffix that appears unconditionally, they can simply just type the tokens they want before and after: there is nothing to be gained from adding a standards-mandated prefix and suffix that works in both the empty and non-empty case.
4.2.1.3. Empty Signifier
This is for the case when the given resource exists, but it is empty. This allows a user to have a sequence of tokens between the parentheses passed to the
parameter here:
.
If
exists but is empty, this will replace the directive with the (potentially macro expanded) contents between the parentheses of the
parameter. This can also be combined with a
parameter to always have the
token return. This can be useful for macro-expanded integer constant expressions that may end up being 0.
An example program
:
int main () { #define SOME_CONSTANT 0 return #embed </dev/urandom> is_empty(0) limit(SOME_CONSTANT) ; }
This program will expand to the equivalent of
if
is 0, or a single (random)
value if it is 1. (If
is greater than 1, it produces a comma-delimited list of integers, which gets treated as a sequence to the comma operator after the
keyword. Some compilers warn about the left-hand operands having no effect.)
Previously, this was the only way to detect that the resource was empty. This functionality can be substituted with having to use
with the same contents and specifically check for the return value of
. While this change create some repeating-yourself friction in the identifier, there was only 1 user who actually needed the is_empty signifier, and that was only because they were using it to replace it with a very particularly sized and shaped data array. The
technique worked just fine for them as well at the cost of some repetition (to check for embed parameters), and after some discussion with the user it was deemed okay to switch to this syntax, since during the discusison of
in the January/February 2022 WG14 C Standards Committee Meeting it was commented on that there were too many signifiers.
We do not want to entirely lose that user’s use case, however, so we have made the
parameter an optional part of the wording, to be voted on as a separate piece.
4.3. Constant Expressions
Both C and C++ compilers have rich constant folding capabilities. While C compilers only acknowledge a fraction of what is possible by larger implementations like MSVC, Clang, and GCC, C++ has an entire built-in compile-time programming bit, called
. Most typical solutions cannot be used as constant expressions because they are hidden behind run-time or link-time mechanisms (
, or the resource compiler
on Windows, or the static library archiving tools). This means that many algorithms and data components which could strongly benefit from having direct access to the values of the integer constants do not because the compiler cannot "see" the data, or because Whole Program Optimization cannot be aggressive enough to do anything with those values at that point in the compilation (i.e., during the final linking stage).
This makes
especially powerful, since it guarantees these values are available as-if it was written by as a sequence of integers whose values fit within an
.
4.4. __has_embed
C and C++ are support a
. It makes sense to have an analogous
identifier. It can take a
or
resource name identifier, as well as additional arguments to let vendors pass in any additional arguments they need to properly access the file (following the same attribute-like parameters passed to the directive).
evaluates to:
-
if the reesource is not found or any parameter in the0
does not exist; or,embed - parameter - list -
if the resource is found, it is not empty, and the1
(including the vendor-specific ones) are supported; or,embed - parameter - list -
if the resource is found, it is empty, and the2
(including the vendor-specific ones) are supported.embed - parameter - list
This may raise questions of "TOCTTOU" (Time of Check to Time of Use) problems, but we already have these problems between
and
. They are also already solved by existing implementations. For example, the LLVM/Clang compiler uses
and
abstractions which cache files. GCC’s "libcpp" will cache already-opened files (up to a limit). Any TOCTTOU problems have already been managed and provided for using the current
infrastructure of these compilers, and if any compiler wants a more streamlined and consistent experience they should deploy whatever Quality of Implementation (QoI) they see fit to achieve that goal.
Finally, note that this directive DOES expand to
if a given parameters that the implementation does not support. This makes it easier to determine if a given vendor-specific embed directive is supported. In fact, support can be checked in most cases by using a combination of
and
:
int main () { #if __has_embed (__FILE__ clang::element_type(short)) // load "short" values directly from memory short meow [] = { #embed "bits.bin" clang::element_type(short) }; #else // no support for implementation-specifid // clang::element_type parameter unsigned char meow_bytes [] = { #embed "bits.bin" }; unsigned short meow [] = { /* parse meow_bytes into short values by-hand! */ }; #endif return 0 ; }
For the C proposal, the wording for
returning
is optional, as it depends on whether or not the C Committee would like to solve this problem in one specific direction or another.
4.5. Bit Blasting: Endianness
What would happen if you did
into an
fread ?
int that’s my answer 🙂
– Isabella Muerte
It’s a simple answer. While we may not be reading into
, the idea here is that the interpretation of the directive is meant to get as close to directly copying the bitstream, as is possible. A compiler-magic based implementation like the ones provided as part of this paper have no endianness issues, but an implementation which writes out integer literals may need to be careful of host vs. target endianness to make sure it serializes correctly to the final binary. As a litmus test, the following code -- given a suitably sized
resource -- should return
:
#include <cstdio>#include <cstring>int main () { const unsigned char foo0 [] = { #embed "foo.bin" }; const unsigned char foo1 [ sizeof ( foo0 )]; std :: FILE * fp = std :: fopen ( "foo.bin" ); if ( fp == nullptr ) { return 1 ; } std :: size_t foo1_read = std :: fread ( foo1 , 1 , sizeof ( foo1 ), fp ); if ( foo1_read != sizeof ( foo1 )) { return 1 ; } if ( memcmp ( & foo0 [ 0 ], & foo1 [ 0 ], sizeof ( foo0 )) != 0 ) { return 1 ; } return 0 ; }
If the same file during both translation and execution,
, is used here, this program should always return
. This is what the wording below attempts to achieve. Note that this is always a concern already, due to
and other target environment-specific variables that already exist; implementations have always been responsible for handling differences between the host and the target and this directive is no different. If the
of the host vs. the target is the same, then the directive is more simple. If it is not, then an implementation will have to perform translation.
5. Implementation Experience
An implementation of this functionality is available in branches of both GCC and Clang, accessible right now with an internet connection through the online utility Compiler Explorer. The Clang compiler with this functionality is called "x86-64 clang (thephd.dev)" in the Compiler Explorer UI:
int main () { return #embed </dev/urandom> limit(1) ; }
6. Alternative Syntax
There were previous concerns about the syntax using pragma-like syntax and more. WG14 voted to keep the syntax as a plain
preprocessor directive, unanimously.
Previously, different syntax was used to specify the limit and other kinds of parameters. These have been normalized to be a suffix of attribute-like parameters, at the request of an implementer and the C++ Standards Committee discussion of the paper in June 2021. It has had hugely positive feedback and users have reported the new syntax to be clearer, while other implementers have stated this is much better for them and the platforms for which they intend to add additional embed parameters.
7. Wording
This wording is relative to C’s latest working draft.
Editor’s Note: The ✨ characters are intentional. They represent stand-ins to be replaced by the editor.
7.1. Modify 6.4, paragraph 4
If the input stream has been parsed into preprocessing tokens up to a given character, the next preprocessing token is the longest sequence of characters that could constitute a preprocessing token. There is one exception to this rule:
header name preprocessing tokens are recognized only withinheader name preprocessing tokens are recognized only withinpreprocessing directives, in
#include expressions, and in implementation-defined locations within #pragma directives.
__has_include and
#include preprocessing directives, in
#embed and
__has_include expressions, as well as in implementation-defined locations within
__has_embed directives. In such contexts, a sequence of characters that could be either a header name or a string literal is recognized as the former.
#pragma
7.2. Add another control-line production to §6.10 Preprocessing Directives, Syntax, paragraph 1
control-line:
…
- # embed pp-tokens new-line
…embed-parameter-list:
attribute
embed-parameter-list attribute
7.3. Modify §6.10.1 Conditional inclusion to include a new "has-embed-expression" production by modifying paragraph 1, then modify the following paragraphs:
Syntax…has-include-expression:
__has_include ( header-name )
__has_include ( header-name-tokens )
has-embed-expression:
__has_embed ( header-name embed-parameter-list )
__has_embed ( header-name-tokens embed-parameter-list )
…The expression that controls conditional inclusion shall be an integer constant expression except that: identifiers (including those lexically identical to keywords) are interpreted as described below182) and it may contain zero or more defined macro expressions, has_include expressions, has_embed expressions, and/or has_c_attribute expressions as unary operator expressions.
…The second forms of the has_include expression and has_embed expression are considered only if the first form does not match, in which case the preprocessing tokens are processed just as in normal text.…The resource (6.10.✨) identified by the header-name preprocessing token sequence in each contained has_embed expression is searched for as if those preprocessing token were the pp-tokens in adirective, except that no further macro expansion is performed. Such a directive shall satisfy the syntactic requirements of a
#embed directive. The has_embed expression evaluates to:
#embed
— 0 if the search fails or if any of the embed parameters in the embed parameter list specified are not supported by the implementation for the
directive; or,
#embed — 1 if the search for the resource succeeds and all embed parameters in the embed parameter list specified are supported by the implementation for the
directive.
#embed …SemanticsThe,
#ifdef ,
#ifndef , and
#elifdef , and the defined conditional inclusion operator, shall treat
#elifndef ,
__has_include , and
__has_embed as if they were the name of defined macros. The identifiers
__has_c_attribute ,
__has_include , and
__has_embed shall not appear in any context not mentioned in this subclause.
__has_c_attribute …EXAMPLE: A combination of(6.10.8.1) and
__FILE__ could be used to check for support of specific implementation extensions for the
__has_embed directive’s parameters.
#embed #if __has_embed(__FILE__ ext::token(0xB055)) #define DESCRIPTION "Supports extended token embed" #else #define DESCRIPTION "Does not support extended token embed" #endif EXAMPLE: The below snippet usesto check for support of a specific implementation-defined embed parameter, and otherwise uses standard behavior to produce the same effect.
__has_embed void parse_into_s ( short * ptr , unsigned char * ptr_bytes , unsigned long long size ); int main () { #if __has_embed ("bits.bin" ds9000::element_type(short)) /* Implementation extension: create short integers from the */ /* translation environment resource into */ /* a sequence of integer constants */ short meow [] = { #embed "bits.bin" ds9000::element_type(short) }; #else /* no support for implementation-specific */ /* ds9000::element_type(short) parameter */ const unsigned char meow_bytes [] = { #embed "bits.bin" }; short meow [ sizeof ( meow_bytes ) / sizeof ( short )] = {}; /* parse meow_bytes into short values by-hand! */ parse_into_s ( meow , meow_bytes , sizeof ( meow_bytes )); #else #error "cannot find bits.bin resource" #endif return ( int )( meow [ 0 ] + meow [( sizeof ( meow ) / sizeof ( * meow )) - 1 ]); } …Forward references: … Mandatory macros (6.10.8.1) …
7.4. Add a new sub clause as §6.10.✨ to §6.10 Preprocessing Directives, preferably after §6.10.2 Source file inclusion
6.10.✨ Binary resource inclusionDescriptionA resource is a source of data accessible from the translation environment. An embed parameter is a single attribute in the embed parameter list. It has an implementation resource width, which is the implementation-defined size in bits of the located resource. It also has a resource width, which is either:
— the number of bits as computed from the optionally-provided
embed parameter (6.10.✨.1), if present; or,
limit — the implementation resource width.
An embed parameter list is a whitespace-delimited list of attributes which may modify the result of the replacement for the
preprocessing directive.
#embed ConstraintsAn
directive shall identify a resource that can be processed by the implementation as a binary data sequence given the provided embed parameters.
#embed Embed parameters not specified in this document shall be implementation-defined. Implementation-defined embed parameters may change the below-defined semantics of the directive; otherwise,
directives which do not contain implementation-defined embed parameters shall behave as described in this document.
#embed A resource is considered empty when its resource width is zero.Let embed element width be either:
— an integer constant expression greater than zero determined an implementation-defined embed parameter; or,
—
.
CHAR_BIT The result of
shall be zero.FN0✨)
( resource width ) % ( embed element width ) SemanticsThe expansion of a
directive is a token sequence formed from the list of integer constant expressions described below. The group of tokens for each integer constant expression in the list is separated in the token sequence from the group of tokens for the previous integer constant expression in the list by a comma. The sequence neither begins nor ends in a comma. If the list of integer constant expressions is empty, the token sequence is empty. The directive is replaced by its expansion and, with the presence of certain embed parameters, additional or replacement token sequences.
#embed A preprocessing directive of the form
# embed < h-char-sequence
embed-parameter-listopt new-line
> searches a sequence of implementation-defined places for a resource identified uniquely by the specified sequence between the
and
< . The search for the named resource is done in an implementation-defined manner.
> A preprocessing directive of the form
# embed " q-char-sequence " embed-parameter-listopt new-line
searches a sequence of implementation-defined places for a resource identified uniquely by the specified sequence between the
delimiters. The search for the named resource is done in an implementation-defined manner. If this search is not supported, or if the search fails, the directive is reprocessed as if it read
"
# embed
h-char-sequence
< embed-parameter-listopt new-line
> with the identical contained q-char-sequence (including
characters, if any) from the original directive.
> Either form of thedirective specified previously behave as specified below. The values of the integer constant expressions in the expanded sequence is determined by an implementation-defined mapping of the resource’s data. Each integer constant expression’s value is in the range from
#embed to
0 , inclusiveFN1✨).
( 2 embed element width ) - 1 If the list of integer constant expressions:
— is used to initialize an array of a type compatible with
or, if
unsigned char is an unsigned type; and,
char — the embed element width is equivalent to
(5.2.4.2.1),
CHAR_BIT then the contents of the initialized elements of the array are as-if the resource’s binary data was
into the array at translation time.
fread A preprocessing directive of the form
# embed pp-tokens new-line
(that does not match one of the two previous forms) is permitted. The preprocessing tokens after embed in the directive are processed just as in normal text. (Each identifier currently defined as a macro name is replaced by its replacement list of preprocessing tokens.) The directive resulting after all replacements shall match one of the two previous formsFN2✨). The method by which a sequence of preprocessing tokens between a
and a
< preprocessing token pair or a pair of
> characters is combined into a single resource name preprocessing token is implementation-defined.
" An embed parameter with an attribute token that is one of the following is a standard embed parameter:
limit
prefix
suffix FN0✨) This constraint helps ensure data is neither filled with padding values nor truncated in a given environment, and helps ensure the data is portable with respect to usages of
with character type arrays initialized from the data.
memcpy FN1✨) For example, an embed element width of 8 will yield a range of values from 0 to 255, inclusive.
FN2✨) Note that adjacent string literals are not concatenated into a single string literal (see the translation phases in 5.1.1.2); thus, an expansion that results in two string literals is an invalid directive.
Recommended PracticeThe
directive is meant to translate binary data in resources to sequence of integer constant expressions in a way that preserves the value of the resource’s bit stream where possible.
#embed Implementations should take into account translation-time bit and byte orders as well as execution time bit and byte orders to more appropriately represent the resource’s binary data from the directive. This maximizes the chance that, if the resource referenced at translation time through the
directive is the same one accessed through execution-time means, the data that is e.g.
#embed or similar into contiguous storage will compare bit-for-bit equal to an array of character type initialized from an
fread directive’s expanded contents.
#embed Implementations are encouraged to diagnose embed parameters that they do not process or understand, with the understanding that
can be used to check if an implementation supports a given embed parameter.
__has_embed EXAMPLE 1 Placing a small image resource.
#include <stddef.h>void have_you_any_wool ( const unsigned char * , size_t ); int main ( int , char * []) { const unsigned char baa_baa [] = { #embed "black_sheep.ico" }; have_you_any_wool ( baa_baa , sizeof ( baa_baa )); return 0 ; } EXAMPLE 2 Checking the first 4 elements of a sound resource.
#include <assert.h>int main ( int , char * []) { const char sound_signature [] = { #embed <sdk/jump.wav> limit(2+2) }; // verify PCM WAV resource assert ( sound_signature [ 0 ] == 'R' ); assert ( sound_signature [ 1 ] == 'I' ); assert ( sound_signature [ 2 ] == 'F' ); assert ( sound_signature [ 3 ] == 'F' ); assert ( sizeof ( sound_signature ) == 4 ); return 0 ; } EXAMPLE 3 Constraint violation for a resource which is too small.
int main ( int , char * []) { const unsigned char coefficients [] = { #embed "only_3_bits.bin" // constraint violation }; return 0 ; } EXAMPLE 4 Extra elements added to array initializer.
#include <string.h>#ifndef SHADER_TARGET #define SHADER_TARGET "edith-impl.glsl" #endif extern char * null_term_shader_data ; void fill_in_data () { const char internal_data [] = { #embed SHADER_TARGET \ suffix(,) 0 }; strcpy ( null_term_shader_data , internal_data ); } EXAMPLE 5 Initialization of non-arrays.
Non-array types can still be initialized since the directive produces a comma-delimited lists of integer constant expressions, a single integer constant expression, or nothing.int main () { int i = { #embed "i.dat" }; /* i value is [0, 2^(embed element width)) from first entry */ int i2 = #embed "i.dat" ; /* valid if i.dat produces 1 value, i2 value is [0, 2^(embed element width)) */ struct s { double a , b , c ; struct { double e , f , g ; }; double h , i , j ; }; struct s x = { /* initializes each element in order according to initialization rules with comma-separated list of integer constant expressions inside of braces */ #embed "s.dat" }; return 0 ; } EXAMPLE 6 Equivalency of bit sequence and bit order.
#include <string.h>#include <stddef.h>#include <stdio.h>int main () { const unsigned char embed_data [] = { #embed <data.dat> }; const size_t f_size = sizeof ( embed_data ); unsigned char f_data [ f_size ]; FILE * f_source = fopen ( "data.dat" , "rb" ); if ( f_source == NULL); return 1 ; char * f_ptr = ( char * ) & f_data [ 0 ]; if ( fread ( f_ptr , 1 , f_size , f_source ) != f_size ) { fclose ( f_source ); return 1 ; } fclose ( f_source ); int is_same = memcmp ( & embed_data [ 0 ], f_ptr , f_size ); // if both operations refers to the same resource/file at // execution time and translation time, "is_same" should be 0 return is_same == 0 ? 0 : 1 ; } EXAMPLE 7 A potential constraint violation from a resource that may not have enough information in an environment that has a
greater than 24.
CHAR_BIT int main ( int , char * []) { const unsigned char arr [] = { #embed "24_bits.bin" limit(1) // may be a constraint violation }; return 0 ; } EXAMPLE 8 A null-terminated character array with a prefix value and suffix set of additional tokens when the resource is not empty.
#include <string.h>#include <assert.h>#ifndef SHADER_TARGET #define SHADER_TARGET "ches.glsl" #endif extern char * merp ; void init_data () { const char whl [] = { #embed SHADER_TARGET \ prefix(0xEF, 0xBB, 0xBF, ) /* UTF-8 BOM */ \ suffix(,) 0 }; // always null terminated, // contains BOM if not-empty int is_good = ( sizeof ( whl ) == 1 && whl [ 0 ] == '\0' ) || ( whl [ 0 ] == '\xEF' && whl [ 1 ] == '\xBB' && whl [ 2 ] == '\xBF' && whl [ sizeof ( whl ) - 1 ] == '\0' ); assert ( is_good ); strcpy ( merp , whl ); } EXAMPLE 9 This resource is considered empty due to the
embed parameter, always. This program always returns 0, even if the resource is searched for and found successfully by the implementation.
limit ( 0 ) int main () { return #embed </owo/uwurandom> limit(0) prefix(1) is_empty(0) ; // becomes: // return 0; } EXAMPLE 10 This resource is considered empty due to the
embed parameter, always, including in
limit ( 0 ) clauses.
__has_embed int main () { #if __has_embed(</owo/uwurandom> limit(0) prefix(1)) == 2 // if </owo/uwurandom> exits, this // token sequence is always taken. return 0 ; #else // the resource does not exist #error "The resource does not exist" #endif } EXAMPLE 11 Similar to a previous example, except it illustrates macro expansion specifically done for the
parameter.
limit ( …) #include <assert.h>#define TWO_PLUS_TWO 2+2 int main ( int , char * []) { const char sound_signature [] = { /* the token sequence within the parentheses for the "limit" parameter undergoes macro expansion, at least once, resulting in #embed <sdk/jump.wav> limit(2+2) */ #embed <sdk/jump.wav> limit(TWO_PLUS_TWO) }; // verify PCM WAV resource assert ( sound_signature [ 0 ] == 'R' ); assert ( sound_signature [ 1 ] == 'I' ); assert ( sound_signature [ 2 ] == 'F' ); assert ( sound_signature [ 3 ] == 'F' ); assert ( sizeof ( sound_signature ) == 4 ); return 0 ; }
7.5. Add 3 new sub clauses as §6.10.✨.1 through §6.10.✨.3, under §6.10.✨ Binary resource inclusion
6.10.✨.1parameter
limit ConstraintsIt may appear zero, one, or multiple times in the embed parameter list. The most recent in lexical order applies and the others shall be ignored. Its attribute argument clause shall be present and have the form:
( balanced-token-sequence )
and shall be an integer constant expression.
The token
shall not appear within the balanced-token-sequence.
defined SemanticsThe embed parameter with an attribute token
denotes a balanced token sequence that will be used to compute the resource width. The balanced token sequence is evaluated after it is processed at least once as normal text, using the same rules for conditional inclusion (6.10.1), with the exception that any defined macro expressions are not permitted.
limit The resource width is:
— 0, if the integer constant expression evaluates to 0; or,
— the implementation resource width if it is less than the embed element width multiplied by the integer constant expression; or,
— the embed element width multiplied by the integer constant expression, if it is less than or equal to the implementation resource width.
6.10.✨.2parameter
prefix ConstraintsIt may appear zero, one, or multiple times in the embed parameter list. The most recent in lexical order applies and the others are ignored. Its attribute argument clause shall be present and have the form:
( balanced-token-sequenceopt )
SemanticsThe embed parameter with an attribute token
denotes a balanced token sequence within its attribute argument clause that will be placed immediately before the result of the associated
prefix directive’s expansion, if any.
#embed If the resource is empty, then
has no effect and is ignored.
prefix 6.10.✨.3parameter
suffix ConstraintsIt may appear zero, one, or multiple times in the embed parameter list. The most recent in lexical order applies and the others are ignored. Its attribute argument clause shall be present and have the form:
( balanced-token-sequenceopt )
SemanticsThe embed parameter with an attribute token
denotes a balanced token sequence within its attribute argument clause that will be placed immediately after the result of the associated
suffix directive’s expansion.
#embed If the resource is empty, then
has no effect and is ignored.
suffix
7.6. OPTIONAL Modify §6.10.1 Conditional inclusion for __has_embed
expressions to return 2
alongside the above changes in paragraph 6
…The resource (6.10.✨) identified by the header-name preprocessing token sequence in each contained has_embed expression is searched for as if those preprocessing token were the pp-tokens in adirective, except that no further macro expansion is performed. Such a directive shall satisfy the syntactic requirements of a
#embed directive. The has_embed expression evaluates to:
#embed
— 0 if the search fails or if any of the embed parameters in the embed parameter list specified are not supported by the implementation for the
directive; or,
#embed — 1 if the search for the resource succeeds and all embed parameters in the embed parameter list specified are supported by the implementation for the
directive and the resource is not empty; or,
#embed - — 2 if the search for the resource succeeds and all embed parameters in the embed parameter list specified are supported by the implementation for the
directive and the resource is empty.
#embed …
7.7. OPTIONAL Add 1 new sub clause as §6.10.✨.4, under §6.10.✨ Binary resource inclusion and add an additional modification to the above changes' paragraph 14
This portion of the proposal must be approved with a separate vote. This does not happen if the previous vote to accept does not exist.
An embed parameter with an attribute token that is one of the following is a standard embed parameter:
limit
prefix
suffix
is_empty
6.10.✨.4parameter
is_empty ConstraintsIt may appear zero, one, or multiple times in the embed parameter list. The most recent in lexical order applies and the others shall be ignored. Its attribute argument clause shall be present and have the form:
( balanced-token-sequenceopt )
and shall be an integer constant expression.
SemanticsThe embed parameter with an attribute token
denotes a balanced token sequence within its attribute argument clause that will be replace the
is_empty directive entirely.
#embed If the resource is not empty, then
has no effect and is ignored.
is_empty
8. Acknowledgements
Thank you to Alex Gilding for bolstering this proposal with additional ideas and motivation. Thank you to Aaron Ballman, David Keaton, and Rajan Bhakta for early feedback on this proposal. Thank you to the
for bouncing lots of ideas off the idea in their Discord. Thank you to Hubert Tong for refining the proposal’s implementation-defined extension points.
Thank you to the Lounge<C++> for their continued support, and to rmf for the valuable early implementation feedback.
9. Appendix
9.1. Existing Tools
This section categorizes some of the platform-specific techniques used to work with C++ and some of the challenges they face. Other techniques used include pre-processing data, link-time based tooling, and assembly-time runtime loading. They are detailed below, for a complete picture of today’s landscape of options. They include both C and C++ options.
9.1.1. Pre-Processing Tools
-
Run the tool over the data (
) to obtain the generated file (xxd - i xxd_data . bin > xxd_data . h
) and add a null terminator if necessary:xxd_data . h
unsigned char xxd_data_bin [] = { 0x48 , 0x65 , 0x6c , 0x6c , 0x6f , 0x2c , 0x20 , 0x57 , 0x6f , 0x72 , 0x6c , 0x64 , 0x0a , 0x00 }; unsigned int xxd_data_bin_len = 13 ;
-
Compile
:main . c
#include <stdlib.h>#include <stdio.h>// prefix as const, // even if it generates some warnings in g++/clang++ const #include "xxd_data.h"int main () { const char * data = reinterpret_cast < const char *> ( xxd_data_bin ); puts ( data ); // Hello, World! return 0 ; }
Others still use python or other small scripting languages as part of their build process, outputting data in the exact C++ format that they require.
There are problems with the
or similar tool-based approach. Tokenization and Parsing data-as-source-code adds an enormous overhead to actually reading and making that data available.
Binary data as C(++) arrays provide the overhead of having to comma-delimit every single byte present, it also requires that the compiler verify every entry in that array is a valid literal or entry according to the C++ language.
This scales poorly with larger files, and build times suffer for any non-trivial binary file, especially when it scales into Megabytes in size (e.g., firmware and similar).
9.1.2. python
Other companies are forced to create their own ad-hoc tools to embed data and files into their C++ code. MongoDB uses a custom python script, just to format their data for compiler consumption:
import os import sys def jsToHeader ( target , source ): outFile = target h = [ '#include "mongo/base/string_data.h"' , '#include "mongo/scripting/engine.h"' , 'namespace mongo {' , 'namespace JSFiles{' , ] def lineToChars ( s ): return ',' . join ( str( ord( c )) for c in ( s . rstrip () + ' \n ' )) + ',' for s in source : filename = str( s ) objname = os . path . split ( filename )[ 1 ] . split ( '.' )[ 0 ] stringname = '_jscode_raw_' + objname h . append ( 'constexpr char ' + stringname + "[] = {" ) with open( filename , 'r' ) as f : for line in f : h . append ( lineToChars ( line )) h . append ( "0};" ) # symbols aren’t exported w/o this h . append ( 'extern const JSFile %s ;' % objname ) h . append ( 'const JSFile %s = { " %s ", StringData( %s , sizeof( %s ) - 1) };' % ( objname , filename . replace ( ' \\ ' , '/' ), stringname , stringname )) h . append ( "} // namespace JSFiles" ) h . append ( "} // namespace mongo" ) h . append ( "" ) text = ' \n ' . join ( h ) with open( outFile , 'wb' ) as out : try : out . write ( text ) finally : out . close () if __name__== "__main__" : if len( sys . argv ) < 3 : print"Must specify [target] [source] " sys . exit ( 1 ) jsToHeader ( sys . argv [ 1 ], sys . argv [ 2 :])
MongoDB were brave enough to share their code with me and make public the things they have to do: other companies have shared many similar concerns, but do not have the same bravery. We thank MongoDB for sharing.
9.1.3. ld
A complete example (does not compile on Visual C++):
-
Have a file ld_data.bin with the contents
.Hello , World ! -
Run
.ld - r binary - o ld_data . o ld_data . bin -
Compile the following
withmain . cpp
:gcc - std = c ++ 17 ld_data . o main . cpp
#include <stdlib.h>#include <stdio.h>#define STRINGIZE_(x) #x #define STRINGIZE(x) STRINGIZE_(x) #ifdef __APPLE__ #include <mach-o/getsect.h>#define DECLARE_LD_(LNAME) extern const unsigned char _section$__DATA__##LNAME[]; #define LD_NAME_(LNAME) _section$__DATA__##LNAME #define LD_SIZE_(LNAME) (getsectbyLNAME("__DATA", "__" STRINGIZE(LNAME))->size) #define DECLARE_LD(LNAME) DECLARE_LD_(LNAME) #define LD_NAME(LNAME) LD_NAME_(LNAME) #define LD_SIZE(LNAME) LD_SIZE_(LNAME) #elif (defined __MINGW32__) /* mingw */ #define DECLARE_LD(LNAME) \ extern const unsigned char binary_##LNAME##_start[]; \ extern const unsigned char binary_##LNAME##_end[]; #define LD_NAME(LNAME) binary_##LNAME##_start #define LD_SIZE(LNAME) ((binary_##LNAME##_end) - (binary_##LNAME##_start)) #define DECLARE_LD(LNAME) DECLARE_LD_(LNAME) #define LD_NAME(LNAME) LD_NAME_(LNAME) #define LD_SIZE(LNAME) LD_SIZE_(LNAME) #else /* gnu/linux ld */ #define DECLARE_LD_(LNAME) \ extern const unsigned char _binary_##LNAME##_start[]; \ extern const unsigned char _binary_##LNAME##_end[]; #define LD_NAME_(LNAME) _binary_##LNAME##_start #define LD_SIZE_(LNAME) ((_binary_##LNAME##_end) - (_binary_##LNAME##_start)) #define DECLARE_LD(LNAME) DECLARE_LD_(LNAME) #define LD_NAME(LNAME) LD_NAME_(LNAME) #define LD_SIZE(LNAME) LD_SIZE_(LNAME) #endif DECLARE_LD ( ld_data_bin ); int main () { const char * p_data = reinterpret_cast < const char *> ( LD_NAME ( ld_data_bin )); // impossible, not null-terminated //puts(p_data); // must copy instead return 0 ; }
This scales a little bit better in terms of raw compilation time but is shockingly OS, vendor and platform specific in ways that novice developers would not be able to handle fully. The macros are required to erase differences, lest subtle differences in name will destroy one’s ability to use these macros effectively. We omitted the code for handling VC++ resource files because it is excessively verbose than what is present here.
N.B.: Because these declarations are
, the values in the array cannot be accessed at compilation/translation-time.
9.1.4. incbin
There is a tool called
which is a 3rd party attempt at pulling files in at "assembly time". Its approach is incredibly similar to
, with the caveat that files must be shipped with their binary. It unfortunately falls prey to the same problems of cross-platform woes when dealing with Visual C, requiring additional pre-processing to work out in full.
9.1.5. Type Flexibility
Note: As per the vote in the September C++ Evolution Working Group Meeting, Type Flexibility is not being pursued in the preprocessor for various implementation and support splitting concerns.
A type can be specified after the
to view the data in a very specific manner. This allows data to initialized as exactly that type.
Type flexibility was not pursued for various implementation concerns. Chief among them was single-purpose preprocessors that did not have access to frontend information. This meant it was very hard to make a system that was both preprocessor conformant but did not require e.g.
information at the point of preprocessor invocation. Therefore, the type flexibility feature was pulled from
and will be conglomerated in other additions such as
or
.
/* specify a type-name to change array type */ const int shorten_flac [] = { #embed int "stripped_music.flac" };
The contents of the resource are mapped in an implementation-defined manner to the data, such that it will use
bits for each element. If the file does not have enough bits to fill out a multiple of
bits, then a diagnostic is required. Furthermore, we require that the type passed to
that must one of the following fundamental types, signed or unsigned, spelled exactly in this manner:
-
,char
,unsigned char signed char -
,short
,unsigned short signed short -
,int
,unsigned int signed int -
,long
,unsigned long signed long -
,long long
,unsigned long long signed long long
More types can be supported by the implementation if the implementation so chooses (both the GCC and Clang prototypes described below support more than this). The reason exactly these types are required is because these are the only types for which there is a suitable way to obtain their size at pre-processor time. Quoting from §5.2.4.2.1, paragraph 1:
The values given below shall be replaced by constant expressions suitable for use in
preprocessing directives.
#if
This means that the types above have a specific size that can be properly initialized by a preprocessor entirely independent of a proper C frontend, without needing to know more than how to be a preprocessor. Originally, the proposal required that every use of
is accompanied by a
(or, in the case of C++,
). Instead, the proposal now lets the implementation "figure it out" on an implementation-by-implementation basis.