Document Number: N2658
Submitter: Aaron Peter
Bachmann
Submission Date: 2021-12-30
Make pointer type casting useful without negatively impacting
performance - updates N2484
Pointer-type-punning is useful but mostly
undefined behavior according to the standards ISO/IEC 9899:1999
... ISO/IEC 9899:2018.
Before that there were implementation defined aspects.
Therefore, we have to resort to additional - non-portable -
extensions compilers normally provide or use another programming
language (often assembler) or work-arounds to achieve our
intended goals as programmers. Strict aliasing rules are also
useful. They allow for more efficient code without extra work
for the programmer. We can allow type-punning in many cases
without negatively impacting performance.
memset()
, memcpy()
,
strlen()
, strcpy()
, ... malloc()
and friends shall
be implementable in plain C.memcpy()
is a canonical
righteous way to deal with the representation of
objects, but it has its problemsmemcpy()
introduced. The programs become harder to read,
the number of source lines increases, the programs
become less efficient in terms of runtime, code-size,
stack-consumption, latency.
memcpy()
cannot reasonably be used in an
implementation of memcpy()
.
memcpy()
, then memcpy()
must not reside in the same flash - often the only
one. If code and data are both in RAM they may compete
for the same bus introducing additional inefficiency.
memcpy()
which
has to be able to copy objects of arbitrary type.
restrict
.
-fno-strict-aliasing
.
Some C-compilers assume no strict aliasing either by default or
as the only option: pcc, tcc, the original C-compiler by DMR,
Microsoft C-compilers, ... -fno-strict-aliasing
.
attribute((__may_alias__))
Make object-pointer-casting (casting a
pointer-type to a pointer to a different type) valid
and well defined. The accesses via this pointer shall be valid as
well, provided we honor other restrictions (const, alignment,
...).
void f1(float **f, unsigned **u){
_Static_assert(sizeof(float)==sizeof(unsigned),"precondition violated");
_Static_assert(4==sizeof(unsigned),"precondition violated");
_Static_assert(8==CHAR_BIT,"precondition violated");
unsigned u32;
#ifdef DO_SOMETING_INVALID
*f=*u; // invalid
*u=*f; // invalid
*f=*(float**)u; // invalid
*u=*(unsigned**)f; // invalid
#else
(void)u;
#endif
float *fp=*f;
unsigned *up=*u;
u32=*(unsigned*)fp; // valid according to the proposal ... reading
u32^=1u<<31;
// under the restrictions given above a compiler not seeing
// the implementationof the function but its prototype only
// must already assume:
// **f may be changed (as float); in this case **f=-**f
*(float*)up=u32; // valid according to the proposal ... writing
}
#include <string.h>
#include <limits.h>
#include <stdint.h>
#define ALIGN (sizeof(size_t))
#define ONES ((size_t)-1/UCHAR_MAX) // 0x01010101 for 32 bit integer
#define HIGHS (ONES * (UCHAR_MAX/2+1)) // 0x80808080 for 32 bit integer
#define HASZERO(x) ((x)-ONES & ~(x) & HIGHS) // only 0 has OV & high bit 0
// harmless but undefined, because we eventually read more than object-size!
//#define HELP_THE_COMPILER
#if defined(HELP_THE_COMPILER) && defined(__GNUC__)
#define AA(p,a) __builtin_assume_aligned(p,a) // non-portable hint
#else
#define AA(p,a) (p)
#endif
size_t strlen_1(const char *s){
union U {
size_t v;
char c[sizeof(size_t)];
} u;
const char *a = s;
for (; (uintptr_t)s % ALIGN; s++){
if (!*s)
return s-a;
}
// pray compiler will remove memcpy() and use size_t-wide access
for (;memcpy(&u.c,AA(s,ALIGN),sizeof(size_t)), !HASZERO(u.v); s+=sizeof(size_t))
;
for (; *s; s++)
;
return s-a;
}
size_t strlen_2(const char *s){
const char *a = s;
for (; (uintptr_t)s % ALIGN; s++){
if (!*s)
return s-a;
}
for (; !HASZERO(*(size_t*)s); s+=sizeof(size_t)) // code matching this proposal
;
for (; *s; s++)
;
return s-a;
}
All compilers tested fail to recognize that the 2nd loop instrlen_1()
is aligned without a hint given by the programmer in a non-portable way. The code generated is poor for many CPUs without support for misaligned access. We show examples of code produced for a few targets with a few compilers.
Forstrlen_2()
it is immediately clear - without further analysis - the pointer must be properly aligned, otherwise the code would be invalid, even if this proposal gets accepted.
static float *Fp=(float*)&Something;
static unsigned *Uu
=(unsigned*)
&Something; // invalid
for example:
For portable programs [u]int32_t
must be assumed to alias any integer type, except
for long long
, since it could be int
or long
and almost any other integer-type.
short
can have 32 bits. The same applies
for other [u]intx_t
-types.malloc()
. For that we have to
allow aliasing in other situations as well. This too may
be possible without significantly restricting the
usefulness of strict aliasing. This bullet point only
sketches a potential loophole. char*
.
If we also allow to access any object of char*
via any_type*
we can implement malloc()
and friends. This would still not allow to alias
any_type1*
with any_type2*
provided any_type1*
and any_type2*
are mutually different and different from char*
.
Within the lifetime of the object pointed to by char*
we would allow char*
to alias any_type1*
,
any_type2*
,... but the use of the aliases any_type1*
,
any_type2*
,... must be sequenced. unrestrict
could be
introduced in addition or as alternative.
The working changes are relative to N2596 [3].
An object shall have its stored value accessed only by an lvalue
expression that has one of the following types:89)
In bold
we have the inner loop. The colored
comments point out the more interesting code parts.
strlen_1() ARM gcc 11.2 -Os -mthumb -mcpu=cortex-m0
-mfloat-abi=soft
strlen_1:
movs r3, r0
movs r2, #3
push {r4, lr}
.L2:
tst r3, r2
bne .L5
ldr r4, .L11
.L6:
// memcpy() is inlined, but we have only
byte-loads
ldrb
r2, [r3, #1]
ldrb r1, [r3]
lsls r2, r2, #8
orrs r2, r1
ldrb r1, [r3, #2]
lsls r1, r1, #16
orrs r1, r2
ldrb r2, [r3, #3]
lsls r2, r2, #24
orrs r2, r1
ldr r1, .L11+4
adds r1, r2, r1
bics r1, r2
tst r1, r4
beq .L7
.L8:
ldrb r2, [r3]
cmp r2, #0
beq .L10
adds r3, r3, #1
b .L8
.L5:
ldrb r1, [r3]
cmp r1, #0
bne .L3
.L10:
subs r0, r3, r0
pop {r4, pc}
.L3:
adds r3, r3, #1
b .L2
.L7:
adds r3, r3, #4
b .L6
.L11:
.word -2139062144
.word -16843009
strlen_1() armc7-a clang 11.0.1 -Os -mthumb
-mcpu=cortex-m0 -mfloat-abi=soft
strlen_1:
push {r4, r5, r6, lr}
lsls r1, r0, #30
mov r1, r0
beq .LBB0_6
ldrb r1, [r0]
cmp r1, #0
beq .LBB0_10
adds r1, r0, #1
.LBB0_3:
@ =>This Inner Loop Header: Depth=1
lsls r2, r1, #30
beq .LBB0_6
adds r2, r1, #1
ldrb r1, [r1]
cmp r1, #0
mov r1, r2
bne .LBB0_3
subs r1, r2, #4
adds r1, r1, #3
b .LBB0_11
.LBB0_6:
subs r1, r1, #4
ldr r2, .LCPI0_0
ldr r3, .LCPI0_1
.LBB0_7:
// memcpy() is inlined, but we have
only byte-loads
@ =>This Inner Loop Header: Depth=1
ldrb r4, [r1, #4]
ldrb r5, [r1, #5]
lsls r5, r5, #8
adds r4, r5, r4
ldrb r5, [r1, #6]
ldrb r6, [r1, #7]
lsls r6, r6, #8
adds r5, r6, r5
lsls r5, r5, #16
adds r4, r5, r4
adds r5, r4, r3
mov r6, r2
bics r6, r4
adds r1, r1, #4
tst r6, r5
beq .LBB0_7
lsls r2, r4, #24
beq .LBB0_11
.LBB0_9:
@ =>This Inner Loop Header: Depth=1
ldrb r2, [r1, #1]
adds r1, r1, #1
cmp r2, #0
bne .LBB0_9
b .LBB0_11
.LBB0_10:
mov r1, r0
.LBB0_11:
subs r0, r1, r0
pop {r4, r5, r6, pc}
.LCPI0_0:
.long
2155905152
@ 0x80808080
.LCPI0_1:
.long
4278124287
@ 0xfefefeff
strlen_1() MIPS64 gcc 5.4 -Os
strlen_1:
move $2,$4
.L2:
andi $3,$2,0x7
beq $3,$0,.L13
li
$5,-8454144 #
0xffffffffff7f0000
lb $3,0($2)
beq $3,$0,.L11
nop
b .L2
daddiu $2,$2,1
.L13:
li
$3,-16711680
# 0xffffffffff010000
daddiu $5,$5,32639
daddiu $3,$3,257
dsll $5,$5,16
dsll $3,$3,16
daddiu $5,$5,32639
daddiu $3,$3,257
dsll $5,$5,17
dsll $3,$3,23
ori $5,$5,0xfeff
ori $3,$3,0x8080
.L6:
// memcpy() is inlined
ldl $6,0($2)
//
2 instructions for unaligned
access for 8 bytes ... nice
ldr $6,7($2)
daddu $7,$6,$5
nor $6,$0,$6
and $6,$7,$6
and $6,$6,$3
bnel $6,$0,.L14
lb $3,0($2)
b .L6
daddiu $2,$2,8
// off-topic: instruction in delay-slot
.L8:
lb $3,0($2)
.L14:
beq $3,$0,.L11
nop
b .L8
daddiu $2,$2,1
.L11:
j $31
dsubu $2,$2,$4
strlen_1() ARM gcc 11.2 -Os -mthumb -mcpu=cortex-m0
-mfloat-abi=soft -DHELP_THE_COMPILER
strlen_1:
movs r3, r0
movs r2, #3
push {r4, lr}
.L2:
tst r3, r2
bne .L5
ldr r4, .L11
.L6:
// with help we have word
loads even with memcpy()
ldr r1, [r3]
// off-topic: better use ldm
r3!,{r1}
ldr r2, .L11+4 // off-topic: bad: loading a constant
PC-relative in a loop!
adds r2, r1, r2
bics r2, r1
tst r2, r4
beq .L7
// off-topic: 2nd branch in a loop, with
the change above jump to .L6
.L8:
ldrb r2, [r3]
cmp r2, #0
beq .L10
adds r3, r3, #1
b .L8
.L5:
ldrb r1, [r3]
cmp r1, #0
bne .L3
.L10:
subs r0, r3, r0
pop {r4, pc}
.L3:
adds r3, r3, #1
b .L2
.L7:
adds r3, r3, #4
// off-topic: rather use ldm r3!,{r1}
instead of
ldr r1, [r3]
b .L6
.L11:
.word -2139062144
.word -16843009
strlen_2() ARM gcc 11.2 -Os -mthumb -mcpu=cortex-m0
-mfloat-abi=soft
strlen_2:
movs r3, r0
movs r2, #3
push {r4, lr}
.L2:
tst r3, r2
bne .L5
ldr r4, .L11
.L6:
ldr r1, [r3]
// the proposal improved code-quality
ldr r2, .L11+4
// off-topic: unrelated problems already
outlined above remain
adds r2, r1, r2
bics r2, r1
tst r2, r4
beq .L7
// off-topic: unrelated problems already
outlined above remain
.L8:
ldrb r2, [r3]
cmp r2, #0
beq .L10
adds r3, r3, #1
b .L8
.L5:
ldrb r1, [r3]
cmp r1, #0
bne .L3
.L10:
subs r0, r3, r0
pop {r4, pc}
.L3:
adds r3, r3, #1
b .L2
.L7:
adds r3, r3, #4
// off-topic: unrelated problems already
outlined above remain
b .L6
.L11:
.word -2139062144
.word -16843009
strlen_2() armv7-a clang 11.0.1 -Os -mthumb
-mcpu=cortex-m0 -mfloat-abi=soft
strlen_2:
push {r4, r5, r6, lr}
lsls r1, r0, #30
mov r1, r0
beq .LBB0_6
ldrb r1, [r0]
cmp r1, #0
beq .LBB0_10
adds r1, r0, #1
.LBB0_3:
@ =>This Inner Loop Header: Depth=1
lsls r2, r1, #30
beq .LBB0_6
adds r2, r1, #1
ldrb r1, [r1]
cmp r1, #0
mov r1, r2
bne .LBB0_3
subs r1, r2, #4
adds r1, r1, #3
b .LBB0_11
.LBB0_6:
subs r1, r1, #4
ldr r2, .LCPI0_0
ldr r3, .LCPI0_1
.LBB0_7:
@ =>This Inner Loop Header: Depth=1
ldr r4, [r1, #4]
// the proposal improved code-quality
adds
r5, r4, r3
mov r6, r2
bics r6, r4
adds r1, r1, #4
// off-topic: with ldm this does not need to
be in the loop
tst r6, r5
beq .LBB0_7
lsls r2, r4, #24
beq .LBB0_11
.LBB0_9:
@ =>This Inner Loop Header: Depth=1
ldrb r2, [r1, #1]
adds r1, r1, #1
cmp r2, #0
bne .LBB0_9
b .LBB0_11
.LBB0_10:
mov r1, r0
.LBB0_11:
subs r0, r1, r0
pop {r4, r5, r6, pc}
.LCPI0_0:
.long
2155905152
@ 0x80808080
.LCPI0_1:
.long
4278124287
@ 0xfefefeff
Brief discussion of the results
For ARM Cortex-M0 a human would use 6
instructions in the inner loop (9 cycles). With the proposal in
place compilers use 7...8 instuctions and 10...14 cycles (assuming
no additional penalty due to wait-states for the PC-relative load)
in the inner loop. Without the proposal compilers use 16...17
instructions (22...25 cycles) in the inner loop.
Using uint64_t
instead of size_t
could
further increase the discrepancy.
While this is certainly something a compiler could do for the
relative simple examples given, it is nothing the compilers used
in the experiments can do. Experiments with other compiler options
and other compilers (both older and from other vendors) executed
in Summer 2021 gave similar results.
The same applies for many other CPUs without hardware-support for
misaligned memory-accesses.
The asm-listings given above are produced using godbold.org [4].
I want to thank Martin Uecker, for coming up with
the idea "immediate cast", for providing most of the normative
text and for giving additional valuable advice. Without him the
proposal would not exist in its current form.
[1] N2484
2020/02/17 Bachmann, Make pointer type casting useful without
negatively impacting performance
[2] N2279
2018/07/11 Yodaiken, Proposal to make aliasing consistent
[3] N2596
2020/12/12 Meneide, C2x Working Draft