JTC1/SC22/WG14
N938
WG14 Document: N938
BDTI Proposal for Fixed-Point Arithmetic in Embedded C
John R. Hauser
Berkeley Design Technology, Inc.
2001 March 16
BDTI has been writing software for and evaluating the performance of DSPs
(digital signal processors) for almost a decade, and we are naturally
interested in WG14's effort to standardize C extensions for fixed-point
arithmetic as part of its "Embedded C" work item. To promote discussion on
this topic and help foster the best possible standard, we would like to put
forth a proposal for the fixed-point types and arithmetic that differs in
some respects from the working draft WDTR18037. This document consists of
some preliminary comments followed by an outline of the specification we
would prefer.
----------------------------------------------------------------------------
Preliminary Commentary
- As was discussed in Toronto, BDTI would like the syntax to be such that
the embedded C features can all be implemented directly in C++ without
extra compiler support. Of course, for actual application development,
a C or C++ compiler would need to treat the new features specially or
they won't be efficient. But making the syntax compatible with existing
C++ has two positive effects: (1) It makes the C++ people happier by
allowing them to be compatible without having to extend C++ and change
all of their compilers. (2) Equally important, it gives everyone with
access to a C++ compiler a convenient prototyping platform, even if their
C compilers don't support the embedded C features.
- We approve of the basic plan to have separate "type-A" and "type-B"
fixed-point types, with three nominal sizes of each ("short", normal, and
"long"). As defined by Willem Wakker, type-A fixed-point formats have
no integer bits and support a range of -1 to 1, while type-B formats have
both integer and fractional bits.
- It is BDTI's view that unsigned fixed-point is not worth pursuing at this
time. While unsigned fixed-point types can sometimes be useful, they
really only buy one extra bit of precision over the signed types. To
take the type-A case (and echoing Willem), the corresponding unsigned
fixed-point types would either have the same number of fraction bits and
one integer bit to cover the range 0 to 2, or they would have an extra
fraction bit and cover the range 0 to 1. The first style would be more
consistent with the correspondence between signed and unsigned integers;
however, the range of 0 to 2 has no obvious justification other than to
maintain alignment with the signed fixed-point format.
Some DSPs don't fully support an unsigned fractional type in hardware.
For those that do, the main benefit is to permit multi-word operations to
be built from processor primitives more conveniently. For such purposes,
the location of the binary point in individual operations becomes rather
murky. The C language already has unsigned integers that can be used for
constructing multi-word operations about as effectively.
Because one bit isn't much, it's our experience that true unsigned
fixed-point formats rarely come up. Note that the choice not to require
unsigned fixed-point types has precedent in the fact that there are no
unsigned _floating-point_ types in C, either. We advocate postponing the
standardization of unsigned fixed-point (and thus coercing all Embedded C
implementations to provide it) until there is a clearer need for it.
- As was briefly touched on at the Toronto meeting, a saturation attribute
is most conveniently attached to fixed-point types, even though
saturation would ideally be an attribute of operations, not types.
There's no convenient syntax we know of for attaching a "sat" keyword to
operators like "+" and "=".
- As with unsigned fixed-point, we don't see enough need for saturating
_integer_ arithmetic to justify standardizing it at this time. We were
unable to think of a single instance where support for this feature would
significantly benefit one of our applications.
- Besides saturation, some DSP algorithms require modular wrap-around on
overflow (like the wrap-around that C requires for unsigned integers).
We propose that there be an attribute for this that can be attached to
types in the same manner as the saturation attribute.
============================================================================
The following outlines our proposal for fixed-point extensions for C. BDTI
has developed a prototype implementation of this system in C++ and has been
building experience using it to express DSP calculations.
Additional comments in the text below are enclosed in boxes.
----------------------------------------------------------------------------
Fixed-Point Types
Eighteen new fixed-point type names are defined:
short_fract short_fract_sat short_fract_mod
fract fract_sat fract_mod
long_fract long_fract_sat long_fract_mod
short_accum short_accum_sat short_accum_mod
accum accum_sat accum_mod
long_accum long_accum_sat long_accum_mod
All of these fixed-point types are signed; no unsigned fixed-point types are
defined. The "_sat" and "_mod" types are identical to their respective base
types except for the treatment of overflows as explained later.
The "fract" types by definition have no integer bits. The binary point in
the "fract" types is always just to the left of the most significant non-
sign bit, so that "fract" types support values in the range of -1 to 1.
(The endpoints -1 and 1 are not necessarily representable exactly by a
"fract" type.) The minimum number of bits for each "fract" type (including
the sign bit) is as follows:
short_fract: 8 bits
fract: 16 bits
long_fract: 32 bits
The "accum" types have integer bits as well as fractional bits. Each
"accum" type (short, normal, and long) must have exactly the same number of
fractional bits as its corresponding "fract" type, plus a minimum of four
integer bits. (It is not required that the three "accum" types all have the
same number of integer bits.)
+---------------------------------------------------------------------------
| Willem and others have proposed names such as "short accum", together
| with a separate "sat" keyword. Although this would be more consistent
| with existing C, names requiring more than one token have always been
| problematic because they can't be defined either through "typedef"s
| or macros in C or with classes in C++. Without the ability to declare
| the proper type names, a working implementation can't be created in C++.
| This is important because we expect most C compilers _not_ to support
| the embedded extensions, so a C++ implementation will be the only one
| available to many people.
+---------------------------------------------------------------------------
----------------------------------------------------------------------------
Overflow and Rounding
Conversion of a real arithmetic value to a fixed-point type may overflow
and/or may require rounding. When the source value does not fit within
the range of the fixed-point type, the conversion overflows. Two different
behaviors are defined for overflow:
- Saturation: If the fixed-point type is a "_sat" type, the source value
is replaced by the closest available fixed-point value (which will be
either the maximal negative or maximal positive value of the fixed-point
type).
- Modular wrap-around: If the fixed-point type is a "_mod" type, the
source value is replaced by a value in the range of -(2^I) to 2^I that is
congruent (in the mathematical sense) to the source value modulo 2^(I+1),
where I is the number of integer bits in the fixed-point format. For
example, for "fract" types, the source value is replaced by a value in
the range -1 to 1 that is congruent to the source value modulo 2. (This
has the effect of discarding all the bits above the most significant bit
of the fixed-point format. This is exactly the same sort of overflow
behavior required of unsigned integers.)
Otherwise, if the fixed-point type is not a "_sat" or "_mod" type, overflow
is undefined (and may legitimately terminate the program, for example).
If, after overflow handling, the source value cannot be represented exactly
by the fixed-point type, the source value is rounded to either the closest
fixed-point value greater than the source value (rounded up) or to the
closest fixed-point value less than the source value (rounded down).
Whether rounding is up or down is implementation-defined and may differ for
different values and different situations.
----------------------------------------------------------------------------
Type Conversions
All conversions between a fixed-point type and another arithmetic type
(which can be another fixed-point type) are defined. Overflow and
rounding are handled according to the usual rules for the destination type.
Conversions from a fixed-point to an integer type round toward zero, just as
conversions from floating-point to integer do. The rounding of conversions
from fixed-point to floating-point is unspecified.
----------------------------------------------------------------------------
Operations
The usual operators for basic arithmetic and for comparisons (add, subtract,
multiply, divide, equal, less than, etc.) are defined to work with fixed-
point values. In addition, fixed-point shifts left and right are defined
to be equivalent to multiplications and divisions by a power of two. The
"roundfx" operator is provided for explicitly rounding a fixed-point value
to a number of fractional bits. Finally, a "bits" operator allows the bits
of a fixed-point value to be accessed and manipulated as an integer, for
situations not covered by the other operations. All of these operations are
detailed in the subsections below.
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Basic arithmetic
The standard binary arithmetic operators +, -, *, and / accept fixed-point
operands as follows:
- If both operands are fixed-point, the two operand types must be
"comparable" according to the following partial order (or lattice):
short_fract <= fract <= long_fract
short_accum <= accum <= long_accum
short_fract <= short_accum
fract <= accum
long_fract <= long_accum
(This means that one of the types must be guaranteed to have at least
as many integer bits and as many fractional bits as the other type, and
thus able to hold all values of the other type.) The "_sat" or "_mod"
attributes do not affect whether two fixed-point types are comparable.
It is a constraint error for the two fixed-point types not to be
comparable according to this partial order. The type of the result is
the larger type, with the adoption of any "_sat" or "_mod" attribute
from the other operand. For example, if the operands of an addition have
types "long_accum" and "fract_sat", the result type of the operation is
"long_accum_sat". It is a constraint error for one fixed-point operand
to be a "_sat" type and the other a "_mod" type.
+---------------------------------------------------------------------------
| If one of the fixed-point types isn't strictly "bigger" than the other,
| there's no certain way to choose the result type. Rather than invent
| an arbitrary rule, we've simply prohibited the questionable cases.
| Programmers can achieve what they want unambiguously through the use of
| type casts.
+---------------------------------------------------------------------------
- If one operand is fixed-point and the other an integer, the result type
is that of the fixed-point operand.
- It is a constraint error for one operand to be fixed-point and the other
floating-point.
+---------------------------------------------------------------------------
| The prohibition against mixing fixed-point and floating-point may come as
| a surprise, but it serves an important purpose. Imagine a program written
| for an embedded processor with no floating-point hardware. Fixed-point
| types are used in the program for certain non-integer calculations.
| Assume that some numeric constants appear in the code, e.g.,
| "a * 0.333333", where the variable "a" has fixed-point type. Now, if
| fixed-point and floating-point operands were allowed to mix, this would
| (presumably) count as a floating-point multiplication, which might take
| hundreds of clock cycles to emulate instead of the two- or three-cycle
| fixed-point multiplication the programmer intended. The problem is that
| obvious constants like 0.333333 have floating-point type in C, so any
| inadvertent use of them as operands will drag in floating-point emulation
| software and suck up clock cycles. From the perspective of the embedded
| software developer, these situations are bugs that have to be found and
| eliminated.
|
| It's a safe bet that embedded software that uses fixed-point arithmetic
| is going to be run on machines without efficient floating-point. Having
| fixed-point values automatically promote to floating-point is not a
| convenience; it's an annoyance working against the programmer. Rather
| than think of fixed-point as being between integers and floating-point,
| it's more helpful to think of it as an _alternative_ to floating-point.
| With that view, an operation with both fixed-point and floating-point
| operands has no preferred result type. Of course, programmers can easily
| use explicit type casts and assignments to convert between the two forms
| when needed.
+---------------------------------------------------------------------------
In all cases above, a fixed-point arithmetic operation is done exactly
(according to its mathematical definition), and then overflow handling
and rounding is performed for the result type as explained in the earlier
section _Overflow_and_Rounding_. An integer operand is _not_ first
converted to fixed-point before the operation is performed.
+---------------------------------------------------------------------------
| The proposal not to promote integers to fixed-point to balance the
| operands is clearly a departure from the way C is normally defined
| and, in particular, the way the same operations work when integer and
| floating-point operands are mixed. At BDTI, we've struggled with this
| inconsistency, but we keep running into a fundamental problem: integer
| values often can't be promoted honestly to fixed-point types. None of
| the "fract" types have _any_ integer bits, and some DSPs will have as
| few as four integer bits in their "accum" types. On such a machine, it's
| impossible to promote an integer bigger than 8 to any fixed-point type,
| which leaves only a limited range of integers to work with. Consider, for
| example, how you divide a fixed-point value by an integer variable whose
| value could be as large as, say, 15.
|
| The floating-point types have the advantage that (on all machines I know)
| the _range_ of all the integers fits within even the smallest floating-
| point type, so promoting an integer to floating-point at worst suffers
| a rounding error (and often not even that). This is absolutely not the
| case for the fixed-point types. On the other hand, unlike floating-point,
| from a hardware perspective, fixed-point and integer operations are nearly
| the same thing. Thus, it's no trouble for the hardware to mix integer
| and fixed-point operands and perform the calculation directly, as we've
| proposed. Since it seems the only argument against this approach is the
| wrinkle it incurs on the C language, we think the balance of the argument
| favors defining mixed fixed-point and integer operations as working
| directly without the integer argument being promoted to fixed-point first.
+---------------------------------------------------------------------------
Division by zero is undefined.
The standard assignment operators +=, -=, *=, and /= are defined in the
usual way when either operand is fixed-point. Note, in particular, that,
given the declarations
fract_sat a;
fract_mod b;
the expression "a += b" is a constraint violation for the same reason that
"a + b" is.
The pre- and postfix ++ and -- operators have their usual meaning of adding
or subtracting the integer 1 to/from the operand and returning the value
before or after the addition as the result.
The unary plus (+) and negation (-) operations are defined for fixed-point
as well. The result type is the same as that of the operand. The negation
operation is equivalent to subtracting the operand from the integer zero.
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Comparisons
The standard comparison operators (<, <=, ==, >=, >, and !=) accept fixed-
point operands. For consistency, the rules for mixing operands of different
types are the same as for the basic arithmetic operations, except that it
is permitted to compare a fixed-point value of "_sat" type with a fixed-
point value of "_mod" type. (Note this means a fixed-point value cannot be
compared to a floating-point value without a type cast.) Fixed-point and
integer values are compared directly; the integer operand is not first
converted to fixed-point before the comparison is made.
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Shifts left and right
Shifts of fixed-point values using the standard << and >> operators are
defined to be equivalent to multiplication or division by a power of two.
The right operand is converted to type "int" and must be nonnegative and
less than the total number of bits of the fixed-point operand (the left
operand). The result type is the same as that of the fixed-point operand.
An exact result is calculated and then converted to the result type in the
same way as the other fixed-point arithmetic operations.
The standard assignment operators <<= and >>= are defined in the usual way
when the left operand is fixed-point.
+---------------------------------------------------------------------------
| Shifts of fixed-point values are common, particularly for block floating-
| point, and need to be supported.
+---------------------------------------------------------------------------
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
roundfx
The "roundfx" operator rounds a fixed-point operand to a specified number of
fractional bits. The operator has a function-like syntax:
roundfx ( <fixed-value> , <num-fract-bits> )
The result type is the type of the first operand, which must be fixed-point.
The second operand is converted to type "int" and must be nonnegative and
less than the number of fractional bits in the fixed-point type. The fixed-
point value is rounded to the specified number of fractional bits in an
implementation-defined way, and this rounded value is returned as the
result. Fractional bits beyond the rounding point are set to zero in the
result.
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Accessing the bits of the fixed-point type
The "bits" operator provides access to the bits of a fixed-point value as an
integer. The operator has a function-like syntax:
bits ( <fixed-value> )
The single operand must be a fixed-point type. When used as an expression,
the "bits" operator returns an integer equal to the fixed-point value
multiplied by 2^F, where F is the number of fractional bits in the fixed-
point type. The result type is a signed integer type big enough to hold all
valid result values for the given operand fixed-point type. For example, if
"fract" is 16 bits, then after the declaration
fract a = 0.5;
the value of "bits(a)" is 0.5 * 2^16 = 0x4000.
If the operand to "bits" is an lvalue, the same construct is also an lvalue
that can be used as the destination of an assignment. Continuing the
example above, the assignment
bits(a) = 0x2000;
sets "a" to the fixed-point value 0.25. The assignment causes the value on
the right hand side to be converted to an integer. If this integer value is
too large for the fixed-point type of the "bits" operand, overflow handling
occurs in the usual manner according to the fixed-point type, as defined
earlier in the section _Overflow_and_Rounding_.
The "bits" construct cannot have its address taken (by the "&" operator).
+---------------------------------------------------------------------------
| It is often necessary to access a fixed-point value as an integer in this
| way, and the other arithmetic operations don't provide as convenient a
| means to do this.
+---------------------------------------------------------------------------
----------------------------------------------------------------------------
Fixed-Point Constant Expressions
Fixed-point constant expressions whose type is neither a "_sat" nor a "_mod"
type are evaulated by the compiler with saturation on overflow.
+---------------------------------------------------------------------------
| Fixed-point constant expressions can be constructed using type casts, as
| in "(fract) 1.0".
+---------------------------------------------------------------------------
----------------------------------------------------------------------------