JTC1/SC22/WG14
N808
WG14/N808 J11/98-007
DR 63 (and 56): Floating-Point accuracy
Proposed wording for the Response for DR 63 (and 56).
The following wording should be incorporated into a future
version of the standard in section 5.2.4.2.2 <float.h>, page
25, between paragraphs 3 and 4:
The accuracy of the floating-point operations (+, -, *, /)
and of the math library (<math.h> and <complex.h>) functions
that return a floating-point result is implementation
defined. The implementation shall document the accuracy as
the worst case error in terms of units in the last place
(ULPs), or it shall say that the accuracy is "unknown."
Rationale (section 5.2.4.2.2)
In documenting the accuracy of the math library functions,
an implementation is allowed to break the domain of a
function into disjoint regions and document the worst case
error in each region. An implementation should document the
accuracy of the complex operations and functions by listing
the worst case error in both the real and the imaginary
parts.
The error between two floating-point numbers (such as
computed and infinitly precise) in ULPs (Unit in the Last
Places) depends on the radix and the precision used in
representing the number, but not the exponent. Thus, the
ULP of a value x, with exponent e, precision p, and radix r,
is r**(1-p). With a decimal radix and 3 digits of
precision, the computed value 3.14e-2 differs from the value
3.1416e-2 by 0.16 ULPs.
The C9X committee discussed the idea of allowing the
programmer to find out the accuracy of floating-point
operations and math functions during compilation (say via
macro symbols) or during execution (via a function call),
but neither got enough support (even though this is what
some users want) to warrant the change to the standard.
Either would require over one hundred macro symbols to name
every math function (such as ULP_SINF, ULP_SIN, and ULP_SIND
just for the sin function) and that was a large change for a
small utility.
The values in <float.h> should be in terms of the hardware
representation used to store floating-point values (not in
terms of the effective accuracy of operations). That is, a
poor quality implementation cannot say (in <float.h>) that
it has less precision than the hardware representation, so
that it could claim 1-ULP accuracy for its operations.
The committee could not agree on upper limits on accuracy
that all conforming imlementations must meet, eg, addition
is no worst than 2 ULPs for all implementations. It is a
quality of implementation issue.
Implementations that conform to IEC-559 have 0.5 ULP
accuracy in round to nearest mode, and 1.0 ULP accuracy in
the other three rounding modes, for the basic arithmetic
operations and sqrt. For other floating-point arithmetics,
it is a rare implementation that has worse than 1-ULP
accuracy for the basic arithmetic operations.
The accuracy of binary-decimal converions and format
conversions are discussed elsewhere in the standard.
For the math library functions, currently, fast correctly
rounded 0.5 ULP accuracy is still a research problem. Some
imlementation provide two math libraries: one being fast and
sloppy, the other being slow and accurate.
Some math functions, such as those that do argument
reduction modulo an approximation of pi, have good accuracy
for small arguments, and poor accuracy for large arguments.
It is not unusual for an implementation of the trig
functions to have zero bits correct in the computed result
for large arguments. That is why the committee allows an
implementation to break the domain of the function into
disjoint sections and specify the worst case accuracy in
each section.
The value of an ULP of a positive number x that is a power
of the radix is the difference between the next
representable number (in the model) larger than x and x,
rather than the difference between x and the closest
representable number smaller than x. These two potential
definitions of an ULP's value differ by a factor of the
radix.