In *Table-Driven Implementation of the Exponential Function
in IEEE Floating-Point Arithmetic*, Ping Tak Peter Tang described
an implementation of an almost correctly-rounded elementary function in
single and double precision. John Harrison later did a complete formal proof
in HOL Light of the implementation in *Floating point verification
in HOL Light: the exponential function*.

Here we just focus on the tedious computation of the rounding error. We
consider neither the argument reduction nor the reconstruction part (trivial
anyway, excepted when the end result is subnormal). We want to prove that, in
the C code below, the absolute error between `e`

and the exponential
`E0`

of `R0`

(the ideal reduced argument) is less than
0.54 ulp. Variable `n`

is an integer and all the other variables are
single-precision floating-point numbers.

r2 = -n * l2; r = r1 + r2; q = r * r * (a1 + r * a2); p = r1 + (r2 + q); s = s1 + s2; e = s1 + (s2 + s * p);

Let us note `R`

the computed reduced argument and
`S`

the stored value of the ideal constant `S0`

. There
are 32 such constants. For the sake of simplicity, we only consider the
second constant: [math]. `E`

is
the value of the expression `e`

computed with an infinitely
precise arithmetic. `Z`

is the absolute error between the
polynomial [math] and
[math] for [math].

a1 = 8388676b-24; a2 = 11184876b-26; l2 = 12566158b-48; s1 = 8572288b-23; s2 = 13833605b-44; r2 rnd= -n * l2; r rnd= r1 + r2; q rnd= r * r * (a1 + r * a2); p rnd= r1 + (r2 + q); s rnd= s1 + s2; e rnd= s1 + (s2 + s * p); R = r1 + r2; S = s1 + s2; E0 = S0 * (1 + R0 + a1 * R0 * R0 + a2 * R0 * R0 * R0 + Z); { Z in [-55b-39,55b-39] /\ S - S0 in [-1b-41,1b-41] /\ R - R0 in [-1b-34,1b-34] /\ R in [0,0.0217] /\ n in [-10176,10176] -> e in ? /\ e - E0 in ? }

Please note the way `Z`

is introduced. Its definition is
backward: `Z`

is a real number such that `E0`

is the
product of `S0`

and the exponential of `R0`

. It makes for
clearer definitions and it avoids having to deal with divisions.

Results: e in [4282253b-22 {1.02097, 2^(0.0299396)}, 8768135b-23 {1.04524, 2^(0.0638374)}] e - E0 in [-13458043620277891b-59 {-0.023346, -2^(-5.42068)}, 3364512538651833b-57 {0.023346, 2^(-5.42068)}]

Gappa is able to bound both expressions. While the bounds for
`e`

seem sensible, the bounds for `e - E0`

are
grossly overestimated. This overestimation comes from the difference
between the structures of `e`

and `E0`

. To improve
the bounds on the error `e - E0`

, we split it into two parts:
a round-off error and a term that combines both discretization and
truncation errors. The round-off error is expressed by introducing a term
`E`

with the same structure as `e`

but without any
rounding operator.

E = s1 + (s2 + S * (r1 + (r2 + R * R * (a1 + R * a2))));

So the round-off error is `e - E`

, while the other term
is `E - E0`

. As before, the expressions `E`

and
`E0`

are structurally different, so Gappa grossly
overestimates the bounds of `E - E0`

. This time, we introduce a
term `Er`

with the same structure as `E0`

but equal
to `E`

. Since `Z`

has no corresponding term in
`E`

, we insert an artificial term `0`

in
`Er`

to obtain the same structure.

Er = S * (1 + R + a1 * R * R + a2 * R * R * R + 0);

Finally, we tell Gappa how to compute `e - E0`

using
`E`

and `Er`

.

e - E0 -> (e - E) + (Er - E0);

Note that, rather than using a hint, this equality could also have been indicated as a hypothesis of the logical formula.

@rnd = float< ieee_32, ne >; a1 = 8388676b-24; a2 = 11184876b-26; l2 = 12566158b-48; s1 = 8572288b-23; s2 = 13833605b-44; r2 rnd= -n * l2; r rnd= r1 + r2; q rnd= r * r * (a1 + r * a2); p rnd= r1 + (r2 + q); s rnd= s1 + s2; e rnd= s1 + (s2 + s * p); R = r1 + r2; S = s1 + s2; E = s1 + (s2 + S * (r1 + (r2 + R * R * (a1 + R * a2)))); Er = S * (1 + R + a1 * R * R + a2 * R * R * R + 0); E0 = S0 * (1 + R0 + a1 * R0 * R0 + a2 * R0 * R0 * R0 + Z); { Z in [-55b-39,55b-39] /\ S - S0 in [-1b-41,1b-41] /\ R - R0 in [-1b-34,1b-34] /\ R in [0,0.0217] /\ n in [-10176,10176] -> e in ? /\ e - E0 in ? } e - E0 -> (e - E) + (Er - E0);

Gappa answers that the error is bounded by 0.535 ulp. This is consistent with the bounds computed by Tang and Harrison.

Results: e in [8572295b-23 {1.0219, 2^(0.0312489)}, 4380173b-22 {1.04431, 2^(0.0625575)}] e - E0 in [-75807082762648785b-80 {-6.27061e-08, -2^(-23.9268)}, 154166255364809243b-81 {6.37617e-08, 2^(-23.9027)}]