7. The Yacas arithmetic library


7.1 Introduction

Yacas comes with its own arbitrary-precision arithmetic library, to reduce dependencies on other software.

This part describes how the arithmetic library is embedded into Yacas.


7.2 The link between the interpreter and the arithmetic library

The Yacas interpreter has the concept of an atom, an object which has a string representation. Numbers are also atoms and are initially entered into Yacas as decimal strings.
There are functions to work with numbers in non-decimal bases, but direct input/output of numbers is supported only in decimal notation.
As soon as a calculation needs to be performed, the string representation is used to construct an object representing the number, in an internal representation that the arithmetic library can work with.

The basic layout is as follows: there is one class BigNumber that offers basic numerical functions, arithmetic operations such as addition and multiplication, through a set of class methods. Integers and floating-point numbers are handled by the same class.

The BigNumber class delegates the actual arithmetic operations to the auxiliary classes BigInt and BigFloat. These two classes are direct wrappers of an underlying arithmetic library. The library implements a particular internal representation of numbers.

The responsibility of the class BigNumber is to perform precision tracking, floating-point formatting, error reporting, type checking and so on, while BigInt and BigFloat only concern themselves with low-level arithmetic operations on integer and floating-point numbers respectively. In this way Yacas isolates higher-level features like precision tracking from the lower-level arithmetic operations. The number objects in a library should only be able to convert themselves to/from a string and perform basic arithmetic. It should be easy to wrap a generic arithmetic library into a BigNumber implementation.

It is impossible to have several alternative number libraries operating at the same time. [In principle, one might write the classes BigInt and BigFloat as wrappers of two different arithmetic libraries, one for integers and the other for floats, but at any rate one cannot have two different libraries for integers at the same time.] Having several libraries in the same Yacas session does not seem to be very useful; it would also incur a lot of overhead because one would have to convert the numbers from one internal library representation to another. For performance benchmarking or for testing purposes one can compile separate versions of Yacas configured with different arithmetic libraries.

To embed an arbitrary-precision arithmetic library into Yacas, one needs to write two wrapper classes, BigInt and BigFloat. (Alternatively, one could write a full BigNumber wrapper class but that would result in code duplication unless the library happens to implement a large portion of the BigNumber API. There is already a reference implementation of BigNumber through BigInt and BigFloat in the file numbers.cpp.) The required API for the BigNumber class is described below.


7.3 Interface of the BigNumber class

The following C++ code demonstrates how to use the objects of the BigNumber class.

// Calculate z=x+y where x=10 and y=15
BigNumber x("10",100,10);
BigNumber y("15",100,10);
BigNumber z;
z.Add(x,y,10));    
// cast the result to a string
LispString  str;
z.ToString(str,10);
The behaviour is such that in the above example z will contain the result of adding x and y, without modifying x or y. This is equivalent to z:=x+y in Yacas.

A calculation might modify one of its arguments. This might happen when one argument passed in is actually the object performing the calculation itself. For example, if a calculation

x.Add(x,y);
were issued, the result would be assigned to x, and the old value of x is deleted. This is equivalent to the Yacas code x:=x+y. In this case a specific implementation might opt to perform the operation destructively ("in-place"). Some operations can be performed much more efficiently in-place, without copying the arguments. Among them are for example Negate, Add, ShiftLeft, ShiftRight.

Therefore, all class methods of BigNumber that allow a BigNumber object as an argument should behave correctly when called destructively on the same BigNumber object. The result must be exactly the same as if all arguments were copied to temporary locations before performing tasks on them, with no other side-effects. For instance, if the specific object representing the number inside the numeric class is shared with other objects, it should not allow the destructive operation, as then other objects might start behaving differently.

The basic arithmetic class BigNumber defines some simple arithmetic operations, through which other more elaborate functions can be built. Particular implementations of the multiple-precision library are wrapped by the BigNumber class, and the rest of the Yacas core should only use the BigNumber API.

This API will not be completely exposed to Yacas scripts, because some of these functions are too low-level. Among the low-level functions, only those that are very useful for optimization will be available to the Yacas scripts. (For the functions that seem to be useful for Yacas, suggested Yacas bindings are given below.) But the full API will be available to C++ plugins, so that multiple-precision algorithms could be efficiently implemented when performance is critical. Intermediate-level arithmetic functions such as MathAdd, MathDiv, MathMod and so on could be implemented either in the Yacas core or in plugins, through this low-level API. The library scripts will be able to transform numerical expressions such as x:=y+z into calls of these intermediate-level functions.

Here we list the basic arithmetic operations that need to be implemented by a multiple-precision class BigNumber. The operations are divided into several categories for convenience. Equivalent Yacas script code is given, as well as examples of C++ usage.

1. Input/output operations.

2. Basic object manipulation.
These operations, as a rule, do not need to change the numerical value of the object.

3. Basic arithmetic operations.
Note that here "precision" always means the number of significant bits, i.e. digits in the base 2, not decimal digits.

4. Auxiliary operations.
Some additional operations useful for optimization purposes. These operations can be efficiently implemented with a binary-based internal representation of big numbers.

For floating-point numbers, BitCount should return the binary exponent of the number (with sign), like the integer output of the standard C function frexp. More formally: if n=BitCount(x), and x!=0, then 1/2<=Abs(x)*2^(-n)<1. The bit count of an integer or a floating 0 is arbitrarily defined to be 1. (Optimization of the binary logarithm.) C++:
x.BitCount();
Yacas:
MathBitCount(x)
Note: the return type of the bit count is a platform signed integer (C++ type long).

The API includes only the most basic operations. All other mathematical functions such as GCD, power, logarithm, cosine and so on, can be efficiently implemented using this basic interface.

Note that generally the arithmetic functions will set the type of the resulting object to the type of the result of the operation. For example, operations that only apply to integers (Mod, BitAnd etc.) will set the type of the resulting object to integer if it is a float. The results of these operations on non-integer arguments are undefined.


7.4 Precision of arithmetic operations

All operations on integers are exact. Integers must grow or shrink when necessary, limited only by system memory. But floating-point numbers need some precision management.

In some arithmetic operations (add, multiply, divide) the working precision is given explicitly. For example,
x.Add(y,z,100)
will add y to z and put the result into x, truncating it to at most 100 bits of mantissa, if necessary. (The precision is given in bits, not in decimal digits, because when dealing with low-level operations it is much more natural to think in terms of bits.) If the numbers y, z have fewer than 100 bits of mantissa each, then their sum will not be precise to all 100 digits. That is fine; but it is important that the sum should not contain more than 100 digits. Floating-point values, unlike integers, only grow up to the given number of significant bits and then a round-off must occur. Otherwise we will be wasting a lot of time on computations with many meaningless digits.


Automatic precision tracking

The precision of arithmetic operations on floating-point numbers can be maintained automatically. A rigorous way to do it would be to represent each imprecise real number x by an interval with rational bounds within which x is guaranteed to be. This is usually called "interval arithmetic." A result of an interval-arithmetic calculation is "exact" in the sense that the actual (unknown) number x is always within the resulting interval. However, interval arithmetic is computationally expensive and at any rate the width of the resulting interval is not guaranteed to be small enough for a particular application.

For the Yacas arithmetic library, a "poor man's interval arithmetic" is proposed where the precision is represented by the "number of correct bits". The precision is not tracked exactly but almost always adequately. The purpose of this kind of rough precision tracking is to catch a critical roundoff error or to indicate an unexpected loss of precision in numerical calculations.

Suppose we have two floating-point numbers x and y and we know that they have certain numbers of correct mantissa bits, say m and n. In other words, x is an approximation to an unknown real number x'=x*(1+delta) and we know that Abs(delta)<2^(-m); and similarly y'=y*(1+epsilon) with Abs(epsilon)<2^(-n). Here delta and epsilon are the relative errors for x and y. Typically delta and epsilon are much smaller than 1.

Suppose that every floating-point number knows the number of its correct digits. We can symbolically represent such numbers as pairs {x,m} or {y,n}. When we perform an arithmetic operation on numbers, we need to update the precision component as well.

Now we shall consider the basic arithmetic operations to see how the precision is updated.

Multiplication
If we need to multiply x and y, the correct answer is x'*y' but we only know an approximation to it, x*y. We can estimate the precision by x'*y'=x*y*(1+delta)*(1+epsilon) and it follows that the relative precision is at most delta+epsilon. But we only represent the relative errors by the number of bits. The whole idea of the simplified precision tracking is to avoid costly operations connected with precision. So instead of tracking the number delta+epsilon exactly, we represent it roughly: either set the error of x*y to the larger of the errors of x and y, or double the error.

More formally, we have the estimates Abs(delta)<2^(-m), Abs(epsilon)<2^(-n) and we need a similar estimate Abs(r)<2^(-p) for r=delta+epsilon.

If the two numbers x and y have the same number of correct bits, we should double the error (i.e. decrease the number of significant bits by 1). But if they don't have the same number of bits, we cannot really estimate the error very well. To be on the safe side, we might double the error if the numbers x and y have almost the same number of significant bits, and leave the error constant if the numbers of significant bits of x and y are very different.

The answer expressed as a formula is p=Min(m,n) if Abs(m-n)>=D and p=Min(m,n)-1 otherwise. Here D is a constant that expresses our tolerance for error. In the current implementation, D=1.

If one of the operands is a floating zero x={0.,m} (see below) and x={x,n}, then p=m-BitCount(x)+1. This is the same formula as above, if we pretend that the bit count of {0.,m} is equal to 1-m.

Division
Division is multiplication by the inverse number. When we take the inverse of x*(1+delta), we obtain approximately 1/x*(1-delta). The relative precision does not change when we take the inverse. So the handling of precision is exactly the same as for the multiplication.

Addition
Addition is more complicated because the absolute rather than the relative precision plays the main role, and because there may be roundoff errors associated with subtracting almost equal numbers.

Formally, we have the relative precision r of x+y as

r=(delta*x+epsilon*y)/(x+y).

We have the bounds on delta and epsilon:

[Abs(delta)<2^(-m);Abs(epsilon)<2^(-n);],

and we need to find a bit bound on r, i.e. an integer p such that Abs(r)<2^(-p). But we cannot estimate p without first computing x+y and analyzing the relative magnitude of x and y. To perform this estimate, we need to use the bit counts of x and y and on x+y. Let these bit counts be a, b and c, so that Abs(x)<2^a, Abs(y)<2^b, and 2^(c-1)<=Abs(x+y)<2^c. (At first we assume that x!=0, y!=0, and x+y!=0.) Now we can estimate r as

r<=Abs((x*2^(-m))/(x+y))+Abs((y*2^(-n))/(x+y))<=2^(a+1-m-c)+2^(b+1-n-c).

This is formally similar to multiplying two numbers with a+1-m-c and b+1-m-c correct bits. As in the case of multiplication, we may take the minimum of the two numbers, or double one of them if they are almost equal.

Note that there is one important case when we can estimate the precision better than this. Suppose x and y have the same sign; then there is no cancellation when we compute x+y. The above formula for r gives an estimate

r<Max(Abs(delta),Abs(epsilon))

and therefore the precision of the result is at least p=Min(m,n).

If one of the operands is a floating zero represented by x={0.,m} (see below), then the calculation of the error is formally the same as in the case x={1.,m}. This is as if the bit count of {0.,m} were equal to 1 (unlike the case of multiplication).

Finally, if the sum x+y is a floating zero but x!=0 and y!=0, then it must be that a=b. In that case we represent x+y as {0.,p}, where p=Min(m,n)-a.

Computations with a given target precision
Using these rules, we can maintain a bound on the numerical errors of all calculations. But sometimes we know in advance that we shall not be needing any more than a certain number of digits of the answer, and we would like to avoid an unnecessarily high precision and reduce the computation time. How can we combine an explicitly specified precision, for example, in the function
x.Add(y,z,100)
with the automatic precision tracking?

We should truncate one or both of the arguments to a smaller precision before starting the operation. For the multiplication as well as for the addition, the precision tracking involves a comparison of two binary exponents 2^(-g) and 2^(-h) to obtain an estimate on 2^(-g)+2^(-h). Here g and h are some integers that are easy to obtain during the computation. For instance, the multiplication involves g=m and h=n. This comparison will immediately show which of the arguments dominates the error.

The ideal situation would be when one of these exponentials is much smaller than the other, but not very much smaller (that would be a waste of precision). In other words, we should aim for Abs(g-h)<8 or so, where 8 is the number of guard bits we would like to maintain. (Generally it is a good idea to have at least 8 guard bits; somewhat more guard bits do not slow down the calculation very much, but 200 guard bits would be surely an overkill.) Then the number that is much more precise than necessary can be truncated.

For example, if we find that g=250 and h=150, then we can safely truncate x to 160 bits or so; if, in addition, we need only 130 bits of final precision, then we could truncate both x and y to about 140 bits.

Note that when we need to subtract two almost equal numbers, there will be a necessary loss of precision, and it may be impossible to decide on the target precision before performing the subtraction. Therefore the subtraction will have to be performed using all available digits.

The floating zero
There is a difference between an integer zero and a floating-point zero. An integer zero is exact, so the result of 0*1.1 is exactly zero (also an integer). However, x:=1.1-1.1 is a floating-point zero (a "floating zero" for short) of which we can only be sure about the first digit after the decimal point, i.e. x=0.0. The number x might represent 0.01 or -0.02 for all we know.

It is impossible to track the relative precision of a floating zero, but it is possible to track the absolute precision. Suppose we store the bit count of the absolute precision, just as we store the bit count of the relative precision with nonzero floats. Thus we represent a floating zero as a pair {0.,n} where n is an integer, and the meaning of this is a number between -2^(-n) and 2^(-n).

We can now perform some arithmetic operations on the floating zero. Addition and multiplication are handled similarly to the non-zero case, except that we interpret n as the absolute error rather than the relative error. This does not present any problems. For example, the error estimates for addition is the same as if we had a number 1 with relative error 2^(-n) instead of {0.,n}. With multiplication of {x,m} by {0.,n}, the result is again a floating zero {0.,p}, and the new estimate of absolute precision is p=n-BitCount(x)+1.

The division by the floating zero, negative powers, and the logarithm of the floating zero are not representable in our arithmetic because, interpreted as intervals, they would correspond to infinite ranges. The bit count of the floating zero is therefore undefined. However, we can define a positive power of the floating zero (the result is again a floating zero).

The sign of the floating zero is defined as (integer) 0. (Then we can quickly check whether a given number is a zero.)


Comparison of floats

Suppose we need to compare floating-point numbers x and y. In the strict mathematical sense this is an unsolvable problem because we may need in principle arbitrarily many digits of x and y before we can say that they are equal. In other words, "zero-testing is uncomputable". So we need to relax the mathematical rigor somewhat.

Suppose that x=12.0 and y=12.00. Then in fact x might represent a number such as 12.01, while y might represent 11.999. There may be two approaches: first, "12.0" is not equal to "12.00" because x and y might represent different numbers. Second, "12.0" is equal to "12.00" because x and y might also represent equal numbers. A logical continuation of the first approach is that "12.0" is not even equal to another copy of "12.0" because they might represent different numbers, e.g. if we compute x=6.0+6.0 and y=24.0/2.0, the roundoff errors might be different.

Here is an illustration in support for the idea that the comparison 12.0=12 should return True. Suppose we are writing an algorithm for computing the power, x^y. This is much faster if y is an integer because we can use the binary squaring algorithm. So we need to detect whether y is an integer. Now suppose we are given x=13.3 and y=12.0. Clearly we should use the integer powering algorithm, even though technically y is a float. (To be sure, we should check that the integer powering algorithm generates enough significant digits.)

However, the opposite approach is also completely possible: no two floating-point numbers should be considered equal, except perhaps when one is a bit-for-bit exact copy of the other and when we haven't yet performed any arithmetic on them.

It seems that no algorithm really needs a test for equality of floats. The two useful comparisons on floats x, y seem to be the following:

Given these predicates, it seems that any floating-point algorithm can be implemented just as efficiently as with any "reasonable" definition of the floating-point equality.


How to increase of the working precision

Suppose that in a Yacas session we declare Builtin'Precision'Set(5), write x:=0.1, and then increase precision to 10 digits. What is x now? There are several approaches:

1) The number x stays the same but further calculations are done with 10 digits. In terms of the internal binary representation, the number is padded with binary zeros. This means that now e.g. 1+x will not be equal to 1.1 but to something like 1.100000381 (to 10 digits). And actually x itself should evaluate to 0.1000003815 now. This was 0.1 to 5 digits but it looks a little different if we print it to 10 digits. (A "binary-padded decimal".)

This problem may look horrible at first sight -- "how come I can't write 0.1 any more??" -- but this seems so because we are used to calculations in decimals with a fixed precision, and the operation such as "increase precision by 10 digits" is largely unfamiliar to us except in decimals. This seems to be mostly a cosmetic problem. In a real calculation, we shouldn't be writing "0.1" when we need an exact number 1/10. When we request to increase precision in the middle of a calculation, this mistake surfaces and gives unexpected results.

2) When precision is increased, the number x takes its decimal representation, pads it with zeros, and converts back to the internal representation, just so that the appearance of "1.100000381" does not jar our eyes. (Note that the number x does not become "more precise" if we pad it with decimal zeros instead of binary zeros, unless we made a mistake and wrote "0.1" instead an exact fraction 1/10.)

With this approach, each number x that doesn't currently have enough digits must change in a complicated way. This will mean a performance hit in all calculations that require dynamically changing precision (Newton's method and some other fast algorithms require this). In these calculations, the roundoff error introduced by "1.100000381" is automatically compensated and the algorithm will work equally well no matter how we extend x to more digits; but it's a lot slower to go through the decimal representation every time.

3) The approach currently being implemented in Yacas is a compromise between the above two. We distinguish number objects that were given by the user as decimal strings (and not as results of calculations), for instance x:=1.2, from number objects that are results of calculations, for instance y:=1.2*1.4. Objects of the first kind are interpreted as exact rational numbers given by a decimal fraction, while objects of the second kind are interpreted as inexact floating-point numbers known to a limited precision. Suppose x and y are first assigned as indicated, with the precision of 5 digits each, then the precision is increased to 10 digits and x and y are used in some calculation. At this point x will be converted from the string representation "1.2" to 10 decimal digits, effectively making 1.2 a shorthand for 1.200000000. But the value of y will be binary-padded in some efficient way and may be different from 1.680000000.

In this way, efficiency is not lost (there are no repeated conversions from binary to decimal and back), and yet the cosmetic problem of binary-padded decimals does not appear. An explicitly given decimal string such as "1.2" is interpreted as a shorthand for 1.2000... with as many zeroes as needed for any currently selected precision. But numbers that are results of arithmetic operations are not converted back to a decimal representation for zero-padding. Here are some example calculations:

In> Builtin'Precision'Set(5)
Out> True
In> x:=1.2
Out> 1.2
In> y:=N(1/3)
Out> 0.33333
The number y is a result of a calculation and has a limited precision. Now we shall increase the precision:
In> Builtin'Precision'Set(20)
Out> True
In> y
Out> 0.33333
The number y is printed with 5 digits, because it knows that it has only 5 correct digits.
In> y+0
Out> 0.33333333325572311878
In a calculation, y was binary-padded, so the last digits are incorrect.
In> x+0
Out> 1.2
However, x has not been padded and remains an "exact" 1.2.
In> z:=y+0
Out> 0.33333333325572311878
In> Builtin'Precision'Set(40)
Out> True
In> z
Out> 0.33333333325572311878
Now we can see how the number z is padded again:
In> z+0
Out> 0.33333333325572311878204345703125


The meaning of the Builtin'Precision'Set() call

The user calls Builtin'Precision'Set() to specify the "wanted number of digits." We could use different interpretations of the user's wish: The first interpretation is that Builtin'Precision'Set(10) means "I want all answers of all calculations to contain 10 correct digits". The second interpretation is "I want all calculations with floating-point numbers done using at least 10 digits".

Suppose we have floating-point numbers x and y, known only to 2 and 3 significant digits respectively. For example, x=1.6 and y=2.00. These x and y are results of previous calculations and we do not have any more digits than this. If we now say
Builtin'Precision'Set(10);
x*y;
then clearly the system cannot satisfy the first interpretation because there aren't enough digits of x and y to find 10 digits of x*y. But we can satisfy the second interpretation, even if we print "3.2028214767" instead of the expected 3.2. The garbage after the third digit is unavoidable and harmless unless our calculation really depends on having 10 correct digits of x*y. The garbage digits can be suppressed when the number is printed, so that the user will never see them. But if our calculation depends on the way we choose the extra digits, then we are using a bad algorithm.

The first interpretation of Builtin'Precision'Set() is only possible to satisfy if we are given a self-contained calculation with integer numbers. For example,
N(Sin(Sqrt(3/2)-10^(20)), 50)
This can be computed to 50 digits with some effort, but only if we are smart enough to use 70 digits in the calculation of the argument of Sin().) (This level of smartness is currently not implemented in the N function.) The result of this calculation will have 50 digits and not a digit more; we cannot put the result inside another expression and expect full precision in all cases. This seems to be a separate task, "compute something with n digits no matter what", and not a general routine to be followed at all times.

So it seems that the second interpretation of Builtin'Precision'Set(n), namely: "please use n digits in all calculations now", is more sensible as a general-purpose prescription.

But this interpretation does not mean that all numbers will be printed with n digits. Let's look at a particular case (for simplicity we are talking about decimal digits but in the implementation they will be binary digits). Suppose we have x precise to 10 digits and y precise to 20 digits, and the user says Builtin'Precision'Set(50) and z:=1.4+x*y. What happens now in this calculation? (Assume that x and y are small numbers of order 1; the other cases are similar.)

First, the number "1.4" is now interpreted as being precise to 50 digits, i.e. "1.4000000...0" but not more than 50 digits.

Then we compute x*y using their internal representations. The result is good only to 10 digits, and it knows this. We do not compute 50 digits of the product x*y, it would be pointless and a waste of time.

Then we add x*y to 1.4000...0. The sum, however, will be precise only to 10 digits. We can do one of the two things now: (a) we could pad x*y with 40 more zero digits and obtain a 50-digit result. However, this result will only be correct to 10 digits. (b) we could truncate 1.4 to 10 digits (1.400000000) and obtain the sum to 10 digits. In both cases the result will "know" that it only has 10 correct digits.

It seems that the option (b) is better because we do not waste time with extra digits.

The result is a number that is precise to 10 digits. However, the user wants to see this result with 50 digits. Even if we chose the option (a), we would have had some bogus digits, in effect, 40 digits of somewhat random round-off error. Should we print 10 correct digits and 40 bogus digits? It seems better to print only 10 correct digits in this case.

If we choose this route, then the only effect of Builtin'Precision'Set(50) will be to interpret a literal constant 1.4 as a 50-digit number. All other numbers already know their real precision and will not invent any bogus digits.

In some calculations, however, we do want to explicitly extend the precision of a number to some more digits. For example, in Newton's method we are given a first approximation x[0] to a root of f(x)=0 and we want to have more digits of that root. Then we need to pad x[0] with some more digits and re-evaluate f(x[0]) to more digits (this is needed to get a better approximation to the root). This padding operation seems rather special and directed at a particular number, not at all numbers at once. For example, if f(x) itself contains some floating-point numbers, then we should be unable to evaluate it with higher precision than allowed by the precision of these numbers. So it seems that we need access to these two low-level operations: the padding and the query of current precision. The proposed interface is GetExactBits(x) and SetExactBits(x,n). These operations are directed at a particular number object x.


Summary of arbitrary-precision semantics


Formal definitions of precision tracking

Here we shall consider arithmetic operations on floats x and y, represented as pairs {x,m} and {y,n}. The result of the operation is z, represented as a pair {z,p}. Here x, y, z are floats and m, n, p are integers.

We give formulae for p in terms of x, y, m, and n. Sometimes the bit count of a number x is needed; it is denoted B(x) for brevity.

Formal definitions
A pair {x,m} where x is a floating-point value and m is an integer value (the "number of correct bits") denotes a real number between x*(1-2^(-m)) and x*(1+2^(-m)) when x!=0, and a real number between -2^(-m) and 2^(-m) when x=0 (a "floating zero").

The bit count B(x) is an integer function of x defined for real x!=0 by

B(x):=1+Floor(Ln(Abs(x))/Ln(2)).

This function also satisfies

2^(B(x)-1)<=Abs(x)<2^B(x).

For example, B(1/4)= -1, B(1)=B(3/2)=1, B(4)=3. The bit count of zero is arbitrarily set to 1. For integer x, the value B(x) is the number of bits needed to write the binary representation of x.

The bit count function can be usually computed in constant time because the usual representation of long numbers is by arrays of platform integers and a binary exponent. The length of the array of digits is usually available at no computational cost.

The absolute error Delta[x] of {x,n} is of order Abs(x)*2^(-n). Given the bit count of x, this can be estimated from as

2^(B(x)-n-1)<=Delta[x]<2^(B(x)-n).

So the bit count of Delta[x] is B(x)-n.

Floor()
The function Floor({x,m}) gives an integer result if there are enough digits to determine it exactly, and otherwise returns the unchanged floating-point number. The condition for Floor({x,m}) to give an exact result is

m>=B(x).

BecomeFloat()
The function BecomeFloat(n) will convert an integer to a float with at least n digits of precision. If x is the original integer value, then the result is {x,p} where p=Max(n,B(x)).

Underflow check
It is possible to have a number {x,n} with x!=0 such that {0.,m}={x,n} for some m. This would mean that the floating zero {0.,m} is not precise enough to be distinguished from {x,n}, i.e.

Abs(x)<2^(-m).

This situation is normal. But it would be meaningless to have a number {x,n} with x!=0 and a precision interval that contains 0. Such {x,n} will in effect be equal to any zero {0.,m}, because we do not know enough digits of x to distinguish {x,n} from zero.

From the definition of {x,n} with x!=0 it follows that 0 can be within the precision interval only if n<= -1. Therefore, we should transform any number {x,n} such that x!=0 and n<= -1 into a floating zero {0.,p} where

p=n-B(x).

(Now it is not necessarily true that p>=0.) This check should be performed at any point where a new precision estimate n is obtained for a number x and where a cancellation may occur (e.g. after a subtraction). Then we may assume that any given float is already reduced to zero if possible.

Equals()
We need to compare {x,m} and {y,n}.

First, we can quickly check that the values x and y have the same nonzero signs and the same bit counts, B(x)=B(y). If x>0 and y<0 or vice versa, or if B(x)=B(y), then the two numbers are definitely unequal. We can also check whether both x=y=0; if this is the case, then we know that {x,m}={y,n} because any two zeros are equal.

However, a floating zero can be sometimes equal to a nonzero number. So we should now exclude this possibility: {0.,m}={y,n} if and only if Abs(y)<2^(-m). This condition is equivalent to

B(y)< -m.

If these checks do not provide the answer, the only possibility left is when x!=0 and y!=0 and B(x)=B(y).

Now we can consider two cases: (1) both x and y are floats, (2) one is a float and the other is an integer.

In the first case, {x,m}={y,n} if and only if the following condition holds:

Abs(x-y)<Max(2^(-m)*Abs(x),2^(-n)*Abs(y)).

This is a somewhat complicated condition but its evaluation does not require any long multiplications, only long additions, bit shifts and comparisons.

It is now necessary to compute x-y (one long addition); this computation needs to be done with Min(m,n) bits of precision.

After computing x-y, we can avoid the full evaluation of the complicated condition by first checking some easier conditions on x-y. If x-y=0 as floating-point numbers ("exact cancellation"), then certainly {x,m}={y,n}. Otherwise we can assume that x-y!=0 and check:

If neither of these conditions can give us the answer, we have to evaluate the full condition by computing Abs(x)*2^(-m) and Abs(x)*2^(-m) and comparing with Abs(x-y).

In the second case, one of the numbers is an integer x and the other is a float {y,n}. Then x={y,n} if and only if

Abs(x-y)<2^(-n)*Abs(y).

For the computation of x-y, we need to convert x into a float with precision of n digits, i.e. replace the integer x by a float {x,n}. Then we may use the procedure for the first case (two floats) instead of implementing a separate comparison procedure for integers.

LessThan()
If {x,m}={y,n} according to the comparison function Equals(), then the predicate LessThan is false. Otherwise it is true if and only if x<y as floats.

IsIntValue()
To check whether {x,n} has an integer value within its precision, we first need to check that {x,n} has enough digits to compute Floor(x)=Floor(x) accurately. If not (if n<B(x)), then we conclude that x has an integer value. Otherwise we compute y:=x-Floor(x) as a float value (without precision control) to n bits. If y is exactly zero as a float value, then x has an integer value. Otherwise {x,n} has an integer value if and only if B(y)< -n.

This procedure is basically the same as comparing {x,n} with Floor(x).

Sign()
The sign of {x,n} is defined as the sign of the float value x. (The number {x,n} should have been reduced to a floating zero if necessary.)

Addition and subtraction (Add, Negate)
We need to add {x,m} and {y,n} to get the result {z,p}. Subtraction is the same as addition, except we negate the second number. When we negate a number, its precision never changes.

First consider the case when x+y!=0.

If x is zero, i.e. {0.,m} (but x+y!=0), then the situation with precision is the same as if x were {1.,m}, because then the relative precision is equal to the absolute precision. In that case we take the bit count of x as B(0)=1 and proceed by the same route.

First, we should decide whether it is necessary to add the given numbers. It may be unnecessary if e.g. x+y<=>x within precision of x (we may say that a "total underflow" occurred during addition). To check for this, we need to estimate the absolute errors of x and y:

2^(B(x)-m-1)<=Delta[x]<2^(B(x)-m),

2^(B(y)-n-1)<=Delta[y]<2^(B(y)-n).

Addition is not necessary if Abs(x)<=Delta[y] or if Abs(y)<=Delta[x]. Since we should rather perform an addition than wrongly dismiss it as unnecessary, we should use a sufficient condition here: if

B(x)<=B(y)-n-1

then we can neglect x and set z=y, p=n-Dist(B(x),B(y)-n-1). (We subtract one bit from the precision of y in case the magnitude of x is close to the absolute error of y.) Also, if

B(y)<=B(x)-m-1

then we can neglect y and set z=x, p=m-Dist(B(y),B(x)-m-1).

Suppose none of these checks were successful. Now, the float value z=x+y needs to be calculated. To find it, we need the target precision of only

1+Max(B(x),B(y))-Max(B(x)-m,B(y)-n)

bits. (An easier upper bound on this is 1+Max(m,n) but this is wasteful when x and y have very different precisions.)

Then we compute B(z) and determine the precision p as

p=Min(m-B(x),n-B(y))+B(z)

-1-Dist(m-B(x),n-B(y)),

where the auxiliary function Dist(a,b) is defined as 0 when Abs(a-b)>2 and 1 otherwise.
The definition of Dist(a,b) is necessarily approximate; if we replace 2 by a larger number, we shall be overestimating the error in more cases.

In the case when x and y have the same sign, we have a potentially better estimate p=Min(m,n). We should take this value if it is larger than the value obtained from the above formula.

Also, the above formula is underestimating the precision of the result by 1 bit if the result and the absolute error are dominated by one of the summands. In this case the absolute error should be unchanged save for the Dist term, i.e. the above formula needs to be incremented by 1. The condition for this is B(x)>B(y) and B(x)-m>B(y)-n, or the same for y instead of x.

The result is now {z,p}.

Note that the obtained value of p may be negative (total underflow) even though we have first checked for underflow. In that case, we need to transform {z,p} into a floating zero, as usual.

Now consider the case when z:=x+y=0.

This is only possible when B(x)=B(y). Then the result is {0.,p} where p is found as

p=1+Min(m,n)-B(x)-Dist(m,n).

Note that this is the same formula as in the general case, if we define B(z)=B(0):=1. Therefore with this definition of the bit count one can use one formula for the precision of addition in all cases.

If the addition needs to be performed with a given maximum precision P, and it turns out that p>P, then we may truncate the final result to P digits and set its precision to P instead. (It is advisable to leave a few bits untruncated as guard bits.) However, the first operation z:=x+y must be performed with the precision specified above, or else we run the danger of losing significant digits of z.

Adding integers to floats
If an integer x needs to be added to a float {y,n}, then we should formally use the same procedure as if x had infinitely many precise bits. In practice we can take some shortcuts.

It is enough to convert the integer to a float {x,m} with a certain finite precision m and then follow the general procedure for adding floats. The precision m must be large enough so that the absolute error of {x,m} is smaller than the absolute error of {y,n}: B(x)-m<=B(y)-n-1, hence

m>=1+n+B(x)-B(y).

In practice we may allow for a few guard bits over the minimum m given by this formula.

Sometimes the formula gives a negative value for the minimum m; this means underflow while adding the integer (e.g. adding 1 to 1.11e150). In this case we do not need to perform any addition at all.

Multiplication
We need to multiply {x,m} and {y,n} to get the result {z,p}.

First consider the case when x!=0 and y!=0. The resulting value is z=x*y and the precision is

p=Min(m,n)-Dist(m,n).

If one of the numbers is an integer x, and the other is a float {y,n}, it is enough to convert x to a float with somewhat more than n bits, e.g. {x,n+3}, so that the Dist function does not decrement the precision of the result.

Now consider the case when {x,m}={0,m} but y!=0. The result z=0 and the resulting precision is

p=m-B(y)+1.

Finally, consider the case when {x,m}={0,m} and {y,n}={0,n}. The result z=0 and the resulting precision is

p=m+n.

The last two formulae are the same if we defined the bit count of {0.,m} as 1-m. This differs from the "standard" definition of B(0)=1. (The "standard" definition is convenient for the handling of addition.) With this non-standard definition, we may use the unified formula

p=2-B(x)-B(y)

for the case when one of x, y is a floating zero.

If the multiplication needs to be performed to a given target precision P which is larger than the estimate p, then we can save time by truncating both operands to P digits before performing the multiplication. (It is advisable to leave a few bits untruncated as guard bits.)

Division
Division is handled essentially in the same way as multiplication. The relative precision of x/y is the same as the relative precision of x*y as long as both x!=0 and y!=0.

When x=0 and y!=0, the result of division {0.,m}/{y,n} is a floating zero {0.,p} where p=m+B(y)-1. When x is an integer zero, the result is also an integer zero.

Division by an integer zero or by a floating zero is not permitted. The code should signal a zero division error.

ShiftLeft(), ShiftRight()
These operations efficiently multiply a number by a positive or negative power of 2. Since 2 is an exact integer, the precision handling is similar to that of multiplication of floats by integers.

If the number {x,n} is nonzero, then only x changes by shifting but n does not change; if {x,n} is a floating zero, then x does not change and n is decremented (ShiftLeft) or incremented (ShiftRight) by the shift amount:
{x, n} << s = {x<<s, n};
{0.,n} << s = {0., n-s};
{x, n} >> s = {x>>s, n};
{0.,n} >> s = {0., n+s};


7.5 Implementation notes


Large exponents

The BigNumber API does not support large exponents for floating-point numbers. A floating-point number x is equivalent to two integers M, N such that x=M*2^N. Here M is the (denormalized) mantissa and N is the (binary) exponent. The integer M must be a "big integer" that may represent thousands of significant bits. But the exponent N is a platform signed integer (C++ type long) which is at least 2^31, allowing a vastly larger range than platform floating-point types. One would expect that this range of exponents is enough for most real-world applications. In the future this limitation may be relaxed if one uses a 64-bit platform. (A 64-bit platform seems to be a better choice for heavy-duty multiple-precision computations than a 32-bit platform.) However, code should not depend on having 64-bit exponent range.

We could have implemented the exponent N as a big integer but this would be inefficient most of the time, slowing down the calculations. Arithmetic with floating-point numbers requires only very simple operations on their exponents (basically, addition and comparisons). These operations would be dominated by the overhead of dealing with big integers, compared with platform integers.

A known issue with limited exponents is the floating-point overflow and exponent underflow. (This is not the same underflow as with adding 1 to a very small number.) When the exponent becomes too large to be represented by a platform signed integer type, the code must signal an overflow error (e.g. if the exponent is above 2^31) or an underflow error (e.g. if the exponent is negative and below -2^31).


Library versions of mathematical functions

It is usually the case that a multiple-precision library implements some basic mathematical functions such as the square root. A library implementation may be already available and more efficient than an implementation using the API of the wrapper class BigNumber. In this case it is desirable to wrap the library implementation of the mathematical function, rather than use a suboptimal implementation. This could be done in two ways.

First, we recognize that we shall only have one particular numerical library linked with Yacas, and we do not have to compile our implementation of the square root if this library already contains a good implementation. We can use conditional compilation directives (#ifdef) to exclude our square root code and to insert a library wrapper instead. This scheme could be automated, so that appropriate #defines are automatically created for all functions that are already available in the given multiple-precision library, and the corresponding Yacas kernel code that uses the BigNumber API is automatically replaced by library wrappers.

Second, we might compile the library wrapper as a plugin, replacing the script-level square root function with a plugin-supplied function. This solution is easier in some ways because it doesn't require any changes to the Yacas core, only to the script library. However, the library wrapper will only be available to the Yacas scripts and not to the Yacas core functions. The basic assumption of the plugin architecture is that plugins can provide new external objects and functions to the scripts, but plugins cannot modify anything in the kernel. So plugins can replace a function defined in the scripts, but cannot replace a kernel function. Suppose that some other function, such as a computation of the elliptic integral which heavily uses the square root, were implemented in the core using the BigNumber API. Then it will not be able to use the square root function supplied by the plugin because it has been already compiled into the Yacas kernel.

Third, we might put all functions that use the basic API (MathSqrt, MathSin etc.) into the script library and not into the Yacas kernel. When Yacas is compiled with a particular numerical library, the functions available from the library will also be compiled as the kernel versions of MathSqrt, MathPower and so on (using conditional compilation or configured at build time). Since Yacas tries to call the kernel functions before the script library functions, the available kernel versions of MathSqrt etc. will supersede the script versions, but other functions such as BesselJ will be used from the script library. The only drawback of this scheme is that a plugin will not be able to use the faster versions of the functions, unless the plugin was compiled specifically with the requirement of the particular numerical library.

So it appears that either the first or the third solution is viable.


Converting from bits to digits and back

One task frequently needed by the arithmetic library is to convert a precision in (decimal) digits to binary bits and back. (We consider the decimal base to be specific; the same considerations apply to conversions between any other bases.) The kernel implements auxiliary routines bits_to_digits and digits_to_bits for this purpose.

Suppose that the mantissa of a floating-point number is known to d decimal digits. It means that the relative error is no more than 0.5*10^(-d). The mantissa is represented internally as a binary number. The number b of precise bits of mantissa should be determined from the equation 10^(-d)=2^(-b), which gives b=d*Ln(10)/Ln(2).

One potential problem with the conversions is that of incorrect rounding. It is impossible to represent d decimal digits by some exact number b of binary bits. Therefore the actual value of b must be a little different from the theoretical one. Then suppose we perform the inverse operation on b to obtain the corresponding number of precise decimal digits; there is a danger that we shall obtain a number d' that is different from d.

To avoid this danger, the following trick is used. The binary base 2 is the least of all possible bases, so successive powers of 2 are more frequent than successive powers of 10 or of any other base. Therefore for any power of 10 there will be a unique power of 2 that is the first one above it.

The recipe to obtain this power of 2 is simple: one should round d*Ln(10)/Ln(2) upwards using the Ceil function, but b*Ln(2)/Ln(10) should be rounded downwards using the Floor function.

This procedure will make sure that the number of bits b is high enough to represent all information in the d decimal digits; at the same time, the number d will be correctly restored from b. So when a user requests d decimal digits of precision, Yacas may simply compute the corresponding value of b and store it. The precision of b digits is enough to hold the required information, and the precision d can be easily computed given b.


The internal storage of BigNumber objects

An object of type BigNumber represents a number (and contains all information relevant to the number), and offers an interface to operations on it, dispatching the operations to an underlying arbitrary precision arithmetic library.

Higher up, Yacas only knows about objects derived from LispObject. Specifically, there are objects of class LispAtom which represent an atom.

Symbolic and string atoms are uniquely represented by the result returned by the String() method. For number atoms, there is a separate class, LispNumber. Objects of class LispNumber also have a String() method in case a string representation of a number is needed, but the main uniquely identifying piece of information is the object of class BigNumber stored inside a LispNumber object. This object is accessed using the Number() method of class LispNumber.

The life cycle of a LispNumber is as follows:

In order to fully support the LispNumber object, the function in the kernel that determines if two objects are the same needs to know about LispNumber. This is required to get valid behaviour. Pattern matching for instance uses comparisons of this type, so comparisons are performed often and need to be efficient.

The other functions working on numbers can, in principle, call the String() method, but that induces conversions from BigNumber to string, which are relatively expensive operations. For efficiency reasons, the functions dealing with numeric input should call the Number() method, operate on the BigNumber returned, and return a LispNumber constructed with a BigNumber. A function can call String() and return a LispNumber constructed with a string representation, but it will be less efficient.

Precision tracking inside LispNumber
There are various subtle details when dealing with precision. A number gets constructed with a certain precision, but a higher precision might be needed later on. That is the reason there is the aPrecision argument to the Number() method.

When a BigNumber is constructed from a decimal string, one has to specify a desired precision (in decimal digits). Internally, BigNumber objects store numbers in binary and will allocate enough bits to cover the desired precision. However, if the given string has more digits than the given precision, the BigNumber object will not truncate it but will allocate more bits so that the information given in the decimal string is not lost. If later the string representation of the BigNumber object is requested, the produced string will match the string from which the BigNumber object was created.

Internally, the BigNumber object knows how many precise bits it has. The number of precise digits might be greater than the currently requested precision. But a truncation of precision will only occur when performing arithmetic operations. This behavior is desired, for example:

In> Builtin'Precision'Set(6)
Out> True;
In> x:=1.23456789
Out> 1.23456789;
In> x+1.111
Out> 2.345567;
In> x
Out> 1.23456789;
In this example, we would like to keep all information we have about x and not truncate it to 6 digits. But when we add a number to x, the result is only precise to 6 digits.

This behavior is implemented by storing the string representation "1.23456789" in the LispNumber object x. When an arithmetic calculation such as x+1.111 is requested, the Number method is called on x. This method, when called for the first time, converts the string representation into a BigNumber object. That BigNumber object will have 28 bits to cover the 9 significant digits of the number, not the 19 bits normally required for 6 decimal digits of precision. But the result of an arithmetic calculation is not computed with more than 6 decimal digits. Later when x needs to be printed, the full string representation is available so it is printed.

If now we increase precision to 20 digits, the object x will be interpreted as 1.23456789 with 12 zeros at the end.

In> Builtin'Precision'Set(20)
Out> True;
In> x+0.000000000001
Out> 1.234567890001;
This behavior is more intuitive to people who are used to decimal fractions.