#### The discovery of decimal arithmetic

The discovery of decimal arithmetic in ancient India, together with the well-known schemes for long multiplication and long division, surely must rank as one of the most important discoveries in the history of science. The date of this discovery, by an unknown Indian mathematician or group of mathematicians, was recently pushed back to the third century CE, based on the recent dating of the Bakhshali manuscript, but it probably happened earlier, perhaps around 0 CE.

#### Arithmetic on modern computers

Computers, of course, do not use decimal arithmetic in computation (except input and output for human eyes). Instead they use binary arithmetic, either in an integer format, where the bits in a computer word represent a binary integer (with perhaps one bit as a sign), or else in floating-point format, where part of the computer word represents the data and part represents an exponent — the binary equivalent of writing a value in scientific notation, such as $1.2345 \times 10^{67}$. A widely-used 64-bit integer format can represent integers up to roughly $4.5 \times 10^{18}$, whereas a widely-used 64-bit floating-point format can represent numbers from about $10^{-308}$ to $10^{308}$, with nearly 16-significant-digit accuracy. Arithmetic on such numbers is typically done in computer hardware using binary variations of the basic schemes. For additional details on these formats, see this Wikipedia article.

For some scientific and engineering applications, however, even 16-digit accuracy is not sufficient, and so such applications rely on “multiprecision” arithmetic — software extensions to the standard hardware arithmetic operations. Cryptography, which is typically performed numerous times each day in one’s smartphone or laptop, requires computations to be done on integers as large as several thousand digits. Some computations in pure mathematics and mathematical physics research require extremely high-precision floating-point arithmetic — tens of thousands of digits in some cases (see for example this paper). Researchers exploring properties of the binary and decimal expansions of numbers such as $\pi$ have computed with millions, billions or even trillions of digits — the most recent record computation of $\pi$ was to 31 trillion digits (actually 31,415,926,535,897 decimal digits, which as a point of interest, happens to be the first 14 digits in the decimal expansion of $\pi$).

#### Karatsuba’s algorithm for multiplication

It was widely believed, even as late as the 1950s, that there was no fundamentally faster way of multiplying large numbers than the ancient scheme we learned in school, or a minor variation such as performing the operations in base $2^{32}$ rather than base ten. Then in 1960, the young Russian mathematician Anatoly Karatsuba found a clever technique for performing very high precision multiplication. By dividing each of the inputs into two substrings, he could find the product, as least to the precision often needed, with only three substring multiplications instead of four. By continuing recursively to break down the input numbers in this fashion, he obtained a scheme whose cost increased only by the approximate factor $n^{1.58}$ instead of $n^2$ with the ordinary scheme, where $n$ is the number of computer words. But this was soon overshadowed by a scheme first formalized by Schonhage and Strassen, which we now describe.

#### The fast Fourier transform and multiplication

A major breakthrough in performing such computations came with the realization that the fast Fourier transform (FFT), which was originally developed to process signal data, could be used to dramatically accelerate high-precision multiplication. The fast Fourier transform is merely a clever computer algorithm to rapidly calculate the frequency spectrum of a string of data interpreted as a signal. In effect, one’s ear performs a Fourier transform (in analog, not in digital) when it distinguishes the pitch contour of a musical note or a person’s voice.

The idea for FFT-based multiplication is, first of all, to represent a very high precision number as a string of computer words, each containing, say, 32 successive bits of its binary expansion (i.e., each entry is an integer between $0$ and $2^{32} – 1$). Then two such multiprecision numbers can be multiplied by observing that their product computed the old-fashioned way (in base $2^{32}$ instead of base ten) is nothing more than the “linear convolution” of the two strings of computer words, which convolution can be performed efficiently using an FFT. In particular, the FFT-based scheme for multiplying two high-precision numbers $A = (a_0, a_1, a_2, \cdots, a_{n-1})$ and $B = (b_0, b_1, b_2, \cdots, b_{n-1})$ is the following:

- Extend each of the $n$-long input strings $A$ and $B$ to length $2n$ by padding with zeroes.
- Perform an FFT on each of the extended $A$ and $B$ strings to produce $2n$-long complex strings denoted $F(A)$ and $F(B)$.
- Multiply the strings $F(A)$ and $F(B)$ together, term-by-term, as complex numbers.
- Perform an inverse FFT on the resulting $2n$-long string to yield a $2n$-long string $C$, i.e., $C = F^{-1}[F(A) \cdot F(B)]$. Round each entry of $C$ to the nearest integer.
- Starting at the end of $C$, release carries in each term, i.e., leave only the final 32 bits in each word, with the higher-order bits added to the previous word or words as necessary.
- The result $D$ is the product of $A$ and $B$, represented as a $2n$-long string of 32-bit integers.

Note, however, that several important details were omitted here. For instance, the product of two 32-bit numbers can be computed exactly using a 64-bit hardware multiply operation, but numerous such 64-bit values must be added together when computing an FFT, requiring somewhat more than 64-bit precision. Also, for optimal performance it is important to take advantage of the fact that both input strings $A$ and $B$ are real numbers rather than complex, and that the final inverse FFT returns real numbers in $C$. Finally, the outline above omitted mention on how to manage roundoff error when performing these operations. One remedy is to divide the data into even smaller chunks, say only 16 or 24 bits per word; another remedy is perform these computations not in the field of complex numbers using floating-point arithmetic, but instead in the field of integers modulo a prime number, so that roundoff error is not a factor. For additional details, see this paper.

With a fast scheme such as FFT-based multiplication in hand, division can be performed by Newton iterations, with a level of precision that approximately doubles with each iteration. This scheme reduces the cost of high-precision division to only somewhat more than twice that of high-precision multiplication. A similar scheme can be used to compute high-precision square roots. See this paper for details.

Schonhage and Strassen, in carefully analyzing a certain variation of the FFT-based multiplication scheme, found that for large $n$ (where $n$ is the number of computer words) the total computational cost, in terms of hardware operations, scales as $n \cdot \log(n) \cdot \log(\log(n))$. In practice, the FFT-based scheme is extremely fast for high-precision multiplication, since one often can utilize highly tuned library routines to perform the FFT operations.

#### An n log(n) algorithm for multiplication

Ever since the Schoenhage-Strassen scheme was discovered and formalized, researchers have wondered if this was truly the end of the road. For example, can we remove the final $\log(\log(n))$ factor and just have $n \cdot \log(n)$ asymptotic complexity?

Well that day has now arrived. In a brilliant new paper by David Harvey of the University of New South Wales, Australia, and Joris van der Hoeven of the French National Centre for Scientific Research, France, the authors have defined a new algorithm with exactly that property.

The Harvey-van der Hoeven paper approaches the problem by splitting the problem into smaller problems, applying the FFT multiple times, and replacing more multiplications with additions and subtractions, all in a very clever scheme that eliminates the nagging $\log(\log(n))$ factor. The authors note that their result does not prove that there is no algorithm even faster than theirs. So maybe an even faster one will be found. Or not.

Either way, don’t put away your multiplication tables quite yet. The Harvey-van der Hoeven scheme is only superior to FFT-based multiplication for exceedingly high precision. But it is an important theoretical result that has wide-ranging implications in the field of computational complexity.

For some additional details, see this well-written Quanta article by Kevin Hartnett.