#### Computational Physics Basics: Piecewise and Linear Interpolation

Posted 24th February 2022 by Holger Schmitz

One of the main challenges of computational physics is the problem of representing continuous functions in time and space using the finite resources supplied by the computer. A mathematical function of one or more continuous variables naturally has an infinite number of degrees of freedom. These need to be reduced in some manner to be stored in the finite memory available. Maybe the most intuitive way of achieving this goal is by sampling the function at a discrete set of points. We can store the values of the function as a lookup table in memory. It is then straightforward to retrieve the values at the sampling points. However, in many cases, the function values at arbitrary points between the sampling points are needed. It is then necessary to interpolate the function from the given data.

Apart from the interpolation problem, the pointwise discretisation of a function raises another problem. In some cases, the domain over which the function is required is not known in advance. The computer only stores a finite set of points and these points can cover only a finite domain. Extrapolation can be used if the asymptotic behaviour of the function is known. Also, clever spacing of the sample points or transformations of the domain can aid in improving the accuracy of the interpolated and extrapolated function values.

In this post, I will be talking about the interpolation of functions in a single variable. Functions with a higher-dimensional domain will be the subject of a future post.

### Functions of a single variable

A function of a single variable, \(f(x)\), can be discretised by specifying the function values at sample locations \(x_i\), where \(i=1 \ldots N\). For now, we don’t require these locations to be evenly spaced but I will assume that they are sorted. This means that \(x_i < x_{i+1}\) for all \(i\). Let’s define the function values, \(y_i\), as \[

y_i = f(x_i).

\] The intuitive idea behind this discretisation is that the function values can be thought of as a number of measurements. The \(y_i\) provide incomplete information about the function. To reconstruct the function over a continuous domain an interpolation scheme needs to be specified.

#### Piecewise Constant Interpolation

The simplest interpolation scheme is the piecewise constant interpolation, also known as the nearest neighbour interpolation. Given a location \(x\) the goal is to find a value of \(i\) such that \[

|x-x_i| \le |x-x_j| \quad \text{for all} \quad j\ne i.

\] In other words, \(x_i\) is the sample location that is closest to \(x\) when compared to the other sample locations. Then, define the interpolation function \(p_0\) as \[

p_0(x) = f(x_i)

\] with \(x_i\) as defined above. The value of the interpolation is simply the value of the sampled function at the sample point closest to \(x\).

The left plot in the figure above shows some smooth function in black and a number of sample points. The case where 10 sample points are taken is shown by the diamonds and the case for 20 sample points is shown by the circles. Also shown are the nearest neighbour interpolations for these two cases. The red curve shows the interpolated function for 10 samples and the blue curve is for the case of 20 samples. The right plot in the figure shows the difference between the original function and the interpolations. Again, the red curve is for the case of 10 samples and the blue curve is for the case of 20 samples. We can see that the piecewise constant interpolation is crude and the errors are quite large.

As expected, the error is smaller when the number of samples is increased. To analyse exactly how big the error is, consider the residual for the zero-order interpolation \[

R_0(x) = f(x) – p_0(x) = f(x) – f(x_i).

\] The first step to analyse the magnitude of the residual is to perform a Taylor expansion of the residual around the point \(x_i\). We only need the zero order term. Using Taylor’s Theorem and the Cauchy form of the remainder, one can write \[

R_0(x) = \left[ f(x_i) + f'(\xi_c)(x – x_i)\right] – f(x_i).

\] The term in the brackets is the Taylor expansion of \(f(x)\), and \(\xi_c\) is some value that lies between \(x_i\) and \(x\) and depends on the value of \(x\). Let’s define the distance between two samples with \(h=x_{i+1}-x_i\). Assume for the moment that all samples are equidistant. It is not difficult to generalise the arguments for the case when the support points are not equidistant. This means, the maximum value of \(x – x_i\) is half of the distance between two samples, i.e. \[

x – x_i \le \frac{h}{2}.

\] It os also clear that \(f'(\xi_c) \le |f'(x)|_{\mathrm{max}}\), where the maximum is over the interval \(|x-x_i| \le h/2\). The final result for an estimate of the residual error is \[

|R_0(x)| \le\frac{h}{2} |f'(x)|_{\mathrm{max}}

\]

#### Linear Interpolation

As we saw above, the piecewise interpolation is easy to implement but the errors can be quite large. Most of the time, linear interpolation is a much better alternative. For functions of a single argument, as we are considering here, the computational expense is not much higher than the piecewise interpolation but the resulting accuracy is much better. Given a location \(x\), first find \(i\) such that \[

x_i \le x < x_{i+1}.

\] Then the linear interpolation function \(p_1\) can be defined as \[

p_1(x) = \frac{x_{i+1} – x}{x_{i+1} – x_i} f(x_i)

+ \frac{x – x_i}{x_{i+1} – x_i} f(x_{i+1}).

\] The function \(p_1\) at a point \(x\) can be viewed as a weighted average of the original function values at the neighbouring points \(x_i\) and \(x_{i+1}\). It can be easily seen that \(p(x_i) = f(x_i)\) for all \(i\), i.e. the interpolation goes through the sample points exactly.

The left plot in the figure above shows the same function \(f(x)\) as the figure in the previous section but now together with the linear interpolations for 10 samples (red curve) and 20 samples (blue curve). One can immediately see that the linear interpolation resembles the original function much more closely. The right plot shows the error for the two interpolations. The error is much smaller when compared to the error for the piecewise interpolation. For the 10 sample interpolation, the maximum absolute error of the linear interpolation is about 0.45 compared to a value of over 1.5 for the nearest neighbour interpolation. What’s more, going from 10 to 20 samples improves the error substantially.

One can again try to quantify the error of the linear approximation using Taylor’s Theorem. The first step is to use the Mean Value Theorem that states that there is a point \(x_c\) between \(x_i\) and \(x_{i+1}\) that satisfies \[

f'(x_c) = \frac{ f(x_{i+1}) – f(x_i) }{ x_{i+1} – x_i }.

\] Consider now the error of the linear approximation, \[

R_1(x) = f(x) – p_1(x) = f(x) – \left[\frac{x_{i+1} – x}{x_{i+1} – x_i} f(x_i)

+ \frac{x – x_i}{x_{i+1} – x_i} f(x_{i+1})\right].

\] The derivative of the error is \[

R’_1(x) = f'(x) – \frac{ f(x_{i+1}) – f(x_i) }{ x_{i+1} – x_i }.

\] The Mean Value Theorem implies that the derivative of the error at \(x_c\) is zero and the error is at its maximum at that point. In other words, to estimate the maximum error, we only need to find an upper bound of \(|R(x_c)|\).

We now perform a Taylor expansion of the error around \(x_c\). Using again the Cauchy form of the remainder, we find \[

R(x) = R(x_c) + xR'(x_c) + \frac{1}{2}R’^\prime(\xi_c)(x-\xi_c)(x-x_c).

\] The second term on the right hand side is zero by construction, and we have \[

R(x) = R(x_c) + \frac{1}{2}R’^\prime(\xi_c)(x-\xi_c)(x-x_c).

\] Let \(h\) again denote the distance between the two points, \(h=x_{i+1} – x_i\). We assume that \(x_c – x_i < h/2\) and use the equation above to calculate \(R(x_i)\) which we know is zero. If \(x_c\) was closer to \(x_{i+1}\) we would have to calculate \(R(x_{i+1})\) but otherwise the argument would remain the same. So, \[

R(x_i) = 0 = R(x_c) + \frac{1}{2}R’^\prime(\xi_c)(x_i-\xi_c)(x_i-x_c)

\] from which we get \[

|R(x_c)| = \frac{1}{2}|R’^\prime(\xi_c)(x_i-\xi_c)(x_i-x_c)|.

\] To get an upper estimate of the remainder that does not depend on \(x_c\) or \(\xi_c\) we can use the fact that both \(x_i-\xi_c \le h/2\) and \(x_i-x_c \le h/2\). We also know that \(|R(x)| \le |R(x_c)|\) over the interval from \(x_i\) to \(x_{i+1}\) and \(|R’^\prime(\xi_c)| = |f’^\prime(\xi_c)| \le |f’^\prime(x)|_{\mathrm{max}}\). Given all this, we end up with \[

|R(x)| \le \frac{h^2}{8}|f’^\prime(x)|_{\mathrm{max}}.

\]

The error of the linear interpolation scales with \(h^2\), in contrast to \(h\) for the piecewise constant interpolation. This means that increasing the number of samples gives us much more profit in terms of accuracy. Linear interpolation is often the method of choice because of its relative simplicity combined with reasonable accuracy. In a future post, I will be looking at higher-order interpolations. These higher-order schemes will scale even better with the number of samples but this improvement comes at a cost. We will see that the price to be paid is not only a higher computational expense but also the introduction of spurious oscillations that are not present in the original data.

#### Computational Physics: Truncation and Rounding Errors

Posted 15th October 2021 by Holger Schmitz

In a previous post, I talked about accuracy and precision in numerical calculations. Ideally one would like to perform calculations that are perfect in these two aspects. However, this is almost never possible in practical situations. The reduction of accuracy or precision is due to two numerical errors. These errors can be classified into two main groups, round-off errors and truncation errors.

#### Rounding Error

Round-off errors occur due to the limits of numerical precision at which numbers are stored in the computer. As I discussed here a 32-bit floating-point number for example can only store up to 7 or 8 decimal digits. Not just the final result but every intermediate result of a calculation will be rounded to this precision. In some cases, this can result in a much lower precision of the final result. One instance where round-off errors can become a problem happens when the result of a calculation is given by the difference of two large numbers.

#### Truncation Error

Truncation errors occur because of approximations the numerical scheme makes with respect to the underlying model. The name truncation error stems from the fact that in most schemes the underlying model is first expressed as an infinite series which is then truncated allowing it to be calculated on a computer.

### Example: Approximating Pi

Let’s start with a simple task. Use a series to approximate the value of \(\pi\).

#### Naive summation

One of the traditional ways of calculating \(\pi\) is by using the \(\arctan\) function together with the identity \[

\arctan(1) = \frac{\pi}{4}.

\] One can expand \(\arctan\) into its Taylor series, \[

\arctan(x)

= x – \frac{x^3}{3} +\frac{x^5}{5} – \frac{x^7}{7} + \ldots

= \sum_{n=0}^\infty \frac{(-1)^n x^{2n+1}}{2n+1}.

\] The terms of the series become increasingly smaller and you could try to add up all the terms up to some maximum \(N\) in the hope that the remaining infinite sum is small and can be neglected. Inserting \(x=1\) into the sum will give you an approximation for \(\pi\), \[

\pi \approx 4\sum_{n=0}^N \frac{(-1)^n }{2n+1}.

\]

Here are e implementations for this approximation in C++, Python and JavaScript.

**C++**

```
double pi_summation_slow(int N) {
double sum = 0.0;
int sign = 1;
for (int i=0; i<N; ++i) {
sum += sign/(2*i + 1.0);
sign = -sign;
}
return 4*sum;
}
```

**Python**

```
def pi_summation_slow(N):
sum = 0.0
sign = 1
for i in range(0,N):
sum = sum + sign/(2*i + 1.0)
sign = -sign
return 4*sum
```

**JavaScript**

```
function pi_summation_slow(N) {
let sum = 0.0;
let sign = 1;
for (let i=0; i<N; ++i) {
sum += sign/(2*i + 1.0);
sign = -sign;
}
return 4*sum;
}
```

Let’s call this function with \(N=10\). All the results I am showing here are calculated using a Python implementation. We get a result of around 3.0418. The relative error is 0.0318 and is, of course, unacceptable. This error falls under the category of truncation errors because it is caused by not summing up enough terms of the Taylor series. Calling the function with \(N=1000\) gives us a result of 3.14059 with a relative error of \(3.183\times 10^{-4}\). The error has improved but is still far off from the possible \(10^{-14}\) to \(10^{-15}\) achievable in double-precision arithmetic. The figure below shows how the relative error decreases with the number of iterations.

From this curve, one h long m wl hat the error decreases with \(1/N\). If one extrapolates the curve, one finds that it would take \(10^{14}\) iterations to reach an error below \(10^{-14}\). Even if this was computationally feasible, the round-off errors of such a long sum would eventually prevent the error from being lowered to this limit.

#### Improvements using Machin’s formula

The technique of calculating \(\pi\) can be improved in two ways. Firstly, instead of using the Taylor series, you can use Euler’s series for the \(\arctan\) function.

\[

\arctan(x) = \sum_{n=0}^\infty \frac{2^{2n} (n!)^2}{(2n + 1)!} \frac{x^{2n + 1}}{(1 + x^2)^{n + 1}}.

\]

This series converges much more quickly than the Taylor series. The other way to improve convergence is to use trigonometric identities to come up with formulas that converge more quickly. One of the classic equations is the Machin formula for \(\pi\), first discovered by John Machin in 1706, \[

\frac{\pi}{4} = 4 \arctan \frac{1}{5} – \arctan \frac{1}{239}

\] Here are the implementations for this formula.

**C++**

```
double pi_summation_fast(int order) {
using boost::math::factorial;
double sum = 0.0;
for (unsigned int n=0; n<order; ++n) {
double f = factorial<double>(n);
double common = pow(2.0, 2*n)*f*f/factorial<double>(2*n + 1);
double A = pow(25./26., n+1)/pow(5., 2*n+1);
double B = pow(239.*239. / (239.*239. + 1.), n+1)/pow(239., 2*n+1);
sum += common*( 4*A - B );
}
return 4*sum;
}
```

**Python**

```
def pi_summation_fast(N):
sum = 0.0
for n in range(0,N):
f = factorial(n)
common = math.pow(2.0, 2*n)*f*f/factorial(2*n + 1)
A = math.pow(25/26, n+1)/math.pow(5, 2*n+1)
B = math.pow(239*239 / (239*239 + 1), n+1)/math.pow(239, 2*n+1)
sum = sum + common*( 4*A - B )
return 4*sum;
```

**JavaScript**

```
function pi_summation_fast(N) {
let sum = 0.0;
for (let n=0; n<N; ++n) {
const f = factorial(n);
const common = Math.pow(2.0, 2*n)*f*f/factorial(2*n + 1);
const A = pow(25/26, n+1)/pow(5, 2*n+1);
const B = pow(239*239 / (239*239 + 1), n+1)/pow(239, 2*n+1);
sum += common*( 4*A - B );
}
return 4*sum;
}
```

The table below shows the computed values for \(\pi\) together with the relative error. You can see that each iteration reduces the error by more than an order of magnitude and only a few iterations are necessary to achieve machine precision accuracy.

N | \(S_N\) | error |
---|---|---|

1 | 3.060186968243409 | 0.02591223443732105 |

2 | 3.139082236428362 | 0.0007990906009289966 |

3 | 3.141509789149037 | 2.6376570705797483e-05 |

4 | 3.141589818359699 | 9.024817686074192e-07 |

5 | 3.141592554401089 | 3.157274505454055e-08 |

6 | 3.141592650066872 | 1.1213806035463463e-09 |

7 | 3.1415926534632903 | 4.0267094489200705e-11 |

8 | 3.1415926535852132 | 1.4578249079970333e-12 |

9 | 3.1415926535896266 | 5.3009244691058615e-14 |

10 | 3.1415926535897873 | 1.8376538159566985e-15 |

11 | 3.141592653589793 | 0.0 |

### Example: Calculating sin(x)

Calculate the value of \(\sin(x)\) using it’s Taylor series around x=0.

The Taylor series for \(\sin(x)\) is \[

\sin x = \sum_{n=0}^\infty \frac{(-1)^n}{(2n+1)!}x^{2n+1}.

\] this series is much more well-behaved than the Taylor series for \(\arctan\) we saw above. Because of the factorial in the denominator, the individual terms of this series will converge reasonably quickly. Here are some naive implementations of this function where the infinite sum has been replaced by a sum from zero to \(N\).

**C++**

```
double taylor_sin(double x, int order)
{
using boost::math::factorial;
double sum = 0.0;
int sign = 1;
for (unsigned int n=0; n<order; ++n)
{
sum += sign*pow(x, 2*n + 1)/factorial<double>(2*n +1);
sign = -sign;
}
return sum;
}
```

**Python**

```
def taylor_sin(x, N):
sum = 0.0
sign = 1
for n in range(0,N):
sum = sum + sign*math.pow(x, 2*n + 1)/factorial(2*n + 1)
sign = -sign
return sum
```

**JavaScript**

```
function taylor_sin(x, N) {
let sum = 0.0;
let sign = 1;
for (let n=0; n<N; n++) {
sum += sign*pow(x, 2*n + 1)/factorial(2*n +1);
sign = -sign;
}
return sum;
}
```

A good test for this function is the evaluation of \(\sin(x)\) at values \(x = k\pi\), where \(k\) is an integer. We know that \(\sin(k\pi) = 0\) and the return value from the numeric function can directly be used as the absolute error of the computation. The figure below shows results for some values of \(k\) plotted against \(N\).

For small values of \(k\), this series converges relatively quickly. But for larger \(k\) you can see that more and more terms are needed. The error even grows first before being reduced. Just like the example above, the truncation error requires large values of \(N\) to reach a good accuracy of the result. In practice, you would not calculate the \(\sin\) function this way. Instead you would make use of known properties, such as \(\sin(2k\pi + x) = \sin(x)\) for integer \(k\), to transform the argument into a range where fast convergence is guaranteed.

However, I would like to continue my analysis of this function because it shows two more interesting pitfalls when performing long sums. First, you will notice that the curves in the figure above show dashed lines for \(N>85\). This is because the implementation I showed above will actually fail with a range error. The `pow`

function and the factorial both start producing numbers that exceed the valid range of `double`

floating-point numbers. The quotient of the two, on the other hand, remains well-behaved. It is, therefore, better to write the Taylor series using a recursive definition of the terms.

\[

\sin x = \sum_{n=0}^\infty a_n(x),

\] with \[

a_0 = x

\] and \[

a_{n} = -\frac{x^2}{2n(2n+1)}a_{n-1}

\]

The implementations are given again below.

**C++**

```
double taylor_sin_opt(double x, int order)
{
double sum = x;
double an = x;
for (unsigned int n=1; n<order; ++n)
{
an = -x*x*an/(2*n*(2*n+1));
sum += an;
}
return sum;
}
```

**Python**

```
def taylor_sin_opt(x, N):
sum = x
an = x
for n in range(1,N):
an = -x*x*an/(2*n*(2*n+1))
sum = sum + an
return sum
```

**JavaScript**

```
function taylor_sin_opt(x, N) {
let sum = x;
let an = x;
for (let n=1; n<N; n++) {
an = -x*x*an/(2*n*(2*n+1));
sum += an;
}
return sum;
}
```

The other takeaway from the graphs of the errors is that they don’t always converge to machine accuracy. The reason for this originates from fact that the initial terms of the sum can be quite large but with opposite signs. They should cancel each other out exactly, but they don’t because of numerical round-off errors.

#### Multi Slit Interference

Posted 11th September 2021 by Holger Schmitz

This week I have been playing around with my electromagnetic wave simulation code MPulse. The code simulates Maxwell’s equations using an algorithm known as finite-difference time-domain, FDTD for short. Here, I am simulating the wave interference pattern behind the screen with n slits. For $n=1$ there is only one slit and no interference at all. The slit is less than the wavelength wide so you can see the wave spreading out at a large angle. When $n>1$ the slits are 5 wavelengths apart. With $n=2$ you have the classical double-slit experiment. At the right side of the simulation box, the well-known interference pattern can be observed. For 3 slits the pattern close to the slits becomes more complicated. It settles down to a stable pattern much further to the right. For numbers 4, 5 and 6 the patterns become more and more interesting. You would expect the development of sharper lines but it turns out that, for these larger numbers, the simulation box just wasn’t big enough to see the final far-field pattern.

MPulse is open-source and is available at https://github.com/holgerschmitz/MPulse.

The code uses MPI parallelization and was run on 24 cores for this simulation.

MPulse is a work in progress and any help from enthusiastic developers is greatly appreciated.

#### Computational Physics Basics: Accuracy and Precision

Posted 24th August 2021 by Holger Schmitz

Problems in physics almost always require us to solve mathematical equations with real-valued solutions, and more often than not we want to find functional dependencies of some quantity of a real-valued domain. Numerical solutions to these problems will only ever be approximations to the exact solutions. When a numerical outcome of the calculation is obtained it is important to be able to quantify to what extent it represents the answer that was sought. Two measures of quality are often used to describe numerical solutions: accuracy and precision. Accuracy tells us how will a result agrees with the true value and precision tells us how reproducible the result is. In the standard use of these terms, accuracy and precision are independent of each other.

#### Accuracy

Accuracy refers to the degree to which the outcome of a calculation or measurement agrees with the true value. The technical definition of accuracy can be a little confusing because it is somewhat different from the everyday use of the word. Consider a measurement that can be carried out many times. A high accuracy implies that, on average, the measured value will be close to the true value. It does not mean that each individual measurement is near the true value. There can be a lot of spread in the measurements. But if we only perform the measurement often enough, we can obtain a reliable outcome.

#### Precision

Precision refers to the degree to which multiple measurements agree with each other. The term precision in this sense is orthogonal to the notion of accuracy. When carrying out a measurement many times high precision implies that the outcomes will have a small spread. The measurements will be reliable in the sense that they are similar. But they don’t necessarily have to reflect the true value of whatever is being measured.

#### Accuracy vs Precision

To fully grasp the concept of accuracy vs precision it is helpful to look at these two plots. The crosses represent measurements whereas the line represents the true value. In the plot above, the measurements are spread out but they all lie around the true value. These measurements can be said to have low precision but high accuracy. In the plot below, all measurements agree with each other, but they do not represent the true value. In this case, we have high precision but low accuracy.

A moral can be gained from this: just because you always get the same answer doesn’t mean the answer is correct.

When thinking about numerical methods you might object that calculations are deterministic. Therefore the outcome of repeating a calculation will always be the same. But there is a large class of algorithms that are not quite so deterministic. They might depend on an initial guess or even explicitly on some sequence of pseudo-random numbers. In these cases, repeating the calculation with a different guess or random sequence will lead to a different result.

#### Computational Physics Basics: Floating Point Numbers

Posted 25th May 2021 by Holger Schmitz

In a previous contribution, I have shown you that computers are naturally suited to store finite length integer numbers. Most quantities in physics, on the other hand, are real numbers. Computers can store real numbers only with finite precision. Like storing integers, each representation of a real number is stored in a finite number of bits. Two aspects need to be considered. The precision of the stored number is the number of significant decimal places that can be represented. Higher precision means that the error of the representation can be made smaller. Bur precision is not the only aspect that needs consideration. Often, physical quantities can be very large or very small. The electron charge in SI units, for example, is roughly $1.602\times10^{-19}$C. Using a fixed point decimal format to represent this number would require a large number of unnecessary zeros to be stored. Therefore, the *range* of numbers that can be represented is also important.

In the decimal system, we already have a notation that can capture very large and very small numbers and I have used it to write down the electron charge in the example above. The scientific notation writes a number as a product of a mantissa and a power of 10. The value of the electron charge (without units) is written as

$$1.602\times10^{-19}.$$

Here 1.602 is the mantissa (or the significand) and -19 is the exponent. The general form is

$$m\times 10^n.$$

The mantissa, $m$, will always be between 1 and 10 and the exponent, $n$, has to be chosen accordingly. This format can straight away be translated into the binary system. Here, any number can be written as

$$m\times2^n,$$

with $1\le m<2$. Both $m$ and $n$ can be stored in binary form.

#### Memory layout of floating-point numbers, the IEEE 754 standard

In most modern computers, numbers are stored using 64 bits but some architectures, like smartphones, might only use 32 bits. For a given number of bits, a decision has to be made on how many bits should be used to store the mantissa and how many bits should be used for the exponent. The IEEE 754 standard sets out the memory layout for various size floating-point representations and almost all hardware supports these specifications. The following table shows the number of bits for mantissa and exponent for some IEEE 754 number formats.

Bits | Name | Sign bit | Mantissa bits, m | Exponent bits, p | Exponent bias | Decimal digits |
---|---|---|---|---|---|---|

16 | half-precision | 1 | 10 | 5 | 15 | 3.31 |

32 | single precision | 1 | 23 | 8 | 127 | 7.22 |

64 | double precision | 1 | 52 | 11 | 1023 | 15.95 |

128 | quadruple precision | 1 | 112 | 15 | 16383 | 34.02 |

The layout of the bits is as follows. The first, most significant bit represents the sign of the number. A 0 indicates a positive number and a 1 indicates a negative number. The next $p$ bits store the exponent. The exponent is not stored as a signed integer, but as an unsigned integer with offset. This offset, or bias, is chosen to be $2^p – 1$ so that a leading zero followed by all ones corresponds to an exponent of 0.

The remaining bits store the mantissa. The mantissa is always between 1 and less than 2. This means that, in binary, the leading bit is always equal to one and doesn’t need to be stored. The $m$ bits, therefore, only store the fractional part of the mantissa. This allows for one extra bit to improve the precision of the number.

**Example**

#### Infinity and NaN

The IEEE 754 standard defines special numbers that represent infinity and the not-a-number state. Infinity is used to show that a result of a computation has exceeded the allowed range. It can also result from a division by zero. Infinity is represented by the maximum exponent, i.e. all $p$ bits of the exponent are set to 1. In addition, the $m$ bits of the mantissa are set to 0. The sign bit is still used for infinity. This means it is possible to store a `+Inf`

and a `-Inf`

value.

**Example**

The special state `NaN`

is used to store results that are not defined or can’t otherwise be represented. For example, the operation $\sqrt{-1}$ will result in a not-a-number state. Similar to infinity, it is represented by setting the $p$ exponent bits to 1. To distinguish it from infinity, the mantissa can have any non-zero value.

#### Subnormal Numbers

As stated above, all numbers in the regular range will be represented by a mantissa between 1 and 2 so that the leading bit is always 1. Numbers very close to zero will have a small exponent value. Once the exponent is exactly zero, it is better to explicitly store all bits of the mantissa and allow the first bit to be zero. This allows even smaller numbers to be represented than would otherwise be possible. Extending the range in this way comes at the cost of reduced precision of the stored number.

**Example**

#### Floating Point Numbers in Python, C++, and JavaScript

Both Python and JavaScript exclusively store floating-point numbers using 64-bit precision. In fact, in JavaScript, **all** numbers are stored as 64-bit floating-point, even integers. This is the reason for the fact that integers in JavaScript only have 53 bits. They are stored in the mantissa of the 64-bit floating-point number.

C++ offers a choice of different precisions

Type | Alternative Name | Number of Bits |
---|---|---|

`float` |
single precision | usually 32 bits |

`double` |
double precision | usually 64 bits |

`long double` |
extended precision | architecture-dependent, not IEEE 754, usually 80 bits |

#### Spherical Blast Wave Simulation

Posted 2nd December 2020 by Holger Schmitz

Here is another animation that I made using the Vellamo fluid code. It shows two very simple simulations of spherical blast waves. The first simulation has been carried out in two dimensions. The second one shows a very similar setup but in three dimensions.

You might have seen some youtube videos on blast waves and dimensional analysis on the Sixty Symbols channel or on Numberphile. The criterion for the dimensional analysis given in those videos is true for strong blast waves. The simulation that I carried out, looks at the later stage of these waves when the energy peters out and the strong shock is replaced by a compression wave that travels at a constant velocity. You can still see some of the self-similar behaviour of the Sedov-Taylor solution during the very early stages of the explosion. But after the speed of the shock has slowed down to the sound speed, the compression wave continues to travel at the speed of sound, gradually losing its energy.

The video shows the energy density over time. The energy density includes the thermal energy as well as the kinetic energy of the gas.

For those of you who are interested in the maths and the physics, the code simulates the Euler equations of a compressible fluid. These equations are a model for an ideal adiabatic gas. For more information about the Euler equations check out my previous post.

#### Computational Physics Basics: Integers in C++, Python, and JavaScript

Posted 5th August 2020 by Holger Schmitz

In a previous post, I wrote about the way that the computer stores and processes integers. This description referred to the basic architecture of the processor. In this post, I want to talk about how different programming languages present integers to the developer. Programming languages add a layer of abstraction and in different languages that abstraction may be less or more pronounced. The languages I will be considering here are C++, Python, and JavaScript.

### Integers in C++

C++ is a language that is very close to the machine architecture compared to other, more modern languages. The data that C++ operates on is stored in the machine’s memory and C++ has direct access to this memory. This means that the C++ integer types are exact representations of the integer types determined by the processor architecture.

The following integer datatypes exist in C++

Type | Alternative Names | Number of Bits | G++ on Intel 64 bit (default) |
---|---|---|---|

`char` |
at least 8 | 8 | |

`short int` |
`short` |
at least 16 | 16 |

`int` |
at least 16 | 32 | |

`long int` |
`long` |
at least 32 | 64 |

`long long int` |
`long long` |
at least 64 | 64 |

This table does not give the exact size of the datatypes because the C++ standard does not specify the sizes but only lower limits. It is also required that the larger types must not use fewer bits than the smaller types. The exact number of bits used is up to the compiler and may also be changed by compiler options. To find out more about the regular integer types you can look at this reference page.

The reason for not specifying exact sizes for datatypes is the fact that C++ code will be compiled down to machine code. If you compile your code on a 16 bit processor the plain `int`

type will naturally be limited to 16 bits. On a 64 bit processor on the other hand, it would not make sense to have this limitation.

Each of these datatypes is signed by default. It is possible to add the `signed`

qualifier before the type name to make it clear that a signed type is being used. The `unsigned`

qualifier creates an unsigned variant of any of the types. Here are some examples of variable declarations.

char c; // typically 8 bit unsigned int i = 42; // an unsigned integer initialised to 42 signed long l; // the same as "long l" or "long int l"

As stated above, the C++ standard does not specify the exact size of the integer types. This can cause bugs when developing code that should be run on different architectures or compiled with different compilers. To overcome these problems, the C++ standard library defines a number of integer types that have a guaranteed size. The table below gives an overview of these types.

Signed Type | Unsigned Type | Number of Bits |
---|---|---|

`int8_t` |
`uint8_t` |
8 |

`int16_t` |
`uint16_t` |
16 |

`int32_t` |
`uint32_t` |
32 |

`int64_t` |
`uint64_t` |
64 |

More details on these and similar types can be found here.

The code below prints a 64 bit `int64_t`

using the binary notation. As the name suggests, the `bitset`

class interprets the memory of the data passed to it as a bitset. The bitset can be written into an output stream and will show up as binary data.

#include <bitset> void printBinaryLong(int64_t num) { std::cout << std::bitset<64>(num) << std::endl; }

### Integers in Python

Unlike C++, Python hides the underlying architecture of the machine. In order to discuss integers in Python, we first have to make clear which version of Python we are talking about. Python 2 and Python 3 handle integers in a different way. The Python interpreter itself is written in C which can be regarded in many ways as a subset of C++. In Python 2, the integer type was a direct reflection of the `long int`

type in C. This meant that integers could be either 32 or 64 bit, depending on which machine a program was running on.

This machine dependence was considered bad design and was replaced be a more machine independent datatype in Python 3. Python 3 integers are quite complex data structures that allow storage of arbitrary size numbers but also contain optimizations for smaller numbers.

It is not strictly necessary to understand how Python 3 integers are stored internally to work with Python but in some cases it can be useful to have knowledge about the underlying complexities that are involved. For a small range of integers, ranging from -5 to 256, integer objects are pre-allocated. This means that, an assignment such as

n = 25

will not create the number 25 in memory. Instead, the variable `n`

is made to reference a pre-allocated piece of memory that already contained the number 25. Consider now a statement that might appear at some other place in the program.

a = 12 b = a + 13

The value of `b`

is clearly 25 but this number is not stored separately. After these lines `b`

will reference the exact same memory address that `n`

was referencing earlier. For numbers outside this range, Python 3 will allocate memory for each integer variable separately.

Larger integers are stored in arbitrary length arrays of the C `int`

type. This type can be either 16 or 32 bits long but Python only uses either 15 or 30 bits of each of these "digits". In the following, 32 bit `int`

s are assumed but everything can be easily translated to 16 bit.

Numbers between −(2^{30} − 1) and 2^{30} − 1 are stored in a single `int`

. Negative numbers are not stored as two’s complement. Instead the sign of the number is stored separately. All mathematical operations on numbers in this range can be carried out in the same way as on regular machine integers. For larger numbers, multiple 30 bit digits are needed. Mathamatical operations on these large integers operate digit by digit. In this case, the unused bits in each digit come in handy as carry values.

### Integers in JavaScript

Compared to most other high level languages JavaScript stands out in how it deals with integers. At a low level, JavaScript does not store integers at all. Instead, it stores all numbers in floating point format. I will discuss the details of the floating point format in a future post. When using a number in an integer context, JavaScript allows exact integer representation of a number up to 53 bit integer. Any integer larger than 53 bits will suffer from rounding errors because of its internal representation.

const a = 25; const b = a / 2;

In this example, `a`

will have a value of 25. Unlike C++, JavaScript does not perform integer divisions. This means the value stored in `b`

will be 12.5.

JavaScript allows bitwise operations only on 32 bit integers. When a bitwise operation is performed on a number JavaScript first converts the floating point number to a 32 bit signed integer using two’s complement. The result of the operation is subsequently converted back to a floating point format before being stored.

#### The SIR Model for the Spread of Infectious Diseases

Posted 26th April 2020 by Holger Schmitz

In the current Coronavirus crisis, everybody is talking about flattening *“the curve”*. In the news, you will often see graphs of the total number of cases or the total number of deaths over time. So you may be forgiven to think that these are the curves that everybody is trying to flatten. In fact, what epidemiologists mean by *the curve* is the graph of the number of actively infected people over time. This curve is important because it determines the load that is placed on the healthcare system of a country. The current number of cases determines how many hospital beds, how many ventilators, and how much healthcare personnel are needed.

Mathematics and computer simulations play an important role in estimating how the disease will spread, how many people will be affected, and how much resources are needed. They also allow predicting the effects of different measures to control the spread. For example, the current lockdown in many countries around the world is reducing the number of people that an infected individual can pass the virus on to. It is important to know how effective this measure is. One of the big questions is when it is safe to relax the isolation of people and how much it would affect the spread if individual businesses re-open.

Before continuing, I have to add a disclaimer. I am interested in mathematics but I am not an expert epidemiologist. The models I am showing you here are very simple starting points for simulating the spread of diseases. They can give you some idea on how parameters like the infection rate and recovery rate influence the overall number of infected individuals. But they should not be used to draw any quantitative conclusions.

#### The SIR Model

To get a basic feel for the way infections spread through a population, epidemiologists have developed simple mathematical models. Probably the first model you will hear about in this context is the **SIR model**. The SIR model is a version of a compartmental model. This means that the total population is divided up into separate compartments. The quantity $S$ denotes the number of susceptible individuals. These are the people that are not infected and also don’t have any immunity to the disease. $I$ is the number of infected individuals and $R$ is the number of individuals that are not infectious but also can’t get the disease. Most scientists denote the $R$ to mean removed as it includes both people who have recovered and are immune but also those that have died. Due to the current sensitivity of the subject, many people prefer to call $R$ the recovered population.

Compartmental models define rates by which individuals change from one population to another. The SIR model has two rates, the rate of infection and the rate of recovery. The absolute rate of infection is proportional to the number of infected people. On average, each infected individual will pass the infection to a number of people in a given time interval. This number is usually called $\beta$. However, if an infected individual passes the virus to a recovered person, the infection will not spread. The probability of passing the infection on is given by $S/N$ where $N$ is the total population $N=S+I+R$. Putting this together, the absolute rate of infection is

$$\frac{\beta I S}{N}.$$

The rate of recovery is slightly more simple. Each infected individual will recover with some probability $\gamma$ in a given time interval. The absolute rate of recovery is then expressed as

$$\gamma I.$$

The infection rate reduces the number of susceptible individuals $S$ and increases the number of infected individuals $I$. The recovery rate reduces the number of infected individuals $I$ and increases the number of recovered individuals $R$. The complete set of rate equations is then

$$\begin{eqnarray}

\frac{dS}{dt} &=& – \frac{\beta I S}{N}, \\

\frac{dI}{dt} &=& \frac{\beta I S}{N} – \gamma I, \\

\frac{dR}{dt} &=& \gamma I.

\end{eqnarray}$$

The ratio of the coefficients $\beta$ and $\gamma$ is known as the basic reproduction ratio.

$$R_0 = \frac{\beta}{\gamma}$$.

The $R_0$ is important because it determines whether the infection will spread exponentially or eventually die out.

I have implemented a little JavaScript app that integrates the SIR equations and shows the development of the populations over time. Feel free to play around with the sliders and explore how the parameters influence the spread.

I encourage you to play around with the parameters to see how the model behaves. For an infection rate of 1 and a recovery rate of 0.5, the populations stabilise when about 80% of the population has been infected and has recovered. The maximum of the infectious population, the $I$ curve, reaches about 16%. If you reduce the infection rate, the $I$ curve flattens, prolonging the time over which the disease is spreading but reducing the maximum number of infected individuals at any one time.

#### The SEIR Model

One of the major assumptions in the SIR model is that an infected individual can immediately spread the infection. A refinement of the model is the addition of a population, $E$, of exposed individuals. These are people that are infected but are not yet infectious. The SEIR model introduces another rate, $a$, at which exposed individuals turn infectious. The quantity $a$ can be understood as the inverse of the average incubation period. The absolute rate at which exposed individuals become infectious is

$$a E.$$

The complete set of equations of the SEIR model are then as follows.

$$\begin{eqnarray}

\frac{dS}{dt} &=& – \frac{\beta I S}{N}, \\

\frac{dE}{dt} &=& \frac{\beta I S}{N} – a E, \\

\frac{dI}{dt} &=& a E – \gamma I, \\

\frac{dR}{dt} &=& \gamma I.

\end{eqnarray}$$

The SEIR model is also implemented in the app above. Simply pick *SEIR Model* from the dropdown menu and start exploring.

#### The SEIR Model with Delay

The SEIR model above assumes that an individual, once exposed, can immediately turn infectious. The constant rate $a$ implies that the probability of changing from the exposed state to the infectious state is the same on day one of being exposed as it is on day ten. This might not be realistic because diseases typically have some incubation period. Only after some number of days after being exposed will an individual become infectious. One can model this kind of behaviour with a time delay. Let’s say that after a given incubation period $\tau$, every exposed individual will turn infectious. The absolute rate at which exposed individuals become infectious is then given by

$$\frac{\beta I(t-\tau) S(t-\tau)}{N}.$$

Here the $S(t-\tau)$ means taking the value of the susceptible individuals not at the current time, but at a time in the past with a delay of $\tau$. The complete set of equations of the SEIR model with delay are then as follows.

$$\begin{eqnarray}

\frac{dS}{dt} &=& – \frac{\beta I(t) S(t)}{N}, \\

\frac{dE}{dt} &=& \frac{\beta I(t) S(t)}{N} – \frac{\beta I(t-\tau) S(t-\tau)}{N}, \\

\frac{dI}{dt} &=& \frac{\beta I(t-\tau) S(t-\tau)}{N} – \gamma I(t), \\

\frac{dR}{dt} &=& \gamma I(t).

\end{eqnarray}$$

I have written the time dependence explicitly for all quantities on the right-hand side to make it clear how the time delay should be applied.

You can choose this model in the app above by selecting *SEIR Model with Delay* from the dropdown menu.

#### Some Conclusions

The SEIR model and the SEIR model with delay both introduce a population of exposed individuals that are not yet infectious. This draws out the spread of the disease over a longer time. It also slightly reduces the maximum of the infectious population curve $I$. Introducing a time delay doesn’t change the curves too much. But for long incubation periods, the curve of infectious individuals can have multiple maxima. So at some time, it may look like the disease has stopped spreading while in reality, a next wave is just about to start. The two versions of the SEIR model are two extremes and the truth lies somewhere in between these two.

I have to stress again that I am not an epidemiology expert and that the models presented here are very simple models. For any meaningful prediction of the spread of a real disease, much more complex models are needed. These models must include real data about the number of contacts that different parts of the population have between each other.

The code for the application above is available on

GitHub

#### Computational Physics Basics: How Integers are Stored

Posted 4th April 2020 by Holger Schmitz

### Unsigned Integers

Computers use binary representations to store various types of data. In the context of computational physics, it is important to understand how numerical values are stored. To start, let’s take a look at non-negative integer numbers. These unsigned integers can simply be translated into their binary representation. The binary number-format is similar to the all-familiar decimal format with the main difference that there are only two values for the digits, not ten. The two values are **0** and **1**. Numbers are written in the same way as decimal numbers only that the place values of each digit are now powers of 2. For example, the following 4-digit numbers show the values of the first four

0 0 0 1 decimal value 2^{0} = 1

0 0 1 0 decimal value 2^{1} = 2

0 1 0 0 decimal value 2^{2} = 4

1 0 0 0 decimal value 2^{3} = 8

The binary digits are called **bits** and in modern computers, the bits are grouped in units of 8. Each unit of 8 bits is called a **byte** and can contain values between 0 and 2^{8} − 1 = 255. Of course, 255 is not a very large number and for most applications, larger numbers are needed. Most modern computer architectures support integers with 32 bits and 64 bits. Unsigned 32-bit integers range from 0 to 2^{32} − 1 = 4, 294, 967, 295 ≈ 4.3 × 10^{9} and unsigned 64-bit integers range from 0 to 2^{64} − 1 = 18, 446, 744, 073, 709, 551, 615 ≈ 1.8 × 10^{19}. It is worthwhile noting that many GPU architectures currently don’t natively support 64-bit numbers.

The computer’s processor contains registers that can store binary numbers. Thus a 64-bit processor contains 64-bit registers and has machine instructions that perform numerical operations on those registers. As an example, consider the addition operation. In binary, two numbers are added in much the same way as using long addition in decimal. Consider the addition of two 64 bit integers 7013356221863432502 + 884350303838366524. In binary, this is written as follows.

01100001,01010100,01110010,01010011,01001111,01110010,00010001,00110110 + 00001100,01000101,11010111,11101010,01110101,01001011,01101011,00111100 --------------------------------------------------------------------------- 01101101,10011010,01001010,00111101,11000100,10111101,01111100,01110010

The process of adding two numbers is simple. From right to left, the digits of the two numbers are added. If the result is two or more, there will be a carry-over which is added to the next digit on the left.

You could add integers of any size using this prescription but, of course, in the computer numbers are limited by the number of bits they contain. Consider the following binary addition of (2^{64} − 1) and 1 .

11111111,11111111,11111111,11111111,11111111,11111111,11111111,11111111 + 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000001 --------------------------------------------------------------------------- 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000

If you were dealing with mathematical integers, you would expect to see an extra digit `1`

on the left. The computer cannot store that bit in the register containing the result but stores the extra bit in a special **carry flag**. In many computer languages, this unintentional overflow will go undetected and the programmer has to take care that numerical operations do not lead to unintended results.

### Signed Integers

The example above shows that adding two non-zero numbers can result in 0. This can be exploited to define negative numbers. In general, given a number *a*, the negative − *a* is defined as the number that solves the equation

*a* + ( − *a*) = 0.

Mathematically, the *N*-bit integers can be seen as the group of integers modulo 2^{N}. This means that for any number *a* ∈ {0, …, 2^{N} − 1} the number − *a* can be defined as

− *a* = 2^{N} − *a* ∈ {0, …, 2^{N} − 1}.

By convention, all numbers whose highest value binary bit is zero are considered positive. Those numbers whose highest value bit is one are considered negative. This makes the addition and subtraction of signed integers straightforward as the processor does not need to implement different algorithms for positive or negative numbers. Signed 32-bit integers range from − 2, 147, 483, 648 to 2, 147, 483, 647, and 64-bit integers range from − 9, 223, 372, 036, 854, 775, 808 to 9, 223, 372, 036, 854, 775, 807.

This format of storing negative numbers is called the **two’s complement** format. The reason for this name becomes obvious when observing how to transform a positive number to its negative.

01100001,01010100,01110010,01010011,01001111,01110010,00010001,00110110 (7013356221863432502) 10011110,10101011,10001101,10101100,10110000,10001101,11101110,11001010 (-7013356221863432502)

To invert a number, first, invert all its bits and then add 1. This simple rule of taking the two’s complement can be easily implemented in the processor’s hardware. Because of the simplicity of this prescription, and the fact that adding a negative number follows the same algorithm as adding a positive number, two’s complement is de-facto the only format used to store negative integers on modern processors.

### Exercises

- Show that taking the two’s complement of an
*N*-bit number*a*does indeed result in the negative −*a*if the addition of two numbers is defined as the addition modulo 2^{N}. - Find out how integers are represented in the programming language of your choice. Does this directly reflect the representation of the underlying architecture? I will be writing another post about this topic soon.
- Most processors have native commands for multiplying two integers. The result of multiplying the numbers in two
*N*-bit registers are stored in two*N*-bit result registers representing the high and low bits of the result. Show that the resulting 2*N*bits will always be enough to store the result. - Show how the multiplication of two numbers can be implemented using only the bit-shift operator and conditional addition based on the bit that has been shifted out of the register. The bit-shift operator simply shifts all bits of a register to the left or right.