# Parallel Spectral Numerical Methods/Finite Precision Arithmetic

For more on this see a text book on numerical methods such as Bradie[1]. Because computers have a fixed amount of memory, floating point numbers can only be stored with a finite number of digits of precision. This limits the accuracy to which the solution to a numerical problem can be obtained in finite time. Most computers use binary IEEE 754 arithmetic to perform numerical calculations. There are other formats, but this will be the one of most relevance to us.

## Exercises

1) Download the most recent IEEE 754 standard from http://ieeexplore.ieee.org/xpl/mostRecentIssue.jsp?punumber=2355, see also http://grouper.ieee.org/groups/754/ -- unfortunately the links to the official standard require either IEEE membership or a subscription. If you do not have this please see the wikipedia page http://en.wikipedia.org/wiki/IEEE_754-2008 for the information you will need to answer the questions below. (These links are correct as of 1 April 2012, should they not be active, we expect that the information should be obtained by a search engine or by referring to a numerical analysis textbook such as Bradie[2].)
a) In this standard what is the range of precision of numbers in:
i) Single precision
ii) Double precision
b) What does the standard specify for quadruple precision?
c) What does the standard specify about how elementary functions should be computed? How does this affect the portability of programs?
2) Suppose we discretize a function for ${\displaystyle x\in [-1,1]}$. For what values of ${\displaystyle \epsilon }$ is
${\displaystyle \epsilon \log \left(\cosh \left({\frac {x}{\epsilon }}\right)\right)=\lVert x\rVert }$

in:

i) Single precision?
ii) Double precision?
3) Suppose we discretize a function for ${\displaystyle x\in [-1,1]}$. For what values of ${\displaystyle \epsilon }$ is
${\displaystyle \tanh \left({\frac {x}{\epsilon }}\right)={\begin{cases}1\quad x\geq 0\\-1\quad x<0\end{cases}}}$

in:

i) Single precision?
ii) Double precision?
4)
a) What is the magnitude of the largest 4 byte integer in the IEEE 754 specification that can be stored?
b) Suppose you are doing a simulation with ${\displaystyle N^{3}}$ grid points and need to calculate ${\displaystyle N^{3}}$. If ${\displaystyle N}$ is stored as a 4 byte integer, what is the largest value of ${\displaystyle N}$ for which ${\displaystyle N^{3}}$ can also be stored as a 4 byte integer?

## References

Bradie, B. (2006). A Friendly Introduction to Numerical Analysis. Pearson.

http://en.wikibooks.org/wiki/Floating_Point