MATLAB Applications Part 1: Numerical Methods
July 1, 2025
Introduction
In my previous guide, I referenced how MATLAB is used for:
- modelling: creating representations of a scientific phenomena
- scientific computing: using the capabilities of a computer to analyze these models and solve difficult problems
How did MATLAB get to this point? Well, some of the intellectual fathers of modern day computing have contributed to MATLAB and it thus has one of the most intriuging histories for a software. It truly reflects an ideal of scientific computing and engineering at their best: relentless revision and collaboration.
Recall that the numerical linear algebra package known as LINPACK developed by many computer scientists and mathematicians is the heart of MATLAB. One person who contributed to LINPACK was Cleve Moler, who went on to create MATLAB for his students. Back then, it was only a matrix calculator. This early version of MATLAB was worked on by him and George E. Forsythe, the founder of Stanford's computer science department. The first versions of it were then distributed to a variety of universities and national labs, and then was ultimately released as a commercial product when Moler met Stanford electrical engineering grad student John N. Little. While working, Little and his friend Steve Bangert would go on to rewrite MATLAB in C. After that, the trio went on to found MathWorks in Portola Valley, California—nestled in the beautiful hills just behind Stanford. Later, the first ten copies of MATLAB were bought by numerical analyst Nick Trefethen at a little school known as MIT. The rest is history.
In the first part of guide, we will first develop an understanding of numerical analysis and work through the most important numerical methods.
In the second part, we will learn a new tool called Simulink. Finally, we will go a medley of MATLAB applications in science and engineering. The content here will hopefully allow you to integrate programming into anything you work on and also pursue your own projects in MATLAB.
Please check out Bibliography for all of the wonderful sources I used to build this guide. Any specific information not pulled from them is adapted from MATLAB's blogs and documentation. If you have any questions or concerns, please contact me using the socials listed on my home page.
1. Intro to Numerical Analysis
Around 2600 years ago, Pythagoras and the Greeks believed that the universe could be expressed entirely through natural numbers and ratios between them. Some time later, their world view was disturbed when they discovered that
is irrational using the Greek ladder technique. And yet, a few millenia before even this, the Babylonians and Indians had methods to approximate
. The Babylonians were sophisticated mathemeticians, known for developing their own numerical system. A Babylonian tablet, YBC (Yale Babylonian Collection) 7289, is one example of just how impressive their work was. In base-60 (can you think of anything in our lives today that uses the number 60?), one diagonal contains the numbers
. After converting to decimal, this refers to the following fraction, which is incredibly close to
:
Why did they record
? Well, notice how the square also has a side labeled 30. Multiplying these two values together gets the value along the other diagonal,
. Given a side length of 30, they used the approximation of
to find the diagonal of the square. Note: In Babylonian mathematics, division by 2 was equivalent to multiplication by 30, since 2 and 30 are reciprocal values in their number system
. This allowed them to scale values depending on what they were solving for. How were they able to calculate this? Well, they could have used a process modeled by the following:
where
n is the number you want to find the square root of
a is the approximation to 
B is a "correction" term such that 
improves the estimate
Let's try evaluating this for
:
Our approximation is too small.
Let's try evaluating this for
:
This approximation is too large but a little closer.
They likely ended up with a more accurate recursive method known as the Babylonian method. This is also known as, and perhaps more accurately so, Heron's method:
where
S is the number you want to find the square root of
is your current guess for 
is then your improved guess for 
Let's iterate through:
After just three iterations, we're at a much closer approximation.
percent_Err = (((1 / 2) * (1.417 + (2 / 1.417)) - sqrt(2)) / sqrt(2)) * 100
Later along the way, this ended up as Newton's method:
If we plug in
, we get:
In other words, the Babylonian method is Newton's method for a function in the form of
which has the root
. Anyways, there's very rich history here that I skipped, but the important idea is that this is arguably an early form of numerical analysis.
Numerical analysis isn't something that is super easy to define. Essentially, it's the study of using algorithms to solve problems involving continuous functions. Continuous functions involve real or complex variables, which cannot be represented on computers to complete accuracy because of the way they store information. Thus, through the convergence of approximations, numerical analysis is used to find unknowns to the best of our ability.
1.1 Numbers and Rounding: Unsigned and Signed Integers
Recall the following number systems:
Types of Numbers | Definition | Description |
Natural numbers | | The set of all positive whole numbers |
Unsigned integers | | An extension of the natural numbers that includes zero |
Integers | | An extension of the unsigned integers that includes negative whole numbers |
Rational numbers | | An extension of the integers to include fractions |
Real numbers | | An extension of the rational numbers to include irrational numbers |
Complex numbers | ℂ =  | An extension of the real numbers to include imaginary numbers |
And recall the 16 fundamental array-based data types in MATLAB:
Name of Class | Description | Usage |
logical | Stores 1 (true) or 0 (false) | logical() |
string | String arrays stored | " " |
char | Character arrays stored | ' ' |
uint8 | 8-bit unsigned integer array | uint(8) |
uint16 | 16-bit unsigned integer array | uint(16) |
uint32 | 32-bit unsigned integer array | uint(32) |
uint64 | 64-bit unsigned integer array | uint(64) |
int8 | 8-bit signed integer array | int(8) |
int16 | 16-bit signed integer array | int(16) |
int32 | 32-bit signed integer array | int(32) |
int64 | 64-bit signed integer array | int(64) |
single | single-precision floating point arrays | single() |
double | double-precision floating point arrays (numeric default) | double() |
table | table array that stores tabular data | table() |
cell | cell array that groups related data of different types | { } |
structure | structure array that groups related data via data fields | struct |
Here's a flowchart to visualize this:
Let's begin with unsigned integers.
The one digit number 7 can be broken down like this:
The two digit number 15 can be broken down like this:
The three digit number 100 can be broken down like this:
The four digit number 8192 can be broken down like this:
How can this be represented through code?
p = (10 * ones(1,4)).^[0:1:3]
In other words, we represented a decimal number as a vector. This is important because MATLAB stores everything as arrays.
Alternatively,
Okay, instead of base-10, let's represent numbers in base-2 (binary).
Let's convert the 1-bit binary number 0 to decimal:
Let's convert the 1-bit binary number 1 to decimal:
Let's convert the 2-bit binary number 10 to decimal:
Let's convert the 2-bit binary number 11 to decimal:
Let's convert the 4-bit binary number 1001 to decimal:
The 8-bit binary number
is equivalent to the 4-bit binary number 1001. There are different ways to represent binary numbers depending on the number of bits. Here's a table with the first 15 decimal integers and their equivalents in 4-bit binary representation:
Decimal | 4-Bit Binary |
0 | 0000 |
1 | 0001 |
2 | 0010 |
3 | 0011 |
4 | 0100 |
5 | 0101 |
6 | 0110 |
7 | 0111 |
8 | 1000 |
9 | 1001 |
10 | 1010 |
11 | 1011 |
12 | 1100 |
13 | 1101 |
14 | 1110 |
15 | 1111 |
Try representing the decimal number 16 through binary. That's right, you cannot. 4-bit binary numbers can only represent a total of 16 (
) possible values, including 0. These 4-bit binary numbers are also known as nibbles. The largest possible number represented by an 8-bit binary number:
8-bit binary numbers can represent a total of 256 (
) values, including 0.
"8-bit" in computing refers to early systems where 8 bits was a reasonable compromise in efficiency. This affects audio quality, resolution, and processing power. You may recognize 8-bit music as "Chiptune" or arcade music and 8-bit images as pixel art.
How can binary be represented through code? MATLAB has a built-in function for this:
d = [0; 1; 2; 3; 9; 255];
b = dec2bin(d)
b =
'00000000'
'00000001'
'00000010'
'00000011'
'00001001'
'11111111'
We can also go backwards:
Let's try converting the decimal number 100 to binary.
b = [1, 1, 0, 0, 1, 0, 0];
p = (2*ones(1,7)).^[6:-1:0];
Let's look at the data representation.
Using fliplr(b), we can vertically flip b.
Indeed, logic can be involved.
Let's convert it into a logical array:
Let's fnd the powers and calculate it again:
p = (2*ones(1,7)).^[6:-1:0];
Representing binary numbers as logic is important because they can then be found through hardware. Logic gates are built through transistors, which are semiconductor devices that store electronic signals. Transistors represent numbers in computers using binary code. Each transistor can be either on (1) or off (0). Combined, these sequences form binary numbers.
Another way to represent unsigned numbers is through base-16, or hexademical.
Decimal | Hexadecimal |
0 | 0 |
1 | 1 |
2 | 2 |
3 | 3 |
4 | 4 |
5 | 5 |
6 | 6 |
7 | 7 |
8 | 8 |
9 | 9 |
10 | a |
11 | b |
12 | c |
13 | d |
14 | e |
15 | f |
For instance, d in hexademical is broken down as such:
You can convert quickly from hexadecimal to binary as such:
in binary
in binary
in binary
Let's look at an example:
We will convert it to binary:
in binary
in binary
in binary
Let's try MATLAB's built-in functions:
Let's try using format hex:
y is stored using the double class and y4 is stored using the uint16 class.
Hexademical can also be used to "spell" out words. Here's an example of one used in debugging.
Hexademical Code | Decimal | Word | Indicates |
0xDEADBEEF | 3735928559 | deadbeef | software crash |
Going back to decimal, let's think about the magnitudes of the values that it can represent:
Number of Digits d | Form | Min Value | Max Value |
1 | | 0 | |
2 | | | |
3 | | | |
n | | | |
From this, let's come up with a formula for the number of decimal digits:
Let
where n is an integer and
. Then there is some
such that
Thus,
This can be represented as
This gives us n, the number of decimal digits needed to represent
. Let's test this in MATLAB. How many digits are needed to represent
?
Let's think about this for binary:
Number of Digits b (bits) | Form | Min Value | Max Value |
1 | | 0 | |
2 | | | |
3 | | | |
n | | | |
From this, let's come up with a formula for the number of bits:
Let
and
. Then there is some
such that
Thus,
This can be represented as
This gives us n, the number of bits needed to represent
. Let's test this in MATLAB:
Let's try this for a 25 digit number.
We can check our work by finding the length of N.
What would be outputted for n if we had N = 2^24?
We can see that our formula comes from an inequality where an unsigned integer can be bounded between two powers of 2.
For instance,
We simplified this expression using logarithms to get our formula.
Let's look at the following proposition:
Let
with
. We have the expression:
This is an increasing function with
where z is the domain of the function.
Let's prove this proposition. A function is increasing on an interval if the derivative is positive on that interval.
Let
. We can simplify:
Let's now take the derivative of both sides:
Keep in mind that
. And the domain of
. Thus,
Because the derivative is positive, we know that y is increasing.
We have proved the propisition that if
, then
.
We used the floor function earlier. Let's take a closer look at it.
For division with integers, the following is true:
With remainders, this can be rewritten as:
This leads to the following:
where 
Let's try using the floor function.
help floor
floor - Round toward negative infinity
This MATLAB function rounds each element of X to the nearest integer
less than or equal to that element.
Syntax
Y = floor(X)
Y = floor(t)
Y = floor(t,unit)
Input Arguments
X - Input array
scalar | vector | matrix | multidimensional array | table |
timetable
t - Input duration
duration array
unit - Unit of time
'seconds' (default) | 'minutes' | 'hours' | 'days' | 'years'
Examples
Round Matrix Elements Toward Negative Infinity
Round Duration Values Toward Negative Infinity
See also ceil, fix, round
Introduced in MATLAB before R2006a
Documentation for floor
Other uses of floor
Let's create a vector, floor each value, and then plot the relationship:
title('Floored Value Versus Actual Value')
How can we interpret this? Well, for the first set of values, notice how any x from -10 t0 -9.5 is rounded off to -10:
Let's look at the floor and rem functions. If take the floor of
, we get the following: The actual value of
is
, but floor returns 12. This is because it rounds down. Let's take a look at rem.
Note the following:
Indeed, q * D + r = N.
Another true expression can be found:
Furthermore,
The rem function might remind you of the modulo operator (mod).
What's the difference? Well, let's look at a graph:
M = [mod(V(:),3),rem(V(:),3)];
title('Difference Between Rem and Mod')
legend('mod','rem','location','best')
Additionally, here's a table:
T = table(V, modVals, remVals, ...
'VariableNames', {'V', 'mod', 'rem'});
disp(T)
V mod rem
___ ___ ___
-10 2 -1
-9 0 0
-8 1 -2
-7 2 -1
-6 0 0
-5 1 -2
-4 2 -1
-3 0 0
-2 1 -2
-1 2 -1
0 0 0
1 1 1
2 2 2
3 0 0
4 1 1
5 2 2
6 0 0
7 1 1
8 2 2
9 0 0
10 1 1
The outputs change when negative integers are involved.
Here's a table where a is the numerator and m is the denominator:
Aspect | mod(a,m) | rem(a,m) |
Purpose | returns the remainder after division | returns the remainder after division |
Rounding | floor(a/m) (rounds toward ) | fix(a/m) (rounds toward 0) |
Output | Always 0 or same sign as a | Always 0 or same sign as m |
a > 0 | Returns positive number | Returns negative number |
m = 0 | Returns a | Returns 0 |
Let's derive mod.
We have the following:
Remainder r is 0 if N is even and 1 if N is odd.
Okay, let's test for an even N:
evenOddTester = 2 * q + r
Let's test for an odd N:
evenOddTester = 2 * q + r
In both cases, we get that the output of evenOddTester is equivalent to N. Thus, it is working as planned.
Okay, so where is the output of modulo in evenOddTester? It's in r. Let's rearrange the function:
where 
Given the condition
and the fact we want an integer quotient
, should we be using floor division or fix? We should be using floor. Thus,
because
. So we now have the following:
Let's test this out:
We now have a working modulo function.
For more than just evens and odds, we can generalize it:
Okay, let's verify that our function works. First, let's get an expected output:
5 fits into 32 exactly 6 times, meaning that
is accounted for. We can see from this that there is a remainder of
. We can also multiply the fractional leftover
by the denominator (divisor) 5 to get 2. Thus,
. Now, let's test it in code:
Knowing that arrays are the backbone of MATLAB, let's write out a definition for the modulo operation in MATLAB:
b = mod(a,m) returns the remainder after dividing dividend a by divisor m.
It can be expressed as b = a - m.*floor(a./m).
We can now test mod using a vector of integers:
Here's a review problem: The output of mod(a,m) would represent something if it were a logical vector. What would it represent as a logical vector? Once you figure that out, code up the logical vector.
Let's test mod using positive and negative integers:
a = [-12, -7, -4, -1, 46, 93];
Note that nonzero outputs are always positive if the divisor is positive.
Let's test mod using a negative divisor:
a = [-12, -7, -4, -1, 46, 93];
Note that nonzero outputs are always negative if the divisor is negative.
Let's test mod using floating-point values:
theta = [0.0 3.2 6.2 7.6 9.0 4*pi];
mod(theta,m)
0 3.2000 6.2000 1.3168 2.7168 0
Note that mod attempts to compensate for the rounding of floating-point values to produce the most accurate integer values as possible.
We've successfully used floor to derive the modulo operator. Let's now use floor to code up a general algorithm for converting an integer from decimal to binary.
Let's convert 27 into binary:
We can recognize that:
Thus, the number of bits n is given by floor(log2(27)) + 1 = 5.
27 can be represented by
bits b:
Okay, now let's assign each of those bits a power of 2. Here's a table:
| = decimal |
| 1 |
| 2 |
| 4 |
| 8 |
| 16 |
| 32 |
| 64 |
| 128 |
| 256 |
| 512 |
| 1024 |
| 2048 |
| 4096 |
| 8192 |
| 16384 |
| 32768 |
| 65536 |
From this, we have the following table, going from the most significant bit to least significant:
Thus, 27 in the binary form of
is 11011.
Let's convert 149 into binary:
We can recognize that:
Thus, the number of bits n is given by floor(log2(149)) + 1 = 8.
149 can be represented by
bits b:
Now let's assign each of those bits a power of 2:
From this, we have the following table:
Thus, 149 in the binary form of
is 10010101.
Let's convert 19229 into binary:
We can recognize that:
Thus, the number of bits n is given by floor(log2(19229)) + 1 = 15.
149 can be represented by
bits b:
Now let's assign each of those bits a power of 2:
From this, we have the following table:
Thus, 19229 in the binary form of
is 100101100011101.
Let's move over to MATLAB and build this out:
n = floor(log2(y))+1; % number of bits needed to represent y in binary
b = false(1,n); % initialize the binary vector using only logical zeros
tmp = y; % gives us a copy of y to manipulate while preserving our original
for i = n-1:-1:0 % loop from most significant to least significant bit
b(1, n - i) = true; % set the corresponding bit from the conditional to 1
tmp = tmp - 2^i; % subtract 2^i from tmp because it's now accounted for
p = (2*ones(1,n)).^[n-1:-1:0];
sum(p(b)); % reconstructs the original number, should be equal to y
bin_str = char(b + 48); % convert the logical binary vector into a string
We've just built an algorithm that expresses the fact that all integers can be represented as powers of 2 plus remainders. This idea is essential to how computers store integers and how MATLAB operates. We also worked with the floor function, which is an essential tool in rounding and thus scientific computation. However, most of our work revolved around unsigned integers. Let's move on to signed integers.
If we want to store positive and negative integers using bits, we can't just put a negative sign somewhere in the binary representation. We also want to make sure arithmetic works properly. When performing binary addition, the difference from decimal addition is that you carry values of 2 instead of values of 10. Let's take a look at some binary addition:
In
, notice how we go from 1-bit integers to a 2-bit integer.
Let's try adding
in binary:
We went from two 8-bit integers to an 8-bit integer.
Let's try adding
in binary:
We went from two 8-bit integers to a 9-bit integer.
Indeed, 300 requires a minimum of 9 bits to represent.
When we try adding up the previous two examples using the uint8 data class...
ex1 = uint8(120) + uint8(100)
ex2 = uint8(200) + uint8(100)
Notice how ex1 yields the expected
. But ex2 gives
. The number 300 in binary cannot be represented in 8-bits. This is called overflow. We can fix this problem by using a different data class like uint16. This expands the bit length.
ex2 = uint16(200) + uint16(100)
Unfortunately, we can't do that with negative integers. Let's look at an example of how to represent signed integers in binary. First, let's do it with a positive integer as we know how to already:
Now, let's do it with a negative integer. You want to flip the bits:
11100011
And then add
:
This representation is known as two's complement. Most computers represent integers this way.
In two's complement, a leading 1 means the integer is negative.
Addition and subtraction in two's complement is pretty simple. Here's what it looks like for a positive integer and a negative integer. We are adding
.
Let's try adding two integers in two's complement that result in a sum of 0. We are adding
.
Overflow also exists when representation integers via two's complement. Let's say we add up
.
The sum of two positive integers yielded a negative integer. This means we have overflow. If you add two numbers with the same sign and get a result with the opposite sign, that means you have overflow. To fix this, you can use a larger data class with more bits. This is known as sign extension.
We have 113 in 8-bit:
01110001
Let's expand it to 16-bit with placeholders:
This is a positive number, so the sign bit (0) moves to the new leftmost position. Then, we fill the rest with copies of the sign bit. In this case, we have a positive number, so we fill it in with zeroes. If it were negative, we'd fill it in with all ones.
0000000001110001
Doing the same for 91 yields 0000000001011011. Let's now add our numbers, now appropriate for int16, together.
16 bits can represent a significantly larger range of values, so we have no more overflow.
Let's do it with negative numbers:
91 in binary is 01011011. Flip the bits and then add
to get 10100101. We now have
. 113 in binary is 01110001. Flip the bits and then add
to get 10001111. We now have
.
in 16-bit is 1111111110100101.–113 in 16-bit is 1111111110001111.
Now, let's combine it all to evaluate
.
Let's review our work. Using int8 does not work and results in overflow:
Using int16 gives the correct answer:
You may have deduced is a threshold for the range of numbers that can be represented in two's complement using a certain number of bits n. A general rule is:
For instance, in 8-bit, the range would be as follows:
Notice how we can go all the way to
but only go up to 127. We can represent one more negative number than positive because there there is not enough space to represent the positive counterpart of the most negative number. Here is an example in 3-bit:
Decimal | Binary |
| 100 |
| 101 |
| 110 |
| 111 |
0 | 000 |
1 | 001 |
2 | 010 |
3 | 011 |
The allowed range is
. 100 is deserved for
. We cannot go beyond 3 mathematically. Instead, we'd wrap around back to
. You can think of two's complement system for a certain number of bits like a circle. Here, there are a certain number of combinations
.
If you are performing a sequence of arithmethic, it's okay if the intermediate operations result in overflow as long as the final result is within the bounds. For instance, let's say we want to do
in this system. For the first operation
, there is overflow because
. For the second operation
, we "revert" the overflow and get an answer
. Ethough there was overflow in the intermediate step, the final result falls within the allowed range, so the operation is valid.
1.2 What Are Fixed-Point and Floating-Point Numbers?
Our previous work with unsigned and signed integers is part of a larger class of data representation that deals with the ways computers store and manipulate data. The first class is is a fixed-precision method that is comprised of fixed-point number systems. A fixed-point number simply means there is a constant number of digits stored on both sides of the radix (often decimal) point. You must decide where the point goes.
It's a good time to talk about how fractions work in these number systems. For any
, we have
where
and represent the numerator and denominator respectively. Thus
.
For instance, the fraction
can be represented by a 4-digit decimal number:
We can use a decimal point to separate this representation into two parts:
From this, we can come up with a general notation for representing fractions:
Let
be nonnegative integers. Then, we can write
in afinite decimal expansion
Each digit is
for all
. The total number of decimal digits is thus
where f is the number of decimal digits.
Let's look at an example for
.
The integer part of our number 8142 is expressed using
decimal digits therefore
. We can see that
decimal digits after the decimal point. All in all,
We cannot write finite decimal expansions for certain fractions because of repeating decimals.
for instance is written as
. For this, we'd have a different form where
Let's try this out both ways. We have
.
Remember that decimal points are a system for humans to understand. We can multiply by a power of 10 and then subtract to move the repeating decimal to the fractional part of the number:
Since
has 7 decimal digits in the fractional part and we've already moved over 4 digits, let's shift over another 3 digits:
Now we can do
to get
where the fractional parts cancel out.
Using this, we can also find the rational fractional form:
More in line with our general notation, we can also express the number
as a summation.
First, we can break down the number as
Recall the formula of a geometric series:
We can then write out what we have as:
Let's put the summation (which represents the repeating part) in a form more similar to the formula above:
We can rewrite the sum as such, where we factor out
:
The inside can be represented as a geometric series of general form
. Let
and
:
We can see that
. We can then say that
We now have the number with an exact fractional form for our repeating decimal.
The general form of a number with a repeating decimal looks like this:
for a sequence of
repeating digits
Let I be the integer part
, A be the non-repeating part after the decimal point, and B be the repeating decimal part. We now have
Trying this out for our previous number
, we can see that
,
with
, and
with
. When we plug these values in to our general formula, we get the following:
This matches our previous work.
The above is for infinite repeating decimals, but what happens if we have a finite decimal? We should be able to distinguish between the two. We can say that the decimal expansion of
is finite if and only if its fractional representation can be expressed with a denominator in the form of
where j and k are nonnegative integers.
If we expand
, we we will eventually arrive at the following:
In other words, a finite decimal can be written as
and since
, we can conclude that the fraction representation of a finite decimal must have a denominator that is a product of twos and fives.
Binary representations are what the computer deals in, so let's now work with binary fractions. First, we will define the general form of a finite binary expansion. Let
be nonnegative integers. Then
can be written in the form
As this is a binary representation we have each bit
where
. Similarly to before, we know that the total number of bits
.
Here's an example dealing with the fraction
. It can be written as
. In binary form, we get
. In binary, recall that decimal 3 is equivalent to binary 11.
For another example, we have
. This can be rewritten as
In binary form, we get
. Once again, recall that decimal 19 is equivalent to binary 10011.
Here, not all rational numbers can be expressed using a finite binary expansion either because of repeating decimals. For instance,
We can write an infinite binary expansion in the form
Let's try this out for an example where
. First, we do a similar process where we multiply by powers of 2 and then subtract so that we are no longer representing the number through infinite repeating bits:
We then shift again:
Now, subtracting eliminates the repeating bits:
For the fractional representation of a number with infinite repeating bits, we have the following gemeral form of a number:
Let I be the integer part, A be the non-repeating binary fraction with n bits, and B be the repeating binary fraction with r bits. Then,
Try plugging in the example we did above with
. Lastly, we can say that a rational number has a finite binary expansion if its denominator is a power of 2. That is, the binary expansion is finite if and only if the denominator is in the form
where
.
I think the toughest part of this section is figuring out how the conceptual framework of fixed-point numbers interplays with how MATLAB actually works. In other words, how can I apply this to my own work? Keep thinking about this as we move forward.
Let's do a little bit of fixed-point arithmetic. MATLAB has a special toolbox called Fixed-Point Designer that lets you really optimize and be specific about your fixed-point arithmetic. It's a little too specialized for this guide, so I won't be talking about it much after this. That being said, the fundamentals of why it works are still paramount here.
Fixed-point arithmetic is essentially the same as integer arithmetic, but with a scale factor. For instance, in Q4.4 format (4 bits for integer, 4 bits for fraction), our scale factor is
. We would multiply the stored integers by
to get our actual number. We would do the reverse to go from the actual number to the stored integer.
Actual Number | Scaling | Integer Stored By Fixed-Point |
| | 52 |
| | 88 |
| | 16 |
So for addition and subtraction, perform the operations directly and then use the scale to interpret the result.
In Fixed-Point Designer, there is a constructor called fi that lets you create a fi object. A fi object is simply storing a real number with a scaling factor.
There are different ways to use this function:
Parameters | Description |
fi | Returns a signed fi object with no value, a 16-bit word length, and a 15-bit fraction length |
fi(v) | Returns a signed fi object with value v, a 16-bit word length, and best-precision fraction length |
fi(v,s) | Returns a fi object with value v, signedness s, a 16-bit word length, and best-precision fraction length |
fi(v, s, w) | Returns a fi object with value v, signedness s, and word length w |
fi(v, s, w, f) | Returns a fi object with value v, signedness s, word length w, and fraction length f |
fi(v, s, w, slope, bias) | Returns a fi object with value v, signedness s, word length w, slope, and bias |
fi(v, s, w, slopeadjustmentfactor, fixedexponent, bias) | Returns a fi object with value v, signedness s, word length w, slopeadjustmentfactor, fixedexponent, and bias |
fi(v,T) | Returns a fi object with value v and numerictype T |
Let's go over some examples of these parameters.
- Value v refers to the number assigned to the data type of the constructor.
- Signedness s refers to whether or not the number can be negative. It stores only logical 1 or 0.
- Word length w refers to the total number of bits used to store the number where
w
. This means you are limited to a maximum of 16 bits to index individual bits. - Fraction length f refers to the number of fractional bits which is then used as the scale factor. 15 is the default.
- slope refers to the scalar you multiply the stored integer by to get the actual value (your real-world number). It is represented by the equation v = slope ∗ integer + bias. It differs from f because you can use non-binary scalars.
- bias lets you shift the scaling and slopeadjustmentfactor lets you use a potentially non-binary slope which must be
. fixedexponent is also a scalar that lets you further adjust slope where slope = slopeadjustmentfactor ∗ 2^fixedexponent. These three parameters are not used much, so we will not discuss them further.
For example...
This would create a fixed-point number that is unsigned, 4 bits total, and of fraction length 6. Thus, the fixed-point number a is found by
. Then, the next stored integer is
. But wait, how is it possible to store a fraction length that is larger than the total number of bits? To store thie value, MATLAB automatically expands the number of bits to prevent overflow; in this case the word length becomes 5 bits total. That is the end of the introduction to Fixed-Point Designer. If you are interested in using it yourself, you will have to acquire a license for the toolkit.
For multiplication, the result is "scaled twice." You multiply the two integers and the scale factor is
. Thus for two integers a and b, you would do
. In binary, the scaling factor is often a power of two. For example, let one stored integer be 125 and the scaling factor be
. Let another stored integer be 33 scaled by
. We can multiply these together to yield
. The scale factor is then
. With this, we can then do
to get
. To check our work, let's convert the original stored integers back to real numbers by doing
and
. Finally,
. The two results match.
For fixed-point division, let's just jump straight into an example. We are dividing the stored integer 3124 with a scale factor of
by the stored integer 9753 with a scale factor of
. The numerator is given by
. The denominator is given by
. The stored quotient is thus
. The final scale factor is given by
. Thus, our calculated real result is
. For
= stored integer values and
, we have the general formulas for fixed-point multiplication and division:
Operation | Formula | Accuracy (generally) |
Multiplication | | Exact |
Division | | Approximate |
To begin our work into floating-point numbers, let's start with an example.
Let's say you wanted to use some sort of set 𝔽 to represent the real numbers on a number line. 3 numbers in the set have been plotted already:
The number
has 3 significant digits and
has 9 significant digits. In the real world, it's generally not useful for large numbers to be represented with more significant digits than small ones. If these were experimental data, it would mean that the
in
is the digit we are not fully confident in. But if we were to change it to say
, it wouldn't make too much of a difference. If we did the same for
and it became
, that would be orders of magnitude more noticeable. Thus, this way of representing numbers is not very effective for scientific computing.
Using the same number line, let's start over. Let's zoom in on the interval from 1 to 2. For the following interval, there are
values because of binary representation. But the exact number isn't as important as the concept. Just imagine that there is a certain amount of values on this interval.
Let's say we were to multiply those
values by 2. They would then end up on the interval from 2 to 4:
There are still
values on the new interval, they are just spaced out twice as far apart. The same idea applies if we were to multiply our original
by
:
To generalize this pattern, we would say that each interval can be represented as
where e is the exponent range.
We can also say that each interval has
evenly spaced values in and that the spaces are relative to the size of the interval. Suppose we have a real number
and
is the closest value in floating point to x. It then follows that the error is bounded by:
This tells us how far off we might be due to the spacing between values. Now, let's look at the same thing but the size of the spacing relative to x. This means that
.
Thus, we can remove
. We are making the upper bound more flexible which is a safe approximation:
We now have the rounding error in terms of bits n:
This special value machine epsilon
can then be calculated as
which is the relative precision of floating-point arithmetic in this case. In real computing, floating-point arithmetic has around 16 decimal digits of precision, so our generalization checks out.
Let's take a second to come up with a formal definition for floating-point numbers:
The set 𝔽 of floating-point numbers can be expressed as
where
and exponent range
. The mantissa
is the part of a floating-point number that represents its significant digits. It then follows that
where d is a fixed integer called the binary precision.
The binary representation of
uses d bits thus
. We will elaborate on this definition later.
So we've seen floating-point as an idea, but why exactly we use it hasn't been well-established. Let's move on to an example that illustrates the advantages floating-point has over fixed-point.
Let x be a fixed-point data type such that
where
is the unsigned, fixed-point number encoded using 8 digits in binary with l bits for the integer part of the binary expansion and f bits for the fractional part of the binary expansion such that
with nonnegative integers 
Input
is an 8-bit binary word. Output
is a rational number with
where
and
. The output can be expressed as a finite binary expansion.
For instance, let
. We will set
and
which lets us see that
This is in the form of the binary expansion
This is then equivalent to
. Let's now convert x to decimal:
If we were to set
and
, we would get a different value:
There are different combinations of values for
and f such that
.
Case # | Possible l | Possible f |
1 | 4 | 0 |
2 | 3 | 1 |
3 | 2 | 2 |
4 | 1 | 3 |
5 | 0 | 4 |
Let's begin with Case 1 where
. We then get the following:
where 
Here is a table of values for all of its possible 4-bit binary words:
Binary Word | Binary Value | Decimal Value |
0000 | | |
0001 | | |
0010 | | |
0011 | | |
0100 | | |
0101 | | |
0110 | | |
0111 | | |
1000 | | |
1001 | | |
1010 | | |
1011 | | |
1100 | | |
1101 | | |
1110 | | |
1111 | | |
The range for x is
. In general form that is
.
Let's move on to Case 2 where
. We then get the following:
where 
Here is a table of values for all of its possible 4-bit binary words:
Binary Word | Binary Value | Decimal Value |
0000 | | |
0001 | | |
0010 | | |
0011 | | |
0100 | | |
0101 | | |
0110 | | |
0111 | | |
1000 | | |
1001 | | |
1010 | | |
1011 | | |
1100 | | |
1101 | | |
1110 | | |
1111 | | |
The range for x is
. In general form that is
. Let's move on to Case 3 where
. We then get the following:
where 
Here is a table of values for all of its possible 4-bit binary words:
Binary Word | Binary Value | Decimal Value |
0000 | | |
0001 | | |
0010 | | |
0011 | | |
0100 | | |
0101 | | |
0110 | | |
0111 | | |
1000 | | |
1001 | | |
1010 | | |
1011 | | |
1100 | | |
1101 | | |
1110 | | |
1111 | | |
The range for x is
. In general form that is
.
Let's move on to Case 4 where
. We then get the following:
where 
Here is a table of values for all of its possible 4-bit binary words:
Binary Word | Binary Value | Decimal Value |
0000 | | |
0001 | | |
0010 | | |
0011 | | |
0100 | | |
0101 | | |
0110 | | |
0111 | | |
1000 | | |
1001 | | |
1010 | | |
1011 | | |
1100 | | |
1101 | | |
1110 | | |
1111 | | |
The range for x is
. In general form that is
.
Let's move on to Case 5 where
. We then get the following:
where 
Here is a table of values for all of its possible 4-bit binary words:
Binary Word | Binary Value | Decimal Value |
0000 | | |
0001 | | |
0010 | | |
0011 | | |
0100 | | |
0101 | | |
0110 | | |
0111 | | |
1000 | | |
1001 | | |
1010 | | |
1011 | | |
1100 | | |
1101 | | |
1110 | | |
1111 | | |
The range for x is
. In general form that is
.
From all of these cases, we can conclude that for
, we can write the interpreted value of the finite binary expansion as
where the total number of bits 
We can then say that
Assuming 
where 
We now get an approximation in the range of
that can be encoded in our
data class using a n-bit binary string.
Let's try this out.
Let
. Expanding out gets us
In binary,
Let's now encode x exactly in fixed-point format.
Let's recast it with
and
. We have a new object represented by
. You may recognize the hat notation from unit vectors in physics, when you have to approximate inconsistent systems in linear algebra, or to denote the Fourier transform of a function. In this case, we are using it to refer to the programming object used to store our approximation of x.
Notice how this has created an extra bit. We could round down to
or round up to
. Let's say we round down to get our approximation:
With
, it follows that
. We can now find the absolute error in our approximation:
For a different method of dealing with the extra bit, we can round up. We will refer to this instance as
. We can evaluate and then find absolute error again:
Let's look at another instance of storing x in
:
Previously, the least significant bits (LSB) were extra. In this case, the most significant bit (MSB) is extra. We could round down to
or round up to
. Let's say we round down to get our approximation and then find absolute error:
On the other hand, rounding up would yield:
This makes sense because we approximated our exact 4-bit integer value (1010) as a 3-bit integer (110) and
. Let's summarize these cases through a table:
Data Type | Binary Value | Absolute Error | Type of Error |
| | 0 | None |
| | | LSB |
| | | LSB |
| | 8 | MSB |
| | 4 | MSB |
Which type of error is preferable between LSB and MSB?
Let's move on to another example. Let
. Then,
This can be further simplified as
. If we were to store this using 8 bits, what would happen? Well, for one, it would have to be an approximation with a good amount of error. This is inevitable, but we can still try to find the encoding with low error. Applying a strategy from the previous example, let's try a data type where
and
. We will also round up:
Because
, it then follows that
We now have an absolute error in our approximation of
. To convert this back into binary, notice that
. Then we have
as our absolute error in binary.
In our previous work, when we were given a number of bits
, one possible way to represent it was to say that
such that half of the bits in the number represented the integer part and half represented the fractional part. This would mean
,
,
, or
. We can analyze the ranges of these representations. Let's look at
.
The lower bound is
. The upper bound is
. Thus,
. More generally,
. If you do this for the other possibilities, you will notice that the upper bounds, or the max integer values, will increase very rapidly.
After all of these examples, we can better see some disadvantages of fixed-point. One of these drawbacks is wasted bits. Using
to encode
would result in the following string of bits
. In high performance environments, this is a waste of storage that will also hurt performance.
Another drawback is that the range is very specific. For instance,
has an upper bound of
. The size of this would be okay for applications, but it's not nearly robust enough. If we needed to represent larger values, we could increase l, but this would result in us having less precision with the fractional part of a number. Remember that having more significant digits with lower numbers is generally more important than having the same with large numbers.
Additionally, remember that the overflow that came from our work with fixed-point arithmetic.
The solution to this is to use floating-point. Floating-point, as we saw before, functions like scientific notation. It can remove the first drawback of extra bits. Take a look at the following example:
Let's say we wanted to represent
. In fixed-point binary, this would become
. All of these zeros do not really add anything important to the representation in terms of precision. They are only there to signify the location of the radix point. And in fixed-point decimal, this would become
. We can represent the number much more efficiently by allowing the radix point to float:
We can then float the radix point four places to the right to get
. This can of course also be done in binary:
Let's look at yet another example of where floating-point is much more efficient than fixed-point.
We have the number
which is equal to
in binary. To represent this in fixed-point, we would need 8 bits for the fractional part. In floating-point, we can represent it as follows:
Floating-point requires only 4 bits for the mantissa, also known as the significand. It stores only the significant digits and uses an exponent to shift the radix point. Because our number has a finite binary expansion and all bits fit into the mantissa, this is an exact number.
Mathematically, the general format of a floating-point number in binary looks like
,
where m is the mantissa and e is the exponent value.
A normalized floating-point number would look like
,
where
. We then have
,
where fis the fractional part of the mantissa. This is equivalent to our previous formal definition of a floating-point number:
where 
Be warned that exact numbers aren't often the case in floating-point, and that would result in an approximation.
Let's conclude this chapter with a table summarizing fixed-point and floating-point numbers:
Aspect | Fixed-Point | Floating-Point |
General form | | |
Stores | Integer part and fractional part | Sign bit, mantissa, exponent |
Analogous to | Decimal numbers scaled by a fixed decimal point | Scientific notation |
Range | Limited by fixed scaling | Controlled by dynamic exponent scaling |
Resources | Very little, inexpensive | Generally fast, depends on hardware but is more expensive |
Created by | Early computer architects and engineers | Numerical analysts |
Used for | Digital signal processing, controls systems | Modern scientific computing |
1.3 IEEE Standard for Floating-Point Arithmetic (IEEE 754)
The Institute of Electrical and Electronics Engineers (IEEE, pronounced "eye-triple-E") is an association responsible for electrical and computer engineering. As a professional organization, they set industry standards, publish research, and organize conferences across many disciplines. IEEE 754 is a technical standard for floating-point arithmetic, and is one of their most important. It was published in 1985 and changes have been made since. Before 1985, different computers could have given slightly different results for the same math. Just think about how confusing this would have been! To this day, most computers abide by IEEE 754.
In IEEE 754, there are different data classes for different levels of precision.
Level of Precision | Number of Bits |
Quarter | 8 |
Half | 16 |
Single | 32 |
Double | 64 |
Extended | 128 |
While the standard nowadays for these is binary8, binary16, binary32, binary64, and binary128, these are some of the different data classes before IEEE 754 came around. The following table details some of the historical floating-point formats:
Example Data Class | Used By |
1750A_32 | Military |
IBM_32 | IBM |
TI_32 | Texas Instruments |
Let's work with quarter-precision floats. Let the total number of bits be 8 and let the number be normalized. Then, we have the bitwise format
. The 8-bit binary word looks as follows:
The sign bit has the value:
Let's try it out. Suppose
. It then follows that
Since
,
.
means that
. And
means that
. We are thus going to store
:
Let's look at another example:
where + is the sign bit and 1 is the implied normalized leading bit. We can now say that
,
, and
.
Let's try an example with a larger number
:
We can now say that
,
, and
. The value is stored as:
Recall that IEEE 754 doesn't store the actual exponent e but rather encodes something called the biased exponent u using the bias k. In this case,
. Thus
. Plugging our values in, we can find the actual value being stored
. We can now come up with a table that summarizes how this works. Recall that u is the value that is stored by the computer for the 3-bit exponent field
. The exponent value is the 3-bit interpreted value of e in decimal found by
. This is known as excess-3.
Exponent Bits | Biased Exponent u | Exponent Value e |
000 | 0 | |
001 | 1 | |
010 | 2 | |
011 | 3 | 0 |
100 | 4 | 1 |
101 | 5 | 2 |
110 | 6 | 3 |
111 | 7 | 4 |
This is just how you go from the exponent bits back to the actual exponent e or vice-versa.
Here is a table of some raw 8-bit words and their numeric values (the first row is the general form):
Quarter-precision 8-bit Word | Sign Bit | Exponent Code | Numerical Value (floating-point binary) | Numerical Value as Sum (decimal) | Numerical Value (decimal) |
| | | |
|
|
11100000 | 1 | 110 | | | |
11011111 | 1 | 101 | | | |
11011110 | 1 | 101 | | | |
11011101 | 1 | 101 | | | |
11011100 | 1 | 101 | | | |
11011011 | 1 | 101 | | | |
11011010 | 1 | 101 | | | |
11011001 | 1 | 101 | | | |
11011000 | 1 | 101 | | | |
... | ... | ... | ... | ... | ... |
01011000 | 0 | 101 | | | |
01011001 | 0 | 101 | | | |
01011010 | 0 | 101 | | | |
01011011 | 0 | 101 | | | |
01011100 | 0 | 101 | | | |
01011101 | 0 | 101 | | | |
01011110 | 0 | 101 | | | |
01011111 | 0 | 101 | | | |
01100000 | 0 | 110 | | | |
... | ... | ... | ... | ... | ... |
01101111 | 0 | 110 | | | |
Here are some notes on floating-point in MATLAB. Let's start by intializing the characterizing parameters of floating-point:
p = [4, 10, 23, 52]; % This represents the number of bits in the fraction (precision)
q = [3, 5, 8, 11]; % This represents the number of bits in the exponent
w = p + q + 1 % Word length, total number of bits
b = 2.^(q-1)-1 % This is the floating-point system's numeric base, usually 2
Roundoff absolute error can be found by eps(x) which gives the distance from x to the next larger floating-point number:
format shortg % shortg is specifically for working with fixed-point and floating-point
eps = 2.^(-p)
0.0625 0.00097656 1.1921e-07 2.2204e-16
This tells us that double-precision is good for around 16 decimal digits of accuracy, single for 7, half for 3, and quarter 1.
If a real number is too large, it also has overflow here and is replaced with infinity. The largest floating point number that does not overflow can be found by:
realmax = 2.^b.*(2-eps)
15.5 65504 3.4028e+38 1.7977e+308
Similarly with underflow, the smallest normalized floating point number is given by:
realmin = 2.^(-b+1)
0.25 6.1035e-05 1.1755e-38 2.2251e-308
To add to this, IEEE 754 introduced the idea of gradual underflow. Gradual underflow is what happens when you perform an operation that results in a number that is too small; it returns a rounded number. And there are also subnormal numbers, which are a special category of floating-point values that are too close to zero to be normalized. The leading digit of a subnormal number is zero. They help prevent underflow.
Thus, the smallest numbers in MATLAB can be found as such:
tiny = realmin.*eps
0.015625 5.9605e-08 1.4013e-45 4.9407e-324
Here are some subnormal numbers in binary8:
Quarter-precision 8-bit Word | Sign Bit | Exponent Code | Numerical Value (floating-point binary) | Numerical Value as Sum (decimal) | Numerical Value (decimal) |
| | | |
|
|
10001111 | 1 | 000 | | | |
10001110 | 1 | 000 | | | |
10001101 | 1 | 000 | | | |
... | ... | ... | ... | ... | ... |
00001101 | 0 | 000 | | | |
00001110 | 0 | 000 | | | |
00001111 | 0 | 000 | | | |
Arithmetic can be done with floating point numbers through what MATLAB calls flints. The largest flint is given by:
flintmax = 2./eps
32 2048 1.6777e+07 9.0072e+15
To summarize, here's a table with the properties of IEEE 754 formats.
T = [w; p; q; b; eps; realmax; realmin; tiny; flintmax];
T = table(T(:,1), T(:,2), T(:,3), T(:,4), ...
'variablenames',{'quarter','half','single','double'}, ...
'rownames',{'w','p','q','b','eps','realmax','realmin', ...
disp(T)
quarter half single double
________ __________ __________ ___________
w 8 16 32 64
p 4 10 23 52
q 3 5 8 11
b 3 15 127 1023
eps 0.0625 0.00097656 1.1921e-07 2.2204e-16
realmax 15.5 65504 3.4028e+38 1.7977e+308
realmin 0.25 6.1035e-05 1.1755e-38 2.2251e-308
tiny 0.015625 5.9605e-08 1.4013e-45 4.9407e-324
flintmax 32 2048 1.6777e+07 9.0072e+15
One example of extended precision format can be found in Intel CPUs, which have the capacity to use what can be referred to as binary80.
With this, let's put together a reference for the most simple of the formats, quarter-precision, or binary8. Recall the visual representation:
This is interpreted as follows:
In this, we have
a sign field
, an exponent field
where
, and significand field
which is interpreted as
where
.
And this only holds if
, which is the range of exponent codes that produce normalized floating-point numbers. What do you think the numbers 000 and 111 outside of these bounds might be reserved for?
If
, then
So far, what we've interpreted the 8-bit binary floating point-number x by breaking it into a form that resembles scientific notation. We then arrived at the final decoded value of the number based on its sign bit, biased exponent, and significand. This is what the number is represented as in the memory of a computer.
Furthermore, we have
,
where 4 is the number of bits in the significand field. This tells you that f is the fractional part of the significand. To decode the significand bits, divide by
to get f. To encode a known f, multiply by
.
Keep the following in mind that when you are working with normalized floating-point numbers in binary8.
The hidden leading bit doesn't always appear:
It only appears when
. Thus, this is the range for normalized floating-point numbers. Only normalized numbers use the hidden leading bit in the significand. Outside of this range you can get subnormal numbers,
, and NaN. Subnormal numbers are where
which means
. Thus we have
. The significand is
and the leading 0 is not hidden. Normalized floating-point numbers in binary formats thus have an extra bit of precision that does not need to be represented in memory. Subnormals trade precision for the ability to represent smaller numbers after the lower bound exponent is reached. For ∞ and NaN
. We have
if
and
, but we have
if
and
. Anything else yields NaN, a special error pattern triggered when at least one
for
.
You may be wondering about what controls precision and the space between different floating-point numbers. After all, they can be used to represent a number line. The distance between two adjacent floating-point numbers is not constant. Recall from the first example in this chapter that they get more spread out as the numbers get larger. For exponent e, the smallest distance you can represent is
, where p is the number of significand bits. Near 0, floating-point numbers are very close together and further from 0, they are more spread apart. As e increases, the gap between floating-point numbers is also increasing. Errors in floating-point arithmetic depend on where the numbers are located and how you are representing them. Knowing Δ can tell you the maximum rounding error for the rounding of a real number as a floating-point number. Think about the difference between relative precision and absolute precision as values grow. If you are confused, go back to the first example from this chapter.
Pushing further, consider a significand field bit string pattern
. Thus
. The floating-point number then becomes
. What would happen to the mapped value as we vary the exponent code? Let's try it out. Keep in mind that for binary8,
. Here are the results:
Value of e | Mapped Value |
| |
| |
0 | |
1 | |
2 | |
3 | |
This is how floating-point has such a wide, dynamic range of numbers. As e changes, you are changing where the radix point goes. Floating-point is really just a way to adjust the fixed-point representation of a number as you see fit.
Okay, let's move on to more IEEE 754 formats. These are two of the most important, and especially so in MATLAB:
| Data Class | Exponent Bits | Fractional Bits |
Single-precision | | 8 | 23 |
Double-precision | | 11 | 52 |
Let's showcase single-precision visually:
Similarly to quarter-precision, let's now make a shortened table (
entries would be too repetitive with little value!) for the relationship between the exponent codes and the numerical exponent values. For
, the exponent value is
. Outside of those bounds we have subnormals and
or NaN.
Exponent Bits | Biased Exponent u | Exponent value e | Numerical Interpretation of x |
00000000 | 0 | n/a (subnormal) | |
00000001 | 1 | | |
00000010 | 2 | | |
... | ... | ... | ... |
11111101 | 253 | 126 | |
11111110 | 254 | 127 | |
11111111 | 255 | n/a | or NaN |
Let's find the range for normalized floats in single-precision:
If
and
, then
Since
,
This is the decoded value of a normalized binary32 float.
Let's showcase double-precision visually:
Let's now make a shortened table for the relationship between the exponent codes and the numerical exponent values. For
, the exponent value is
. Outside of those bounds we have subnormals and
or NaN.
Exponent Bits | Biased Exponent u | Exponent value e | Numerical Interpretation of x |
00000000000 | 0 | n/a (subnormal) | |
00000000001 | 1 | | |
00000000010 | 2 | | |
... | ... | ... | ... |
11111111101 | 2045 | 1022 | |
11111111110 | 2046 | 1023 | |
11111111111 | 2047 | n/a | or NaN |
The range for normalized floats in double-precision follows the same pattern.
If
and
then
Since
,
This is the decoded value of a normalized binary64 float.
Let's work through some problems.
Convert
into single-precision and double-precision. Noticing that
, we can say that
. In binary, this is equal to
. We can then rewrite it further as
. Moving the radix point five points to the left gets us the normalized binary number
. We can now put it into the final single-precision form of
. We can now find individual components of this representation. Because
where
, we can find
. We can then assign bits to the exponent field. We first have the bit string as an 8-bit unsigned integer
. We can then assign each bit into the correct slots of our binary32 representation to get
. In hexademical, you would extend the representation as a raw bit string 01000010001000110010000000000000. Then
0100 | 0010 | 0010 | 0011 | 0010 | 0000 | 0000 | 0000 |
4 | 2 | 2 | 3 | 2 | 0 | 0 | 0 |
Thus, in hexadecimal we have our final IEEE 754 float, 0x42232000. It is pretty easy to convert between binary and hex, so it's common in the world of computer science. We can check our work in MATLAB:
For double-precision, recall the normalized form
. We have
and
. Thus we have
as the normalized value. We can then compute the biased exponent
where
. This yields
. As a bit string we have
. Finally, we have
. And now we can convert it to hexadecimal:
0100 | 0000 | 0100 | 0100 | 0110 | 0100 | |
4 | 0 | 4 | 4 | 6 | 4 | 0000000000 |
We have 0x4044640000000000.
The above two examples are for numbers that can be represented exactly. Some decimal numbers cannot be encoded exactly in binary, so they must be represented approximately. Here's an example with
:
Now we have to convert
to binary
. We can do this iteratively:
Step | Result | Bit |
0 | | 0 |
1 | | 1 |
2 | | 1 |
3 | | 0 |
4 | | 0 |
5 | | 1 |
6 | | 1 |
7 | | 0 |
Thus
. We can then finish the encoding:
We now have 10000000010. Continuing...
And finally we have hex 0x4022cccccccccccd.
There are a lot of cases where floating-point seems better because of its malleable, massive range of values. And that's true. But what about for when you want to be able to distinguish between nearby values with high precision? What happens when you want to represent very small changes? A lot of floating-point representations may not be able to do this, and especially not for less resources than a fixed-point representation. This relates to how coarse the precision in a representation is. A coarse ground of salt means large and uneven granules. You wouldn't use this salt for baking which needs to be very exact and homogenous. Instead, it's used to sprinkle on top of food because you want to taste heterogeneity. You want those bursts of flavor that stand out. In the same way, floating-point is great when you need range rather than clear uniformity, when a little bit of rounding or approximation is okay here and there. You don't really need to distinguish between miniscule values. This is what we would describe as coarse.
Recall that numerical analysis is the study of the algorithms used to solve problems in continuous mathematics. Continuous mathematics requires us to use real or complex numbers which are ℝ, ℂ, and ℚ. But these numbers cannot be stored exactly on computers because computers have finite memory and use binary representations. Thus, we have to approximate them. One of the primary tools used to approximate them is IEEE 754. The resolution of these problems is the backbone of modern day science and engineering. Indeed, much of the world would fail to run without numerical analysis—with IEEE 754 as the standard at its core.
1.4 IEEE 754 and NVIDIA
NVIDIA is one of the most powerful and influential companies on the planet. As a leader in the semiconductor industry, it produces graphics processing units (GPUs). GPUs can accelerate everything from the standard graphics rendering to intense numerical computations with the right software. They are designed to run in parallel, making them massively efficient and scalable. These chips are the engines of video games, scientific computing, and AI.
NVIDIA Voyager and Park, NVIDIA Newsroom. © NVIDIA
GeForce RTX 5090 Graphics Card, designed to be the workhorse of AI. © NVIDIA
At the foundation of all of this is IEEE 754. Proper approximation of numbers via floating-point is essential to ensure not only accuracy but also to guarantee efficiency and reproducibility. Whether it's simulating the amount of a feedstock left in an industrial chemical process, optimizing a stock portfolio, rendering explosions for a movie, or training a neural network with billions of parameters, correct use of floating-point will allow for the best results. To begin, let's look at floating-point operations. Let's say we have three numbers stored in IEEE 754 single-precision:
These numbers are given a small nonzero tail just to showcase some of the effects of floating-point rounding.
Real numbers are associative. If we were to add these purely mathematically, we would get:
But if we add these in floating-point, the results differ. We can break these down step by step to show the differences.
denotes the numbers that would be stored in the computer and operated on using floating-point arithmetic:
The floating-point computations deviate from the exact mathematical computations. They also differ from each other depending on where and how much you round.
In 2008, IEEE 754 was revised with the fused multiply-add operation (FMA). It computes
with only one rounding step. If not for this, the computation would be
which we know would have a different result.
Let's say we have
. We want to find
. Mathematically,
. We want a floating-point value that is as close to possible as
. If we were to round twice using the non-FMA method, we would get
.
If we were to then round using FMA, we would get
.
This is more accurate.
When it comes to a final result, the place where you round matters. Rounding errors accumulate differently depending on when, where, and how often you do so. Thus, computing in high precision and then rounding in low precision is not the same as a value computed solely in low precision.
While FMA is faster and more accurate than performing the operations separately, there are instances in which you might not want it. By performing the operations separately, you can debug or interpret certain steps easier. The other reason is that the codebase or hardware you are working with might not support FMA. That is, it might have IEEE 754 from before 2008. Old subroutines written in legacy Fortran or C++ might not be compatible with FMA. That being said, by defalt, most GPUs operate in a fully IEEE 754-compliant mode. However, NVIDIA GPUs are configurable. You can put them in fast mode, which prioritizes performance. These are the three compiler flags:
Flag | Purpose |
-ftz=true | Denormals are flushed to zero |
-prec-div=false | Division is not computed to the nearest floating-point value |
-prec-sqrt=false | Square roots are not computed to the nearest floating-point value |
The accuracy of these operations will then be worse, as rounding may no longer be correct. This does not apply to double-precision.
NVIDIA GPUs are different from x86 CPUs which use a dynamic control word. They encode rounding and precision standards directly into each instruction. They don't have error handling (trap handlers) for floating-point exceptions like overflow, underflow, or dividing by zero. This is because their GPUs are designed for massively parallel computation. Trap handlers would stall this process and hurt performance.
Another important product of NVIDIA is called Compute Unified Device Architecture (CUDA). It is a parallel computing platform that allows software to accelerate not only graphics rendering, but also general-purpose tasks. In other words, it lets you write code that runs on your GPU instead of your CPU. CUDA can be used for a variety of applications such as deep learning with PyTorch. CUDA also has a library of math functions. While it's easy to say what the right choice is for addition and multipication in IEEE 754, it's a lot more difficult for more complex computation. CUDA is a great compromise on speed and accuracy. And while GPU programming with CUDA is primarily for languages like C++, MATLAB has over a thousand functions that work specifically on NVIDIA GPUs (since R2010b). If you have toolboxes like the Parallel Computing Toolbox, you can do something like the following:
Keep in mind that moving code across different systems—such as from a CPU to a GPU, or from an NVIDIA GPU to an AMD GPU—can result in differences in computation. This is due to variations in floating-point precision, order of operations, and different math libraries. Remember to have some sort of system that accounts for these different outputs. Familiarize yourself with the capabilities of your GPU. In MATLAB, you can begin by running gpuDevice:

The growth of large language models has driven advances in both computer architecture and the standards of numerical computation. If built correctly, ultra low precision data formats can guarantee faster computation while maintaining high accuracy. NVIDIA is currently working on this with floating-point 8 (FP8), building on the previous standards for efficient neural network training which was brain floating point 16 (BF16). FP8 has two variants, E4M3 and E5M2, which are optimized for different stages of training. When it comes to Blackwell, NVIDIA's new series of GPU architecture, MXFP8 is used to alleviate any sacrifices of accuracy through block-scaling. These innovations reflect the ongoing evolution of IEEE 754 to meet the demands of modern day AI workloads.
For more information, please check out NVIDIA's technical resources which are in Bibliography.
2. Numerical Methods
Now that we understand more about numerical analysis and computing, let's go through numerical methods. Numerical methods are techniques used to approximate mathematical processes that cannot be solved exactly, or analytically. You don't need to master all of these techniques to follow the applications in later chapters, but this will still give you "under the hood" insight into how the most demanding problems in science and engineering are being solved today. If some of the math is unfamiliar, please check out my notes on calculus, linear algebra, and differential equations. That said, please do not feel discouraged. If you've already made it this far, you've done fantastic work. Learn these math courses then come back here whenever you are ready. The material in them is a drop in the pond compared to what you can learn and then accomplish. Do not let it hold you back!
2.1 Calculus Review
This is not meant to be comprehensive or mathematically flawless. Instead, it's a way for you to revisit some content that you may not have seen since AP Calculus.
Limits tell us how functions behave as they get close to some point. The limit L of
as x approaches a is given by
.
Let's graph
. This function is undefined at
thus has a hole there. f = @(x) (x.^2 - 4)./(x - 2); % Equivalent to f(x) = x + 2 except at x = 2
x = linspace(1.5, 2.5, 100); % Values near x = 2
plot(2, 4, 'ro') % The limit point at x = 2
xlabel('x'); ylabel('f(x)')
title('Limit as x approaches 2')
We can use a form more useful for numerical methods by saying that
.
This version is more valuable because we will often use small values of h to probe at how a function behaves near a.
We can say that the function is continuous at
if
.
This means that the function exists (doesn't have jumps or holes) at a.
A sequence of approximations will get closer to the correct answer over multiple iterations if there is convergence. That is, a sequence
has a limit L if
.
This would be called a convergent sequence. f is continuous at a if
.
If a function is continuous on the interval
, and L is a number between
and
, there exists a number c with
where
.
This is known as the intermediate value theorem.
For a function f that is continuous on
, there exists a lower bound
, an upper bound
, and two numbers
such that
when
.
This is known as the extreme value theorem.
Derivatives tell us how fast a function is changing. For a function
defined on an interval containing a, fis differentiable at a if
.
The derivative is defined as
.
In numerical computing, we pick a small number for h and use it to approximate derivatives:
If
is differentiable at
, then
is also continuous at
. If a function is continuous on
and differentiable on
, then there exists some value
such that
This is the mean value theorem, and it says that there is a point in the interval where the instantaneous rate of change is equivalent to the average rate of change. More specifically, the slope of the tangent line to
at
is equal to the slope of the secant line through
and
. Here's an example:
For the continuous function
on the interval
, the mean value theorem guarantees that there exists a point
such that
.
Using
, we can solve for c:
At c, the instantaneous slope of
is equal to the average slope over the interval. m_secant = (f(b) - f(a)) / (b - a);
c_secant = f(a) - m_secant * a;
tangent_line = @(x) m_secant * (x - c_mvt) + y_mvt; % Define tangent line at c
x = linspace(0, 2.5, 300);
plot(x, y, 'b-', 'LineWidth', 1.5); hold on
plot(x, m_secant * x + c_secant, 'r--', 'LineWidth', 1.5); % Secant
plot(x, tangent_line(x), 'g-.', 'LineWidth', 1.5); % Tangent
plot(c_mvt, y_mvt, 'ko', 'MarkerSize', 8, 'MarkerFaceColor', 'k'); % MVT point c
xlabel('x'); ylabel('f(x)');
title('MVT on f(x) = sin(x) over [0.1, 2.1]');
legend('f(x) = sin(x)', 'Secant line', 'Tangent at c', 'MVT point c', 'Location', 'southeast');
Integration tells us how much a function accumulates. The area under a curve from a to b for a function
can be found as such:
Recall that there are different methods for numerical integration such as the trapezoidal rule:
where
for a number of subintervals n
Trapezoidal rule for
,
,
, and 
And Simpson's rule:
where
for a number of subintervals n
Simpson's rule for
,
,
, and 
If f is continuous over
and F is an antiderivative
on
, then we have the following, which is known as the first fundamental theorem of calculus:
If f is continuous over
and
, then we have the second fundamental theorem of calculus:
For a function f that is continuous over
, there exists a number
such that
where
is the average value of f over
. This is known as the mean value theorem for integrals.
A series is the sum of an infinite sequence of terms:
The nth partial sum is given by
. The infinite series converges if and only if the sequence of
converges to a limit S:
If the series does not converge, we say that it diverges. The process of sequences converging towards a target is the backbone of iterative algorithms in numerical analysis. Series are the sum of a sequence and thus help us build approximations for functions.
If a function f is differentiable at
, it can be approximated using an nth-order polynomial

2.2 Computational Errors
Like most of the equations you see in your science classes, the models depicted by numerical methods are often approximations. If deviations are systematically high or low, there is likely something not solely due to chance that is affecting your model. If the errors have more random distribution with tight grouping around the prediction, your model is still likely accurate. We've thrown around the terms accuracy and precision a lot, so let's clarify what the two mean now. Accuracy refers to how close your computed value is to the true value in nature. Precision refers to how close the values are to each other. Let's say you are performing an experiment to calculate acceleration due to gravity g on Earth and get this final value:
Your result is accurate because it is pretty close to the standard value of
. We can't say anything about precision here because it is only a single measurement. However, if you were to get this set of data:
Your experiment data are precise because they are close to each other but not accurate because they are far from the true value. Generally, if your data show precision over time, this can give you confidence that your results are reliable and trending towards accuracy.
Here's a visual with targets:
- For low accuracy and low precision, your data are not close to the actual value (green center) nor are they closely grouped.
- For low accuracy and high precision, your data are not close to the actual value, but they are closely grouped.
- For high accuracy and low precision, your data are generally close to the center, thus they are accurate. But because they are not closely grouped, they are imprecise.
- For high accuracy and high precision, your data are close to the center and tightly grouped.
We've talked about error in our representations of real numbers before, but here are the actual terms and some formulas:
This would be known as absolute error. On the other hand, relative error is
.
Multiplying this by
would get you percent relative error
. In scientific computing, we often will not know the true answer before hand. After all, if we knew all the answers, we wouldn't have to perform many calculations. Thus, we can normalize error using the best possible estimate:
In the case that we are iterating, our current approximation will be based on the previous approximation. Over time, we should get closer to the actual value because our approximations should be getting better and better. Here, percent error is given by
I've included absolute value on these equations which means that these only apply for the magnitude of the error. You can take them off if you want signed error. Generally, we don't care about the sign of the error. Instead, we care about whether the magnitude of the percent relative error is below a set tolerance
. If we are iterating towards a solution, we would stop once
. Take for example a Maclaurin series expansion for
where
.
As we keep iterating, we keep adding more terms to the sequence and the approximation gets closer and closer to the true value of
. We can use the following equation to determine what would be considered a sufficiently accurate result:
This yields what is known as stopping criterion.
Plugging in our values gets us
. We would keep iterating until
. If we were to use
as our first estimate we would have
. Then with
, we would have
. We can then calculate
. The approximate estimate of this error would be
. Our value of
is still not below the threshold, so we would keep iterating through. After 6 terms, our result is
and
. This is a sufficient approximation. Notice how the number of significant digits has increased to 5. This is because our equations for approximate estimate of error and stopping criterion are conservative. They usually guarantee a result that is at least as good as what they yield, if not better.
In MATLAB, we can describe the series expansion as
. The function expoApprox passes the value to be evaluated x with stopping criterion es and a maximum allowable number of iterations maxit. In other words, it keeps adding terms until the approximate relative error is below the threshold or the number of iterations reaches a point where performance will be hurt. Convergence might never happen if there is something wrong with the program, so maxit essentially prevents infinite loops. In less extreme cases, it might just take a really long time to reach convergence, so it also can prevent very long runtimes. function [fx, ea, iter] = expoApprox(x, es, maxit)
% expoApprox - Approximate e^x using Maclaurin series
% x - value at which to evaluate e^x
% es - stopping criterion, default = 0.0001%
% maxit - maximum number of iterations, default = 50
% fx - estimated value of e^x
% ea - approximate relative error (%)
% iter - number of iterations performed
if nargin<2 || isempty(es), es=0.0001; % handling for if the user did not provide a second nonempty input
if nargin<3 || isempty(maxit), maxit=50; % handling for if the user did not provide a third nonempty input
while (ea > es && iter <= maxit)
solold = sol; % store the current estimate in solold
sol = sol + x^iter / factorial(iter); % add the next term to sol
iter = iter + 1; % continue to the next iteration
if sol ~= 0 % error checking to prevent dividing by zero
ea = abs((sol - solold) / sol) * 100;
[x_val, error, iterations] = expoApprox(1, 1e-6, 100)
x_val =
2.718281826198493
error =
9.216155641522974e-07
This result means that after 12 iterations, a result of 2.718281826198493 comes an aproximate error estimate of
.
We've already worked through roundoff error extensively, but I want to give a quick list of places it can occur in when doing arithmetic:
- Adding two numbers where the result requires a mantissa that is longer than what can be stored. In this case, some bits must be rounded off. This can manifest as subtractive cancellation, which happens when you subtract a number from a number that it is very close to in value. Because their leading digits are the same, you are left with a very imprecise number.
- Very lengthy computations will skew towards the more recently computed values.
- When you add a large and small number, the smaller number's decimal point can be shifted so far to the left (to match the exponent of the larger one) that it becomes essentially zero in the calculation.
- Smearing occurs when the individual terms in a summation are larger than the actual summation.
- Some infinite series, especially slowly converging ones, are prone to roundoff error.
Here's an example of what happens when we try to add a very large and a very small number:
Suppose we want to add
to 4000. In floating-point, we will use a 4-digit mantissa and 1-digit exponent:
The exponents don't match, so we will have to shift
to
. And we can now add:
We can only store 4 digits in the mantissa, so we have to round this to
. We have lost
due to rounding like it almost never existed in the calculation in the first place. It contributed zero to the end result.
Truncation errors occur when a mathematical procedure is cut short. This happens when you use an approximation in place of the proper mathematical expression. The origin of this is when people would replace functions with a Taylor series. The Taylor series is extremely valuable in numerical analysis because by the Taylor theorem, any smooth function can be approximated as a polynomial.
The Taylor series expansion looks like
.
If the function has more curvature, you would have to keep adding more derivatives.
The remainder term
for a truncated Taylor series can be expressed as
where ζ is some value of
. In general, an nth-order Taylor series expansion will be exact for an nth-order polynomial. For some differentiable and continuous functions such as exponentials and sinusoids, a finite number of terms will not yield an exact estimate. You would need an infinite number of terms for the series to converge. The amount of terms that is considered sufficient is dependent on
. The problem with this is that ζ is not known exactly and neither is
because that is the function we are trying to approximate! Not knowing the latter means we can't calculate
which is necessary to evaluate
. We can, however, control how far away from x we want to evaluate
at which means we can control the number of terms in the expansion. As a result, we can express the remainder
in the more useful form of

Big O notation describes the limiting behavior of a function when it approaches a value. In computer science, it is used to assess the speed of an algorithm as the input grows. In this case, we are controlling the step size h and thus the truncation error is on the order of of
. If error is
, cutting h in half would halve the error. If error is
, cutting h in half would quarter the error. Let's work through some examples.
We will use Taylor series expansions up to
to approximate
at
centered at
. We know that
thus
. Using the true function, we can acquire the correct value of
. The zero-order approximation is
.
We can find percent relative error:
We will now add the first derivative term
for the first-order approximation:
We will now add the second derivative term
for the second-order approximation:
Jumping ahead to
we can get the sixth-order approximation:
We can also ignore the remainder by truncating it:
We can use the Taylor series to estimate truncation errors. We can expand function
as a Taylor series:
Let's now truncate the series after the first derivative term to get
.
Then we have the following:
Using
and
, we can further simplify this as such:
The truncation error is proportional to step size
. Thus, if we halve step size, we would expect the error of the derivative to also halve. More generally, this can be written as
where
. This is known as a finite difference, or more specifically a forward difference. There is also a backward difference which uses the previous point before x to approximate the derivative
,
and a central difference that uses points on both sides of x:
This method is more accurate.
For now, let's find finite difference approximations to estimate
for
. We will say that
and
. The true value is
.
We can now find the differences:
Central difference is significantly more accurate.
Total numerical error is the summation of the truncation and roundoff errors. This introduces a dilemna because the best way to reduce the former is to increase the number of significant figures and the best way to reduce the latter is to reduce step size. Reducing step size can lead to subtractive cancellation. Thus, the truncation errors would decrease at the cost of the roundoff errors increasing. Thankfully, in MATLAB, this is rarely a problem because of its high precision.
The central difference approximation of a first derivative
can be written as
.
Adding on roundoff error such that
and
, we have
where
denotes the rounded function values and e denotes the associated roundoff error. Assuming an upper bound on roundoff error ε, the maximum possible value of the difference
. We will also assume that the third derivative has a maximum absolute value of M. Then, we have an upper bound on the absolute value of the total error
.
From this we can differentiate and set the equation equal to zero to find an optimal step size:
It's difficult to come up with a concrete, generalized way to estimate error. Every type of problem is different. The two best practices are to not perform subtractive cancellation and to operate on the smallest numbers first. This avoids loss of significance. In large computations, it can be incredibly tedious to find total numerical error. In this case, you should move forward and focus on evaluating the accuracy of your results. The amount of errors and the manner in which they are introduced during computation is known as the numerical stability of an algorithm.
2.3 Root-finding
A nonlinear equation is any equation where the change of the output is not proportional to the change of the input. Systems of nonlinear equations are particularly important to scientists and engineers because most real systems are nonlinear in nature. In the most simple case, we might want to find the solution of
because that would mean finding a root. For instance, let's find the mass of a bungee jumper with a drag coefficient of
and a velocity of
after
of free fall. cd = 0.25; g = 9.81; v = 36; t = 4;
fp = sqrt(g*mp/cd).*tanh(sqrt(g*cd./mp)*t)-v;
title('Bungee Jumper Mass')
Visually, we can estimate that the root lies somewhere before
. We will just say
. We can check the validity of this using the following: sqrt(g*145/cd)*tanh(sqrt(g*cd/145)*t)-v
That is fairly close to zero. We can also plug it into an equation to find velocity:
sqrt(g*145/cd)*tanh(sqrt(g*cd/145)*t)
This is fairly close to the given fall velocity of
. Alternatively, we can use linear interpretation to approximate the root. Since we know there's a root somewhere between points
and
, we can use something similar to linear interpolation:
Solving for where
, we get
.
We can code this up:
idx = find(fp(1:end-1).*fp(2:end) < 0); % Find the index where the function changes sign (crosses zero)
m1 = mp(idx); % First mass value, before the sign change
m2 = mp(idx+1); % Second mass value, after the sign change
f1 = fp(idx); % Gets the function value at m1
f2 = fp(idx+1); % Gets the function value at m2
approxRoot = m1 - f1.*(m2 - m1)./(f2 - f1)
This tells us that the approximate root is 142.7400. We can check the validity of this using the following:
sqrt(g*approxRoot/cd)*tanh(sqrt(g*cd/approxRoot)*t)-v
That is much closer to zero. We can also plug it into an equation to find velocity:
sqrt(g*approxRoot/cd)*tanh(sqrt(g*cd/approxRoot)*t)
This is very close to the given fall velocity of
. The first method is imprecise and relies on a single human guess. The second is much closer.
There are two categories of root-finding. Bracketing methods are based on two initial guesses on either side of the root. Open methods rely on at least one initial guess. What makes these better is that they are iterative. Assuming a situation with no errors, bracketing methods will converge slowly but always find a root. On the other hand, open methods can diverge but can find a root faster if there is one.
In the bungee jump example, we observed that the function changed signs on opposite sides of the root. Generally, if
is a real function that is continuous on
where
has a sign opposite to
, we can say that
.
This would mean that there is at least one real root on
. One problem with incremental methods starts with what you choose as the increment length. An increment length that is too small can result in a very long search. An increment length that is too long might miss roots. We can use the following program for an incremental search that locates the roots of a function func from xmin to xmax to identify brackets within
for the function
: function xb = incsearch(func,xmin,xmax,ns)
% incsearch: incremental search root locator
% xb = incsearch(func,xmin,xmax,ns):
% finds brackets of x that contain sign changes
% of a function on an interval
% func = name of function
% xmin, xmax = endpoints of interval
% ns = number of subintervals (default = 50)
% xb(k,1) = the lower bound of the kth sign change
% xb(k,2) = the upper bound of the kth sign change
% if no brackets found, xb = []
if nargin < 3, error('at least 3 arguments required'), end
if nargin < 4, ns = 50; end % if ns blank set to 50
x = linspace(xmin,xmax,ns);
nb = 0; xb = []; % xb is null unless sign change detected
if sign(f(k)) ~= sign(f(k+1)) % check for sign change
if isempty(xb) % display that no brackets were found
disp('no brackets found')
disp('check interval or increase ns')
disp(['number of brackets:', num2str(nb)])
Plotting this reveals the root locations:
f = sin(10*x) + cos(3*x);
plot(x,f,'b','LineWidth',1.5);
title('Function Plot with Root Brackets')
xb = incsearch(@(x) sin(10*x) + cos(3*x), 3, 6, 50);
number of brackets:5
3.244897959183673 3.306122448979592
3.306122448979592 3.367346938775510
3.734693877551020 3.795918367346939
4.653061224489796 4.714285714285714
5.632653061224490 5.693877551020408
y_bracket = sin(10*x_bracket) + cos(3*x_bracket);
plot(x_bracket, y_bracket, 'ks', 'MarkerFaceColor', 'r')
The bisection method is a variation of the incremental search method where the interval is always divided in half. If a function has a sign change over an interval, the function value at the midpoint is evaluated. Thus, the function value at that point must lie within the subinterval in which the sign changes. This subinterval then becomes the interval for the next iteration. This is repeated until the required precision is met. In other words, the end points of the interval are moved closer and closer together until the zero is bracketed. Here is the formal theorem:
For a continuous function
on
, there exists a number
such that
. If
and
have opposite signs and
represents the sequence of midpoints generated by the bisection method, then for each iteration n, we have
.
The sequence
gets exponentially closer to the true root
every iteration. That is,
.
We can try this out for
on the initial interval
. For each iteration k, we have left endpoint
, right endpoint
, and midpoint
. The function value at the midpoint is
. If
, the root is between
and
. Thus
. If this isn't true, the root is between
and
, so
. And then this repeats. fprintf('k\t a_k\t\t c_k\t\t b_k\t\t f(c_k)\n');
fprintf('%d\t %.6f\t %.6f\t %.6f\t %.6f\n', k, a, c, b, fc);
end
0 0.000000 1.000000 2.000000 -0.158529
1 1.000000 1.500000 2.000000 0.496242
2 1.000000 1.250000 1.500000 0.186231
3 1.000000 1.125000 1.250000 0.015051
4 1.000000 1.062500 1.125000 -0.071827
5 1.062500 1.093750 1.125000 -0.028362
6 1.093750 1.109375 1.125000 -0.006643
7 1.109375 1.117188 1.125000 0.004208
8 1.109375 1.113281 1.117188 -0.001216
The midpoint is an approximation of the root. The actual root is
. As you can see, f(c_k) tells us how close we are getting after each iteration. Even if the sign changes, remember that we care about the magnitude which is getting smaller.
Another method is known as false position which is very similar to linear interpolation. Rather than bisecting the interval to find a midpoint, it locates the root by joining
and
with a straight line. Using similar triangles, the intersection of the straight line with the x-axis can be estimated as
.
replaces one of the two initial guesses
or
depending on which yields a function value with the same sign as
. This means that the values of
and
will always bracket the true root. The process is repeated until it yields a sufficient appr0ximation.
Let's use the bisection and false position methods to solve the bungee problem.
First, we need to guess two values of the unknown. Based on the Bungee Jumper Mass graph, we can see that the function has to change sign between 50 and 200. We could narrow down the region more, but let's be safe and use those two values. Our initial estimate of the root
is as follows:
Setting the exact value of the root to be
means that our percent relative error is
.
Let's now multiply the function value at the lower bound by the function value at the midpoint:
Because
is greater than zero, we know that no sign change occurs here. We do now know that the root must be between the new lower bound 125 and the upper bound 200 which stays the same. Thus, we have our next iteration
Continuing on to the next iteration, we have
This is a pretty reasonable approximation, but you can continue if you need more precision.
Let's now repeat it using the false position method starting with guesses of
and
. This means that
and
. We now have our first iteration:
For our second iteration, we start with
. We now know that
is the upper limit. Evaluating our function values, we have
and
. Then we can carry out the second iteration:
.
For the third iteration, we have
and
. Then we can plug in to get
While false position generally performs better than bisection, there are some cases, like this one, where that is not the case. If the initial interval is not chosen carefully, false position can be slow because the method will keep on dragging towards the upper endpoint. This happens if
is much further away from zero than
, or vice-versa. Functions with significant curvature can be difficult for even the false position method. On the other hand, bisection struggles with flat functions.
For bracketing methods, the root is bounded by a lower and upper bound. Iterations of these methods, as you've now seen, will always converge towards the true value of the root. Let's move on to open methods where that is not particularly the case. Instead, your values do not necessarily bracket the root and can then diverge. That being said, if they are to converge, they converge much faster.
Perhaps the most famous of all root-finding methods is the Newton-Raphson method. You essentially take an initial guess
near the root and then find the derivative at that point. What you are doing is using a tangent line to predict the root of a function. Geometrically, this means extending the tangent line
until it intersects the x-axis. The formula is
.
Let's say we have
. Our first guess will be
. The first derivative is simply
. We can now evaluate the formula:
We can then iterate through:
i | | (%) |
0 | 0 | 100 |
1 | | |
2 | | |
3 | | |
4 | | |
This converges very fast.
Let's work through another example. We want to find the root of
. We will pick
. Then we can iterate through:
For
and
, I had to include more significant digits to showcase the difference! These are identical for the first eight decimal places, which is a sufficient answer for the root. We can now try evaluating the bungee jumper example with the Newton-Raphson method. Recall that the function to be evaluated is
We can now use the following program:
function [root, ea, iter] = newtraph(func, dfunc, xr, es, maxit, varargin)
% newtraph: Newton-Raphson root location zeroes
% [root, ea, iter] = newtraph(func, dfunc, xr, es, maxit, p1, p2, ...):
% Uses Newton-Raphson method to find the root of func
% func = function handle for f(x)
% dfunc = function handle for derivative f'(x)
% es = desired relative error (default = 0.0001%)
% maxit = maximum allowable iterations (default = 50)
% p1, p2, ... = additional parameters passed to func
% ea = approximate relative error (%)
% iter = number of iterations
error('At least 3 input arguments required');
if nargin < 4 || isempty(es)
if nargin < 5 || isempty(maxit)
while (ea > es && iter < maxit)
xr = xr_old - func(xr_old) / dfunc(xr_old);
ea = abs((xr - xr_old) / xr) * 100;
y = @(m) sqrt(9.81*m/0.25).*tanh(sqrt(9.81*0.25./m)*4) - 36;
dy = @(m) (1/2)*sqrt(9.81/(m*0.25)).*tanh(sqrt(9.81*0.25./m)*4) - ( (1/2)*4*9.81./(2*m) ).*sech(sqrt(9.81*0.25./m)*4).^2;
newtraph(y, dy, 140, 0.00001)
One problem with the Newton-Raphson method is that you need to be able to calculate the derivative. Some functions have very complicated and long derivatives which can potentially hurt performance. The solution is to use a secant line instead of a tangent line. Aptly, this is known as the Secant method:
Let's practice by finding the root of
. Let
and
. Then we can interate through:
Notice how we are at the point where we have two iterations that agree with each other up to a certain number of decimal places. We have found a root sufficiently close to the true value which is
.
There is something called the order of a root-finding algorithm. It refers to how quickly the method will converge. Generally, the order α is equal to the natural log of some error division over the natural log of another error division:
If we have the following errors
,
,
, the order could then be calculated as such:
This means that the order of Newton's method will converge quadratically. However, this isn't always the case; remember that Newton's method doesn't even necessarily converge.* We can also find the order of the Secant mehod using the same formula as before. We will use
,
, and
. Thus we have
One interesting fact is that the order of bisection is
. Compare this with the open methods
and
. Higher order means faster convergence, so if the circumstances allow for it, Newton's method is the best here. But, if let's say, the derivative is too complicated, you can try the secant method. And if you want the slower but always reliable method, go with bisection.
* As long as you pick a value sufficiently close to the root, it should converge.
What if you wanted a compromise? Well, there are just so many root-finding algorithms out there. The hybrid approach is known as Brent's method which is the closest to having the reliability of the bracketing methods but the speed of the open methods. To begin, let's go through some history. Richard Brent's method is actually based on an earlier algorithm known as Dekker's method from Theodorus Dekker. His method starts similarly to some others: you have a root on the interval
. The sign of
has to be opposite to the sign of
. If
, you have to swap a and b because you want to keep b as the better choice. Then, c becomes the previous value of a. We have a midpoint
and we try to compute the secant intercept s using b and c. After updating your variables, you will repeat this process again until
. Let's work through an example for
with
,
, and
:
a | b | | Midpoint m | Secant Intercept s | | | or |
1 | 2 | 1 | | | | 1 | |
2 | | 2 | | | 1 | | |
2 | | | | | 1 | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | 0 |
We finally have two values such that
. One cool thing about Dekker's method is that it includes handling for roundoff errors, overflow, and underflow. The main reason for this is that it can always fall back to bisection which is more numerically robust. Interestingly, one of the people who originally commented on Dekker's paper was George E. Forsythe, who I discuss in Introduction. To paraphrase, he praises Dekker's paper for
- making no assumptions about the smoothness of f
- notes that his method is at least as good as bisection
- has the same fast convergence properties of the secant method for smooth functions
- not having the dangers of secant method when there are multiple roots
Overall, he regards it as "safe and effective" but "also rather subtle" and thus "[meriting] careful study." Later came Brent's method, which tries interpolating with a parabola. If that doesn't work, it falls back to the secant method, and then bisection. The most important aspect of his method is known as inverse quadratic interpolation which is quite similar to the secant method. First, you find the polynomial with the lowest degree that fits a set of points. With quadratic interpolation, you fit the inverse quadratic curve through three points. The main problems with inverse quadratic interpolation are that it is ureliable, fails with two identical y values, and requires you to start close. Thus, it is only ever really used as a part of Brent's method. Brent's method uses the same values
as Dekker's method, but he tries to compute s using inverse quadratic interpolation with
. If this fails, it tries to compute s using b and c. If all else fails,
, or bisection. After updating
points, you test for s between b and
. Then you repeat until
or
or
. This prevents a lot of potential iteration. His method also has logic to handle roundoff error, overflow, and underflow. It also handles functions that aren't well-behaved with a new variable d, the previous value of c. Generally, it always has the option of falling back to bisection. To summarize, note that
- Dekker's method is just as fast as secant method when the function is well-behaved.
- Brent's method is just as fast as inverse quadratic interpolation and secant when well-behaved.
- Dekker's method and Brent's method are both faster than linear.
- Brent's method is at least as fast as Dekker's for well-behaved functions and is guaranteed to converge in a reasonable number of steps for any function.
In MATLAB, the fzero function uses the Brent-Dekker method. It lets you search for an interval with a sign change. This program from Cleve Moler essentially starts with your initial guess and then searches to find an interval where the function changes sign. In the example, we want to find the root of the function
. We know that this really is at
. We are starting with the point
, and then expand outward until we get the sign change. Thus, the root is bracketed between 1 and 1.45255. After that, Brent's method does it work and converges on 1.4142. function [a,b] = signchange(f,x)
% SIGNCHANGE [a,b] = signchance(f,x) seeks an
% interval [a,b] where f(x) changes sign.
while sign(f(a)) == sign(f(b))
opt = optimset('display','iter');
fzero(f,1,opt)
Search for an interval around 1 containing a sign change:
Func-count a f(a) b f(b) Procedure
1 1 1 1 1 initial interval
3 0.971716 1.05577 1.02828 0.942631 search
5 0.96 1.0784 1.04 0.9184 search
7 0.943431 1.10994 1.05657 0.883663 search
9 0.92 1.1536 1.08 0.8336 search
11 0.886863 1.21347 1.11314 0.760926 search
13 0.84 1.2944 1.16 0.6544 search
15 0.773726 1.40135 1.22627 0.496252 search
17 0.68 1.5376 1.32 0.2576 search
19 0.547452 1.7003 1.45255 -0.109897 search
Search for a zero in the interval [0.547452, 1.45255]:
Func-count x f(x) Procedure
19 1.45255 -0.109897 initial
20 1.3976 0.0467142 interpolation
21 1.41399 0.000631974 interpolation
22 1.41421 -9.95025e-08 interpolation
23 1.41421 7.86149e-12 interpolation
24 1.41421 -4.44089e-16 interpolation
25 1.41421 -4.44089e-16 interpolation
Zero found in the interval [0.547452, 1.45255]
2.4 Linear Algebra
Let's say that three clowns at the circus are dangling from the ceiling. They are held up by ropes with distances
, and
. Each rope is fully extended but unstretched. After they are released, the clowns will eventually come to their equilibrium positions. We want to compute the displacement of each of the clowns. We will assume that each rope behaves as a linear spring that follows Hooke's law.
When fully extended, they look like this:
We can assign each clown a free-body diagram as such:
We can then use Newton's second law to balance the forces for each jumper:
where
is the mass of the jumper i in kilograms, t is time in seconds,
is the spring constant for the rope j in Newtons per meter,
is the displacement of the jumper i measured downward from the equilibrium position in meters, and g is gravitational acceleration in meters per second squared. We want the steady-state solution, so we will set the second derivatives to zero. Then, we have a system of linear equations:
We will now solve this problem in MATLAB using these givens:
Jumper | Mass (kg) | Spring Constant  | Unstretched Cord Length (m) |
Top (1) | 60 | 50 | 20 |
Middle (2) | 70 | 100 | 20 |
Bottom (3) | 80 | 50 | 20 |
We can now substitute these into our system:
We will now enter the coefficient matrix and the right-hand-side vector:
K = [150, -100, 0; -100, 150, -50; 0, -50, 50]
150 -100 0
-100 150 -50
0 -50 50
mg = [588.6; 686.7; 784.8]
588.6000
686.7000
784.8000
Let's solve:
Equivalently,
The ropes between each clown are
, so their initial positions are Lastly, their final positions can be calculated as
These results make sense. The first rope stretches the longest because it has the lowest spring constant and is carrying the weight of all three clowns. While the second rope is being pulled on by more weight force than the third, it has a higher spring constant and therefore stretches less. Thus, the second and third ropes stretch about the same.
Recall that Gaussian elimination is when you combine equations to eliminate unknowns. The general steps are as follows:
- Combine the coefficient matrix and the right-hand-side vector into an augmented matrix.
- Forward elimination
- Back substitution
We will perform Gaussian elimination on the following augmented matrix as an example:
Multiply row 1 by 2 and subtract from row 2 to eliminate the first entry in row 2:
Then multiply row 1 by 4 and subtract from row 3:
Then subtract row 2 from row 3 to eliminate the second entry in the third row:
The matrix is now in upper triangular form. We can rewrite it as a system:
Back substitution yields
Without the likes of pivoting, we can use this program for what is known as naive Gauss elimination:
function x = GaussNaive(A, b)
% GaussNaive: naive Gaussian elimination (no pivoting)
% b = right-hand side vector
error('Matrix A must be square!');
nb = n + 1; % Total number of columns in augmented matrix
factor = Aug(i, k) / Aug(k, k);
Aug(i, k:nb) = Aug(i, k:nb) - factor * Aug(k, k:nb);
x(n) = Aug(n, nb) / Aug(n, n);
x(i) = (Aug(i, nb) - Aug(i, i+1:n) * x(i+1:n)) / Aug(i, i);
A = [1 3 4; 2 1 3; 4 7 2];
Naive Gaussian elimination fails for some matrices because to let's say, normalize a row, you may have to divide by zero. You may have to also divide by a pivot element that is very small because that could of course result in roundoff error. Thus, before each row is normalized, you should find the largest coefficient in magnitude in the column before the pivot and then make that the pivot element. You can then switch the rows so that the largest element is now the pivot element. This is known as a partial pivot. If you factor in both columns and rows and then switch, this would be known as complete pivoting. Complete pivoting is rarely used because it adds a lot of computational complexity for little benefit. Most of the positives can come from partial pivoting alone. Thus, to avoid dividing by zero and to minimize roundoff error, partial pivoting is the best strategy.
In MATLAB, you can easily perform Gaussian elimination with pivoting using the backslash (\) operator:
This is an incredibly complex operator that analyzes the structure of the coefficient matrix and then applies the proper technique for the system.
One interesting quirk of MATLAB is in the operation A\A which solves the equation
. For
, the unknown matrix X is equivalent to the identity matrix I. For an undetermined system where
,
is not necessarily I. For an overdetermined system where
,
yields I. There are some special cases for A/A in MATLAB. In the case of a square and nonsingular* matrix,
A\A = I
* Nonsingular simply means full rank, where rank equals the smallest dimensions of the matrix.
If A contains zeros on the diagonal or is the zero matrix,
A\A = NaN
If A has rank less than n,
A\A ~= I
A rank-deficient matrix cannot produce an identity through the backslash operator; after all, rank-deficient matrices aren't invertible.
We can test A\A on magic squares to see how rank affects the solution. Odd-sized magic squares are full rank and give results that are very close to the identity matrix:
r = zeros(length(n_odd),1);
e = zeros(length(n_odd),1);
for idx = 1:length(n_odd)
xlabel('n (Odd Matrix Size)')
title('Rank vs Odd Matrix Size')
set(gca, 'Position', [0.13, 0.6, 0.775, 0.3])
title('Error (norm(A\\A - I))')
set(gca, 'Position', [0.13, 0.15, 0.775, 0.3])
Singly even magic squares (n divisible by 2 but not 4) are rank-deficient, so their A\A deviates from the identity matrix. Notice the computational error:
n_singly_even = 2:2:nmax;
n_singly_even = n_singly_even(mod(n_singly_even,4) ~= 0);
r = zeros(length(n_singly_even),1);
e = zeros(length(n_singly_even),1);
for idx = 1:length(n_singly_even)
bar(n_singly_even, r, 'b')
xlabel('n (Singly Even Sizes)')
title('Rank vs Singly Even Matrix Size')
set(gca, 'Position', [0.13, 0.6, 0.775, 0.3])
bar(n_singly_even, e, 'r')
xlabel('n (Singly Even Sizes)')
title('Error (norm(A\\A - I))')
set(gca, 'Position', [0.13, 0.15, 0.775, 0.3])
Doubly even magic squares (n divisible by 4) are even more rank-deficient. I was unable to reproduce the code from Moler here, so I am just using his image. The error plots are an absolute mess:
Going back to A\A, there's a function called pinv that returns the Moore-Penrose pseudoinverse which is used when a system doesn't have a unique solution. That is, it is used for underdetermined or overdetermined systems. Just like the backslash operator, it can be used in numerical linear algebra, but it is specifically used when a system has many solutions and you want to fond the one with the smallest norm. It doesn't magically fix rank-deficiency. Instead, it only avoids NaN by attempting to hide rank-deficient matrices:
A = [1 2; 2 4]; % Rank-deficient
b = [3; 6]; % Lies in the column space of A so solutions exist
disp('Solution using backslash:')
Solution using backslash:
disp('Solution using pseudoinverse:')
Solution using pseudoinverse:
You should only use it when you want a least squares solution or if you want to find the solution vector with the smallest length.
If that was a little abstract, don't worry about it.
Let's look at an example where we use Gaussian elimination to solve the following system:
The first pivot element
is very close to zero. We will divide equation (1) by
to get
which can then be used to eliminate
from equation (2). It now is
which can be solved to get
. We can now substitute this into equation (1) to get
. We will now table the results:
Significant Figures | | | of  |
3 | | | 1099 |
4 | | | 100 |
5 | | | 10 |
6 | | | 1 |
7 | | | |
Subtractive cancellation makes the result very dependent on the number of significant figures. Instead, we can solve the equations in reverse order to allow the row wih the larger pivot element to be normalized:
Elimination and substitution still yields
. We can now get
. We will now table the pivoted results:
Significant Figures | | | of  |
3 | | | |
4 | | | |
5 | | | |
6 | | | |
7 | | | |
The pivot strategy has much less error. In MATLAB, we can first solve this using naive Gaussian elimination:
x = GaussNaive(eq1, eq2)
0.333333333334037
0.666666666666667
We can instead use pivoting to yield a far more accurate result:
x = eq1\eq2
0.333333333333333
0.666666666666667
Gaussian elimination is essentially a way to get a matrix into a form you know as row echelon form (REF) where
- All nonzero rows are at the bottom.
- Each leading entry (pivot) of a row is in a column to the right of the leading entry of the row above it.
- All entries in a column below a leading entry are zeros.
A matrix is in reduced row echelon form (RREF) if
- It is in row echelon form.
- The leading entry in each nonzero row is 1.
- Each leading 1 is the only nonzero entry in its column.
Both REF and RREF make solving basic linear systems much easier. In MATLAB, we have the function rref. We can use it to solve this system:
A = [1 3 4; 2 1 3; 4 7 2];
[rref_Ab, pivot] = rref(Ab)
1.000000000000000 0 0 0.888888888888889
0 1.000000000000000 0 -0.111111111111111
0 0 1.000000000000000 0.111111111111111
Thus, the first three columns are fully reduced and the fourth column gives us our solution x to
: RREF is generally not used in scientific or engineering applications because it's not computationally efficient for very large systems.
By now, hopefully you are beginning to realize that linear algebra is generally easy in MATLAB. The hardest part of it all is correctly modeling your scientific and engineering problems as systems of linear equations.
LU (Lower Upper) factorization involves decomposing a coefficient matrix into two triangular matrices L and U. LU decomposition is very similar to Gaussian elimination; the only difference is that you do not work with the augmented matrix. It can let you simplify the solution to a system of linear equations. Solving
reduces to solving
and
because then
. In Gaussian elimination, you transform a matrix
to upper triangular form
in
stages. For each stage k during Gaussian elimination, multiples of row k are added to the later rows to eliminate the elements below the diagonal in column k:
where
is the multiplier used to eliminate below the pivot. These multipliers
become the subdiagonal entries in L while the final upper triangular matrix becomes U.
We can decompose our matrix A using lu:
[L,U] = lu(A) % This calculates the multipliers, updates rows, and builds LU
0.250000000000000 -0.500000000000000 1.000000000000000
0.500000000000000 1.000000000000000 0
1.000000000000000 0 0
4.000000000000000 7.000000000000000 2.000000000000000
0 -2.500000000000000 2.000000000000000
0 0 4.500000000000000
Then we can solve it as follows:
% Now for forward and back substitution
y = L\b
3.000000000000000
0.500000000000000
0.500000000000000
x = U\y
0.888888888888889
-0.111111111111111
0.111111111111111
Just as with Gaussian elimination, partial pivoting is important in LU factorization. Use the following steps:
- The LU factorization with pivoting of a matrix
is
. The upper triangular matrix
is generated by elimination with partial pivoting while storing the multipliers used to row swap in
which uses the permutation matrix
to keep track of the row switches. - This step is forward substitution. We need to adjust
to match the row swaps. The matrices
and
are used to perform elimination on
to generate the right-hand-side vector
. The system now looks like
. - Using back substitution, we now have
.
Let's look at an example of LU factorization with pivoting. We will compute the LU factorization for the same system in the form of we did Gaussian elimination with:
We will set up the permutation matrix:
Pivoting is necessary, so we will switch the rows of both:
, 
We can now perform elimination. We will eliminate
by subtracting the factor
from the second row of A. As a result, we can find the new value of
:
, 
We will now reorder b according to P. This is a permutation:
Now for forward substitution:
Solving yields
and
. The system is currently
And lastly, applying back substitution gets us
and
.
When solving linear systems, LU factorization isn't always the best choice. As a matter of fact, QR factorization is often more numerically stable. QR factorization is where you break a matrix A into two parts. A matrix
with linearly independent columns can be factored as
where Q is an mxn matrix with orthonormal columns (
) and R is an nxn upper triangular matrix with nonzero diagonal elements. For example,
This is in the form of
Coding this up, we get
A = [-1, -1, 1; 1, 3, 3; -1, -1, 5; 1, 3, 7];
disp(Q);
-0.500000000000000 -0.500000000000000 0.500000000000000 0.500000000000000
0.500000000000000 -0.500000000000000 0.500000000000000 -0.500000000000000
-0.500000000000000 -0.500000000000000 -0.500000000000000 -0.500000000000000
0.500000000000000 -0.500000000000000 -0.500000000000000 0.500000000000000
disp(R);
2.000000000000000 4.000000000000000 2.000000000000000
0 -2.000000000000000 -8.000000000000000
0 0 -4.000000000000000
0 0 0
Notice that MATLAB flips the signs of some of the elements in R. This is because MATLAB is all about numerical stability and computational speed. In math textbooks, it's common for elements
to be positive. This makes QR factorization recognizable. There are a few main algorithms for QR factorization. There is classical Gram-Schmidt which is the one you likely know from your linear algebra class, a modified Gram-Schmidt, and Householder. Both of the Gram-Schmidt algorithms are pretty terrible for scientific computation because of the amount of error that comes with them. The Householder algorithm is the most common in scientific computing and what MATLAB's qr function uses. Below is a table of each algorithm's complexity in Floating-point Operations Per Second (FLOPS):
Algorithm | Complexity (FLOPS) |
Gram-Schmidt | |
Modified Gram-Schmidt | |
Householder | |
The whole idea of the Householder algorithm is to reflect an entire column and its corresponding rows across a hyperplane. A hyperplane is just the generalization of a two-dimensional plane (flat) in three-dimensional space. One example would be a line in 2D or a plane in 3D. Hyperplanes specifically are used because reflection across one is the easiest way to zero out multiple components of a vector in one shot, while maintaining orthogonality. This relates to its properties as a linear map.
The purpose of this is to zero out all of the entries below the diagonal of a column:
where Q-factor
is orthogonal and constructed as a product of orthogonal matrices
. Q represents the first n columns of the full matrix and
denotes the remaining
columns, turning Q into a full square orthogonal matrix. It is also known as the residual. Each H term is an mxm orthogonal reflector given by
where
is a unit vector. H is orthogonal and symmetric.
is a reflection of x across a hyperplane perpendicular to v. The way the algorithm works is that for a column k from row k down to the bottom, it builds a vector
that defines a Householder reflector. This reflector is then applied to the rest of the matrix, zeroing out all entries below the diagonal in that column. After that, the matrix is updated. In other words, for the first column of R, you start with column 1, move it towards a multiple of the first basis vector, normalize the result to get
, and then apply the reflection to the matrix. This process repeats for each column until the matrix becomes upper triangular. For example, given the previous matrix
We compute reflectors
that triangularize A:
The first column of R is found by first computing a reflector that maps the first column of A to the basis vector
. We have the first column:
where w is an intermediate vector used to construct the Householder reflector.
Then, we reflect by overwriting A with the product of
and A to get
.
The next two updates for the second and third columns of R are
.
MATLAB does not store the Q-matrix during this process. It only holds the vectors
to save storage.
Before we move on, here are some useful tables. If you want to work with these more thoroughly, check out MATLAB Introductory Guide which is linked in Introduction.
Operator/Function | Purpose |
3*x | Multiply each element of x by 3 |
x+2 | Add 2 to each element of x |
x+y | Element-wise addition of vectors x and y |
A*y | Product of a matrix A and a vector y |
A*B | Product of a matrix A and a matrix B |
A.*B | Element-wise product of two matrices A and B |
A^B | Square matrix A raised to the third power |
A.^B | Each element of A raised to the third power |
A' | Transpose of A |
inv(A) | Inverse of A |
det(A) | Determinant of A |
eig(A) | Eigenvalues of A |
[V,D] = eig(A) | Returns a diagonal matrix D of eigenvalues and matrix V whose columns are eigenvalues |
size(A) | Size of A |
Traversal | Gets |
x(2:10) | The 2nd to the 10th elements of x |
x(2:end) | The 2nd to the last element of x |
x(1:3:end) | Every third element of x from the 1st to the last |
x(end) | The last element of x |
A(5,:) | The 5th row of A |
A(:,5) | The 5th column of A |
A(5, 1:3) | The 1st to 3rd elements in the 5th row |
A(end,:) | The last row of A |
A(:,end) | The last column of A |
Up until now, we've focused on very direct methods that stem from the strategies you may have learned in your first linear algebra class. However, when in the domain of numerical linear algebra, iterative methods tend to reign supreme. The most commonly used iterative method is known as the Gauss-Seidel method.
We have a 3x3 linear system
that looks like
Assuming the diagonal elements are all nonzero, we have the following at each iteration:
where j and
are the current and previous iterations. Similarly to the process for root-finding, initial guesses must be made for each x. To make things simple, we can assume that they are all zero. We can then substitute those zeros into our equation for
to calculate a new value for
. We then substitute
and
into
to yield a new value for
. After repeating that for
, this loop is done. We would then repeat the process until we converge to a solution that is close enough to the true value. The criterion is that for each
,
Let's work through an example where we use the Gauss-Seidel method to solve the following system:
The true solution to this system is
,
, and
. First, let's solve each equation for its unknown on the diagonal:
Assuming that
and
are equal to zero, we can now find that
.
Using
and
, we can find
.
Finally,
.
We can repeat for the second iteration to get the following:
We can now find the error estimates:
Recall that our equation for error estimate outputs a conservative standard for convergence. Nevertheless, additional iterations could be computed to improve our answers:
A = [3 -0.1 -0.2; 0.1 7 -0.3; 0.3 -0.2 10];
% tol = 1.0e-6; set this to whatever suits your needs
sum1 = A(i,1:i-1) * x(1:i-1);
sum2 = A(i,i+1:end) * x_old(i+1:end);
x(i) = (b(i) - sum1 - sum2) / A(i,i);
fprintf('After iteration %d: (x_1, x_2, x_3) = ', k)
% if you wanted to add some sort of convergence check...
% if norm(x - x_old, inf) < tol
% fprintf('Converged after %d iterations.\n', k);
end
After iteration 1: (x_1, x_2, x_3) =
2.616666666666667 -2.794523809523810 7.005609523809524
After iteration 2: (x_1, x_2, x_3) =
2.990556507936508 -2.499624684807256 7.000290811065760
After iteration 3: (x_1, x_2, x_3) =
3.000031897910809 -2.499987992353051 6.999999283215615
After iteration 4: (x_1, x_2, x_3) =
3.000000352469273 -2.500000035754606 6.999999988710830
After iteration 5: (x_1, x_2, x_3) =
2.999999998055568 -2.500000000456044 7.000000000049212
With that, let's work towards a general form of the Gauss-Seidel method. We have a general set of equations:
Assuming the diagonal elements are all nonzero, we have the following at each iteration:
In summation form,
,
where
.
There is another iterative technique called the Jacobi method that does not use updated values within the same iteration. For each iteration j, we have:
where
. We do not overwrite
with
because it requires all of the
values to complete the calculation. Thus, the Jacobi method requires more storage. It also converges slower than Gauss-Seidel.
Note: Eigenvalues will be covered in a future project.
2.5 Systems of Nonlinear Equations
In Section 2.3, we worked through root-finding as means of introducing nonlinear equations. Now that we have more practice with iterative methods, we will use them to solve nonlinear systems. Recall that nonlinear systems can have no solution, an infinite number of solutions, or a finite number of solutions. And to solve them, we cannot use Gaussian elimination or any other methods designed specifically for linear systems of equations. This section will focus on fixed-point iteration (which has nothing to do with fixed-point representation) and Newton's method.
In fixed-point iteration, the first step is to rewrite your equations to isolate variables. Let's say we have the following system:
This can be rewritten as such:
or 
You must pick a starting value for X. At each iteration n, you label the update as
or 
This can also be seen as
.
Then, you iterate through until
or whatever is deemed sufficient for your purposes. For an example, we will work with the following system:
Here is the solution out:
Our first step is to rewrite these equations. There are many ways to rewrite them. For now, we will settle for the following:
Equation (1)
Equation (2)
We will start iterating with
. Then,
Zooming in to Quadrant I of the system's graph, we can plot where our iterations land:
Notice how the steps are getting smaller each iteration. We can compute the distance between two points using a triangle:
where c is of course equivalent to 
We can now compute the distance between each iteration:
n | 1 | 2 | 3 | 4 | ... | 12 | 13 |
| | | | | ... | | |
distance  |
| | | | ... | | |
Unfortunately, this reasoning isn't technically correct because these aren't Cartesian points. Mathematically, these would actually have to be Banach spaces. A Banach space is simply a vector space where you can measure the length or size of vectors, meaning that it is a complete* normed vector space. So instead, we would have to compute the norms of the vectors.
We can think of our iteration points as vectors:
Then we have the equation for the
-norm
Let's now compute the distances between the vectors correctly:
* Another condition for a Banach space X is as follows:
For a sequence of vectors
, suppose the series converges:
Then, the vector series
must converge in X. This is known as completeness. Every finite-dimensional normed vector space like
or
is guaranteed to be complete and is thus a Banach space.
The condition for stopping the iteration looks something like
or we reach some
. If we wanted to find the other roots, we would have to rewrite the original system in a different way. For instance, rewriting the system as
and
would result in convergence at
!
However, rewriting the system as
and
would result in divergence. When our norms are getting smaller like this
or 
We're seeing that the distance between the steps is shrinking. This is known as a Cauchy sequence. By the definition of a Banach space, any Cauchy sequence in the space must converge to a point within the space. This is what makes fixed-point iteration work.
When fixed-point iteration fails or converges too slowly, we can use Newton's method for systems of nonlinear equations. Let's begin with an example using the same system as before. We will denote it as
Let's take the point
and evaluate it:
We can think of F as some transformation onto some input to yield an output. We want the points that plug in to F to yield a transformation of
. How do we approach these values? Well, first, let's take a look at how changing our inputs can change our output:
These are simply moving our inputs to the right in the x-direction and upwards in the y-direction respectively. Now, the way we can generalize this for the entire function is using a Jacobian. We have to take the partial derivatives of every function in F for every variable in F:
In this case, our Jacobian is
We can try plugging values in
Let's try multiplying our Jacobian by the small changes in input:
The following is true:
Note that the larger the change in input, the less accurate the Jacobian is.
We can use the following equation for Newton's method:
Now, let's get a system we can solve:
This is now a linear system, so we can use our previous methods.
We have
. This is telling us what change in the input results in a change to the output. Let's try using this to change our input:
The instance where we subtract
is much more useful. From this we can get the general equation for updates:
We can use this to iterate:
Now, let's see what the transformation yields:
That's not quite close enough. Let's now acquire a new value through the Jacobian:
And let's iterate through again with it:
Let's see what the transformation yields:
We stop the process once
or
or of course if the max number of iterations occurs. At this point, we are pretty close to our root:
Let's look at another example using the same system but with a different initial guess and a set ε:
Here is a table of the results:
We can stop at around
iterations. To do this in MATLAB, you can of course use the backslash operator as I did above:
Then we can get
. You may also see it in the form of
using the inverse Jacobian. Then, using the fact that
where I is an identity matrix, we can say that
. Finally, we have a new form of Newton's method:
This would need to be solved symbolically and is generally not the best method for working in MATLAB.
If needed for specific applications, you can find the Jacobian using the Symbolic Math Toolbox if you have it:
J = jacobian(F, [x, y])
J =

2.6 Curve Fitting and Polynomials
Polynomials are thrown around in many textbooks, but they are far from an afterthought. They are one of the oldest and most fundamental objects in mathematics. A polynomial is simply a mathematical expression consisting of variables and coefficients where the only operations are addition, subtraction, multiplication, and exponentiation with variables raised to nonnegative integer powers. A general nth-degree polynomial looks like this:
where x is the variable, a terms are coefficients, and n is the degree of the polynomial.
Polynomials are present in advanced mathematics, particularly algebraic geometry, where they are used to construct objects in the realm of abstract algebra, and their solutions define complex geometric structures.
For our purposes, polynomials are used for the likes of root-finding, interpolation, and curve fitting. MATLAB represents polynomials as row vectors where each entry is a coefficient listed from the highest degree term to the constant
. For instance,
is represented in MATLAB as There are several functions you can use to handle polynomials such as
- polyval to evaluate polynomials
- roots to find polynomial roots. Let's say we want to find roots for the equation
:
- poly which can build polynomials are from known roots. Let's say we have the roots
. We can build a polynomial from this:
r = [2 -1 4]; % (x-2)(x+1)(x+4)
p = poly(r) % Expanding this out gives us x^3 - 5x^2 + 2x + 8 which matches the coefficients
- We can also multiply and perform polynomial long division with conv and polydiv.
You already know what the mean, median, and mode are. Range is slightly more complex. To put it simply, range is a measure of spread. It is easy to evaluate, but there are a few different approaches, and they are generally not reliable. The most common way is through standard deviation s which is a measure of how spread out data are around the mean:
where
is the sum of squares, the sum of the squared differences between data points and the mean. Thus the formula is
For a set of data with very spread out individual measurements,
and thus s will be large. If they are grouped tightly,
and thus s will be small. There is also a term known as variance which is a measure of just how spread out the data is:

The equations for standard deviation and variance both have a term called degrees of freedom which is given by
. This is another term that comes up a lot, and you have to adjust your definition of it slightly depending on context, whether it's chemistry, robotics, or statistics. In statistics, it refers to the number of independent data points available to estimate a parameter—that is, the number of values free to vary once constraints such as the mean are fixed. Having degrees of freedom helps make sure that our calculations of variance and standard deviation do not underestimate the variability of a data set. In MATLAB, we have a variety of functions to help with basic statistics such as mean, median, mode, std (standard deviation), var (variance), and rms (root mean square value).
There are some numerical methods like Monte Carlo simulations or stochastic modeling that rely on random numbers that follow specific probability distributions. MATLAB makes it easy to generate random numbers for simulations or testing through
rand which generates numbers with a uniform distribution,
rand(1,5)
0.8143 0.2435 0.9293 0.3500 0.1966
randn which generates numbers that have a normal distribution,
randn(1,5)
-1.4224 0.4882 -0.1774 -0.1961 1.4193
and randperm which returns a row vector containing a random permutation of integers
p = randperm(6) % Generates a random permutation of unique integers from 1 to 6
p = randperm(8,4) % Generates a random permutation of 4 unique integers from 1 to 8
MATLAB also allows you to control the random number stream of these functions through rng. The general syntax for this is rng(seed). This is a complex function, but it lets you specify what you want to initialize the random number generator with. You can also control the algorithm through rng(seed,generator). The main purpose of this is to let you introduce some reproducibility to your random number testing.
Curve fitting is the process of finding a function that best matches a series of data points. We don't often have the exact equation that governs a system, especially when working with experimental data. Thus, you have to analyze the situation and determine parameters. Curve fitting involves regression, which is all about the statistics used to model the relationship between a dependent variable and one or more independent variables. The whole purpose of regression is to find a function that minimizes error between the predicted and actual values. Linear least squares is one of the most simple methods, where you can fit a line to data. It's called "least squares" because it is all about minimizing the sum of squared differences between the points predicted by your model and their true values. Given a set of data points
, we want to find the line

where m is slope, b is the intercept, and e is error. This is also known as the residual between the model and the observations, which we can solve for specifically by rearranging the equation as
. Thus, the residual is the difference between the true value y and the approximate value
. One way to find the most accurate line is to minimize the sum of the residual errors. That is,
where n is the total number of points.
To determine the values of b and m, we have to differentiate with respect to each unknown coefficient:
We can set these derivatives equal to zero to find the minimum
:
Noticing that
, we can rewrite the equations as normal equations:
We can solve these to acquire
We can combine everything to then acquire
This uses the means of y and x.
Let's fit a straight line to the following data:
i | | | | |
1 | 10 | 25 | 100 | 250 |
2 | 20 | 70 | 400 | 1400 |
3 | 30 | 380 | 900 | 11400 |
4 | 40 | 550 | 1600 | 22000 |
5 | 50 | 610 | 2500 | 30500 |
6 | 60 | 1220 | 3600 | 73200 |
7 | 70 | 830 | 4900 | 58100 |
8 | 80 | 1450 | 6400 | 116000 |
∑ | 360 | 5135 | 20400 | 312850 |
First, let's compute the means:
and 
Now, let's calculate the slope and intercept:
We now have the equation
The data in the table can be interpreted as the relationship between force Fand velocity v, so we now have
. We can now visualize this through a force-velocity graph: v = [10 20 30 40 50 60 70 80];
F = [25 70 380 550 610 1220 830 1450];
v_fit = linspace(min(v), max(v), 100);
F_fit = polyval(p, v_fit);
plot(v, F, 'o', v_fit, F_fit, '-')
title('Force vs. Velocity')
legend({'Data Points', 'Least Squares Fit'}, 'Location', 'northwest', 'FontSize', 8)
While the line fit the data well, there is still error. We can calculate the standard deviation as such:
The standard deviation for the regression line
, also known as the standard error of the estimate, can be calculated as follows:
We know that our estimate is reasonable because
. To measure how accurate our model is, we can find the coefficient of determination
as follows:
This essentially means that
of the uncertainty has been explained by the model. Keep in mind that a
value close to
is not the be-all and end-all by any means. While it suggests a good fit, it still only measures how well the model handles variability. It does not prove that the relationship is truly linear. It's simply a handy way to measure correlation.
You of course know that most of the real world cannot be modeled by a straight line. Common nonlinear models such as the exponential model
or the power law relationship
can be linearized which allows you to analyze them using linear regression. We can easily linearize them by manipulating them algebraically:
We can then analyze the same numbers as last time but using linear regression:
x = [10 20 30 40 50 60 70 80];
y = [25 70 380 550 610 1220 830 1450];
% Linear least squares fit on the log-transformed data
p = polyfit(ln_x, ln_y, 1);
% Display linearized form
fprintf('Least squares line: ln(y) = %.4f + %.4f * ln(x)\n', ln_a, b);
Least squares line: ln(y) = -1.2941 + 1.9842 * ln(x)
fprintf('Power law form: y = %.4f * x^%.4f\n', a, b);
Power law form: y = 0.2741 * x^1.9842
% Plot original data and fitted curve
x_fit = linspace(min(x), max(x), 100);
plot(x, y, 'o', x_fit, y_fit, '-')
legend('Data Points', 'Power Law Fit', 'Location', 'northwest')
title('Power Law Fit to Data')
We can also plot the linearized data:
fit_y = polyval(p, ln_x);
plot(ln_x, ln_y, 'o', ln_x, fit_y, '-')
legend('Transformed Data', 'Linear Fit', 'Location', 'northwest')
title('Linearized Fit (ln(y) vs ln(x))')
We began with curve fitting with linear models. Now, let's move on to curve fitting for polynomials. Let's say we have a quadratic:
The sum of the squares of the residuals is
We will differentiate just as we did with the linear model:
We can then set these equations equal to zero and rearrange to find the normal equations:
You would then solve for the coefficients of the unknowns
,
, and
.
We can extend this case of a quadratic to a general mth degree polynomial
The standard error here is given by
We have lost
degrees of freedom which equals the number of coefficients estimated. That means that more terms in a polynomial will result in losing degrees of freedom. This is all accounted for in our standard error. Let's work through an example:
From this, we can compute the following:
Using this, we can get a linear system:
Using the backslash operator, we can solve for the coefficients:
A = [6, 15, 55; 15, 55, 225; 55, 225, 979];
b = [152.6, 585.6, 2488.8];
We have
,
, and
. Therefore, the least squares quadratic equation is
The standard error of the estimate is
And the coefficient of determination is
Thus,
. This means that
of the uncertainty has been explained by the model. We can now graph: % Original data from the example
y = [2.1 7.7 13.6 27.2 40.9 61.1];
% Normal equations matrix (from your previous system)
b = [152.6; 585.6; 2488.8];
% Solve for coefficients [a0; a1; a2]
% Reorder coefficients for polyval: [a2, a1, a0]
% Generate smooth x-values for plotting the parabola
x_fit = linspace(min(x), max(x), 100);
y_fit = polyval(p, x_fit);
% Plot original data and fitted curve
plot(x, y, 'o', x_fit, y_fit, '-')
title('Quadratic Least Squares Fit')
legend({'Data Points', 'Quadratic Fit'}, 'Location', 'northwest', 'FontSize', 8)
If we had more than one independent variable such that we had a function in the form of
, we'd have to use multiple linear regression. We still make this fit a linear model through the same process as before. It just means we would have to use partial differentiation to acquire the derivatives of
. Setting them equal to zero would then allow us to obtain the linear model:
All of these models are different versions of what is known as a general linear least squares model.
Note: Because of time and length constraints, I was unable to include an example of multiple linear regression. I was also unable to cover fourier analysis, which is where you fit data using sinusoidal functions. However, fourier analysis will be covered in a future project.
We used interpolation before for root-finding, and now we are going to introduce polynomial interpretation. The whole purpose of this is that, given a number of data points, you can find one polynomial that passes through all the points. We will be going over the Lagrange interpolating polynomial.
Let's say that we have a linear interpolating polynomial which is the weighted average of two values
and
connected by a straight line:
where
and
are the weighting coefficients. That is,
If we substitute these into the linear interpolating polynomial, we get
This is a first-order Lagrange interpolating polynomial. We can also find the general form of a second-order lagrange interpolating polynomial for three values
,
, and
. The second-order polynomial looks like
. Then, we need to find each L term. This is done the same way as before, where each term passes through only one point and is equal to zero at the other two. They would be as follows:
Substituting these results in the second-order Lagrange interpolating polynomial:
The first term is equal to
at
and equal to zero at
and
. The second term is equal to
at
and equal to zero at
and
. The third term is equal to
at
and equal to zero at
and
.
Both of these cases, as well as higher-order Lagrange polynomials, can be represented by a general form:
where
.
Let's work through an example. We have the following data, which represent the density of unused motor oil at
.
We will obtain the estimate at
using the first-order polynomial:
We can also use the second-order polynomial to yield a more accurate guess:
We can now use the following function Lagrange to find the density:
function f_est = Lagrange(x_data, y_data, x_target)
% Lagrange - Estimates value at x_target using Lagrange interpolation.
% x_data - Vector of known x-values (e.g., temperatures)
% y_data - Vector of known y-values (e.g., densities)
% x_target - Target x-value to interpolate at
% f_est - Interpolated y-value at x_target
L_j = L_j * (x_target - x_data(i)) / (x_data(j) - x_data(i));
f_est = f_est + y_data(j) * L_j;
y_data = [3.85 0.800 0.212];
density_at_15 = Lagrange(x_data, y_data, x_target);
fprintf('Estimated density at %.1f°C is %.4f g/cm^3\n', x_target, density_at_15);
Estimated density at 15.0°C is 1.3317 g/cm^3
Higher-order polynomials can be generally accurate for some points but can be way off on others. Furthermore, they tend to be very sensitive to roundoff error. These deviations are known as oscillations. Carl Runge, a German mathematician (also known for Runge-Kutta methods which we will discuss in Section 2.9), came up with a function now known as Runge's function. As you increase the number of interpolation points, the polynomial can oscillate wildly at the edges of the interval. This is known as Runge's phenomenon:
f = @(x) 1 ./ (1 + 25 * x.^2);
% Define evaluation x-values for smooth curve
x_fine = linspace(-1, 1, 500);
x_5 = linspace(-1, 1, 5);
y_5_interp = zeros(size(x_fine));
y_5_interp(k) = Lagrange(x_5, y_5, x_fine(k));
x_10 = linspace(-1, 1, 10);
y_10_interp = zeros(size(x_fine));
y_10_interp(k) = Lagrange(x_10, y_10, x_fine(k));
plot(x_fine, y_true, 'k-', 'LineWidth', 2); hold on;
plot(x_fine, y_5_interp, 'r--', 'LineWidth', 1.5);
plot(x_fine, y_10_interp, 'b-.', 'LineWidth', 1.5);
plot(x_5, y_5, 'ro', x_10, y_10, 'bs');
xlabel('x'); ylabel('f(x)');
legend({'True Function', '5-Point Lagrange', '10-Point Lagrange', '5-pt Nodes', '10-pt Nodes'}, ...
'Location', 'eastoutside', 'FontSize', 8);
title('Runge''s Phenomenon');
A better alternative is to use splines. A spline is a piecewise-defined polynomial, where instead of fitting one polynomial over the entire interval, you stitch together multiple low-degree polynomials to cover subsets of data points. This approach avoids the large oscillations we saw with high-order polynomials. The simplest spline is a linear spline, which connects data points with a straight line. The main problem with this approach is that they lack smoothness. Quadratic splines improve on them by using second-order polynomials for each segment. The most widely used spline in numerical methods is the cubic spline. This is the best compromise on smoothness and a lack of oscillations. If you go higher, the oscillations inherent in higher-order polynomials can begin to appear.
Cubic splines are continuous at the points where they are stitched together, which are known as knots. Not only are the function values continuous, but the first and second derivatives are also continuous at each knot, ensuring a smooth curve. A special case is known as a natural cubic spline, where the second derivative is equal to zero at the end points.
The general form of a linear spline is
,
where each interval i has its own spline function
,
is the intercept, and
is the slope of the straight line connecting the points. From this, we can get
Skipping ahead to cubic splines, we have
where for n data points, there are
intervals and
unknown coefficients to evaluate. Also keep in mind that
is the distance from the left endpoint of the interval. This is what localizes the polynomial to the interval. MATLAB has the very convenient function spline which is a cubic spline. The syntax looks like this:
s = spline(x, y, x_query)
where x is a vector of known x-valuers, y is a vector of known y-values, and x_query is a vector of x-values where you want to evaluate the spline.
Here is a simple example:
x = [0 1 2.5 3.6 5 7 8.1 10];
y_interp = spline(x, y, x_query);
plot(x, y, 'o', x_query, y_interp, '-')
title('Cubic Spline Interpolation)')
legend('Data Points', 'Spline Fit', 'Location', 'southoutside', 'Orientation', 'horizontal')
On the other hand, if we were to use interp1 to get a linear spline, we would have a graph that lacks smoothness:
y_linear = interp1(x, y, x_query, 'linear');
plot(x, y, 'o', x_query, y_linear, '-')
title('Linear Spline Interpolation')
legend('Data Points', 'Linear Spline Fit', 'Location', 'southoutside', 'Orientation', 'horizontal')
There is also a function called pchip which stands for Piecewise Cubic Hermite Interpolating Polynomial. This function also uses cubic polynomials, but the second derivative is not continuous. The spline function is a natural cubic spline whereas pchip is not. The main advantage of pchip is that it is what is known as a shape-preserving fit. It avoids overshooting. On the other hand, spline prioritizes a smooth curve, but can overshoot. There is also a fourth method called makima which stands for Modified Akima piecewise cubic Hermite interpolation. It is a good compromise between the lack of oscillations from pchip and the smoothness of spline.
p = pchip(x, y, x_query);
s = spline(x, y, x_query);
m = makima(x, y, x_query);
plot(x, y, 'o', x_query, p, '-', x_query, s, '-.', x_query, m, '--')
title('Comparison of Interpolation Methods')
legend('Sample Points', 'pchip', 'spline', 'makima', 'Location', 'southeast')
2.7 Numerical Differentiation
Numerical differentiation is used to approximate the derivative of a function that does not have an easy analytical derivative, if you only have discrete data points from an experiment, or if you have no equation at all. Recall the idea of finite differences. There are three approaches:
Forward difference is where you use the point ahead of x:
Backward difference is where you use the point behind x:
Central difference is where you use points on both sides of x:
Forward and backward difference are
accurate and central difference is
accurate, meaning that it is a much more accurate choice. Keep in mind that including more terms from the Taylor expansion will mean more accuracy. Truncating the Taylor series introduces error. For a central difference we have
where
. By expanding the Taylor series to include more points
, we can acquire a fourth-order central difference formula of order
:
where
. Let's try this out for
. We want to approximate
. The true value of this is
. Let's begin with step size
using the second-order central difference formula:
Using the fourth-order central difference formula, we get
We will repeat these calculations for different step sizes and compare the error:
Step Size h | Second-Order | Error | Fourth-Order | Error |
| | | | |
| | | | |
| | | | |
| | | | |
The fourth-order central difference method is generally more accurate. Let's look at the formulas for higher-order derivatives.
This same idea holds true for the forward difference method too:
Lower-Order | Error | Higher-Order (more Taylor series terms) | Error |
| | | |
| | | |
| | | |
| | | |
And it holds for the backward difference method:
Lower-Order | Error | Higher-Order (more Taylor series terms) | Error |
| | | |
| | | |
| | | |
| | | |
And lastly, here's the rest of the central difference method:
Lower-Order | Error | Higher-Order (more Taylor series terms) | Error |
| | | |
| | | |
| | | |
| | | |
We can use the following data to test out these three finite difference methods. We are trying to find the derivative of
where
and
. The true value is
. We need to first run some calculations:
and 
and 
and 
and 
and 
Let's now compute the results:
Richardson extrapolation uses two derivative estimates to compute a more accurate approximation. Let's say we have derivative approximations
and
with step sizes
and
. Then, we have the formula for Richardson extrapolation:
For an approximation using central difference methods that is
, the application of Richardson extrapolation will yield a new derivative estimate that is
. Let's apply this using the same function as we did previously. We have
where
,
, and
. Let's now calculate the first derivative estimates:
We can then combine these to yield
which also gives the true answer. In the case with the higher-order central difference method, recall that we got
. This answer is exact because the function being analyzed was a fourth-order polynomial. And we fitted a fourth-order polynomial through the data points to find the formula we used to evaluate it. The Richardson extrapolation yielded
, which is also exact, and this is a result of how the method works; it fitted a higher-order polynommial through the data and then evaluated the derivatives through central divided differences. Both of these cases were lucky because in reality, both of these will yield estimates with error far more often than not. Keep in mind that both of these examples worked because the data were evenly-spaced, and we also knew the functions. In a lot of cases, neither of these will be the case. One way to deal with this would be to fit a Lagrange interpolating polynomial to a set of adjacent points that are around the point you want to differentiate at. Another problem can be the presence or creation of errors. Experiments can yield data with errors and roundoff errors can occur during differentiation. The best way to approach this is to use least squares regression to fit a smooth, differentiable function to your data. If you don't know where to start, using lower-order polynomial regression is the safest first choice.
MATLAB has two functions that can help you find the derivatives of data (both also have many other purposes outside of numerical derivatives). First, we have diff, which calculates the difference between adjacent elements in a vector of length n. It will return a vector of length
. For instance, let's try differentiating the function
from
to
: f = @(x) 0.2 + 25*x - 200*x.^2 + 675*x.^3 - 900*x.^4 + 400*x.^5; % Express f(x) as a function
x = 0:0.1:0.8; % Generate evenly spaced values from 0 to 0.8
% These compute differences between the pairs of elements in x and y.
% Take a look at what they output.
diff(x)
0.1000 0.1000 0.1000 0.1000 0.1000 0.1000 0.1000 0.1000
diff(y)
1.0890 -0.0010 0.3190 0.8490 0.8690 0.1390 -1.1010 -2.1310
% Let's calculate the difference approximations using vector division
% The vector d now contains derivative estimates corresponding to a
% midpoint between adjacent elements, almost like slope
% We need to store the x-values corresponding to the midpoints so we can
xm = (x(1:n-1) + x(2:n)) ./ 2;
% Let's use the actual derivative for a comparison
ya = 25 - 400*xa + 3*675*xa.^2 - 4*900*xa.^3 + 5*400*xa.^4;
% Plot the numerical derivative vs the true derivative
plot(xm, d, 'o', xa, ya, '-')
title('diff Derivative Comparison')
legend('Estimates', 'True Derivative')
On the other hand, the gradient(f,v) function returns the gradient vector of symbolic scalar field f with respect to vector v in Cartesian coordinates.
The gradient vector of
with respect to
is the vector of the first partial derivatives of f:
In other words, it calculates an approximate derivative at every value in your vector rather than at midpoints. I will do this using the same function as before:
dy = gradient(y,0.1)
10.8900 5.4400 1.5900 5.8400 8.5900 5.0400 -4.8100 -16.1600 -21.3100
title('gradient Derivative Comparison')
legend('Estimates', 'True Derivative')
In this case, gradient is less accurate than diff. This is because gradient uses wider intervals. That being said, there's no clear answer on which is more accurate. It depends on your function and needs. I think gradient is just so much more versatile and has more features, so I would generally use it more.
You can also use it to symbolically calculate a vector of partial derivatives for a multivariable function:
syms x y z % Define symbolic variables
f = 2*y*z*sin(x) + 3*x*sin(z)*cos(y);
gradient(f, v)
ans =

We can also use it to create vector fields:
f = 2*y*z*sin(x) + 3*x*sin(z)*cos(y);
% Substitute z = 1 to reduce to a 2D vector field in x-y space
grad_f_2D = subs(grad_f, z, 1); % Fix z=1 for visualization
% Convert the two components to MATLAB functions for numeric evaluation
fx = matlabFunction(grad_f_2D(1), 'Vars', [x y]);
fy = matlabFunction(grad_f_2D(2), 'Vars', [x y]);
% Create grid points for x and y
[X, Y] = meshgrid(-2:0.2:2, -2:0.2:2);
% Evaluate the gradient components over the grid
title('Gradient Vector Field')
That is really awesome, isn't it?
2.8 Numerical Integration
Numerical integration is used to approximate definite integrals that cannot be solved analytically.
From your first calculus class, you may remember the trapezoidal rule. Please see Section 2.1 if you do not. The trapezoidal rule belongs to a larger family of formulas known as the Newton-Cotes formulas. These are the most common methods of numerical integration. The Newton-Cotes strategy is very simple; you replace a more complicated function or data set with a polynomial that is easily to integrate:
where
is an nth-degree polynomial in the form
. For the trapezoidal method, we have a first-order polynomial:
As I'm sure you remember, there is error with the trapezoidal rule. We can make an estimate for truncation error with
Let's numerically integrate
from
to
. We will set the true value as
. We simply plug in our values:
We can calculate
. This is absolutely horrible, but keep in mind that this is exact error because we have the true value. In most cases, we won't because there's basically no point in numerically integrating if we have the true value. We need the second derivative which is
. Then, the average value of the second derivative can be computed as:
We can now find the approximate error:
We used the formula for
, but keep in mind that the average second derivative is not necessarily a sufficient approximation of
. This is why we denote it with
instead. Even with approximate error, this method is clearly not very accurate. We can improve the trapezoidal rule by dividing the bounds of integration into segments and applying the method to each individual section. Let
and
:
Then, we can substitute in the formula for the trapezoidal rule to get
where
. By summing the individual errors for each segment, we can get error:
where i is an individual segment and
. Alternatively,
Finally, we can get aproximate error
This method is known as the composite trapezoidal rule. Let's try it out for the same function as last time but with
and
:
This is an improvement, but it's still not quite accurate enough. We would need to increase the number of segments. For
and
, we have
and
. This is a much better estimate, but it takes too many steps to get to.
Another method is known as Simpson's rule. There are different versions of Simpson's rule: one for second-order polynomials called Simpson's 1/3 rule and one for third-order polynomials called Simpson's 3/8 rule. The formula for Simpson's 1/3 rule is
with truncation error of
. Trying this out for the same function that we have been using yields
. This comes with an error of
. This comes with an estimated error of
. Our estimated error matches our error because this deals with fifth-order polynomials. The trapezoidal rule fits straight lines between points, but Simpson's rule fits parabolas between points. This makes it generally more accurate. Thus, our results make sense. That being said, it is more restrictive; it requires evenly spaced points and even number of intervals.
There is a composite version of Simpson's 1/3 rule which is given by
with
. We can apply this to the same function to yield
. This comes with an error of
. This comes with an estimated error of
which matches our error. This is even more accurate. For Simpson's 3/8 rule, we have
with
. This rule is more accurate than Simpson's 1/3 rule because it uses a higher-order polynomial. Keep in mind that you can combine both rules by using either for a certain number of segments and then adding them together. Higher-order formulas do exist, but they are generally unnecessary for applications in science and engineering. It's funny that your AP calculus teacher (the blame is actually more on College Board than them) likely spent more time on symbolic integration than numerical integration. They likely spent a week breezing through methods for approximating integrals despite the fact that these are what are used in engineering and scientific computing!
All of the methods we've used so far have been for evenly spaced points. There are a lot of situations where this isn't the case. We can use a similar form of the composite trapezoidal rule but with nonconstant values of h to represent the width of each segment i:
MATLAB has a function called trapz that uses the trapezoidal method. For I = trapz(x,y), it integrates y with scaling specified by x. If
is concave up, trapz will overestimate the value of the integral and vice-versa. Here's an example with uniform spacing: x = 0:pi/100:pi; % Domain vector
Here's an example with nonuniform spacing where the rows of y represent velocity data taken at times in x:
y = [5.2, 7.7, 9.6, 13.2;
I = trapz(x,y,2) % Each entry in vector I represents the integral for that corresponding row of y
Here, we use a third parameter dim to specify the dimension to integrate along. In this case, we have dim = 2 because the data is in the rows of y.
If you only have numeric data and don't know the function, you can also use trapz to evaluate double or triple integrals. Let's say we want to approximate
We would simply nest function calls to trapz:
I = trapz(y, trapz(x, F, 2))
For a triple integral
: [X, Y, Z] = meshgrid(x, y, z);
I = trapz(z, trapz(y, trapz(x, F, 2), 1), 3)
We also have the poorly named function cumtrapz that computes cumulative integrals. It essentially is the same thing as trapz, but it tells you the total accumulated area up to a specific point.
cumulative_area = cumtrapz(x, y);
plot(x, cumulative_area, 'o-')
title('Cumulative Integral of {e^x} Using cumtrapz')
We can also use cumtrapz to evaluate double integrals:
I_cum = cumtrapz(y, cumtrapz(x, F, 2)) % This is the cumulative area at each of the points
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0.3321 0.6584 0.9791 1.2944 1.6045 1.9096 2.2099 2.5056 2.7969 3.0840 3.3671 3.6464 3.9221 4.1944 4.4635 4.7296 4.9929 5.2536 5.5119 5.7680 6.0221 6.2744 6.5251 6.7744 7.0225 7.2696 7.5159 7.7616 8.0069 8.2520 8.4971 8.7424 8.9881 9.2344 9.4815 9.7296 9.9789 10.2296 10.4819 10.7360 10.9921 11.2504 11.5111 11.7744 12.0405 12.3096 12.5819 12.8576 13.1369
0 0.6544 1.2972 1.9288 2.5496 3.1600 3.7604 4.3512 4.9328 5.5056 6.0700 6.6264 7.1752 7.7168 8.2516 8.7800 9.3024 9.8192 10.3308 10.8376 11.3400 11.8384 12.3332 12.8248 13.3136 13.8000 14.2844 14.7672 15.2488 15.7296 16.2100 16.6904 17.1712 17.6528 18.1356 18.6200 19.1064 19.5952 20.0868 20.5816 21.0800 21.5824 22.0892 22.6008 23.1176 23.6400 24.1684 24.7032 25.2448 25.7936
0 0.9671 1.9168 2.8497 3.7664 4.6675 5.5536 6.4253 7.2832 8.1279 8.9600 9.7801 10.5888 11.3867 12.1744 12.9525 13.7216 14.4823 15.2352 15.9809 16.7200 17.4531 18.1808 18.9037 19.6224 20.3375 21.0496 21.7593 22.4672 23.1739 23.8800 24.5861 25.2928 26.0007 26.7104 27.4225 28.1376 28.8563 29.5792 30.3069 31.0400 31.7791 32.5248 33.2777 34.0384 34.8075 35.5856 36.3733 37.1712 37.9799
0 1.2704 2.5176 3.7424 4.9456 6.1280 7.2904 8.4336 9.5584 10.6656 11.7560 12.8304 13.8896 14.9344 15.9656 16.9840 17.9904 18.9856 19.9704 20.9456 21.9120 22.8704 23.8216 24.7664 25.7056 26.6400 27.5704 28.4976 29.4224 30.3456 31.2680 32.1904 33.1136 34.0384 34.9656 35.8960 36.8304 37.7696 38.7144 39.6656 40.6240 41.5904 42.5656 43.5504 44.5456 45.5520 46.5704 47.6016 48.6464 49.7056
0 1.5645 3.1000 4.6075 6.0880 7.5425 8.9720 10.3775 11.7600 13.1205 14.4600 15.7795 17.0800 18.3625 19.6280 20.8775 22.1120 23.3325 24.5400 25.7355 26.9200 28.0945 29.2600 30.4175 31.5680 32.7125 33.8520 34.9875 36.1200 37.2505 38.3800 39.5095 40.6400 41.7725 42.9080 44.0475 45.1920 46.3425 47.5000 48.6655 49.8400 51.0245 52.2200 53.4275 54.6480 55.8825 57.1320 58.3975 59.6800 60.9805
0 1.8496 3.6644 5.4456 7.1944 8.9120 10.5996 12.2584 13.8896 15.4944 17.0740 18.6296 20.1624 21.6736 23.1644 24.6360 26.0896 27.5264 28.9476 30.3544 31.7480 33.1296 34.5004 35.8616 37.2144 38.5600 39.8996 41.2344 42.5656 43.8944 45.2220 46.5496 47.8784 49.2096 50.5444 51.8840 53.2296 54.5824 55.9436 57.3144 58.6960 60.0896 61.4964 62.9176 64.3544 65.8080 67.2796 68.7704 70.2816 71.8144
0 2.1259 4.2112 6.2573 8.2656 10.2375 12.1744 14.0777 15.9488 17.7891 19.6000 21.3829 23.1392 24.8703 26.5776 28.2625 29.9264 31.5707 33.1968 34.8061 36.4000 37.9799 39.5472 41.1033 42.6496 44.1875 45.7184 47.2437 48.7648 50.2831 51.8000 53.3169 54.8352 56.3563 57.8816 59.4125 60.9504 62.4967 64.0528 65.6201 67.2000 68.7939 70.4032 72.0293 73.6736 75.3375 77.0224 78.7297 80.4608 82.2171
0 2.3936 4.7408 7.0432 9.3024 11.5200 13.6976 15.8368 17.9392 20.0064 22.0400 24.0416 26.0128 27.9552 29.8704 31.7600 33.6256 35.4688 37.2912 39.0944 40.8800 42.6496 44.4048 46.1472 47.8784 49.6000 51.3136 53.0208 54.7232 56.4224 58.1200 59.8176 61.5168 63.2192 64.9264 66.6400 68.3616 70.0928 71.8352 73.5904 75.3600 77.1456 78.9488 80.7712 82.6144 84.4800 86.3696 88.2848 90.2272 92.1984
0 2.6529 5.2536 7.8039 10.3056 12.7605 15.1704 17.5371 19.8624 22.1481 24.3960 26.6079 28.7856 30.9309 33.0456 35.1315 37.1904 39.2241 41.2344 43.2231 45.1920 47.1429 49.0776 50.9979 52.9056 54.8025 56.6904 58.5711 60.4464 62.3181 64.1880 66.0579 67.9296 69.8049 71.6856 73.5735 75.4704 77.3781 79.2984 81.2331 83.1840 85.1529 87.1416 89.1519 91.1856 93.2445 95.3304 97.4451 99.5904 101.7681
I = I(end) % Final cumulative value (total integral)
We can create a beautiful visualization for the function
as a 3D surface and the surface of cumulative integration where a red star marks the final area: I_cum = cumtrapz(y,cumtrapz(x,F,2));
surf(X,Y,F,'EdgeColor','none')
surf(X,Y,I_cum,'FaceAlpha',0.5,'EdgeColor','none')
plot3(X(end),Y(end),I_cum(end),'r*')
title('cumtrapz Visualization')
MATLAB used to have dblquad and triplequad for double and triple integrals of known functions. These have been replaced with integral, integral2, and integral3.
The general syntax for each is as follows:
q = integral(fun,xmin,xmax,Name,Value)
q = integral2(fun,xmin,xmax,ymin,ymax,Name,Value)
q = integral3(fun,xmin,xmax,ymin,ymax,zmin,zmax,Name,Value)
This may look like a lot of parameters, but keep in mind that you only really need up to xmax, ymax, and zmax respectively.
Here's an example for integral:
Here's an example for integral2:
I2 = integral2(f2, -3, 3, -5, 5)
Here's an example for integral3:
f3 = @(x,y,z) x.*y + z.^2;
I3 = integral3(f3, 0,1, 0,2, -1,1)
2.9 Ordinary Differential Equations
Numerically solving ordinary differential equations (ODEs) is one of the most important parts of scientific computing because real-world systems are rarely modeled by differential equations that have easy closed-form solutions. First, we will look at ODEs in their most basic form of
.
We will be working specifically initial value problems (IVPs). That is,
. We can say that on some interval
, there exists a differentiable function
such that
and
for all
. In other words, the solution curve
must pass through the initial point
and we the slope which is given by
. From this, we can visualize the solutions of ODEs through slope fields across a region of interest. A slope field essentially defines slopes over the region. The process is to start at an initial point
, move a small horizontal step of size h in the direction of the slope
, and then move vertically by
such that the desired slope is being matched. After that, you will be at the next point on the solution curve
, and thus the process repeats. The slope field for
over the rectangle (domain)
with two initial values
and
can be coded up as such: % Normalize vectors for quiver
quiver(T, Y, dt, dy, 0.5)
title('Slope Field for dy/dt = (t - y)/2')
% Plot the two exact solution curves from the example
t_fine = linspace(0, 5, 100);
y1 = 3*exp(-t_fine/2) - 2 + t_fine;
y2 = 6*exp(-t_fine/2) - 2 + t_fine;
plot(t_fine, y1, 'r', 'LineWidth', 2)
plot(t_fine, y2, 'b', 'LineWidth', 2)
legend('Slope Field', 'Solution: y(0)=1', 'Solution: y(0)=4', 'Location', 'southoutside')
The Lipschitz condition is a way of ensuring that small changes in y don't cause a continuous function to change too fast. A function
that is continuous on a rectangle R satisfies the Lipschitz condition provided that a Lipschitz constant
exists such that
,
where
. In other words, if
with respect to y is bounded by a constant L, the function satisfies the Lipschitz constant. And additionally, if
is continuous, then the IVP has a locally defined solution. This is known as the Picard–Lindelöf theorem, the Cauchy-Lipschitz theorem, or most commonly the existence and uniqueness theorem. Like many other numerical methods, the general approach is to construct an algorithm consisting of muultiple steps that converge to a solution. The most famous of these is Euler's method. The first derivative provides an estimate of the slope at
:
From this, we can get the formula for each update in Euler's method:
The goal is to step forward from
to
by following the slope at each step. Euler's method is a first-order method, which means that error is proportional to the square of each step h squared. Reducing step size will improve accuracy, but increase computation time because more iterations will be needed. Let's work through an IVP problem using Euler's method. We have
from
to 4 with a step size of 1. We are given the condition
. The exact y can be determined analytically as
. Then, the true solution is
. Let's now implement Euler's method:
Thus, the relative percent error is
.
We would then continue on for the next three steps. Results are tabulated:
Let's look at the graph:
y_true = [2.00000, 6.19463, 14.84392, 33.67717, 75.33896];
y_euler = [2.00000, 5.00000, 11.40216, 25.51321, 56.84931];
plot(t, y_true, 'r-', 'LineWidth', 2)
plot(t, y_euler, 'bo-', 'LineWidth', 2, 'MarkerSize', 6)
title('Euler’s Method Approximation vs. True Solution')
legend('True Solution', 'Euler Approximation', 'Location', 'southoutside')
As you can see, the error here is non-negligible. Numerical solutions of ODEs are subject to two types of error: truncation and roundoff. Error of Euler's method is synonymous with error of the expansion of a Taylor series. The solution to an ODE with continuous derivatives can be represented as a Taylor series about a starting value
:
where
,
is the remainder term, and
. We can combine these equations to yield an alternate form that uses big O notation:
This says the truncation error is proportional to
. We can also cut out some of the true solution by subtracting
:
For sufficiently small h, the higher-order terms are usually negligible, and thus approximate local truncation error is
.
There is also a metric known as stability. Stability refers to the rate at which errors grow. If error grows exponentially for a problem with a bounded solution, the numerical solution is unstable. For a differential equation
with
, we can use Euler's method to solve:
Combining, we can get the update formula:
where
is known as the amplification factor. If
, the solution is considered stable. This is technically a very simple version of Von Neumann stability analysis applied to ODEs, which is essentially the analysis of the amplication factors in the contet of partial differential equations (PDEs). PDEs will be covered in a future project.
One way to improve on Euler's method is to factor in a second evaluation of slope. In other words, we determine two derivatives for the interval: one at the beginning and another at the end. Then, you average the two derivatives. This is known as Heun's method. In Euler's method, the slope at the beginning of an interval is given by
, which is then extrapolated linearly to
to yield
. In Euler's method, we would stop here. In Heun's method, we continue. This is now known as the predictor equation. We can now get an equation that gives slope at the end of the interval:
where
denotes that it is a prediction. We can combine the equation for slope at the beginning of an interval and slope at the end of an interval to get an average:
Using Euler's method, we can then extrapolate linearly from
to
:
This is known as the corrector equation. Keep in mind that if the ODE is solely a function of the independent variable t, you don't really need the predictor equation.
The corrector equation in this case is
.
Error of the Heun method is similar to the trapezoidal rule. Recall that the trapezoidal rule is given by
.
If we solve the ODE
using integration like this:
We can then substitute this result into the equation for the trapezoidal rule to get
.
This is identical to the corrector equation in Heun's method. Thus, local truncation error is given by the same equation:
We will use Heun's method to work through part of the same function that was used in the Euler's method example. We will add a stopping criterion of
so that corrector iterations do not go on forever. First, we calculate the slope at
as
. Then, we use the predictor to compute a value at 1:
This is the result that would be obtained by Euler's method. This comes with a percent relative error of
. To improve the estimate for
, we will use the value
to predict the slope at the end of the interval:
Combining this with the intial slope, we can calculate average slope over the interval from
to 1:
We will then substitute this into the corrector equation to get the prediction at
:
,
which comes with a true percent relative error of
and 
We can then improve our value by substituting the new result into the corrector:
,
which comes with a true percent relative error of
and
. The next iteration gives
where
and
. After 12 iterations, the result at
is
which has
. This is the point at which the approximate error is below the stopping criterion. I've added the both results for the Heun method to the previous plot: y_true = [2.00000, 6.19463, 14.84392, 33.67717, 75.33896];
y_euler = [2.00000, 5.00000, 11.40216, 25.51321, 56.84931];
y_heun = [2.00000, 6.70108, 16.31978, 37.19925, 83.33777];
y_heun_iter = [2.00000, 6.36087, 15.30224, 34.74328, 77.73510];
plot(t, y_true, 'r-', 'LineWidth', 2, 'DisplayName', 'True Solution')
plot(t, y_euler, 'bo-', 'LineWidth', 2, 'MarkerSize', 6, 'DisplayName', 'Euler Approximation')
plot(t, y_heun, 'gs-', 'LineWidth', 2, 'MarkerSize', 6, 'DisplayName', 'Heun (No Iteration)')
plot(t, y_heun_iter, 'yd-', 'LineWidth', 2, 'MarkerSize', 6, 'DisplayName', 'Heun (With Iteration)')
title('Comparison of All Methods')
legend('Location', 'southoutside')
Runge-Krutta methods (RK) are the workhorse of numerical ODE solvers. They are just as accurate as the Taylor series methods but do not require you to calculate higher order derivatives. There are many variations of Runge-Kutta, but they can all be generalized as
.
where ϕ is called an increment function. It essentially is the weighted sum of slope over the interval. It can be written generally as
.
where the a terms are constant and the k terms are
where the p and q terms are constants.
The second-order version of Range-Kutta (RK2) is
where
and
. If we assume that
,
, and
, then RK2 becomes
where
and
. This is also known as the midpoint method.
The most popular version of Range-Kutta is by far the fourth-order version (RK4):
where
,
,
, and
.
For ODEs that are solely a function of t, RK4 is somewhat similar to Simpson's 1/3 rule and Heun's method. The idea of all of them is to average either the function or the derivative of the function (slope) at multiple points and then to combine them using a weighted average.
Let's use RK4 to solve
by integrating from
to 1 with a step size of 1. The initial condition is
. First, let's calculate the slope at the beginning of the interval:
We can then use this value to find a value of y and then slope at the midpoint:
We can then use this value to find another value of y and another slope at the midpoint:
We can then use this value to find a value of y and then slope at the end of the interval:
Now, we will combine these four slope estimates to yield an average slope. Then, we will use it to make the final prediction at the end of the interval:
This is a satisfactory estimate of the true value
. We have
. It's possible to develop higher-order RK methods. For instance, the fifth-order RK method is written as
.
There is a classification for Runge-Kutta methods known as the Butcher tableau:
This is a table that organizes the coefficients for RK methods. The c-values tell you where in the step each slope is evaluated. a-values tell you how previous k-values influence the next step. b-values are the final weights for combining all the k-values, which are then combined to compute the next y. Here is the one for RK4:
Going from row 1 to row 4, here is what the c-values tell you:
- 0 means that you evaluate
at the start of the interval
.
means that you evaluate
halfway through the step at
.
means that you evaluate
halfway through the step but using the new y-values from the previous step.- 1 means that you evaluate
at the end of the interval 
Going from row 1 to row 4, here is what the a-values tell you:
- All zeros means that
doesn't depend on any previous slopes. This makes sense because it is the first slope value.
means that
depends only on
and is scaled by
.
means that
depends only on
and is scaled by
.- 1 means that
depends only on
and is scaled by 1.
Going from row 1 to row 4, here is what the b-values tell you:
is the weight for
.
is the weight for
.
is the weight for
.
is the weight for
.
This of course is represented in
. Can you figure out which methods these two are?
The first is RK3 and the second is Heun's third-order.
All of the problems we've solved so far are single equations. In real-world applications, you will often have a system of ODEs. A system as such can be represented generally as
All of the methods we have discussed so far can be applied to systems of ODEs. The only difference is that y becomes a vector and your slope function returns a vector of derivatives.
Let's work through an example. These equations for velocity and displacement, respectively, come from the bungee jumper system:
We can express these as ODEs:
Let's first solve for the slopes at the beginning of the interval:
where
is the ith value of k for the jth dependent variable. Next, we will calculate the first values of x and v at the midpoint of the first step:
We will then use these values to calculate the first set of midpoint slopes:
We will then use these values to find the second set of predicted midpoint values and slopes:
These are then used to find the predicted values and slopes at the end of the interval:
We can now plug these in to RK4:
The true values are
and
. From our approximations, we have
and
.
Let's say that the solution to a numerical ODE changes very slowly. For the first half of the interval, over multiple iterations, you aren't getting much change. Then, a large step size would be sufficient. But after that, what happens if the results start to change rapidly? You would need a small step size to track the fast change. But implementing a small step size there would mean that you would also be using an unnecessarily small step size on the first half of the interval. This would be inefficient computation. A dilemma like this reflects one key drawback of solving a numerical ODE with a constant step size. Thankfully, there are adaptive methods that can automatically adjust step size. MATLAB's ode45 (Runge-Kutta 4,5) function is the standard for solving ODEs and handles all of this for us. Keep in mind that ode45 only works for non-stiff ODEs. Non-stiff ODEs are well-behaved whereas stiff ODEs can be numerically unstable. For stiff ODEs, it's best to use a function like ode15s.
Let's start from the top. While MATLAB is mostly for numeric work, it can also perform symbolic work, as we've seen previously. It can solve simple differential equations symbolically through the function dsolve. Let's say we wanted to solve the first-order differential equation
: Using older syntax, we would do this:
s = dsolve('Dy = y*x', 'x')
s =

I'm including the older syntax because the modern syntax requires Symbolic Math Toolbox which you might not have.
Let's instead do it using the modern syntax:
sol = dsolve(eqn)
sol =

The modern syntax is easier to read, easier to work with, and is more in line with other languages.
We could also turn this into an IVP problem with
. Using older syntax, we would do this: y = dsolve(eqn1, 'y(1)=1', 'x')
y =

Let's do it using the modern syntax:
sol = dsolve(ode, cond)
sol =

Let's solve the second-order equation
where
and
. Using older syntax, we would do this: eqn2 = 'D2y + 8*Dy + 2*y = cos(x)';
inits2 = 'y(0) = 0, Dy(0) = 1';
y = dsolve(eqn2, inits2, 'x')
y =

title('Solution to Second-Order ODE')
Using the modern syntax we would have this:
ode = diff(y, x, 2) + 8*diff(y, x) + 2*y == cos(x);
cond2 = subs(diff(y, x), x, 0) == 1;
sol = dsolve(ode, [cond1, cond2]);
x_vals = linspace(0, 1, 200);
f_numeric = matlabFunction(sol);
y_vals = f_numeric(x_vals);
title('Solution to Second-Order ODE using syms')
One application of differential equations is population dynamics in biological systems. One of the most famous that you likely already know is the Lotka-Volterra predator-prey model:
, 
, 
where prey birth rate
, predation rate coefficient
, predator death rate
, predator reproduction efficiency
,
, amd
. These values are actually real data from the Hudson's Bay Company between 1900 and 1920. Here is the function: function yprime = lv(t, y)
% LV: Lotka-Volterra predator-prey equations
% y(2) = predator population
yprime = [a*y(1) - b*y(1)*y(2); -r*y(2) + c*y(1)*y(2)];
To solve this, we finally get to use ode45:
[t, y] = ode45(@lv, [0 20], [30 4])
0
0.1156
0.2311
0.3467
0.4623
0.8430
1.2236
1.6043
1.9850
2.3258
30.0000 4.0000
31.5464 3.9882
33.1727 3.9959
34.8809 4.0242
36.6723 4.0745
43.1440 4.4378
50.5185 5.1629
58.5240 6.4846
66.4499 8.8629
72.6721 12.3253
title('Lotka-Volterra Predator-Prey Model')
legend('Prey', 'Predator', 'Location', 'southoutside')
The vector t holds the time values where the system was evaluated. The matrix y holds the corresponding prey and predator populations. Because this is from the Canadian ecosystem, we can assume that the prey and predators are snowshoe hares and Canadian lynxes respectively. For instance, at t = 0.3467 years, we have y(1) = 34.8809 and y(2) = 4.0242. This means that the prey population has grown from its initial value of 30 thousand hares to approximately 34.8809 thousand hares. The predator population has increased from 4 thousand lynxes to approximately 4.0242 thousand lynxes. We can match this up with the real data:
Fundamentals of Ecology by Odum
Let's work through another example, this time for a system of Lorenz equations:
where Prandtl number
, Rayleigh number
,
. We will set the initial conditions to
,
, and
. The Lorenz system is a model for atmospheric convection that is known for having chaotic behavior in deterministic systems. You may know the analogy of "a butterfly flapping its wings in Brazil causing a tornado in Texas." This is what Lorenz was working on, and you may have heard of it as the butterfly effect.
Here is the function:
function xprime = lorenz(t, x)
% LORENZ: Computes the derivatives for the Lorenz system
-sigma * x(1) + sigma * x(2);
rho * x(1) - x(2) - x(1) * x(3);
-beta * x(3) + x(1) * x(2)
And now let's use ode45 to solve:
tspan = [0 20]; % Arbitrary time span
[t, x] = ode45(@lorenz, tspan, x0)
t =
0
0.0025
0.0050
0.0075
0.0100
0.0226
0.0352
0.0477
0.0603
0.0716
x =
-8.0000 8.0000 27.0000
-7.6036 7.9571 26.6639
-7.2182 7.9093 26.3387
-6.8436 7.8574 26.0242
-6.4796 7.8021 25.7199
-4.8119 7.4981 24.3412
-3.3774 7.1889 23.1694
-2.1473 6.9160 22.1667
-1.0914 6.7046 21.3016
-0.2618 6.5763 20.6165
t is a vector of time points and x is a matrix with 3 columns for each variable.
Let's plot it:
title('Lorenz Strange Attractor')
This is Lorenz's strange sttractor, the attractor associated with the butterfly effect.
We can also plot each component of the solution as a function of t:
sgtitle('Lorenz System Components vs Time')
Here's a 3D model of Lorenz's strange attractor. Please try running it in MATLAB yourself!
plot3(x(:,1), x(:,2), x(:,3), 'y', 'LineWidth', 1.5)
xlabel('x', 'Color', 'w')
ylabel('y', 'Color', 'w')
zlabel('z', 'Color', 'w')
title('Lorenz Strange Attractor', 'Color', 'w')
set(gca, 'XColor', 'w', 'YColor', 'w', 'ZColor', 'w')
Here's the code for a fun animation of a point moving along Lorenz's strange attractor over time. Please try running it in MATLAB yourself!
lorenz_animated = @(t, x) [
28 * x(1) - x(2) - x(1) * x(3);
(8/3) * (x(1) * x(2) - x(3))
[t, x] = ode45(lorenz_animated, tspan, x0);
xlabel('x', 'Color', 'w')
ylabel('z', 'Color', 'w')
title('Lorenz Strange Attractor Animation', 'Color', 'w')
set(gca, 'XColor', 'w', 'YColor', 'w')
plot(x(:,1), x(:,3), 'Color', [0.3 0.3 0.3])
h = plot(NaN, NaN, 'yo', 'MarkerFaceColor', 'y', 'MarkerSize', 6);
set(h, 'XData', x(k,1), 'YData', x(k,3));
Boundary Value Problems (BVPs) are not included in too many introductory courses on ODEs. BVPs are differential equations that are subject to constriants at their extremes. That is, they have conditions at both the beginning and end of their intervals. This is in contrast with IVPs which only have conditions at the beginning of their intervals. In MATLAB, we use the function bvp4c which is a fourth-order method or bvp5c which is a fifth-order method.
Let's test this out in MATLAB. First, here is our system:
We have boundary conditions
and
. The exact analytical solution to this system is
Let's define the systen and its boundary condition as functions:
function dydx = bvpfcn(x,y) % System
function res = bcfcn(ya,yb) % Boundary conditions
function dfdy = jac(x,y) % Analytical jacobian for f
Next, let's specify the analytical Jacobian while also setting options with crude error tolerance and handling:
opts = bvpset('FJacobian',@jac,'RelTol',0.1,'AbsTol',0.1,'Stats','on');
Let's now create an initial guess. We will specify a constant function with an initial mesh of 10 points in the interval
: xmesh = linspace(1/(3*pi), 1, 10);
solinit = bvpinit(xmesh, [1; 1]);
Let's now solve the equation using both functions:
sol4c = bvp4c(@bvpfcn, @bcfcn, solinit, opts);
The solution was obtained on a mesh of 9 points.
The maximum residual is 9.794e-02.
There were 157 calls to the ODE function.
There were 28 calls to the BC function.
sol5c = bvp5c(@bvpfcn, @bcfcn, solinit, opts);
The solution was obtained on a mesh of 11 points.
The maximum error is 6.742e-02.
There were 244 calls to the ODE function.
There were 29 calls to the BC function.
Let's plot:
xplot = linspace(1/(3*pi),1,200);
yplot = [sin(1./xplot); -cos(1./xplot)./xplot.^2];
plot(xplot,yplot(1,:),'k',sol4c.x,sol4c.y(1,:),'r*',sol5c.x,sol5c.y(1,:),'bo')
title('Comparison of BVP Solvers with Crude Error Tolerance')
legend('True','BVP4C','BVP5C')
As is often the trend with higher-order methods, bvp5c is more accurate, but requires more computational effort. bvp5c controls true error in the calculation whereas bvp4c controls it indirectly.
A strange attractor is the shape that chaotic systems tend to follow over time, but they do so in an extremely sensitive and unpredictable manner. They are characterized by their fractal structure, and showcase the beauty of nonlinear dynamical systems. The butterfly effect is really interesting because it shows how tiny differences can lead to massive events. And it belongs to a larger area of study known as chaos theory, which is all about trying to find hidden structure in what may seem like randomness. This might seem really out there, and in some ways it is, but remember that a majority of the real-world is governed by nonlinear systems.
Indeed, the real world rarely hands us systems with easy, closed-form solutions. To understand it, we have to linearize, approximate, and iterate through step-by-step. And that is why numerical methods are the heart of science and engineering, and thus nearly everything around you. Your phone, the internet, medicine, food you buy at the grocery store, bridges, planes, and the very ground you walk on would all be infeasible without years and years of refining algorithms. We do all this to solve the unsolvable, build the unbuildable, and make our chaotic world just a little more understandable.
Bibliography
MATLAB Applications Part 1 uses and was inspired by the following sources. Sources are listed in the order they first appear in. Some may reappear later.
Chapter 1:
Chapter 2: