MATLAB Applications Part 1: Numerical Methods

July 1, 2025

Table of Contents

Introduction 1. Intro to Numerical Analysis 1.1 Numbers and Rounding: Unsigned and Signed Integers 1.2 What Are Fixed-Point and Floating-Point Numbers? 1.3 IEEE Standard for Floating-Point Arithmetic (IEEE 754) 1.4 IEEE 754 and NVIDIA 2. Numerical Methods 2.1 Calculus Review 2.2 Computational Errors 2.3 Root-finding 2.4 Linear Algebra 2.5 Systems of Nonlinear Equations 2.6 Curve Fitting and Polynomials 2.7 Numerical Differentiation 2.8 Numerical Integration 2.9 Ordinary Differential Equations Bibliography

Introduction

If you are completely new to MATLAB or need a refresher, check out MATLAB Introductory Guide. For review, you can test yourself using the MATLAB Onramp.

In my previous guide, I referenced how MATLAB is used for:

modelling: creating representations of a scientific phenomena
scientific computing: using the capabilities of a computer to analyze these models and solve difficult problems

How did MATLAB get to this point? Well, some of the intellectual fathers of modern day computing have contributed to MATLAB and it thus has one of the most intriuging histories for a software. It truly reflects an ideal of scientific computing and engineering at their best: relentless revision and collaboration.

Recall that the numerical linear algebra package known as LINPACK developed by many computer scientists and mathematicians is the heart of MATLAB. One person who contributed to LINPACK was Cleve Moler, who went on to create MATLAB for his students. Back then, it was only a matrix calculator. This early version of MATLAB was worked on by him and George E. Forsythe, the founder of Stanford's computer science department. The first versions of it were then distributed to a variety of universities and national labs, and then was ultimately released as a commercial product when Moler met Stanford electrical engineering grad student John N. Little. While working, Little and his friend Steve Bangert would go on to rewrite MATLAB in C. After that, the trio went on to found MathWorks in Portola Valley, California—nestled in the beautiful hills just behind Stanford. Later, the first ten copies of MATLAB were bought by numerical analyst Nick Trefethen at a little school known as MIT. The rest is history.

In the first part of guide, we will first develop an understanding of numerical analysis and work through the most important numerical methods.

In the second part, we will learn a new tool called Simulink. Finally, we will go a medley of MATLAB applications in science and engineering. The content here will hopefully allow you to integrate programming into anything you work on and also pursue your own projects in MATLAB.

Please check out Bibliography for all of the wonderful sources I used to build this guide. Any specific information not pulled from them is adapted from MATLAB's blogs and documentation.

If you have any questions or concerns, please contact me using the socials listed on my home page.

1. Intro to Numerical Analysis

Around 2600 years ago, Pythagoras and the Greeks believed that the universe could be expressed entirely through natural numbers and ratios between them. Some time later, their world view was disturbed when they discovered that

is irrational using the Greek ladder technique. And yet, a few millenia before even this, the Babylonians and Indians had methods to approximate

The Babylonians were sophisticated mathemeticians, known for developing their own numerical system. A Babylonian tablet, YBC (Yale Babylonian Collection) 7289, is one example of just how impressive their work was. In base-60 (can you think of anything in our lives today that uses the number 60?), one diagonal contains the numbers

. After converting to decimal, this refers to the following fraction, which is incredibly close to

https://peabody.yale.edu/

Why did they record

? Well, notice how the square also has a side labeled 30. Multiplying these two values together gets the value along the other diagonal,

. Given a side length of 30, they used the approximation of

to find the diagonal of the square.

Note: In Babylonian mathematics, division by 2 was equivalent to multiplication by 30, since 2 and 30 are reciprocal values in their number system

. This allowed them to scale values depending on what they were solving for.

How were they able to calculate this? Well, they could have used a process modeled by the following:

where

n is the number you want to find the square root of

a is the approximation to

B is a "correction" term such that

improves the estimate

Let's try evaluating this for

Our approximation is too small.

Let's try evaluating this for

This approximation is too large but a little closer.

They likely ended up with a more accurate recursive method known as the Babylonian method. This is also known as, and perhaps more accurately so, Heron's method:

where

S is the number you want to find the square root of

is your current guess for

is then your improved guess for

Let's iterate through:

After just three iterations, we're at a much closer approximation.

format short
percent_Err = (((1 / 2) * (1.417 + (2 / 1.417)) - sqrt(2)) / sqrt(2)) * 100
percent_Err = 1.9372e-04

Later along the way, this ended up as Newton's method:

If we plug in

, we get:

In other words, the Babylonian method is Newton's method for a function in the form of

which has the root

Anyways, there's very rich history here that I skipped, but the important idea is that this is arguably an early form of numerical analysis.

Numerical analysis isn't something that is super easy to define. Essentially, it's the study of using algorithms to solve problems involving continuous functions. Continuous functions involve real or complex variables, which cannot be represented on computers to complete accuracy because of the way they store information. Thus, through the convergence of approximations, numerical analysis is used to find unknowns to the best of our ability.

1.1 Numbers and Rounding: Unsigned and Signed Integers

Recall the following number systems:

Types of Numbers	Definition	Description
Natural numbers		The set of all positive whole numbers
Unsigned integers		An extension of the natural numbers that includes zero
Integers		An extension of the unsigned integers that includes negative whole numbers
Rational numbers		An extension of the integers to include fractions
Real numbers		An extension of the rational numbers to include irrational numbers
Complex numbers	ℂ =	An extension of the real numbers to include imaginary numbers

And recall the 16 fundamental array-based data types in MATLAB:

Name of Class	Description	Usage
logical	Stores 1 (true) or 0 (false)	logical()
string	String arrays stored	" "
char	Character arrays stored	' '
uint8	8-bit unsigned integer array	uint(8)
uint16	16-bit unsigned integer array	uint(16)
uint32	32-bit unsigned integer array	uint(32)
uint64	64-bit unsigned integer array	uint(64)
int8	8-bit signed integer array	int(8)
int16	16-bit signed integer array	int(16)
int32	32-bit signed integer array	int(32)
int64	64-bit signed integer array	int(64)
single	single-precision floating point arrays	single()
double	double-precision floating point arrays (numeric default)	double()
table	table array that stores tabular data	table()
cell	cell array that groups related data of different types	{ }
structure	structure array that groups related data via data fields	struct

Here's a flowchart to visualize this:

Let's begin with unsigned integers.

The one digit number 7 can be broken down like this:

The two digit number 15 can be broken down like this:

The three digit number 100 can be broken down like this:

The four digit number 8192 can be broken down like this:

How can this be represented through code?

d4 = 8192;
d = [2, 9, 1, 8]
d = 1×4
     2     9     1     8
p = (10 * ones(1,4)).^[0:1:3]
p = 1×4
           1          10         100        1000
[d;p]
ans = 2×4
           2           9           1           8
           1          10         100        1000
x4 = 0;
 
for i = 1:4
   x4 = x4 + d(i)*p(i);
end
 
x4 - d4
ans = 0

In other words, we represented a decimal number as a vector. This is important because MATLAB stores everything as arrays.

Alternatively,

d*p'
ans = 8192

Okay, instead of base-10, let's represent numbers in base-2 (binary).

Let's convert the 1-bit binary number 0 to decimal:

Let's convert the 1-bit binary number 1 to decimal:

Let's convert the 2-bit binary number 10 to decimal:

Let's convert the 2-bit binary number 11 to decimal:

Let's convert the 4-bit binary number 1001 to decimal:

The 8-bit binary number

is equivalent to the 4-bit binary number 1001. There are different ways to represent binary numbers depending on the number of bits.

Here's a table with the first 15 decimal integers and their equivalents in 4-bit binary representation:

Decimal	4-Bit Binary
0	0000
1	0001
2	0010
3	0011
4	0100
5	0101
6	0110
7	0111
8	1000
9	1001
10	1010
11	1011
12	1100
13	1101
14	1110
15	1111

Try representing the decimal number 16 through binary. That's right, you cannot. 4-bit binary numbers can only represent a total of 16 (

) possible values, including 0. These 4-bit binary numbers are also known as nibbles.

The largest possible number represented by an 8-bit binary number:

8-bit binary numbers can represent a total of 256 (

) values, including 0.

"8-bit" in computing refers to early systems where 8 bits was a reasonable compromise in efficiency. This affects audio quality, resolution, and processing power. You may recognize 8-bit music as "Chiptune" or arcade music and 8-bit images as pixel art.

How can binary be represented through code? MATLAB has a built-in function for this:

d = [0; 1; 2; 3; 9; 255];
b = dec2bin(d)
b = 6×8 char array
    '00000000'
    '00000001'
    '00000010'
    '00000011'
    '00001001'
    '11111111'

We can also go backwards:

bin2dec(b)
ans = 6×1
     0
     1
     2
     3
     9
   255

Let's try converting the decimal number 100 to binary.

d = 100;
b = [1, 1, 0, 0, 1, 0, 0];
p = (2*ones(1,7)).^[6:-1:0];
x = 0;
for i = 1:length(b)
    x = x + b(1,i)*p(1,i);
end
 
x
x = 100

Let's look at the data representation.

Using fliplr(b), we can vertically flip b.

bf = fliplr(b)
bf = 1×7
     0     0     1     0     0     1     1
isa(bf, 'double')
ans = logical
   1

Indeed, logic can be involved.

Let's convert it into a logical array:

b_logical = logical(b)
b_logical = 1×7 logical array
   1   1   0   0   1   0   0

Let's fnd the powers and calculate it again:

p = (2*ones(1,7)).^[6:-1:0];
x = sum(p(b_logical))
x = 100

Representing binary numbers as logic is important because they can then be found through hardware. Logic gates are built through transistors, which are semiconductor devices that store electronic signals. Transistors represent numbers in computers using binary code. Each transistor can be either on (1) or off (0). Combined, these sequences form binary numbers.

Another way to represent unsigned numbers is through base-16, or hexademical.

Decimal	Hexadecimal
0	0
1	1
2	2
3	3
4	4
5	5
6	6
7	7
8	8
9	9
10	a
11	b
12	c
13	d
14	e
15	f

For instance, d in hexademical is broken down as such:

You can convert quickly from hexadecimal to binary as such:

in binary

Let's look at an example:

We will convert it to binary:

in binary

Let's try MATLAB's built-in functions:

y = 8191;
b = dec2bin(y)
b = '1111111111111'
h = dec2hex(y)
h = '1FFF'

Let's try using format hex:

format hex
y
y = 
   40bfff0000000000
y4 = uint16(y);
y4
y4 = uint16
   1fff
format short

y is stored using the double class and y4 is stored using the uint16 class.

Hexademical can also be used to "spell" out words. Here's an example of one used in debugging.

Hexademical Code	Decimal	Word	Indicates
0xDEADBEEF	3735928559	deadbeef	software crash

Going back to decimal, let's think about the magnitudes of the values that it can represent:

Number of Digits d	Form	Min Value	Max Value
1		0
2
3
n

From this, let's come up with a formula for the number of decimal digits:

Let

where n is an integer and

Then there is some

such that

Thus,

This can be represented as

This gives us n, the number of decimal digits needed to represent

Let's test this in MATLAB. How many digits are needed to represent

N = 10^3;
n = floor(log10(N))+1
n = 4

Let's think about this for binary:

Number of Digits b (bits)	Form	Min Value	Max Value
1		0
2
3
n

From this, let's come up with a formula for the number of bits:

Let

and

Then there is some

such that

Thus,

This can be represented as

This gives us n, the number of bits needed to represent

Let's test this in MATLAB:

N = 10^3;
n = floor(log2(N))+1
n = 10

Let's try this for a 25 digit number.

N = 2^25;
n = floor(log2(N))+1
n = 26

We can check our work by finding the length of N.

length(dec2bin(N))
ans = 26

What would be outputted for n if we had N = 2^24?

We can see that our formula comes from an inequality where an unsigned integer can be bounded between two powers of 2.

For instance,

We simplified this expression using logarithms to get our formula.

Let's look at the following proposition:

Let

with

. We have the expression:

This is an increasing function with

where z is the domain of the function.

Let's prove this proposition. A function is increasing on an interval if the derivative is positive on that interval.

Let

. We can simplify:

Let's now take the derivative of both sides:

Keep in mind that

. And the domain of

. Thus,

Because the derivative is positive, we know that y is increasing.

We have proved the propisition that if

, then

We used the floor function earlier. Let's take a closer look at it.

For division with integers, the following is true:

With remainders, this can be rewritten as:

This leads to the following:

where

Let's try using the floor function.

help floor
 floor - Round toward negative infinity
    This MATLAB function rounds each element of X to the nearest integer
    less than or equal to that element.

    Syntax
      Y = floor(X)

      Y = floor(t)
      Y = floor(t,unit)

    Input Arguments
      X - Input array
        scalar | vector | matrix | multidimensional array | table |
        timetable
      t - Input duration
        duration array
      unit - Unit of time
        'seconds' (default) | 'minutes' | 'hours' | 'days' | 'years'

    Examples
      Round Matrix Elements Toward Negative Infinity
      Round Duration Values Toward Negative Infinity

    See also ceil, fix, round

    Introduced in MATLAB before R2006a
    Documentation for floor
    Other uses of floor

Let's create a vector, floor each value, and then plot the relationship:

x = -10:0.01:10;

y = floor(x);

figure

scatter(x,y)

xlabel('Actual Value')

ylabel('Floored Value')

title('Floored Value Versus Actual Value')

How can we interpret this? Well, for the first set of values, notice how any x from -10 t0 -9.5 is rounded off to -10:

Let's look at the floor and rem functions. If take the floor of

, we get the following:

N = 25;
D = 2;
q = floor(N/D)
q = 12

The actual value of

, but floor returns 12. This is because it rounds down.

Let's take a look at rem.

r = rem(N,D)
r = 1

Note the following:

q * D + r
ans = 25

Indeed, q * D + r = N.

Another true expression can be found:

q <= N/D && N/D < q + 1
ans = logical
   1

Furthermore,

0 <= r && r < D
ans = logical
   1

The rem function might remind you of the modulo operator (mod).

rem(N,D) == mod(N,D)
ans = logical
   1

What's the difference? Well, let's look at a graph:

V = -10:1:10;

M = [mod(V(:),3),rem(V(:),3)];

figure

plot(V(:),M)

xlabel('Input')

ylabel('Output')

title('Difference Between Rem and Mod')

legend('mod','rem','location','best')

Additionally, here's a table:

V = (-10:10).';
modVals = mod(V, 3);
remVals = rem(V, 3);
 
T = table(V, modVals, remVals, ...
    'VariableNames', {'V', 'mod', 'rem'});
 
disp(T)
     V     mod    rem
    ___    ___    ___

    -10     2     -1 
     -9     0      0 
     -8     1     -2 
     -7     2     -1 
     -6     0      0 
     -5     1     -2 
     -4     2     -1 
     -3     0      0 
     -2     1     -2 
     -1     2     -1 
      0     0      0 
      1     1      1 
      2     2      2 
      3     0      0 
      4     1      1 
      5     2      2 
      6     0      0 
      7     1      1 
      8     2      2 
      9     0      0 
     10     1      1 

The outputs change when negative integers are involved.

Here's a table where a is the numerator and m is the denominator:

Aspect	mod(a,m)	rem(a,m)
Purpose	returns the remainder after division	returns the remainder after division
Rounding	floor(a/m) (rounds toward )	fix(a/m) (rounds toward 0)
Output	Always 0 or same sign as a	Always 0 or same sign as m
a > 0	Returns positive number	Returns negative number
m = 0	Returns a	Returns 0

Let's derive mod.

We have the following:

Remainder r is 0 if N is even and 1 if N is odd.

Okay, let's test for an even N:

N = 6;
D = 2;
q = floor(N/D);
r = rem(N,D);
evenOddTester = 2 * q + r
evenOddTester = 6

Let's test for an odd N:

N = 7;
D = 2;
q = floor(N/D);
r = rem(N,D);
evenOddTester = 2 * q + r
evenOddTester = 7

In both cases, we get that the output of evenOddTester is equivalent to N. Thus, it is working as planned.

Okay, so where is the output of modulo in evenOddTester? It's in r. Let's rearrange the function:

where

Given the condition

and the fact we want an integer quotient

, should we be using floor division or fix? We should be using floor. Thus,

because

. So we now have the following:

Let's test this out:

N = 10;
q = floor(N / 2);
r = N - 2 * q;
r_expected = mod(N, 2);
r_expected == r
ans = logical
   1

We now have a working modulo function.

For more than just evens and odds, we can generalize it:

Okay, let's verify that our function works. First, let's get an expected output:

5 fits into 32 exactly 6 times, meaning that

is accounted for.

We can see from this that there is a remainder of

We can also multiply the fractional leftover

by the denominator (divisor) 5 to get 2.

Thus,

Now, let's test it in code:

N = 32;
D = 5;
r = N - D * floor(N/D)
r = 2
mod(N,D)
ans = 2

Knowing that arrays are the backbone of MATLAB, let's write out a definition for the modulo operation in MATLAB:

b = mod(a,m) returns the remainder after dividing dividend a by divisor m.

It can be expressed as b = a - m.*floor(a./m).

We can now test mod using a vector of integers:

a = 1:1:5;
m = 2;
mod(a,m)
ans = 1×5
     1     0     1     0     1

Here's a review problem: The output of mod(a,m) would represent something if it were a logical vector. What would it represent as a logical vector? Once you figure that out, code up the logical vector.

Let's test mod using positive and negative integers:

a = [-12, -7, -4, -1, 46, 93];
m = 3;
mod(a,m)
ans = 1×6
     0     2     2     2     1     0

Note that nonzero outputs are always positive if the divisor is positive.

Let's test mod using a negative divisor:

a = [-12, -7, -4, -1, 46, 93];
m = -3;
mod(a,m)
ans = 1×6
     0    -1    -1    -1    -2     0

Note that nonzero outputs are always negative if the divisor is negative.

Let's test mod using floating-point values:

theta = [0.0 3.2 6.2 7.6 9.0 4*pi];
m = 2*pi;
mod(theta,m)
ans = 1×6
         0    3.2000    6.2000    1.3168    2.7168         0

Note that mod attempts to compensate for the rounding of floating-point values to produce the most accurate integer values as possible.

We've successfully used floor to derive the modulo operator. Let's now use floor to code up a general algorithm for converting an integer from decimal to binary.

Let's convert 27 into binary:

We can recognize that:

Thus, the number of bits n is given by floor(log2(27)) + 1 = 5.

27 can be represented by

bits b:

Okay, now let's assign each of those bits a power of 2. Here's a table:

	= decimal
	1
	2
	4
	8
	16
	32
	64
	128
	256
	512
	1024
	2048
	4096
	8192
	16384
	32768
	65536

From this, we have the following table, going from the most significant bit to least significant:


1	1	0	1	1

Thus, 27 in the binary form of

is 11011.

Let's convert 149 into binary:

We can recognize that:

Thus, the number of bits n is given by floor(log2(149)) + 1 = 8.

149 can be represented by

bits b:

Now let's assign each of those bits a power of 2:

From this, we have the following table:


1	0	0	1	0	1	0	1

Thus, 149 in the binary form of

is 10010101.

Let's convert 19229 into binary:

We can recognize that:

Thus, the number of bits n is given by floor(log2(19229)) + 1 = 15.

149 can be represented by

bits b:

Now let's assign each of those bits a power of 2:

From this, we have the following table:


1	0	0	1	0	1	1	0	0	0	1	1	1	0	1

Thus, 19229 in the binary form of

is 100101100011101.

Let's move over to MATLAB and build this out:

y = 19229
y = 19229
n = floor(log2(y))+1; % number of bits needed to represent y in binary
b = false(1,n); % initialize the binary vector using only logical zeros
tmp = y; % gives us a copy of y to manipulate while preserving our original
 
for i = n-1:-1:0 % loop from most significant to least significant bit
    if tmp >= 2^i
        b(1, n - i) = true; % set the corresponding bit from the conditional to 1
        tmp = tmp - 2^i; % subtract 2^i from tmp because it's now accounted for
    end
end
 
p = (2*ones(1,n)).^[n-1:-1:0];
sum(p(b)); % reconstructs the original number, should be equal to y
bin_str = char(b + 48); % convert the logical binary vector into a string
disp(bin_str)
100101100011101

We've just built an algorithm that expresses the fact that all integers can be represented as powers of 2 plus remainders. This idea is essential to how computers store integers and how MATLAB operates. We also worked with the floor function, which is an essential tool in rounding and thus scientific computation. However, most of our work revolved around unsigned integers. Let's move on to signed integers.

If we want to store positive and negative integers using bits, we can't just put a negative sign somewhere in the binary representation. We also want to make sure arithmetic works properly. When performing binary addition, the difference from decimal addition is that you carry values of 2 instead of values of 10. Let's take a look at some binary addition:

, notice how we go from 1-bit integers to a 2-bit integer.

Let's try adding

in binary:

We went from two 8-bit integers to an 8-bit integer.

floor(log2(220))+1
ans = 8

Let's try adding

in binary:

We went from two 8-bit integers to a 9-bit integer.

floor(log2(300))+1
ans = 9

Indeed, 300 requires a minimum of 9 bits to represent.

When we try adding up the previous two examples using the uint8 data class...

ex1 = uint8(120) + uint8(100)
ex1 = uint8
220
 
ex2 = uint8(200) + uint8(100)
ex2 = uint8
255

Notice how ex1 yields the expected

. But ex2 gives

. The number 300 in binary cannot be represented in 8-bits. This is called overflow.

We can fix this problem by using a different data class like uint16. This expands the bit length.

ex2 = uint16(200) + uint16(100)
ex2 = uint16
300

Unfortunately, we can't do that with negative integers. Let's look at an example of how to represent signed integers in binary. First, let's do it with a positive integer as we know how to already:

Now, let's do it with a negative integer. You want to flip the bits:

11100011

And then add

This representation is known as two's complement. Most computers represent integers this way.

In two's complement, a leading 1 means the integer is negative.

Addition and subtraction in two's complement is pretty simple. Here's what it looks like for a positive integer and a negative integer. We are adding

Let's try adding two integers in two's complement that result in a sum of 0. We are adding

Overflow also exists when representation integers via two's complement. Let's say we add up

The sum of two positive integers yielded a negative integer. This means we have overflow. If you add two numbers with the same sign and get a result with the opposite sign, that means you have overflow. To fix this, you can use a larger data class with more bits. This is known as sign extension.

We have 113 in 8-bit:

01110001

Let's expand it to 16-bit with placeholders:

This is a positive number, so the sign bit (0) moves to the new leftmost position. Then, we fill the rest with copies of the sign bit. In this case, we have a positive number, so we fill it in with zeroes. If it were negative, we'd fill it in with all ones.

0000000001110001

Doing the same for 91 yields 0000000001011011. Let's now add our numbers, now appropriate for int16, together.

16 bits can represent a significantly larger range of values, so we have no more overflow.

Let's do it with negative numbers:

91 in binary is 01011011. Flip the bits and then add

to get 10100101. We now have

113 in binary is 01110001. Flip the bits and then add

to get 10001111. We now have

in 16-bit is 1111111110100101.

–113 in 16-bit is 1111111110001111.

Now, let's combine it all to evaluate

Let's review our work. Using int8 does not work and results in overflow:

int8(-91) + int8(-113)
ans = int8
-128

Using int16 gives the correct answer:

int16(-91) + int16(-113)
ans = int16
-204

You may have deduced is a threshold for the range of numbers that can be represented in two's complement using a certain number of bits n. A general rule is:

For instance, in 8-bit, the range would be as follows:

Notice how we can go all the way to

but only go up to 127. We can represent one more negative number than positive because there there is not enough space to represent the positive counterpart of the most negative number.

Here is an example in 3-bit:

Decimal	Binary
	100
	101
	110
	111
0	000
1	001
2	010
3	011

The allowed range is

. 100 is deserved for

. We cannot go beyond 3 mathematically. Instead, we'd wrap around back to

. You can think of two's complement system for a certain number of bits like a circle.

Here, there are a certain number of combinations

If you are performing a sequence of arithmethic, it's okay if the intermediate operations result in overflow as long as the final result is within the bounds. For instance, let's say we want to do

in this system.

For the first operation

, there is overflow because

For the second operation

, we "revert" the overflow and get an answer

Ethough there was overflow in the intermediate step, the final result falls within the allowed range, so the operation is valid.

1.2 What Are Fixed-Point and Floating-Point Numbers?

Our previous work with unsigned and signed integers is part of a larger class of data representation that deals with the ways computers store and manipulate data. The first class is is a fixed-precision method that is comprised of fixed-point number systems. A fixed-point number simply means there is a constant number of digits stored on both sides of the radix (often decimal) point. You must decide where the point goes.

It's a good time to talk about how fractions work in these number systems. For any

, we have

where

and represent the numerator and denominator respectively. Thus

For instance, the fraction

can be represented by a 4-digit decimal number:

We can use a decimal point to separate this representation into two parts:

From this, we can come up with a general notation for representing fractions:

Let

be nonnegative integers. Then, we can write

in afinite decimal expansion

Each digit is

for all

. The total number of decimal digits is thus

where f is the number of decimal digits.

Let's look at an example for

The integer part of our number 8142 is expressed using

decimal digits therefore

. We can see that

decimal digits after the decimal point. All in all,

We cannot write finite decimal expansions for certain fractions because of repeating decimals.

for instance is written as

. For this, we'd have a different form where

Let's try this out both ways. We have

Remember that decimal points are a system for humans to understand. We can multiply by a power of 10 and then subtract to move the repeating decimal to the fractional part of the number:

Since

has 7 decimal digits in the fractional part and we've already moved over 4 digits, let's shift over another 3 digits:

Now we can do

to get

where the fractional parts cancel out.

Using this, we can also find the rational fractional form:

More in line with our general notation, we can also express the number

as a summation.

First, we can break down the number as

Recall the formula of a geometric series:

We can then write out what we have as:

Let's put the summation (which represents the repeating part) in a form more similar to the formula above:

We can rewrite the sum as such, where we factor out

The inside can be represented as a geometric series of general form

. Let

and

We can see that

. We can then say that

We now have the number with an exact fractional form for our repeating decimal.

The general form of a number with a repeating decimal looks like this:

for a sequence of

repeating digits

Let I be the integer part

, A be the non-repeating part after the decimal point, and B be the repeating decimal part. We now have

Trying this out for our previous number

, we can see that

with

, and

with

When we plug these values in to our general formula, we get the following:

This matches our previous work.

The above is for infinite repeating decimals, but what happens if we have a finite decimal? We should be able to distinguish between the two. We can say that the decimal expansion of

is finite if and only if its fractional representation can be expressed with a denominator in the form of

where j and k are nonnegative integers.

If we expand

, we we will eventually arrive at the following:

In other words, a finite decimal can be written as

and since

, we can conclude that the fraction representation of a finite decimal must have a denominator that is a product of twos and fives.

Binary representations are what the computer deals in, so let's now work with binary fractions. First, we will define the general form of a finite binary expansion. Let

be nonnegative integers. Then

can be written in the form

As this is a binary representation we have each bit

where

. Similarly to before, we know that the total number of bits

Here's an example dealing with the fraction

. It can be written as

. In binary form, we get

. In binary, recall that decimal 3 is equivalent to binary 11.

For another example, we have

. This can be rewritten as

In binary form, we get

. Once again, recall that decimal 19 is equivalent to binary 10011.

Here, not all rational numbers can be expressed using a finite binary expansion either because of repeating decimals. For instance,

We can write an infinite binary expansion in the form

Let's try this out for an example where

First, we do a similar process where we multiply by powers of 2 and then subtract so that we are no longer representing the number through infinite repeating bits:

We then shift again:

Now, subtracting eliminates the repeating bits:

For the fractional representation of a number with infinite repeating bits, we have the following gemeral form of a number:

Let I be the integer part, A be the non-repeating binary fraction with n bits, and B be the repeating binary fraction with r bits. Then,

Try plugging in the example we did above with

Lastly, we can say that a rational number has a finite binary expansion if its denominator is a power of 2. That is, the binary expansion is finite if and only if the denominator is in the form

where

I think the toughest part of this section is figuring out how the conceptual framework of fixed-point numbers interplays with how MATLAB actually works. In other words, how can I apply this to my own work? Keep thinking about this as we move forward.

Let's do a little bit of fixed-point arithmetic. MATLAB has a special toolbox called Fixed-Point Designer that lets you really optimize and be specific about your fixed-point arithmetic. It's a little too specialized for this guide, so I won't be talking about it much after this. That being said, the fundamentals of why it works are still paramount here.

Fixed-point arithmetic is essentially the same as integer arithmetic, but with a scale factor. For instance, in Q4.4 format (4 bits for integer, 4 bits for fraction), our scale factor is

. We would multiply the stored integers by

to get our actual number. We would do the reverse to go from the actual number to the stored integer.

Actual Number	Scaling	Integer Stored By Fixed-Point
		52
		88
		16

So for addition and subtraction, perform the operations directly and then use the scale to interpret the result.

In Fixed-Point Designer, there is a constructor called fi that lets you create a fi object. A fi object is simply storing a real number with a scaling factor.

There are different ways to use this function:

Parameters	Description
fi	Returns a signed fi object with no value, a 16-bit word length, and a 15-bit fraction length
fi(v)	Returns a signed fi object with value v, a 16-bit word length, and best-precision fraction length
fi(v,s)	Returns a fi object with value v, signedness s, a 16-bit word length, and best-precision fraction length
fi(v, s, w)	Returns a fi object with value v, signedness s, and word length w
fi(v, s, w, f)	Returns a fi object with value v, signedness s, word length w, and fraction length f
fi(v, s, w, slope, bias)	Returns a fi object with value v, signedness s, word length w, slope, and bias
fi(v, s, w, slopeadjustmentfactor, fixedexponent, bias)	Returns a fi object with value v, signedness s, word length w, slopeadjustmentfactor, fixedexponent, and bias
fi(v,T)	Returns a fi object with value v and numerictype T

Let's go over some examples of these parameters.

Value v refers to the number assigned to the data type of the constructor.
Signedness s refers to whether or not the number can be negative. It stores only logical 1 or 0.
Word length w refers to the total number of bits used to store the number where w . This means you are limited to a maximum of 16 bits to index individual bits.
Fraction length f refers to the number of fractional bits which is then used as the scale factor. 15 is the default.
slope refers to the scalar you multiply the stored integer by to get the actual value (your real-world number). It is represented by the equation v = slope ∗ integer + bias. It differs from f because you can use non-binary scalars.
bias lets you shift the scaling and slopeadjustmentfactor lets you use a potentially non-binary slope which must be . fixedexponent is also a scalar that lets you further adjust slope where slope = slopeadjustmentfactor ∗ 2^fixedexponent. These three parameters are not used much, so we will not discuss them further.

For example...

a = fi(0.234375, 0, 4, 6);

c = a + a

This would create a fixed-point number that is unsigned, 4 bits total, and of fraction length 6. Thus, the fixed-point number a is found by

. Then, the next stored integer is

. But wait, how is it possible to store a fraction length that is larger than the total number of bits? To store thie value, MATLAB automatically expands the number of bits to prevent overflow; in this case the word length becomes 5 bits total.

That is the end of the introduction to Fixed-Point Designer. If you are interested in using it yourself, you will have to acquire a license for the toolkit.

For multiplication, the result is "scaled twice." You multiply the two integers and the scale factor is

. Thus for two integers a and b, you would do

. In binary, the scaling factor is often a power of two.

For example, let one stored integer be 125 and the scaling factor be

Let another stored integer be 33 scaled by

We can multiply these together to yield

The scale factor is then

With this, we can then do

to get

To check our work, let's convert the original stored integers back to real numbers by doing

and

. Finally,

. The two results match.

For fixed-point division, let's just jump straight into an example. We are dividing the stored integer 3124 with a scale factor of

by the stored integer 9753 with a scale factor of

The numerator is given by

The denominator is given by

The stored quotient is thus

The final scale factor is given by

. Thus, our calculated real result is

For

= stored integer values and

, we have the general formulas for fixed-point multiplication and division:

Operation	Formula	Accuracy (generally)
Multiplication		Exact
Division		Approximate

To begin our work into floating-point numbers, let's start with an example.

Let's say you wanted to use some sort of set 𝔽 to represent the real numbers on a number line. 3 numbers in the set have been plotted already:

The number

has 3 significant digits and

has 9 significant digits. In the real world, it's generally not useful for large numbers to be represented with more significant digits than small ones. If these were experimental data, it would mean that the

is the digit we are not fully confident in. But if we were to change it to say

, it wouldn't make too much of a difference. If we did the same for

and it became

, that would be orders of magnitude more noticeable. Thus, this way of representing numbers is not very effective for scientific computing.

Using the same number line, let's start over. Let's zoom in on the interval from 1 to 2. For the following interval, there are

values because of binary representation. But the exact number isn't as important as the concept. Just imagine that there is a certain amount of values on this interval.

Let's say we were to multiply those

values by 2. They would then end up on the interval from 2 to 4:

There are still

values on the new interval, they are just spaced out twice as far apart. The same idea applies if we were to multiply our original

To generalize this pattern, we would say that each interval can be represented as

where e is the exponent range.

We can also say that each interval has

evenly spaced values in and that the spaces are relative to the size of the interval.

Suppose we have a real number

and

is the closest value in floating point to x. It then follows that the error is bounded by:

This tells us how far off we might be due to the spacing between values. Now, let's look at the same thing but the size of the spacing relative to x. This means that

Thus, we can remove

. We are making the upper bound more flexible which is a safe approximation:

We now have the rounding error in terms of bits n:

This special value machine epsilon

can then be calculated as

which is the relative precision of floating-point arithmetic in this case. In real computing, floating-point arithmetic has around 16 decimal digits of precision, so our generalization checks out.

Let's take a second to come up with a formal definition for floating-point numbers:

The set 𝔽 of floating-point numbers can be expressed as

where

and exponent range

. The mantissa

is the part of a floating-point number that represents its significant digits. It then follows that

where d is a fixed integer called the binary precision.

The binary representation of

uses d bits thus

. We will elaborate on this definition later.

So we've seen floating-point as an idea, but why exactly we use it hasn't been well-established. Let's move on to an example that illustrates the advantages floating-point has over fixed-point.

Let x be a fixed-point data type such that

where

is the unsigned, fixed-point number encoded using 8 digits in binary with l bits for the integer part of the binary expansion and f bits for the fractional part of the binary expansion such that

with nonnegative integers

Input

is an 8-bit binary word. Output

is a rational number with

where

and

. The output can be expressed as a finite binary expansion.

For instance, let

. We will set

and

which lets us see that

This is in the form of the binary expansion

This is then equivalent to

Let's now convert x to decimal:

If we were to set

and

, we would get a different value:

There are different combinations of values for

and f such that

Case #	Possible l	Possible f
1	4	0
2	3	1
3	2	2
4	1	3
5	0	4

Let's begin with Case 1 where

We then get the following:

where

Here is a table of values for all of its possible 4-bit binary words:

Binary Word	Binary Value	Decimal Value
0000
0001
0010
0011
0100
0101
0110
0111
1000
1001
1010
1011
1100
1101
1110
1111

The range for x is

. In general form that is

Let's move on to Case 2 where

We then get the following:

where

Here is a table of values for all of its possible 4-bit binary words:

Binary Word	Binary Value	Decimal Value
0000
0001
0010
0011
0100
0101
0110
0111
1000
1001
1010
1011
1100
1101
1110
1111

The range for x is

. In general form that is

Let's move on to Case 3 where

We then get the following:

where

Here is a table of values for all of its possible 4-bit binary words:

Binary Word	Binary Value	Decimal Value
0000
0001
0010
0011
0100
0101
0110
0111
1000
1001
1010
1011
1100
1101
1110
1111

The range for x is

. In general form that is

Let's move on to Case 4 where

We then get the following:

where

Here is a table of values for all of its possible 4-bit binary words:

Binary Word	Binary Value	Decimal Value
0000
0001
0010
0011
0100
0101
0110
0111
1000
1001
1010
1011
1100
1101
1110
1111

The range for x is

. In general form that is

Let's move on to Case 5 where

We then get the following:

where

Here is a table of values for all of its possible 4-bit binary words:

Binary Word	Binary Value	Decimal Value
0000
0001
0010
0011
0100
0101
0110
0111
1000
1001
1010
1011
1100
1101
1110
1111

The range for x is

. In general form that is

From all of these cases, we can conclude that for

, we can write the interpreted value of the finite binary expansion as

where the total number of bits

We can then say that

Assuming

where

We now get an approximation in the range of

that can be encoded in our

data class using a n-bit binary string.

Let's try this out.

Let

. Expanding out gets us

In binary,

Let's now encode x exactly in fixed-point format.

Let's recast it with

and

. We have a new object represented by

. You may recognize the hat notation from unit vectors in physics, when you have to approximate inconsistent systems in linear algebra, or to denote the Fourier transform of a function. In this case, we are using it to refer to the programming object used to store our approximation of x.

Notice how this has created an extra bit. We could round down to

or round up to

. Let's say we round down to get our approximation:

With

, it follows that

We can now find the absolute error in our approximation:

For a different method of dealing with the extra bit, we can round up. We will refer to this instance as

We can evaluate and then find absolute error again:

Let's look at another instance of storing x in

Previously, the least significant bits (LSB) were extra. In this case, the most significant bit (MSB) is extra. We could round down to

or round up to

. Let's say we round down to get our approximation and then find absolute error:

On the other hand, rounding up would yield:

This makes sense because we approximated our exact 4-bit integer value (1010) as a 3-bit integer (110) and

Let's summarize these cases through a table:

Data Type	Binary Value	Absolute Error	Type of Error
		0	None
			LSB
			LSB
		8	MSB
		4	MSB

Which type of error is preferable between LSB and MSB?

Let's move on to another example. Let

. Then,

This can be further simplified as

. If we were to store this using 8 bits, what would happen? Well, for one, it would have to be an approximation with a good amount of error. This is inevitable, but we can still try to find the encoding with low error. Applying a strategy from the previous example, let's try a data type where

and

. We will also round up:

Because

, it then follows that

We now have an absolute error in our approximation of

. To convert this back into binary, notice that

. Then we have

as our absolute error in binary.

In our previous work, when we were given a number of bits

, one possible way to represent it was to say that

such that half of the bits in the number represented the integer part and half represented the fractional part. This would mean

, or

. We can analyze the ranges of these representations. Let's look at

The lower bound is

The upper bound is

Thus,

. More generally,

. If you do this for the other possibilities, you will notice that the upper bounds, or the max integer values, will increase very rapidly.

After all of these examples, we can better see some disadvantages of fixed-point. One of these drawbacks is wasted bits. Using

to encode

would result in the following string of bits

. In high performance environments, this is a waste of storage that will also hurt performance.

Another drawback is that the range is very specific. For instance,

has an upper bound of

. The size of this would be okay for applications, but it's not nearly robust enough. If we needed to represent larger values, we could increase l, but this would result in us having less precision with the fractional part of a number. Remember that having more significant digits with lower numbers is generally more important than having the same with large numbers.

Additionally, remember that the overflow that came from our work with fixed-point arithmetic.

The solution to this is to use floating-point. Floating-point, as we saw before, functions like scientific notation. It can remove the first drawback of extra bits. Take a look at the following example:

Let's say we wanted to represent

. In fixed-point binary, this would become

. All of these zeros do not really add anything important to the representation in terms of precision. They are only there to signify the location of the radix point. And in fixed-point decimal, this would become

. We can represent the number much more efficiently by allowing the radix point to float:

We can then float the radix point four places to the right to get

This can of course also be done in binary:

Let's look at yet another example of where floating-point is much more efficient than fixed-point.

We have the number

which is equal to

in binary. To represent this in fixed-point, we would need 8 bits for the fractional part. In floating-point, we can represent it as follows:

Floating-point requires only 4 bits for the mantissa, also known as the significand. It stores only the significant digits and uses an exponent to shift the radix point. Because our number has a finite binary expansion and all bits fit into the mantissa, this is an exact number.

Mathematically, the general format of a floating-point number in binary looks like

where m is the mantissa and e is the exponent value.

A normalized floating-point number would look like

where

. We then have

where fis the fractional part of the mantissa. This is equivalent to our previous formal definition of a floating-point number:

where

Be warned that exact numbers aren't often the case in floating-point, and that would result in an approximation.

Let's conclude this chapter with a table summarizing fixed-point and floating-point numbers:

Aspect	Fixed-Point	Floating-Point
General form
Stores	Integer part and fractional part	Sign bit, mantissa, exponent
Analogous to	Decimal numbers scaled by a fixed decimal point	Scientific notation
Range	Limited by fixed scaling	Controlled by dynamic exponent scaling
Resources	Very little, inexpensive	Generally fast, depends on hardware but is more expensive
Created by	Early computer architects and engineers	Numerical analysts
Used for	Digital signal processing, controls systems	Modern scientific computing

1.3 IEEE Standard for Floating-Point Arithmetic (IEEE 754)

The Institute of Electrical and Electronics Engineers (IEEE, pronounced "eye-triple-E") is an association responsible for electrical and computer engineering. As a professional organization, they set industry standards, publish research, and organize conferences across many disciplines. IEEE 754 is a technical standard for floating-point arithmetic, and is one of their most important. It was published in 1985 and changes have been made since. Before 1985, different computers could have given slightly different results for the same math. Just think about how confusing this would have been! To this day, most computers abide by IEEE 754.

In IEEE 754, there are different data classes for different levels of precision.

Level of Precision	Number of Bits
Quarter	8
Half	16
Single	32
Double	64
Extended	128

While the standard nowadays for these is binary8, binary16, binary32, binary64, and binary128, these are some of the different data classes before IEEE 754 came around. The following table details some of the historical floating-point formats:

Example Data Class	Used By
1750A_32	Military
IBM_32	IBM
TI_32	Texas Instruments

Let's work with quarter-precision floats. Let the total number of bits be 8 and let the number be normalized. Then, we have the bitwise format

. The 8-bit binary word looks as follows:

The sign bit has the value:

Let's try it out. Suppose

. It then follows that

Since

means that

. And

means that

. We are thus going to store

Let's look at another example:

where + is the sign bit and 1 is the implied normalized leading bit. We can now say that

, and

Let's try an example with a larger number

We can now say that

, and

. The value is stored as:

Recall that IEEE 754 doesn't store the actual exponent e but rather encodes something called the biased exponent u using the bias k. In this case,

. Thus

. Plugging our values in, we can find the actual value being stored

We can now come up with a table that summarizes how this works. Recall that u is the value that is stored by the computer for the 3-bit exponent field

. The exponent value is the 3-bit interpreted value of e in decimal found by

. This is known as excess-3.

Exponent Bits	Biased Exponent u	Exponent Value e
000	0
001	1
010	2
011	3	0
100	4	1
101	5	2
110	6	3
111	7	4

This is just how you go from the exponent bits back to the actual exponent e or vice-versa.

Here is a table of some raw 8-bit words and their numeric values (the first row is the general form):

Quarter-precision 8-bit Word	Sign Bit	Exponent Code	Numerical Value (floating-point binary)	Numerical Value as Sum (decimal)	Numerical Value (decimal)

11100000	1	110
11011111	1	101
11011110	1	101
11011101	1	101
11011100	1	101
11011011	1	101
11011010	1	101
11011001	1	101
11011000	1	101
...	...	...	...	...	...
01011000	0	101
01011001	0	101
01011010	0	101
01011011	0	101
01011100	0	101
01011101	0	101
01011110	0	101
01011111	0	101
01100000	0	110
...	...	...	...	...	...
01101111	0	110

Here are some notes on floating-point in MATLAB. Let's start by intializing the characterizing parameters of floating-point:

p = [4, 10, 23, 52]; % This represents the number of bits in the fraction (precision)
q = [3, 5, 8, 11]; % This represents the number of bits in the exponent
w = p + q + 1 % Word length, total number of bits
w = 1×4
     8    16    32    64
b = 2.^(q-1)-1 % This is the floating-point system's numeric base, usually 2
b = 1×4
           3          15         127        1023

Roundoff absolute error can be found by eps(x) which gives the distance from x to the next larger floating-point number:

format shortg  % shortg is specifically for working with fixed-point and floating-point
eps = 2.^(-p)
eps = 1×4
       0.0625   0.00097656   1.1921e-07   2.2204e-16

This tells us that double-precision is good for around 16 decimal digits of accuracy, single for 7, half for 3, and quarter 1.

If a real number is too large, it also has overflow here and is replaced with infinity. The largest floating point number that does not overflow can be found by:

realmax = 2.^b.*(2-eps)
realmax = 1×4
         15.5        65504   3.4028e+38  1.7977e+308

Similarly with underflow, the smallest normalized floating point number is given by:

realmin = 2.^(-b+1)
realmin = 1×4
         0.25   6.1035e-05   1.1755e-38  2.2251e-308

To add to this, IEEE 754 introduced the idea of gradual underflow. Gradual underflow is what happens when you perform an operation that results in a number that is too small; it returns a rounded number. And there are also subnormal numbers, which are a special category of floating-point values that are too close to zero to be normalized. The leading digit of a subnormal number is zero. They help prevent underflow.

Thus, the smallest numbers in MATLAB can be found as such:

tiny = realmin.*eps
tiny = 1×4
     0.015625   5.9605e-08   1.4013e-45  4.9407e-324

Here are some subnormal numbers in binary8:

Quarter-precision 8-bit Word	Sign Bit	Exponent Code	Numerical Value (floating-point binary)	Numerical Value as Sum (decimal)	Numerical Value (decimal)

10001111	1	000
10001110	1	000
10001101	1	000
...	...	...	...	...	...
00001101	0	000
00001110	0	000
00001111	0	000

Arithmetic can be done with floating point numbers through what MATLAB calls flints. The largest flint is given by:

flintmax = 2./eps
flintmax = 1×4
           32         2048   1.6777e+07   9.0072e+15

To summarize, here's a table with the properties of IEEE 754 formats.

T = [w; p; q; b; eps; realmax; realmin; tiny; flintmax];
 
T = table(T(:,1), T(:,2), T(:,3), T(:,4), ...
        'variablenames',{'quarter','half','single','double'}, ...
        'rownames',{'w','p','q','b','eps','realmax','realmin', ...
                    'tiny','flintmax'});
disp(T)
                quarter        half         single        double   
                ________    __________    __________    ___________

    w                  8            16            32             64
    p                  4            10            23             52
    q                  3             5             8             11
    b                  3            15           127           1023
    eps           0.0625    0.00097656    1.1921e-07     2.2204e-16
    realmax         15.5         65504    3.4028e+38    1.7977e+308
    realmin         0.25    6.1035e-05    1.1755e-38    2.2251e-308
    tiny        0.015625    5.9605e-08    1.4013e-45    4.9407e-324
    flintmax          32          2048    1.6777e+07     9.0072e+15

One example of extended precision format can be found in Intel CPUs, which have the capacity to use what can be referred to as binary80.

With this, let's put together a reference for the most simple of the formats, quarter-precision, or binary8. Recall the visual representation:

This is interpreted as follows:

In this, we have

a sign field

an exponent field

where

and significand field

which is interpreted as

where

And this only holds if

, which is the range of exponent codes that produce normalized floating-point numbers. What do you think the numbers 000 and 111 outside of these bounds might be reserved for?

, then

So far, what we've interpreted the 8-bit binary floating point-number x by breaking it into a form that resembles scientific notation. We then arrived at the final decoded value of the number based on its sign bit, biased exponent, and significand. This is what the number is represented as in the memory of a computer.

Furthermore, we have

where 4 is the number of bits in the significand field. This tells you that f is the fractional part of the significand. To decode the significand bits, divide by

to get f. To encode a known f, multiply by

Keep the following in mind that when you are working with normalized floating-point numbers in binary8.

The hidden leading bit doesn't always appear:

It only appears when

. Thus, this is the range for normalized floating-point numbers. Only normalized numbers use the hidden leading bit in the significand. Outside of this range you can get subnormal numbers,

, and NaN. Subnormal numbers are where

which means

. Thus we have

. The significand is

and the leading 0 is not hidden. Normalized floating-point numbers in binary formats thus have an extra bit of precision that does not need to be represented in memory. Subnormals trade precision for the ability to represent smaller numbers after the lower bound exponent is reached. For ∞ and NaN

. We have

and

, but we have

and

. Anything else yields NaN, a special error pattern triggered when at least one

for

You may be wondering about what controls precision and the space between different floating-point numbers. After all, they can be used to represent a number line. The distance between two adjacent floating-point numbers is not constant. Recall from the first example in this chapter that they get more spread out as the numbers get larger. For exponent e, the smallest distance you can represent is

, where p is the number of significand bits. Near 0, floating-point numbers are very close together and further from 0, they are more spread apart. As e increases, the gap between floating-point numbers is also increasing. Errors in floating-point arithmetic depend on where the numbers are located and how you are representing them. Knowing Δ can tell you the maximum rounding error for the rounding of a real number as a floating-point number. Think about the difference between relative precision and absolute precision as values grow. If you are confused, go back to the first example from this chapter.

Pushing further, consider a significand field bit string pattern

. Thus

. The floating-point number then becomes

What would happen to the mapped value as we vary the exponent code? Let's try it out. Keep in mind that for binary8,

. Here are the results:

Value of e	Mapped Value


0
1
2
3

This is how floating-point has such a wide, dynamic range of numbers. As e changes, you are changing where the radix point goes. Floating-point is really just a way to adjust the fixed-point representation of a number as you see fit.

Okay, let's move on to more IEEE 754 formats. These are two of the most important, and especially so in MATLAB:

	Data Class	Exponent Bits	Fractional Bits
Single-precision		8	23
Double-precision		11	52

Let's showcase single-precision visually:

Similarly to quarter-precision, let's now make a shortened table (

entries would be too repetitive with little value!) for the relationship between the exponent codes and the numerical exponent values. For

, the exponent value is

. Outside of those bounds we have subnormals and

or NaN.

Exponent Bits	Biased Exponent u	Exponent value e	Numerical Interpretation of x
00000000	0	n/a (subnormal)
00000001	1
00000010	2
...	...	...	...
11111101	253	126
11111110	254	127
11111111	255	n/a	or NaN

Let's find the range for normalized floats in single-precision:

and

, then

Since

This is the decoded value of a normalized binary32 float.

Let's showcase double-precision visually:

Let's now make a shortened table for the relationship between the exponent codes and the numerical exponent values. For

, the exponent value is

. Outside of those bounds we have subnormals and

or NaN.

Exponent Bits	Biased Exponent u	Exponent value e	Numerical Interpretation of x
00000000000	0	n/a (subnormal)
00000000001	1
00000000010	2
...	...	...	...
11111111101	2045	1022
11111111110	2046	1023
11111111111	2047	n/a	or NaN

The range for normalized floats in double-precision follows the same pattern.

and

then

Since

This is the decoded value of a normalized binary64 float.

Let's work through some problems.

Convert

into single-precision and double-precision.

Noticing that

, we can say that

. In binary, this is equal to

. We can then rewrite it further as

. Moving the radix point five points to the left gets us the normalized binary number

. We can now put it into the final single-precision form of

. We can now find individual components of this representation. Because

where

, we can find

. We can then assign bits to the exponent field. We first have the bit string as an 8-bit unsigned integer

. We can then assign each bit into the correct slots of our binary32 representation to get

. In hexademical, you would extend the representation as a raw bit string 01000010001000110010000000000000. Then

0100	0010	0010	0011	0010	0000	0000	0000
4	2	2	3	2	0	0	0

Thus, in hexadecimal we have our final IEEE 754 float, 0x42232000. It is pretty easy to convert between binary and hex, so it's common in the world of computer science. We can check our work in MATLAB:

format hex
num2hex(single(1305/32))
ans = '42232000'

For double-precision, recall the normalized form

. We have

and

. Thus we have

as the normalized value. We can then compute the biased exponent

where

. This yields

. As a bit string we have

. Finally, we have

. And now we can convert it to hexadecimal:

0100	0000	0100	0100	0110	0100
4	0	4	4	6	4	0000000000

We have 0x4044640000000000.

format hex
num2hex(1305/32)
ans = '4044640000000000'

The above two examples are for numbers that can be represented exactly. Some decimal numbers cannot be encoded exactly in binary, so they must be represented approximately. Here's an example with

Now we have to convert

to binary

. We can do this iteratively:

Step	Result	Bit
0		0
1		1
2		1
3		0
4		0
5		1
6		1
7		0

Thus

. We can then finish the encoding:

We now have 10000000010. Continuing...

And finally we have hex 0x4022cccccccccccd.

format hex
num2hex(9.4)
ans = '4022cccccccccccd'

There are a lot of cases where floating-point seems better because of its malleable, massive range of values. And that's true. But what about for when you want to be able to distinguish between nearby values with high precision? What happens when you want to represent very small changes? A lot of floating-point representations may not be able to do this, and especially not for less resources than a fixed-point representation. This relates to how coarse the precision in a representation is. A coarse ground of salt means large and uneven granules. You wouldn't use this salt for baking which needs to be very exact and homogenous. Instead, it's used to sprinkle on top of food because you want to taste heterogeneity. You want those bursts of flavor that stand out. In the same way, floating-point is great when you need range rather than clear uniformity, when a little bit of rounding or approximation is okay here and there. You don't really need to distinguish between miniscule values. This is what we would describe as coarse.

Recall that numerical analysis is the study of the algorithms used to solve problems in continuous mathematics. Continuous mathematics requires us to use real or complex numbers which are ℝ, ℂ, and ℚ. But these numbers cannot be stored exactly on computers because computers have finite memory and use binary representations. Thus, we have to approximate them. One of the primary tools used to approximate them is IEEE 754. The resolution of these problems is the backbone of modern day science and engineering. Indeed, much of the world would fail to run without numerical analysis—with IEEE 754 as the standard at its core.

1.4 IEEE 754 and NVIDIA

NVIDIA is one of the most powerful and influential companies on the planet. As a leader in the semiconductor industry, it produces graphics processing units (GPUs). GPUs can accelerate everything from the standard graphics rendering to intense numerical computations with the right software. They are designed to run in parallel, making them massively efficient and scalable. These chips are the engines of video games, scientific computing, and AI.

NVIDIA Voyager and Park, NVIDIA Newsroom. © NVIDIA

GeForce RTX 5090 Graphics Card, designed to be the workhorse of AI. © NVIDIA

At the foundation of all of this is IEEE 754. Proper approximation of numbers via floating-point is essential to ensure not only accuracy but also to guarantee efficiency and reproducibility. Whether it's simulating the amount of a feedstock left in an industrial chemical process, optimizing a stock portfolio, rendering explosions for a movie, or training a neural network with billions of parameters, correct use of floating-point will allow for the best results. To begin, let's look at floating-point operations. Let's say we have three numbers stored in IEEE 754 single-precision:

These numbers are given a small nonzero tail just to showcase some of the effects of floating-point rounding.

Real numbers are associative. If we were to add these purely mathematically, we would get:

But if we add these in floating-point, the results differ. We can break these down step by step to show the differences.

denotes the numbers that would be stored in the computer and operated on using floating-point arithmetic:

The floating-point computations deviate from the exact mathematical computations. They also differ from each other depending on where and how much you round.

In 2008, IEEE 754 was revised with the fused multiply-add operation (FMA). It computes

with only one rounding step. If not for this, the computation would be

which we know would have a different result.

Let's say we have

. We want to find

. Mathematically,

. We want a floating-point value that is as close to possible as

If we were to round twice using the non-FMA method, we would get

If we were to then round using FMA, we would get

This is more accurate.

When it comes to a final result, the place where you round matters. Rounding errors accumulate differently depending on when, where, and how often you do so. Thus, computing in high precision and then rounding in low precision is not the same as a value computed solely in low precision.

While FMA is faster and more accurate than performing the operations separately, there are instances in which you might not want it. By performing the operations separately, you can debug or interpret certain steps easier. The other reason is that the codebase or hardware you are working with might not support FMA. That is, it might have IEEE 754 from before 2008. Old subroutines written in legacy Fortran or C++ might not be compatible with FMA. That being said, by defalt, most GPUs operate in a fully IEEE 754-compliant mode. However, NVIDIA GPUs are configurable. You can put them in fast mode, which prioritizes performance. These are the three compiler flags:

Flag	Purpose
-ftz=true	Denormals are flushed to zero
-prec-div=false	Division is not computed to the nearest floating-point value
-prec-sqrt=false	Square roots are not computed to the nearest floating-point value

The accuracy of these operations will then be worse, as rounding may no longer be correct. This does not apply to double-precision.

NVIDIA GPUs are different from x86 CPUs which use a dynamic control word. They encode rounding and precision standards directly into each instruction. They don't have error handling (trap handlers) for floating-point exceptions like overflow, underflow, or dividing by zero. This is because their GPUs are designed for massively parallel computation. Trap handlers would stall this process and hurt performance.

Another important product of NVIDIA is called Compute Unified Device Architecture (CUDA). It is a parallel computing platform that allows software to accelerate not only graphics rendering, but also general-purpose tasks. In other words, it lets you write code that runs on your GPU instead of your CPU. CUDA can be used for a variety of applications such as deep learning with PyTorch. CUDA also has a library of math functions. While it's easy to say what the right choice is for addition and multipication in IEEE 754, it's a lot more difficult for more complex computation. CUDA is a great compromise on speed and accuracy. And while GPU programming with CUDA is primarily for languages like C++, MATLAB has over a thousand functions that work specifically on NVIDIA GPUs (since R2010b). If you have toolboxes like the Parallel Computing Toolbox, you can do something like the following:

y = fft(x); % Computes the fast fourier transform (fft) of x on the default CPU

gpuX = gpuArray(x); % Transfers x to the GPU

gpuY = fft(gpuX); % The fft will now be performed on the GPU

y = gather(gpuY); % Gathers the result from the GPU

Keep in mind that moving code across different systems—such as from a CPU to a GPU, or from an NVIDIA GPU to an AMD GPU—can result in differences in computation. This is due to variations in floating-point precision, order of operations, and different math libraries. Remember to have some sort of system that accounts for these different outputs. Familiarize yourself with the capabilities of your GPU. In MATLAB, you can begin by running gpuDevice:

The growth of large language models has driven advances in both computer architecture and the standards of numerical computation. If built correctly, ultra low precision data formats can guarantee faster computation while maintaining high accuracy. NVIDIA is currently working on this with floating-point 8 (FP8), building on the previous standards for efficient neural network training which was brain floating point 16 (BF16). FP8 has two variants, E4M3 and E5M2, which are optimized for different stages of training. When it comes to Blackwell, NVIDIA's new series of GPU architecture, MXFP8 is used to alleviate any sacrifices of accuracy through block-scaling. These innovations reflect the ongoing evolution of IEEE 754 to meet the demands of modern day AI workloads.

For more information, please check out NVIDIA's technical resources which are in Bibliography.

2. Numerical Methods

Now that we understand more about numerical analysis and computing, let's go through numerical methods. Numerical methods are techniques used to approximate mathematical processes that cannot be solved exactly, or analytically. You don't need to master all of these techniques to follow the applications in later chapters, but this will still give you "under the hood" insight into how the most demanding problems in science and engineering are being solved today. If some of the math is unfamiliar, please check out my notes on calculus, linear algebra, and differential equations. That said, please do not feel discouraged. If you've already made it this far, you've done fantastic work. Learn these math courses then come back here whenever you are ready. The material in them is a drop in the pond compared to what you can learn and then accomplish. Do not let it hold you back!

2.1 Calculus Review

This is not meant to be comprehensive or mathematically flawless. Instead, it's a way for you to revisit some content that you may not have seen since AP Calculus.

Limits tell us how functions behave as they get close to some point. The limit L of

as x approaches a is given by

Let's graph

. This function is undefined at

thus has a hole there.

f = @(x) (x.^2 - 4)./(x - 2); % Equivalent to f(x) = x + 2 except at x = 2

x = linspace(1.5, 2.5, 100); % Values near x = 2

y = f(x);

figure

plot(x, y, 'b-')

hold on

plot(2, 4, 'ro') % The limit point at x = 2

xlabel('x'); ylabel('f(x)')

title('Limit as x approaches 2')

We can use a form more useful for numerical methods by saying that

This version is more valuable because we will often use small values of h to probe at how a function behaves near a.

We can say that the function is continuous at

This means that the function exists (doesn't have jumps or holes) at a.

A sequence of approximations will get closer to the correct answer over multiple iterations if there is convergence. That is, a sequence

has a limit L if

This would be called a convergent sequence. f is continuous at a if

If a function is continuous on the interval

, and L is a number between

and

, there exists a number c with

where

This is known as the intermediate value theorem.

For a function f that is continuous on

, there exists a lower bound

, an upper bound

, and two numbers

such that

when

This is known as the extreme value theorem.

Derivatives tell us how fast a function is changing. For a function

defined on an interval containing a, fis differentiable at a if

The derivative is defined as

In numerical computing, we pick a small number for h and use it to approximate derivatives:

is differentiable at

, then

is also continuous at

If a function is continuous on

and differentiable on

, then there exists some value

such that

This is the mean value theorem, and it says that there is a point in the interval where the instantaneous rate of change is equivalent to the average rate of change. More specifically, the slope of the tangent line to

is equal to the slope of the secant line through

and

. Here's an example:

For the continuous function

on the interval

, the mean value theorem guarantees that there exists a point

such that

Using

, we can solve for c:

At c, the instantaneous slope of

is equal to the average slope over the interval.

f = @(x) sin(x);

df = @(x) cos(x);

a = 0.1;

b = 2.1;

% Evaluate secant slope

m_secant = (f(b) - f(a)) / (b - a);

c_secant = f(a) - m_secant * a;

% Find the secant slope

c_mvt = acos(m_secant);

y_mvt = f(c_mvt);

tangent_line = @(x) m_secant * (x - c_mvt) + y_mvt; % Define tangent line at c

x = linspace(0, 2.5, 300);

y = f(x);

figure

plot(x, y, 'b-', 'LineWidth', 1.5); hold on

plot(x, m_secant * x + c_secant, 'r--', 'LineWidth', 1.5); % Secant

plot(x, tangent_line(x), 'g-.', 'LineWidth', 1.5); % Tangent

plot(c_mvt, y_mvt, 'ko', 'MarkerSize', 8, 'MarkerFaceColor', 'k'); % MVT point c

xlabel('x'); ylabel('f(x)');

title('MVT on f(x) = sin(x) over [0.1, 2.1]');

legend('f(x) = sin(x)', 'Secant line', 'Tangent at c', 'MVT point c', 'Location', 'southeast');

grid on

Integration tells us how much a function accumulates. The area under a curve from a to b for a function

can be found as such:

Recall that there are different methods for numerical integration such as the trapezoidal rule:

where

for a number of subintervals n

Trapezoidal rule for

, and

And Simpson's rule:

where

for a number of subintervals n

Simpson's rule for

, and

If f is continuous over

and F is an antiderivative

, then we have the following, which is known as the first fundamental theorem of calculus:

If f is continuous over

and

, then we have the second fundamental theorem of calculus:

For a function f that is continuous over

, there exists a number

such that

where

is the average value of f over

. This is known as the mean value theorem for integrals.

A series is the sum of an infinite sequence of terms:

The nth partial sum is given by

. The infinite series converges if and only if the sequence of

converges to a limit S:

If the series does not converge, we say that it diverges. The process of sequences converging towards a target is the backbone of iterative algorithms in numerical analysis. Series are the sum of a sequence and thus help us build approximations for functions.

If a function f is differentiable at

, it can be approximated using an nth-order polynomial

2.2 Computational Errors

Like most of the equations you see in your science classes, the models depicted by numerical methods are often approximations. If deviations are systematically high or low, there is likely something not solely due to chance that is affecting your model. If the errors have more random distribution with tight grouping around the prediction, your model is still likely accurate. We've thrown around the terms accuracy and precision a lot, so let's clarify what the two mean now. Accuracy refers to how close your computed value is to the true value in nature. Precision refers to how close the values are to each other. Let's say you are performing an experiment to calculate acceleration due to gravity g on Earth and get this final value:

Your result is accurate because it is pretty close to the standard value of

. We can't say anything about precision here because it is only a single measurement.

However, if you were to get this set of data:

Your experiment data are precise because they are close to each other but not accurate because they are far from the true value. Generally, if your data show precision over time, this can give you confidence that your results are reliable and trending towards accuracy.

Here's a visual with targets:

For low accuracy and low precision, your data are not close to the actual value (green center) nor are they closely grouped.
For low accuracy and high precision, your data are not close to the actual value, but they are closely grouped.
For high accuracy and low precision, your data are generally close to the center, thus they are accurate. But because they are not closely grouped, they are imprecise.
For high accuracy and high precision, your data are close to the center and tightly grouped.

We've talked about error in our representations of real numbers before, but here are the actual terms and some formulas:

This would be known as absolute error. On the other hand, relative error is

Multiplying this by

would get you percent relative error

In scientific computing, we often will not know the true answer before hand. After all, if we knew all the answers, we wouldn't have to perform many calculations. Thus, we can normalize error using the best possible estimate:

In the case that we are iterating, our current approximation will be based on the previous approximation. Over time, we should get closer to the actual value because our approximations should be getting better and better. Here, percent error is given by

I've included absolute value on these equations which means that these only apply for the magnitude of the error. You can take them off if you want signed error. Generally, we don't care about the sign of the error. Instead, we care about whether the magnitude of the percent relative error is below a set tolerance

. If we are iterating towards a solution, we would stop once

Take for example a Maclaurin series expansion for

where

As we keep iterating, we keep adding more terms to the sequence and the approximation gets closer and closer to the true value of

. We can use the following equation to determine what would be considered a sufficiently accurate result:

This yields what is known as stopping criterion.

Plugging in our values gets us

. We would keep iterating until

. If we were to use

as our first estimate we would have

. Then with

, we would have

. We can then calculate

. The approximate estimate of this error would be

. Our value of

is still not below the threshold, so we would keep iterating through. After 6 terms, our result is

and

. This is a sufficient approximation. Notice how the number of significant digits has increased to 5. This is because our equations for approximate estimate of error and stopping criterion are conservative. They usually guarantee a result that is at least as good as what they yield, if not better.

In MATLAB, we can describe the series expansion as

. The function expoApprox passes the value to be evaluated x with stopping criterion es and a maximum allowable number of iterations maxit. In other words, it keeps adding terms until the approximate relative error is below the threshold or the number of iterations reaches a point where performance will be hurt. Convergence might never happen if there is something wrong with the program, so maxit essentially prevents infinite loops. In less extreme cases, it might just take a really long time to reach convergence, so it also can prevent very long runtimes.

function [fx, ea, iter] = expoApprox(x, es, maxit)
% expoApprox - Approximate e^x using Maclaurin series
%
% Inputs:
%   x     - value at which to evaluate e^x
%   es    - stopping criterion, default = 0.0001%
%   maxit - maximum number of iterations, default = 50
%
% Outputs:
%   fx    - estimated value of e^x
%   ea    - approximate relative error (%)
%   iter  - number of iterations performed
 
if nargin<2 || isempty(es), es=0.0001; % handling for if the user did not provide a second nonempty input
end
 
if nargin<3 || isempty(maxit), maxit=50; % handling for if the user did not provide a third nonempty input
end
 
% initialization
iter = 1; 
sol = 1; 
ea = 100;
% iterative calculation
while (ea > es && iter <= maxit)
    solold = sol; % store the current estimate in solold
    sol = sol + x^iter / factorial(iter); % add the next term to sol
    iter = iter + 1; % continue to the next iteration
    if sol ~= 0 % error checking to prevent dividing by zero
        ea = abs((sol - solold) / sol) * 100;
    end
end
 
fx = sol;
end
 
format long
[x_val, error, iterations] = expoApprox(1, 1e-6, 100)
x_val = 
   2.718281826198493
error = 
     9.216155641522974e-07
iterations = 
    12

This result means that after 12 iterations, a result of 2.718281826198493 comes an aproximate error estimate of

We've already worked through roundoff error extensively, but I want to give a quick list of places it can occur in when doing arithmetic:

Adding two numbers where the result requires a mantissa that is longer than what can be stored. In this case, some bits must be rounded off. This can manifest as subtractive cancellation, which happens when you subtract a number from a number that it is very close to in value. Because their leading digits are the same, you are left with a very imprecise number.
Very lengthy computations will skew towards the more recently computed values.
When you add a large and small number, the smaller number's decimal point can be shifted so far to the left (to match the exponent of the larger one) that it becomes essentially zero in the calculation.
Smearing occurs when the individual terms in a summation are larger than the actual summation.
Some infinite series, especially slowly converging ones, are prone to roundoff error.

Here's an example of what happens when we try to add a very large and a very small number:

Suppose we want to add

to 4000. In floating-point, we will use a 4-digit mantissa and 1-digit exponent:

The exponents don't match, so we will have to shift

. And we can now add:

We can only store 4 digits in the mantissa, so we have to round this to

. We have lost

due to rounding like it almost never existed in the calculation in the first place. It contributed zero to the end result.

Truncation errors occur when a mathematical procedure is cut short. This happens when you use an approximation in place of the proper mathematical expression. The origin of this is when people would replace functions with a Taylor series. The Taylor series is extremely valuable in numerical analysis because by the Taylor theorem, any smooth function can be approximated as a polynomial.

The Taylor series expansion looks like

If the function has more curvature, you would have to keep adding more derivatives.

The remainder term

for a truncated Taylor series can be expressed as

where ζ is some value of

In general, an nth-order Taylor series expansion will be exact for an nth-order polynomial. For some differentiable and continuous functions such as exponentials and sinusoids, a finite number of terms will not yield an exact estimate. You would need an infinite number of terms for the series to converge. The amount of terms that is considered sufficient is dependent on

. The problem with this is that ζ is not known exactly and neither is

because that is the function we are trying to approximate! Not knowing the latter means we can't calculate

which is necessary to evaluate

. We can, however, control how far away from x we want to evaluate

at which means we can control the number of terms in the expansion. As a result, we can express the remainder

in the more useful form of

Big O notation describes the limiting behavior of a function when it approaches a value. In computer science, it is used to assess the speed of an algorithm as the input grows. In this case, we are controlling the step size h and thus the truncation error is on the order of of

. If error is

, cutting h in half would halve the error. If error is

, cutting h in half would quarter the error.

Let's work through some examples.

We will use Taylor series expansions up to

to approximate

centered at

. We know that

thus

Using the true function, we can acquire the correct value of

. The zero-order approximation is

We can find percent relative error:

We will now add the first derivative term

for the first-order approximation:

We will now add the second derivative term

for the second-order approximation:

Jumping ahead to

we can get the sixth-order approximation:

We can also ignore the remainder by truncating it:

We can use the Taylor series to estimate truncation errors. We can expand function

as a Taylor series:

Let's now truncate the series after the first derivative term to get

Then we have the following:

Using

and

, we can further simplify this as such:

The truncation error is proportional to step size

. Thus, if we halve step size, we would expect the error of the derivative to also halve. More generally, this can be written as

where

. This is known as a finite difference, or more specifically a forward difference. There is also a backward difference which uses the previous point before x to approximate the derivative

and a central difference that uses points on both sides of x:

This method is more accurate.

For now, let's find finite difference approximations to estimate

for

. We will say that

and

. The true value is

We can now find the differences:

Central difference is significantly more accurate.

Total numerical error is the summation of the truncation and roundoff errors. This introduces a dilemna because the best way to reduce the former is to increase the number of significant figures and the best way to reduce the latter is to reduce step size. Reducing step size can lead to subtractive cancellation. Thus, the truncation errors would decrease at the cost of the roundoff errors increasing. Thankfully, in MATLAB, this is rarely a problem because of its high precision.

The central difference approximation of a first derivative

can be written as

Adding on roundoff error such that

and

, we have

where

denotes the rounded function values and e denotes the associated roundoff error.

Assuming an upper bound on roundoff error ε, the maximum possible value of the difference

. We will also assume that the third derivative has a maximum absolute value of M. Then, we have an upper bound on the absolute value of the total error

From this we can differentiate and set the equation equal to zero to find an optimal step size:

It's difficult to come up with a concrete, generalized way to estimate error. Every type of problem is different. The two best practices are to not perform subtractive cancellation and to operate on the smallest numbers first. This avoids loss of significance. In large computations, it can be incredibly tedious to find total numerical error. In this case, you should move forward and focus on evaluating the accuracy of your results. The amount of errors and the manner in which they are introduced during computation is known as the numerical stability of an algorithm.

2.3 Root-finding

A nonlinear equation is any equation where the change of the output is not proportional to the change of the input. Systems of nonlinear equations are particularly important to scientists and engineers because most real systems are nonlinear in nature. In the most simple case, we might want to find the solution of

because that would mean finding a root.

For instance, let's find the mass of a bungee jumper with a drag coefficient of

and a velocity of

after

of free fall.

cd = 0.25; g = 9.81; v = 36; t = 4;

mp = linspace(50,200);

fp = sqrt(g*mp/cd).*tanh(sqrt(g*cd./mp)*t)-v;

figure

plot(mp, fp), grid on

xlabel('Mass (kg)')

ylabel('f(m)')

title('Bungee Jumper Mass')

Visually, we can estimate that the root lies somewhere before

. We will just say

. We can check the validity of this using the following:

sqrt(g*145/cd)*tanh(sqrt(g*cd/145)*t)-v
ans = 
   0.045626491526377

That is fairly close to zero. We can also plug it into an equation to find velocity:

sqrt(g*145/cd)*tanh(sqrt(g*cd/145)*t)
ans = 
  36.045626491526377

This is fairly close to the given fall velocity of

Alternatively, we can use linear interpretation to approximate the root. Since we know there's a root somewhere between points

and

, we can use something similar to linear interpolation:

Solving for where

, we get

We can code this up:

format short
idx = find(fp(1:end-1).*fp(2:end) < 0); % Find the index where the function changes sign (crosses zero)
 
 
m1 = mp(idx); % First mass value, before the sign change
m2 = mp(idx+1); % Second mass value, after the sign change
f1 = fp(idx); % Gets the function value at m1
f2 = fp(idx+1); % Gets the function value at m2
approxRoot = m1 - f1.*(m2 - m1)./(f2 - f1)
approxRoot = 142.7400

This tells us that the approximate root is 142.7400. We can check the validity of this using the following:

sqrt(g*approxRoot/cd)*tanh(sqrt(g*cd/approxRoot)*t)-v
ans = 4.8646e-05

That is much closer to zero. We can also plug it into an equation to find velocity:

format long
sqrt(g*approxRoot/cd)*tanh(sqrt(g*cd/approxRoot)*t)
ans = 
  36.000048646058637

This is very close to the given fall velocity of

The first method is imprecise and relies on a single human guess. The second is much closer.

There are two categories of root-finding. Bracketing methods are based on two initial guesses on either side of the root. Open methods rely on at least one initial guess. What makes these better is that they are iterative. Assuming a situation with no errors, bracketing methods will converge slowly but always find a root. On the other hand, open methods can diverge but can find a root faster if there is one.

In the bungee jump example, we observed that the function changed signs on opposite sides of the root. Generally, if

is a real function that is continuous on

where

has a sign opposite to

, we can say that

This would mean that there is at least one real root on

. One problem with incremental methods starts with what you choose as the increment length. An increment length that is too small can result in a very long search. An increment length that is too long might miss roots.

We can use the following program for an incremental search that locates the roots of a function func from xmin to xmax to identify brackets within

for the function

function xb = incsearch(func,xmin,xmax,ns)
% incsearch: incremental search root locator
% xb = incsearch(func,xmin,xmax,ns):
%   finds brackets of x that contain sign changes
%   of a function on an interval
%
% inputs:
%   func = name of function
%   xmin, xmax = endpoints of interval
%   ns = number of subintervals (default = 50)
% outputs:
%   xb(k,1) = the lower bound of the kth sign change
%   xb(k,2) = the upper bound of the kth sign change
% if no brackets found, xb = []
if nargin < 3, error('at least 3 arguments required'), end
if nargin < 4, ns = 50; end % if ns blank set to 50
% Incremental search
x = linspace(xmin,xmax,ns);
f = func(x);
nb = 0; xb = []; % xb is null unless sign change detected
for k = 1:length(x)-1
    if sign(f(k)) ~= sign(f(k+1)) % check for sign change
        nb = nb + 1;
        xb(nb,1) = x(k);
        xb(nb,2) = x(k+1);
    end
end
if isempty(xb) % display that no brackets were found
    disp('no brackets found')
    disp('check interval or increase ns')
else
    disp(['number of brackets:', num2str(nb)])
    disp(xb)
end
end

Plotting this reveals the root locations:

x = linspace(3,6,1000);

f = sin(10*x) + cos(3*x);

figure

plot(x,f,'b','LineWidth',1.5);

hold on

grid on

xlabel('x')

ylabel('f(x)')

title('Function Plot with Root Brackets')

xb = incsearch(@(x) sin(10*x) + cos(3*x), 3, 6, 50);

number of brackets:5
244897959183673   3.306122448979592
306122448979592   3.367346938775510
734693877551020   3.795918367346939
653061224489796   4.714285714285714
632653061224490   5.693877551020408

for k = 1:size(xb,1)

x_bracket = xb(k,:);

y_bracket = sin(10*x_bracket) + cos(3*x_bracket);

plot(x_bracket, y_bracket, 'ks', 'MarkerFaceColor', 'r')

end

yline(0,'k--')

The bisection method is a variation of the incremental search method where the interval is always divided in half. If a function has a sign change over an interval, the function value at the midpoint is evaluated. Thus, the function value at that point must lie within the subinterval in which the sign changes. This subinterval then becomes the interval for the next iteration. This is repeated until the required precision is met. In other words, the end points of the interval are moved closer and closer together until the zero is bracketed. Here is the formal theorem:

For a continuous function

, there exists a number

such that

. If

and

have opposite signs and

represents the sequence of midpoints generated by the bisection method, then for each iteration n, we have

The sequence

gets exponentially closer to the true root

every iteration. That is,

We can try this out for

on the initial interval

. For each iteration k, we have left endpoint

, right endpoint

, and midpoint

. The function value at the midpoint is

. If

, the root is between

and

. Thus

. If this isn't true, the root is between

and

, so

. And then this repeats.

format long
 
f = @(x) x.*sin(x) - 1;
 
a = 0.0;
b = 2.0;
max_iter = 8;
 
fprintf('k\t a_k\t\t c_k\t\t b_k\t\t f(c_k)\n');
k	 a_k		 c_k		 b_k		 f(c_k)
 
for k = 0:max_iter
    c = (a + b) / 2;
    fc = f(c);
    
    fprintf('%d\t %.6f\t %.6f\t %.6f\t %.6f\n', k, a, c, b, fc);
    
    if f(a) * fc < 0
        b = c;
    else
        a = c;
    end
end
0	 0.000000	 1.000000	 2.000000	 -0.158529
1	 1.000000	 1.500000	 2.000000	 0.496242
2	 1.000000	 1.250000	 1.500000	 0.186231
3	 1.000000	 1.125000	 1.250000	 0.015051
4	 1.000000	 1.062500	 1.125000	 -0.071827
5	 1.062500	 1.093750	 1.125000	 -0.028362
6	 1.093750	 1.109375	 1.125000	 -0.006643
7	 1.109375	 1.117188	 1.125000	 0.004208
8	 1.109375	 1.113281	 1.117188	 -0.001216

The midpoint is an approximation of the root. The actual root is

. As you can see, f(c_k) tells us how close we are getting after each iteration. Even if the sign changes, remember that we care about the magnitude which is getting smaller.

Another method is known as false position which is very similar to linear interpolation. Rather than bisecting the interval to find a midpoint, it locates the root by joining

and

with a straight line. Using similar triangles, the intersection of the straight line with the x-axis can be estimated as

replaces one of the two initial guesses

depending on which yields a function value with the same sign as

. This means that the values of

and

will always bracket the true root. The process is repeated until it yields a sufficient appr0ximation.

Let's use the bisection and false position methods to solve the bungee problem.

First, we need to guess two values of the unknown. Based on the Bungee Jumper Mass graph, we can see that the function has to change sign between 50 and 200. We could narrow down the region more, but let's be safe and use those two values. Our initial estimate of the root

is as follows:

Setting the exact value of the root to be

means that our percent relative error is

Let's now multiply the function value at the lower bound by the function value at the midpoint:

Because

is greater than zero, we know that no sign change occurs here. We do now know that the root must be between the new lower bound 125 and the upper bound 200 which stays the same. Thus, we have our next iteration

Continuing on to the next iteration, we have

This is a pretty reasonable approximation, but you can continue if you need more precision.

Let's now repeat it using the false position method starting with guesses of

and

. This means that

and

. We now have our first iteration:

For our second iteration, we start with

. We now know that

is the upper limit. Evaluating our function values, we have

and

. Then we can carry out the second iteration:

For the third iteration, we have

and

. Then we can plug in to get

While false position generally performs better than bisection, there are some cases, like this one, where that is not the case. If the initial interval is not chosen carefully, false position can be slow because the method will keep on dragging towards the upper endpoint. This happens if

is much further away from zero than

, or vice-versa. Functions with significant curvature can be difficult for even the false position method. On the other hand, bisection struggles with flat functions.

For bracketing methods, the root is bounded by a lower and upper bound. Iterations of these methods, as you've now seen, will always converge towards the true value of the root. Let's move on to open methods where that is not particularly the case. Instead, your values do not necessarily bracket the root and can then diverge. That being said, if they are to converge, they converge much faster.

Perhaps the most famous of all root-finding methods is the Newton-Raphson method. You essentially take an initial guess

near the root and then find the derivative at that point. What you are doing is using a tangent line to predict the root of a function. Geometrically, this means extending the tangent line

until it intersects the x-axis. The formula is

Let's say we have

. Our first guess will be

. The first derivative is simply

. We can now evaluate the formula:

We can then iterate through:

i		(%)
0	0	100
1
2
3
4

This converges very fast.

Let's work through another example. We want to find the root of

. We will pick

. Then we can iterate through:

For

and

, I had to include more significant digits to showcase the difference! These are identical for the first eight decimal places, which is a sufficient answer for the root.

We can now try evaluating the bungee jumper example with the Newton-Raphson method. Recall that the function to be evaluated is

We can now use the following program:

function [root, ea, iter] = newtraph(func, dfunc, xr, es, maxit, varargin)
% newtraph: Newton-Raphson root location zeroes
% [root, ea, iter] = newtraph(func, dfunc, xr, es, maxit, p1, p2, ...):
% Uses Newton-Raphson method to find the root of func
%
% Inputs:
%   func  = function handle for f(x)
%   dfunc = function handle for derivative f'(x)
%   xr    = initial guess
%   es    = desired relative error (default = 0.0001%)
%   maxit = maximum allowable iterations (default = 50)
%   p1, p2, ... = additional parameters passed to func
%
% Outputs:
%   root  = real root
%   ea    = approximate relative error (%)
%   iter  = number of iterations
 
if nargin < 3
    error('At least 3 input arguments required');
end
if nargin < 4 || isempty(es)
    es = 0.0001;
end
if nargin < 5 || isempty(maxit)
    maxit = 50;
end
 
iter = 0;
ea = 100;
xr_old = xr;
 
while (ea > es && iter < maxit)
    iter = iter + 1;
    xr = xr_old - func(xr_old) / dfunc(xr_old);
    
    if xr ~= 0
        ea = abs((xr - xr_old) / xr) * 100;
    end
    
    xr_old = xr;
end
 
root = xr;
end
 
format short
y = @(m) sqrt(9.81*m/0.25).*tanh(sqrt(9.81*0.25./m)*4) - 36;
dy = @(m) (1/2)*sqrt(9.81/(m*0.25)).*tanh(sqrt(9.81*0.25./m)*4) - ( (1/2)*4*9.81./(2*m) ).*sech(sqrt(9.81*0.25./m)*4).^2;
newtraph(y, dy, 140, 0.00001)
ans = 142.7376

One problem with the Newton-Raphson method is that you need to be able to calculate the derivative. Some functions have very complicated and long derivatives which can potentially hurt performance. The solution is to use a secant line instead of a tangent line. Aptly, this is known as the Secant method:

Let's practice by finding the root of

. Let

and

. Then we can interate through:

Notice how we are at the point where we have two iterations that agree with each other up to a certain number of decimal places. We have found a root sufficiently close to the true value which is

There is something called the order of a root-finding algorithm. It refers to how quickly the method will converge. Generally, the order α is equal to the natural log of some error division over the natural log of another error division:

If we have the following errors

, the order could then be calculated as such:

This means that the order of Newton's method will converge quadratically. However, this isn't always the case; remember that Newton's method doesn't even necessarily converge.* We can also find the order of the Secant mehod using the same formula as before. We will use

, and

. Thus we have

One interesting fact is that the order of bisection is

. Compare this with the open methods

and

. Higher order means faster convergence, so if the circumstances allow for it, Newton's method is the best here. But, if let's say, the derivative is too complicated, you can try the secant method. And if you want the slower but always reliable method, go with bisection.

* As long as you pick a value sufficiently close to the root, it should converge.

What if you wanted a compromise? Well, there are just so many root-finding algorithms out there. The hybrid approach is known as Brent's method which is the closest to having the reliability of the bracketing methods but the speed of the open methods. To begin, let's go through some history. Richard Brent's method is actually based on an earlier algorithm known as Dekker's method from Theodorus Dekker. His method starts similarly to some others: you have a root on the interval

. The sign of

has to be opposite to the sign of

. If

, you have to swap a and b because you want to keep b as the better choice. Then, c becomes the previous value of a. We have a midpoint

and we try to compute the secant intercept s using b and c. After updating your variables, you will repeat this process again until

Let's work through an example for

with

, and

a	b		Midpoint m	Secant Intercept s			or
1	2	1				1
2		2			1
2					1



							0

We finally have two values such that

One cool thing about Dekker's method is that it includes handling for roundoff errors, overflow, and underflow. The main reason for this is that it can always fall back to bisection which is more numerically robust. Interestingly, one of the people who originally commented on Dekker's paper was George E. Forsythe, who I discuss in Introduction. To paraphrase, he praises Dekker's paper for

making no assumptions about the smoothness of f
notes that his method is at least as good as bisection
has the same fast convergence properties of the secant method for smooth functions
not having the dangers of secant method when there are multiple roots

Overall, he regards it as "safe and effective" but "also rather subtle" and thus "[meriting] careful study." Later came Brent's method, which tries interpolating with a parabola. If that doesn't work, it falls back to the secant method, and then bisection. The most important aspect of his method is known as inverse quadratic interpolation which is quite similar to the secant method. First, you find the polynomial with the lowest degree that fits a set of points. With quadratic interpolation, you fit the inverse quadratic curve through three points. The main problems with inverse quadratic interpolation are that it is ureliable, fails with two identical y values, and requires you to start close. Thus, it is only ever really used as a part of Brent's method. Brent's method uses the same values

as Dekker's method, but he tries to compute s using inverse quadratic interpolation with

. If this fails, it tries to compute s using b and c. If all else fails,

, or bisection. After updating

points, you test for s between b and

. Then you repeat until

. This prevents a lot of potential iteration. His method also has logic to handle roundoff error, overflow, and underflow. It also handles functions that aren't well-behaved with a new variable d, the previous value of c. Generally, it always has the option of falling back to bisection.

To summarize, note that

Dekker's method is just as fast as secant method when the function is well-behaved.
Brent's method is just as fast as inverse quadratic interpolation and secant when well-behaved.
Dekker's method and Brent's method are both faster than linear.
Brent's method is at least as fast as Dekker's for well-behaved functions and is guaranteed to converge in a reasonable number of steps for any function.

In MATLAB, the fzero function uses the Brent-Dekker method. It lets you search for an interval with a sign change. This program from Cleve Moler essentially starts with your initial guess and then searches to find an interval where the function changes sign. In the example, we want to find the root of the function

. We know that this really is at

. We are starting with the point

, and then expand outward until we get the sign change. Thus, the root is bracketed between 1 and 1.45255. After that, Brent's method does it work and converges on 1.4142.

function [a,b] = signchange(f,x)
% SIGNCHANGE [a,b] = signchance(f,x) seeks an
% interval [a,b] where f(x) changes sign.
 
   a = x;
   b = x;
   if x ~= 0
      dx = x/50;
   else
      dx = 1/50;
   end
   while sign(f(a)) == sign(f(b))
      dx = sqrt(2)*dx;
      a = x - dx;
      b = x + dx;
   end
end
 
% example
f = @(x) 2 - x^2;
opt = optimset('display','iter');
fzero(f,1,opt)
 
Search for an interval around 1 containing a sign change:
 Func-count    a          f(a)             b          f(b)        Procedure
    1               1             1             1             1   initial interval
    3        0.971716       1.05577       1.02828      0.942631   search
    5            0.96        1.0784          1.04        0.9184   search
    7        0.943431       1.10994       1.05657      0.883663   search
    9            0.92        1.1536          1.08        0.8336   search
   11        0.886863       1.21347       1.11314      0.760926   search
   13            0.84        1.2944          1.16        0.6544   search
   15        0.773726       1.40135       1.22627      0.496252   search
   17            0.68        1.5376          1.32        0.2576   search
   19        0.547452        1.7003       1.45255     -0.109897   search
 
Search for a zero in the interval [0.547452, 1.45255]:
 Func-count    x          f(x)             Procedure
   19         1.45255     -0.109897        initial
   20          1.3976     0.0467142        interpolation
   21         1.41399   0.000631974        interpolation
   22         1.41421  -9.95025e-08        interpolation
   23         1.41421   7.86149e-12        interpolation
   24         1.41421  -4.44089e-16        interpolation
   25         1.41421  -4.44089e-16        interpolation
 
Zero found in the interval [0.547452, 1.45255]
ans = 1.4142

2.4 Linear Algebra

Let's say that three clowns at the circus are dangling from the ceiling. They are held up by ropes with distances

, and

. Each rope is fully extended but unstretched. After they are released, the clowns will eventually come to their equilibrium positions. We want to compute the displacement of each of the clowns. We will assume that each rope behaves as a linear spring that follows Hooke's law.

When fully extended, they look like this:

We can assign each clown a free-body diagram as such:

We can then use Newton's second law to balance the forces for each jumper:

where

is the mass of the jumper i in kilograms, t is time in seconds,

is the spring constant for the rope j in Newtons per meter,

is the displacement of the jumper i measured downward from the equilibrium position in meters, and g is gravitational acceleration in meters per second squared. We want the steady-state solution, so we will set the second derivatives to zero. Then, we have a system of linear equations:

We will now solve this problem in MATLAB using these givens:

Jumper	Mass (kg)	Spring Constant	Unstretched Cord Length (m)
Top (1)	60	50	20
Middle (2)	70	100	20
Bottom (3)	80	50	20

We can now substitute these into our system:

We will now enter the coefficient matrix and the right-hand-side vector:

K = [150, -100, 0; -100, 150, -50; 0, -50, 50]
K = 3×3
   150  -100     0
  -100   150   -50
     0   -50    50
 
mg = [588.6; 686.7; 784.8]
mg = 3×1
  588.6000
  686.7000
  784.8000

Let's solve:

x = K\mg
x = 3×1
   41.2020
   55.9170
   71.6130

Equivalently,

x = inv(K)*mg
x = 3×1
   41.2020
   55.9170
   71.6130

The ropes between each clown are

, so their initial positions are

xi = [20; 40; 60]
xi = 3×1
    20
    40
    60

Lastly, their final positions can be calculated as

xf = x+xi
xf = 3×1
   61.2020
   95.9170
  131.6130

These results make sense. The first rope stretches the longest because it has the lowest spring constant and is carrying the weight of all three clowns. While the second rope is being pulled on by more weight force than the third, it has a higher spring constant and therefore stretches less. Thus, the second and third ropes stretch about the same.

Recall that Gaussian elimination is when you combine equations to eliminate unknowns. The general steps are as follows:

Combine the coefficient matrix and the right-hand-side vector into an augmented matrix.
Forward elimination
Back substitution

We will perform Gaussian elimination on the following augmented matrix as an example:

Multiply row 1 by 2 and subtract from row 2 to eliminate the first entry in row 2:

Then multiply row 1 by 4 and subtract from row 3:

Then subtract row 2 from row 3 to eliminate the second entry in the third row:

The matrix is now in upper triangular form. We can rewrite it as a system:

Back substitution yields

Without the likes of pivoting, we can use this program for what is known as naive Gauss elimination:

function x = GaussNaive(A, b)
% GaussNaive: naive Gaussian elimination (no pivoting)
% x = GaussNaive(A, b)
% Inputs:
%   A = coefficient matrix
%   b = right-hand side vector
% Outputs:
%   x = solution vector
 
[m, n] = size(A);
if m ~= n
    error('Matrix A must be square!');
end
 
nb = n + 1; % Total number of columns in augmented matrix
Aug = [A b];
 
% Forward elimination
for k = 1:n-1
    for i = k+1:n
        factor = Aug(i, k) / Aug(k, k);
        Aug(i, k:nb) = Aug(i, k:nb) - factor * Aug(k, k:nb);
    end
end
 
% Back substitution
x = zeros(n, 1);
x(n) = Aug(n, nb) / Aug(n, n);
for i = n-1:-1:1
    x(i) = (Aug(i, nb) - Aug(i, i+1:n) * x(i+1:n)) / Aug(i, i);
end
end
 
A = [1 3 4; 2 1 3; 4 7 2];
b = [1; 2; 3];
x = GaussNaive(A, b)
x = 3×1
    0.8889
   -0.1111
    0.1111

Naive Gaussian elimination fails for some matrices because to let's say, normalize a row, you may have to divide by zero. You may have to also divide by a pivot element that is very small because that could of course result in roundoff error. Thus, before each row is normalized, you should find the largest coefficient in magnitude in the column before the pivot and then make that the pivot element. You can then switch the rows so that the largest element is now the pivot element. This is known as a partial pivot. If you factor in both columns and rows and then switch, this would be known as complete pivoting. Complete pivoting is rarely used because it adds a lot of computational complexity for little benefit. Most of the positives can come from partial pivoting alone. Thus, to avoid dividing by zero and to minimize roundoff error, partial pivoting is the best strategy.

In MATLAB, you can easily perform Gaussian elimination with pivoting using the backslash (\) operator:

A\b
ans = 3×1
    0.8889
   -0.1111
    0.1111

This is an incredibly complex operator that analyzes the structure of the coefficient matrix and then applies the proper technique for the system.

One interesting quirk of MATLAB is in the operation A\A which solves the equation

. For

, the unknown matrix X is equivalent to the identity matrix I. For an undetermined system where

is not necessarily I. For an overdetermined system where

yields I. There are some special cases for A/A in MATLAB.

In the case of a square and nonsingular* matrix,

A\A = I

* Nonsingular simply means full rank, where rank equals the smallest dimensions of the matrix.

If A contains zeros on the diagonal or is the zero matrix,

A\A = NaN

If A has rank less than n,

A\A ~= I

A rank-deficient matrix cannot produce an identity through the backslash operator; after all, rank-deficient matrices aren't invertible.

We can test A\A on magic squares to see how rank affects the solution. Odd-sized magic squares are full rank and give results that are very close to the identity matrix:

warning off

nmax = 50;

n_odd = 3:2:nmax;

r = zeros(length(n_odd),1);

e = zeros(length(n_odd),1);

for idx = 1:length(n_odd)

n = n_odd(idx);

A = magic(n);

X = A\A;

I = eye(n);

r(idx) = rank(A);

e(idx) = norm(X - I);

end

figure

% Rank

subplot(2,1,1)

bar(n_odd, r, 'b')

xlabel('n (Odd Matrix Size)')

ylabel('Rank')

title('Rank vs Odd Matrix Size')

set(gca, 'Position', [0.13, 0.6, 0.775, 0.3])

% Error

subplot(2,1,2)

bar(n_odd, e, 'r')

xlabel('Odd')

ylabel('Error')

title('Error (norm(A\\A - I))')

set(gca, 'Position', [0.13, 0.15, 0.775, 0.3])

Singly even magic squares (n divisible by 2 but not 4) are rank-deficient, so their A\A deviates from the identity matrix. Notice the computational error:

warning off

nmax = 50;

n_singly_even = 2:2:nmax;

n_singly_even = n_singly_even(mod(n_singly_even,4) ~= 0);

r = zeros(length(n_singly_even),1);

e = zeros(length(n_singly_even),1);

for idx = 1:length(n_singly_even)

n = n_singly_even(idx);

A = magic(n);

X = A\A;

I = eye(n);

r(idx) = rank(A);

e(idx) = norm(X - I);

end

figure

subplot(2,1,1)

bar(n_singly_even, r, 'b')

xlabel('n (Singly Even Sizes)')

ylabel('Rank')

title('Rank vs Singly Even Matrix Size')

set(gca, 'Position', [0.13, 0.6, 0.775, 0.3])

subplot(2,1,2)

bar(n_singly_even, e, 'r')

xlabel('n (Singly Even Sizes)')

ylabel('Error')

title('Error (norm(A\\A - I))')

ylim([0, max(e)*1.1])

set(gca, 'Position', [0.13, 0.15, 0.775, 0.3])

Doubly even magic squares (n divisible by 4) are even more rank-deficient. I was unable to reproduce the code from Moler here, so I am just using his image. The error plots are an absolute mess:

Going back to A\A, there's a function called pinv that returns the Moore-Penrose pseudoinverse which is used when a system doesn't have a unique solution. That is, it is used for underdetermined or overdetermined systems. Just like the backslash operator, it can be used in numerical linear algebra, but it is specifically used when a system has many solutions and you want to fond the one with the smallest norm. It doesn't magically fix rank-deficiency. Instead, it only avoids NaN by attempting to hide rank-deficient matrices:

A = [1 2; 2 4];   % Rank-deficient
b = [3; 6];       % Lies in the column space of A so solutions exist
 
x_backslash = A\b;
x_pinv = pinv(A)*b;
 
disp('Solution using backslash:')
Solution using backslash:
disp(x_backslash)
   NaN
   NaN
 
disp('Solution using pseudoinverse:')
Solution using pseudoinverse:
disp(x_pinv)
    0.6000
    1.2000

You should only use it when you want a least squares solution or if you want to find the solution vector with the smallest length.

If that was a little abstract, don't worry about it.

Let's look at an example where we use Gaussian elimination to solve the following system:

The first pivot element

is very close to zero. We will divide equation (1) by

to get

which can then be used to eliminate

from equation (2). It now is

which can be solved to get

. We can now substitute this into equation (1) to get

. We will now table the results:

Significant Figures			of
3			1099
4			100
5			10
6			1
7

Subtractive cancellation makes the result very dependent on the number of significant figures. Instead, we can solve the equations in reverse order to allow the row wih the larger pivot element to be normalized:

Elimination and substitution still yields

. We can now get

. We will now table the pivoted results:

Significant Figures			of
3
4
5
6
7

The pivot strategy has much less error. In MATLAB, we can first solve this using naive Gaussian elimination:

format long
eq1 = [0.0003, 3; 1, 1];
eq2 = [2.0001; 1];
x = GaussNaive(eq1, eq2)
x = 2×1
   0.333333333334037
   0.666666666666667

We can instead use pivoting to yield a far more accurate result:

x = eq1\eq2
x = 2×1
   0.333333333333333
   0.666666666666667

Gaussian elimination is essentially a way to get a matrix into a form you know as row echelon form (REF) where

All nonzero rows are at the bottom.
Each leading entry (pivot) of a row is in a column to the right of the leading entry of the row above it.
All entries in a column below a leading entry are zeros.

A matrix is in reduced row echelon form (RREF) if

It is in row echelon form.
The leading entry in each nonzero row is 1.
Each leading 1 is the only nonzero entry in its column.

Both REF and RREF make solving basic linear systems much easier. In MATLAB, we have the function rref. We can use it to solve this system:

A = [1 3 4; 2 1 3; 4 7 2];
b = [1; 2; 3];
Ab = [A b];
[rref_Ab, pivot] = rref(Ab)
rref_Ab = 3×4
   1.000000000000000                   0                   0   0.888888888888889
                   0   1.000000000000000                   0  -0.111111111111111
                   0                   0   1.000000000000000   0.111111111111111
pivot = 1×3
     1     2     3

Thus, the first three columns are fully reduced and the fourth column gives us our solution x to

x = rref_Ab(:,3)
x = 3×1
     0
     0
     1

RREF is generally not used in scientific or engineering applications because it's not computationally efficient for very large systems.

By now, hopefully you are beginning to realize that linear algebra is generally easy in MATLAB. The hardest part of it all is correctly modeling your scientific and engineering problems as systems of linear equations.

LU (Lower Upper) factorization involves decomposing a coefficient matrix into two triangular matrices L and U. LU decomposition is very similar to Gaussian elimination; the only difference is that you do not work with the augmented matrix. It can let you simplify the solution to a system of linear equations. Solving

reduces to solving

and

because then

. In Gaussian elimination, you transform a matrix

to upper triangular form

stages. For each stage k during Gaussian elimination, multiples of row k are added to the later rows to eliminate the elements below the diagonal in column k:

where

is the multiplier used to eliminate below the pivot. These multipliers

become the subdiagonal entries in L while the final upper triangular matrix becomes U.

We can decompose our matrix A using lu:

[L,U] = lu(A) % This calculates the multipliers, updates rows, and builds LU
L = 3×3
   0.250000000000000  -0.500000000000000   1.000000000000000
   0.500000000000000   1.000000000000000                   0
   1.000000000000000                   0                   0
U = 3×3
   4.000000000000000   7.000000000000000   2.000000000000000
                   0  -2.500000000000000   2.000000000000000
                   0                   0   4.500000000000000

Then we can solve it as follows:

% Now for forward and back substitution
y = L\b
y = 3×1
   3.000000000000000
   0.500000000000000
   0.500000000000000
x = U\y
x = 3×1
   0.888888888888889
  -0.111111111111111
   0.111111111111111

Just as with Gaussian elimination, partial pivoting is important in LU factorization. Use the following steps:

The LU factorization with pivoting of a matrix is . The upper triangular matrix is generated by elimination with partial pivoting while storing the multipliers used to row swap in which uses the permutation matrix to keep track of the row switches.
This step is forward substitution. We need to adjust to match the row swaps. The matrices and are used to perform elimination on to generate the right-hand-side vector . The system now looks like .
Using back substitution, we now have .

Let's look at an example of LU factorization with pivoting. We will compute the LU factorization for the same system in the form of we did Gaussian elimination with:

We will set up the permutation matrix:

Pivoting is necessary, so we will switch the rows of both:

We can now perform elimination. We will eliminate

by subtracting the factor

from the second row of A. As a result, we can find the new value of

We will now reorder b according to P. This is a permutation:

Now for forward substitution:

Solving yields

and

. The system is currently

And lastly, applying back substitution gets us

and

When solving linear systems, LU factorization isn't always the best choice. As a matter of fact, QR factorization is often more numerically stable. QR factorization is where you break a matrix A into two parts. A matrix

with linearly independent columns can be factored as

where Q is an mxn matrix with orthonormal columns (

) and R is an nxn upper triangular matrix with nonzero diagonal elements.

For example,

This is in the form of

Coding this up, we get

A = [-1, -1, 1; 1,  3, 3; -1, -1, 5; 1, 3, 7];
 
[Q, R] = qr(A);
 
disp('Q:');
Q:
disp(Q);
  -0.500000000000000  -0.500000000000000   0.500000000000000   0.500000000000000
   0.500000000000000  -0.500000000000000   0.500000000000000  -0.500000000000000
  -0.500000000000000  -0.500000000000000  -0.500000000000000  -0.500000000000000
   0.500000000000000  -0.500000000000000  -0.500000000000000   0.500000000000000
 
disp('R:');
R:
disp(R);
   2.000000000000000   4.000000000000000   2.000000000000000
                   0  -2.000000000000000  -8.000000000000000
                   0                   0  -4.000000000000000
                   0                   0                   0

Notice that MATLAB flips the signs of some of the elements in R. This is because MATLAB is all about numerical stability and computational speed. In math textbooks, it's common for elements

to be positive. This makes QR factorization recognizable.

There are a few main algorithms for QR factorization. There is classical Gram-Schmidt which is the one you likely know from your linear algebra class, a modified Gram-Schmidt, and Householder. Both of the Gram-Schmidt algorithms are pretty terrible for scientific computation because of the amount of error that comes with them. The Householder algorithm is the most common in scientific computing and what MATLAB's qr function uses. Below is a table of each algorithm's complexity in Floating-point Operations Per Second (FLOPS):

Algorithm	Complexity (FLOPS)
Gram-Schmidt
Modified Gram-Schmidt
Householder

The whole idea of the Householder algorithm is to reflect an entire column and its corresponding rows across a hyperplane. A hyperplane is just the generalization of a two-dimensional plane (flat) in three-dimensional space. One example would be a line in 2D or a plane in 3D. Hyperplanes specifically are used because reflection across one is the easiest way to zero out multiple components of a vector in one shot, while maintaining orthogonality. This relates to its properties as a linear map.

The purpose of this is to zero out all of the entries below the diagonal of a column:

where Q-factor

is orthogonal and constructed as a product of orthogonal matrices

. Q represents the first n columns of the full matrix and

denotes the remaining

columns, turning Q into a full square orthogonal matrix. It is also known as the residual.

Each H term is an mxm orthogonal reflector given by

where

is a unit vector. H is orthogonal and symmetric.

is a reflection of x across a hyperplane perpendicular to v.

The way the algorithm works is that for a column k from row k down to the bottom, it builds a vector

that defines a Householder reflector. This reflector is then applied to the rest of the matrix, zeroing out all entries below the diagonal in that column. After that, the matrix is updated.

In other words, for the first column of R, you start with column 1, move it towards a multiple of the first basis vector, normalize the result to get

, and then apply the reflection to the matrix. This process repeats for each column until the matrix becomes upper triangular.

For example, given the previous matrix

We compute reflectors

that triangularize A:

The first column of R is found by first computing a reflector that maps the first column of A to the basis vector

. We have the first column:

where w is an intermediate vector used to construct the Householder reflector.

Then, we reflect by overwriting A with the product of

and A to get

The next two updates for the second and third columns of R are

MATLAB does not store the Q-matrix during this process. It only holds the vectors

to save storage.

Before we move on, here are some useful tables. If you want to work with these more thoroughly, check out MATLAB Introductory Guide which is linked in Introduction.

Operator/Function	Purpose
3*x	Multiply each element of x by 3
x+2	Add 2 to each element of x
x+y	Element-wise addition of vectors x and y
A*y	Product of a matrix A and a vector y
A*B	Product of a matrix A and a matrix B
A.*B	Element-wise product of two matrices A and B
A^B	Square matrix A raised to the third power
A.^B	Each element of A raised to the third power
A'	Transpose of A
inv(A)	Inverse of A
det(A)	Determinant of A
eig(A)	Eigenvalues of A
[V,D] = eig(A)	Returns a diagonal matrix D of eigenvalues and matrix V whose columns are eigenvalues
size(A)	Size of A

Traversal	Gets
x(2:10)	The 2nd to the 10th elements of x
x(2:end)	The 2nd to the last element of x
x(1:3:end)	Every third element of x from the 1st to the last
x(end)	The last element of x
A(5,:)	The 5th row of A
A(:,5)	The 5th column of A
A(5, 1:3)	The 1st to 3rd elements in the 5th row
A(end,:)	The last row of A
A(:,end)	The last column of A

Up until now, we've focused on very direct methods that stem from the strategies you may have learned in your first linear algebra class. However, when in the domain of numerical linear algebra, iterative methods tend to reign supreme. The most commonly used iterative method is known as the Gauss-Seidel method.

We have a 3x3 linear system

that looks like

Assuming the diagonal elements are all nonzero, we have the following at each iteration:

where j and

are the current and previous iterations.

Similarly to the process for root-finding, initial guesses must be made for each x. To make things simple, we can assume that they are all zero. We can then substitute those zeros into our equation for

to calculate a new value for

. We then substitute

and

into

to yield a new value for

. After repeating that for

, this loop is done. We would then repeat the process until we converge to a solution that is close enough to the true value. The criterion is that for each

Let's work through an example where we use the Gauss-Seidel method to solve the following system:

The true solution to this system is

, and

First, let's solve each equation for its unknown on the diagonal:

Assuming that

and

are equal to zero, we can now find that

Using

and

, we can find

Finally,

We can repeat for the second iteration to get the following:

We can now find the error estimates:

Recall that our equation for error estimate outputs a conservative standard for convergence. Nevertheless, additional iterations could be computed to improve our answers:

A = [3 -0.1 -0.2; 0.1 7 -0.3; 0.3 -0.2 10];
b = [7.85; -19.3; 71.4];
 
n = size(A,1);
x = zeros(n,1);
maxIter = 5;
% tol = 1.0e-6; set this to whatever suits your needs
 
for k = 1:maxIter
    x_old = x;
    for i = 1:n
        sum1 = A(i,1:i-1) * x(1:i-1);
        sum2 = A(i,i+1:end) * x_old(i+1:end);
        x(i) = (b(i) - sum1 - sum2) / A(i,i);
    end
    fprintf('After iteration %d: (x_1, x_2, x_3) = ', k)
    disp(x')
 
% if you wanted to add some sort of convergence check...
%    if norm(x - x_old, inf) < tol
%        fprintf('Converged after %d iterations.\n', k);
%        break
%    end
end
After iteration 1: (x_1, x_2, x_3) = 
   2.616666666666667  -2.794523809523810   7.005609523809524
After iteration 2: (x_1, x_2, x_3) = 
   2.990556507936508  -2.499624684807256   7.000290811065760
After iteration 3: (x_1, x_2, x_3) = 
   3.000031897910809  -2.499987992353051   6.999999283215615
After iteration 4: (x_1, x_2, x_3) = 
   3.000000352469273  -2.500000035754606   6.999999988710830
After iteration 5: (x_1, x_2, x_3) = 
   2.999999998055568  -2.500000000456044   7.000000000049212

With that, let's work towards a general form of the Gauss-Seidel method. We have a general set of equations:

Assuming the diagonal elements are all nonzero, we have the following at each iteration:

In summation form,

where

There is another iterative technique called the Jacobi method that does not use updated values within the same iteration. For each iteration j, we have:

where

. We do not overwrite

with

because it requires all of the

values to complete the calculation. Thus, the Jacobi method requires more storage. It also converges slower than Gauss-Seidel.

Note: Eigenvalues will be covered in a future project.

2.5 Systems of Nonlinear Equations

In Section 2.3, we worked through root-finding as means of introducing nonlinear equations. Now that we have more practice with iterative methods, we will use them to solve nonlinear systems. Recall that nonlinear systems can have no solution, an infinite number of solutions, or a finite number of solutions. And to solve them, we cannot use Gaussian elimination or any other methods designed specifically for linear systems of equations. This section will focus on fixed-point iteration (which has nothing to do with fixed-point representation) and Newton's method.

In fixed-point iteration, the first step is to rewrite your equations to isolate variables. Let's say we have the following system:

This can be rewritten as such:

You must pick a starting value for X. At each iteration n, you label the update as

This can also be seen as

Then, you iterate through until

or whatever is deemed sufficient for your purposes.

For an example, we will work with the following system:

Here is the solution out:

Our first step is to rewrite these equations. There are many ways to rewrite them. For now, we will settle for the following:

Equation (1)

Equation (2)

We will start iterating with

. Then,

Zooming in to Quadrant I of the system's graph, we can plot where our iterations land:

Notice how the steps are getting smaller each iteration. We can compute the distance between two points using a triangle:

where c is of course equivalent to

We can now compute the distance between each iteration:

n	1	2	3	4	...	12	13
					...
distance					...

Unfortunately, this reasoning isn't technically correct because these aren't Cartesian points. Mathematically, these would actually have to be Banach spaces. A Banach space is simply a vector space where you can measure the length or size of vectors, meaning that it is a complete* normed vector space. So instead, we would have to compute the norms of the vectors.

We can think of our iteration points as vectors:

Then we have the equation for the

-norm

Let's now compute the distances between the vectors correctly:

* Another condition for a Banach space X is as follows:

For a sequence of vectors

, suppose the series converges:

Then, the vector series

must converge in X. This is known as completeness. Every finite-dimensional normed vector space like

is guaranteed to be complete and is thus a Banach space.

The condition for stopping the iteration looks something like

or we reach some

If we wanted to find the other roots, we would have to rewrite the original system in a different way. For instance, rewriting the system as

and

would result in convergence at

However, rewriting the system as

and

would result in divergence.

When our norms are getting smaller like this

We're seeing that the distance between the steps is shrinking. This is known as a Cauchy sequence. By the definition of a Banach space, any Cauchy sequence in the space must converge to a point within the space. This is what makes fixed-point iteration work.

When fixed-point iteration fails or converges too slowly, we can use Newton's method for systems of nonlinear equations. Let's begin with an example using the same system as before. We will denote it as

Let's take the point

and evaluate it:

We can think of F as some transformation onto some input to yield an output. We want the points that plug in to F to yield a transformation of

. How do we approach these values? Well, first, let's take a look at how changing our inputs can change our output:

These are simply moving our inputs to the right in the x-direction and upwards in the y-direction respectively. Now, the way we can generalize this for the entire function is using a Jacobian. We have to take the partial derivatives of every function in F for every variable in F:

In this case, our Jacobian is

We can try plugging values in

Let's try multiplying our Jacobian by the small changes in input:

The following is true:

Note that the larger the change in input, the less accurate the Jacobian is.

We can use the following equation for Newton's method:

Now, let's get a system we can solve:

This is now a linear system, so we can use our previous methods.

J = [2, -1; 1, -2];
F = [-1; 1];
J\F
ans = 2×1
    -1
    -1

We have

. This is telling us what change in the input results in a change to the output. Let's try using this to change our input:

The instance where we subtract

is much more useful. From this we can get the general equation for updates:

We can use this to iterate:

Now, let's see what the transformation yields:

That's not quite close enough. Let's now acquire a new value through the Jacobian:

And let's iterate through again with it:

Let's see what the transformation yields:

We stop the process once

or of course if the max number of iterations occurs.

At this point, we are pretty close to our root:

Let's look at another example using the same system but with a different initial guess and a set ε:

Here is a table of the results:

n
0
1
2
3
4

We can stop at around

iterations.

To do this in MATLAB, you can of course use the backslash operator as I did above:

Then we can get

You may also see it in the form of

using the inverse Jacobian. Then, using the fact that

where I is an identity matrix, we can say that

. Finally, we have a new form of Newton's method:

This would need to be solved symbolically and is generally not the best method for working in MATLAB.

If needed for specific applications, you can find the Jacobian using the Symbolic Math Toolbox if you have it:

syms x y

f1 = x^2 - y - 1;

f2 = x - y^2 + 1;

F = [f1; f2];

J = jacobian(F, [x, y])

J =

2.6 Curve Fitting and Polynomials

Polynomials are thrown around in many textbooks, but they are far from an afterthought. They are one of the oldest and most fundamental objects in mathematics. A polynomial is simply a mathematical expression consisting of variables and coefficients where the only operations are addition, subtraction, multiplication, and exponentiation with variables raised to nonnegative integer powers. A general nth-degree polynomial looks like this:

where x is the variable, a terms are coefficients, and n is the degree of the polynomial.

Polynomials are present in advanced mathematics, particularly algebraic geometry, where they are used to construct objects in the realm of abstract algebra, and their solutions define complex geometric structures.

For our purposes, polynomials are used for the likes of root-finding, interpolation, and curve fitting. MATLAB represents polynomials as row vectors where each entry is a coefficient listed from the highest degree term to the constant

. For instance,

is represented in MATLAB as

p = [3, -5, 2, -7];

There are several functions you can use to handle polynomials such as

polyval to evaluate polynomials

x = [5, 7, 9];
y = polyval(p,x)
y = 1×3
         253         791        1793

roots to find polynomial roots. Let's say we want to find roots for the equation :

format short
p = [3 -2 -4];
r = roots(p)
r = 2×1
    1.5352
   -0.8685

poly which can build polynomials are from known roots. Let's say we have the roots . We can build a polynomial from this:

r = [2 -1 4]; % (x-2)(x+1)(x+4)
p = poly(r) % Expanding this out gives us x^3 - 5x^2 + 2x + 8 which matches the coefficients
p = 1×4
     1    -5     2     8

We can also multiply and perform polynomial long division with conv and polydiv.

You already know what the mean, median, and mode are. Range is slightly more complex. To put it simply, range is a measure of spread. It is easy to evaluate, but there are a few different approaches, and they are generally not reliable. The most common way is through standard deviation s which is a measure of how spread out data are around the mean:

where

is the sum of squares, the sum of the squared differences between data points and the mean. Thus the formula is

For a set of data with very spread out individual measurements,

and thus s will be large. If they are grouped tightly,

and thus s will be small. There is also a term known as variance which is a measure of just how spread out the data is:

The equations for standard deviation and variance both have a term called degrees of freedom which is given by

. This is another term that comes up a lot, and you have to adjust your definition of it slightly depending on context, whether it's chemistry, robotics, or statistics. In statistics, it refers to the number of independent data points available to estimate a parameter—that is, the number of values free to vary once constraints such as the mean are fixed. Having degrees of freedom helps make sure that our calculations of variance and standard deviation do not underestimate the variability of a data set. In MATLAB, we have a variety of functions to help with basic statistics such as mean, median, mode, std (standard deviation), var (variance), and rms (root mean square value).

There are some numerical methods like Monte Carlo simulations or stochastic modeling that rely on random numbers that follow specific probability distributions. MATLAB makes it easy to generate random numbers for simulations or testing through

rand which generates numbers with a uniform distribution,

rand(1,5)
ans = 1×5
    0.8143    0.2435    0.9293    0.3500    0.1966

randn which generates numbers that have a normal distribution,

randn(1,5)
ans = 1×5
   -1.4224    0.4882   -0.1774   -0.1961    1.4193

and randperm which returns a row vector containing a random permutation of integers

p = randperm(6) % Generates a random permutation of unique integers from 1 to 6
p = 1×6
     4     2     1     6     5     3
p = randperm(8,4) % Generates a random permutation of 4 unique integers from 1 to 8
p = 1×4
     8     1     4     2

MATLAB also allows you to control the random number stream of these functions through rng. The general syntax for this is rng(seed). This is a complex function, but it lets you specify what you want to initialize the random number generator with. You can also control the algorithm through rng(seed,generator). The main purpose of this is to let you introduce some reproducibility to your random number testing.

Curve fitting is the process of finding a function that best matches a series of data points. We don't often have the exact equation that governs a system, especially when working with experimental data. Thus, you have to analyze the situation and determine parameters. Curve fitting involves regression, which is all about the statistics used to model the relationship between a dependent variable and one or more independent variables. The whole purpose of regression is to find a function that minimizes error between the predicted and actual values. Linear least squares is one of the most simple methods, where you can fit a line to data. It's called "least squares" because it is all about minimizing the sum of squared differences between the points predicted by your model and their true values. Given a set of data points

, we want to find the line

where m is slope, b is the intercept, and e is error. This is also known as the residual between the model and the observations, which we can solve for specifically by rearranging the equation as

. Thus, the residual is the difference between the true value y and the approximate value

. One way to find the most accurate line is to minimize the sum of the residual errors. That is,

where n is the total number of points.

To determine the values of b and m, we have to differentiate with respect to each unknown coefficient:

We can set these derivatives equal to zero to find the minimum

Noticing that

, we can rewrite the equations as normal equations:

We can solve these to acquire

We can combine everything to then acquire

This uses the means of y and x.

Let's fit a straight line to the following data:

i
1	10	25	100	250
2	20	70	400	1400
3	30	380	900	11400
4	40	550	1600	22000
5	50	610	2500	30500
6	60	1220	3600	73200
7	70	830	4900	58100
8	80	1450	6400	116000
∑	360	5135	20400	312850

First, let's compute the means:

and

Now, let's calculate the slope and intercept:

We now have the equation

The data in the table can be interpreted as the relationship between force Fand velocity v, so we now have

. We can now visualize this through a force-velocity graph:

v = [10 20 30 40 50 60 70 80];

F = [25 70 380 550 610 1220 830 1450];

p = polyfit(v, F, 1);

v_fit = linspace(min(v), max(v), 100);

F_fit = polyval(p, v_fit);

figure

plot(v, F, 'o', v_fit, F_fit, '-')

xlabel('Velocity (m/s)')

ylabel('Force (N)')

title('Force vs. Velocity')

legend({'Data Points', 'Least Squares Fit'}, 'Location', 'northwest', 'FontSize', 8)

While the line fit the data well, there is still error. We can calculate the standard deviation as such:

The standard deviation for the regression line

, also known as the standard error of the estimate, can be calculated as follows:

We know that our estimate is reasonable because

. To measure how accurate our model is, we can find the coefficient of determination

as follows:

This essentially means that

of the uncertainty has been explained by the model. Keep in mind that a

value close to

is not the be-all and end-all by any means. While it suggests a good fit, it still only measures how well the model handles variability. It does not prove that the relationship is truly linear. It's simply a handy way to measure correlation.

You of course know that most of the real world cannot be modeled by a straight line. Common nonlinear models such as the exponential model

or the power law relationship

can be linearized which allows you to analyze them using linear regression. We can easily linearize them by manipulating them algebraically:

We can then analyze the same numbers as last time but using linear regression:

x = [10 20 30 40 50 60 70 80];

y = [25 70 380 550 610 1220 830 1450];

% Linearize

ln_x = log(x);

ln_y = log(y);

% Linear least squares fit on the log-transformed data

p = polyfit(ln_x, ln_y, 1);

% Coefficients

b = p(1);

ln_a = p(2);

a = exp(ln_a);

% Display linearized form

fprintf('Least squares line: ln(y) = %.4f + %.4f * ln(x)\n', ln_a, b);

Least squares line: ln(y) = -1.2941 + 1.9842 * ln(x)

% Display power law form

fprintf('Power law form: y = %.4f * x^%.4f\n', a, b);

Power law form: y = 0.2741 * x^1.9842

% Plot original data and fitted curve

x_fit = linspace(min(x), max(x), 100);

y_fit = a * x_fit.^b;

figure

plot(x, y, 'o', x_fit, y_fit, '-')

xlabel('x')

ylabel('y')

legend('Data Points', 'Power Law Fit', 'Location', 'northwest')

title('Power Law Fit to Data')

We can also plot the linearized data:

fit_y = polyval(p, ln_x);

figure

plot(ln_x, ln_y, 'o', ln_x, fit_y, '-')

xlabel('ln(x)')

ylabel('ln(y)')

legend('Transformed Data', 'Linear Fit', 'Location', 'northwest')

title('Linearized Fit (ln(y) vs ln(x))')

We began with curve fitting with linear models. Now, let's move on to curve fitting for polynomials. Let's say we have a quadratic:

The sum of the squares of the residuals is

We will differentiate just as we did with the linear model:

We can then set these equations equal to zero and rearrange to find the normal equations:

You would then solve for the coefficients of the unknowns

, and

We can extend this case of a quadratic to a general mth degree polynomial

The standard error here is given by

We have lost

degrees of freedom which equals the number of coefficients estimated. That means that more terms in a polynomial will result in losing degrees of freedom. This is all accounted for in our standard error.

Let's work through an example:


0
1
2
3
4
5
Σ

From this, we can compute the following:

Using this, we can get a linear system:

Using the backslash operator, we can solve for the coefficients:

A = [6, 15, 55; 15, 55, 225; 55, 225, 979];
b = [152.6, 585.6, 2488.8];
a = A\transpose(b)
a = 3×1
    2.4786
    2.3593
    1.8607

We have

, and

. Therefore, the least squares quadratic equation is

The standard error of the estimate is

And the coefficient of determination is

Thus,

. This means that

of the uncertainty has been explained by the model. We can now graph:

% Original data from the example

x = [0 1 2 3 4 5];

y = [2.1 7.7 13.6 27.2 40.9 61.1];

% Normal equations matrix (from your previous system)

A = [6, 15, 55;

15, 55, 225;

55, 225, 979];

b = [152.6; 585.6; 2488.8];

% Solve for coefficients [a0; a1; a2]

a = A \ b;

% Reorder coefficients for polyval: [a2, a1, a0]

p = [a(3), a(2), a(1)];

% Generate smooth x-values for plotting the parabola

x_fit = linspace(min(x), max(x), 100);

y_fit = polyval(p, x_fit);

% Plot original data and fitted curve

figure

plot(x, y, 'o', x_fit, y_fit, '-')

xlabel('x')

ylabel('y')

title('Quadratic Least Squares Fit')

legend({'Data Points', 'Quadratic Fit'}, 'Location', 'northwest', 'FontSize', 8)

If we had more than one independent variable such that we had a function in the form of

, we'd have to use multiple linear regression. We still make this fit a linear model through the same process as before. It just means we would have to use partial differentiation to acquire the derivatives of

. Setting them equal to zero would then allow us to obtain the linear model:

All of these models are different versions of what is known as a general linear least squares model.

Note: Because of time and length constraints, I was unable to include an example of multiple linear regression. I was also unable to cover fourier analysis, which is where you fit data using sinusoidal functions. However, fourier analysis will be covered in a future project.

We used interpolation before for root-finding, and now we are going to introduce polynomial interpretation. The whole purpose of this is that, given a number of data points, you can find one polynomial that passes through all the points. We will be going over the Lagrange interpolating polynomial.

Let's say that we have a linear interpolating polynomial which is the weighted average of two values

and

connected by a straight line:

where

and

are the weighting coefficients. That is,

, where

and

, where

and

If we substitute these into the linear interpolating polynomial, we get

This is a first-order Lagrange interpolating polynomial. We can also find the general form of a second-order lagrange interpolating polynomial for three values

, and

. The second-order polynomial looks like

. Then, we need to find each L term. This is done the same way as before, where each term passes through only one point and is equal to zero at the other two. They would be as follows:

Substituting these results in the second-order Lagrange interpolating polynomial:

The first term is equal to

and equal to zero at

and

The second term is equal to

and equal to zero at

and

The third term is equal to

and equal to zero at

and

Both of these cases, as well as higher-order Lagrange polynomials, can be represented by a general form:

where

Let's work through an example. We have the following data, which represent the density of unused motor oil at

	Value
	0
	20
	40

We will obtain the estimate at

using the first-order polynomial:

We can also use the second-order polynomial to yield a more accurate guess:

We can now use the following function Lagrange to find the density:

function f_est = Lagrange(x_data, y_data, x_target)
% Lagrange - Estimates value at x_target using Lagrange interpolation.
%
% Inputs:
% x_data  - Vector of known x-values (e.g., temperatures)
% y_data  - Vector of known y-values (e.g., densities)
% x_target - Target x-value to interpolate at
%
% Output:
% f_est - Interpolated y-value at x_target
 
n = length(x_data);
f_est = 0;
 
for j = 1:n
    L_j = 1;
    for i = 1:n
        if i ~= j
            L_j = L_j * (x_target - x_data(i)) / (x_data(j) - x_data(i));
        end
    end
    f_est = f_est + y_data(j) * L_j;
end
 
end
 
x_data = [0 20 40];
y_data = [3.85 0.800 0.212];
x_target = 15;
 
density_at_15 = Lagrange(x_data, y_data, x_target);
fprintf('Estimated density at %.1f°C is %.4f g/cm^3\n', x_target, density_at_15);
Estimated density at 15.0°C is 1.3317 g/cm^3

Higher-order polynomials can be generally accurate for some points but can be way off on others. Furthermore, they tend to be very sensitive to roundoff error. These deviations are known as oscillations. Carl Runge, a German mathematician (also known for Runge-Kutta methods which we will discuss in Section 2.9), came up with a function now known as Runge's function. As you increase the number of interpolation points, the polynomial can oscillate wildly at the edges of the interval. This is known as Runge's phenomenon:

f = @(x) 1 ./ (1 + 25 * x.^2);

% Define evaluation x-values for smooth curve

x_fine = linspace(-1, 1, 500);

y_true = f(x_fine);

% First 5 points

x_5 = linspace(-1, 1, 5);

y_5 = f(x_5);

y_5_interp = zeros(size(x_fine));

for k = 1:length(x_fine)

y_5_interp(k) = Lagrange(x_5, y_5, x_fine(k));

end

% 10 points

x_10 = linspace(-1, 1, 10);

y_10 = f(x_10);

y_10_interp = zeros(size(x_fine));

for k = 1:length(x_fine)

y_10_interp(k) = Lagrange(x_10, y_10, x_fine(k));

end

figure

plot(x_fine, y_true, 'k-', 'LineWidth', 2); hold on;

plot(x_fine, y_5_interp, 'r--', 'LineWidth', 1.5);

plot(x_fine, y_10_interp, 'b-.', 'LineWidth', 1.5);

plot(x_5, y_5, 'ro', x_10, y_10, 'bs');

xlabel('x'); ylabel('f(x)');

legend({'True Function', '5-Point Lagrange', '10-Point Lagrange', '5-pt Nodes', '10-pt Nodes'}, ...

'Location', 'eastoutside', 'FontSize', 8);

title('Runge''s Phenomenon');

grid on;

A better alternative is to use splines. A spline is a piecewise-defined polynomial, where instead of fitting one polynomial over the entire interval, you stitch together multiple low-degree polynomials to cover subsets of data points. This approach avoids the large oscillations we saw with high-order polynomials. The simplest spline is a linear spline, which connects data points with a straight line. The main problem with this approach is that they lack smoothness. Quadratic splines improve on them by using second-order polynomials for each segment. The most widely used spline in numerical methods is the cubic spline. This is the best compromise on smoothness and a lack of oscillations. If you go higher, the oscillations inherent in higher-order polynomials can begin to appear.

Cubic splines are continuous at the points where they are stitched together, which are known as knots. Not only are the function values continuous, but the first and second derivatives are also continuous at each knot, ensuring a smooth curve. A special case is known as a natural cubic spline, where the second derivative is equal to zero at the end points.

The general form of a linear spline is

where each interval i has its own spline function

is the intercept, and

is the slope of the straight line connecting the points.

From this, we can get

Skipping ahead to cubic splines, we have

where for n data points, there are

intervals and

unknown coefficients to evaluate. Also keep in mind that

is the distance from the left endpoint of the interval. This is what localizes the polynomial to the interval.

MATLAB has the very convenient function spline which is a cubic spline. The syntax looks like this:

s = spline(x, y, x_query)

where x is a vector of known x-valuers, y is a vector of known y-values, and x_query is a vector of x-values where you want to evaluate the spline.

Here is a simple example:

x = [0 1 2.5 3.6 5 7 8.1 10];

y = sin(x);

x_query = 0:0.25:10;

y_interp = spline(x, y, x_query);

figure

plot(x, y, 'o', x_query, y_interp, '-')

xlabel('x')

ylabel('y')

title('Cubic Spline Interpolation)')

legend('Data Points', 'Spline Fit', 'Location', 'southoutside', 'Orientation', 'horizontal')

grid on;

On the other hand, if we were to use interp1 to get a linear spline, we would have a graph that lacks smoothness:

y_linear = interp1(x, y, x_query, 'linear');

figure

plot(x, y, 'o', x_query, y_linear, '-')

xlabel('x')

ylabel('y')

title('Linear Spline Interpolation')

legend('Data Points', 'Linear Spline Fit', 'Location', 'southoutside', 'Orientation', 'horizontal')

grid on;

There is also a function called pchip which stands for Piecewise Cubic Hermite Interpolating Polynomial. This function also uses cubic polynomials, but the second derivative is not continuous. The spline function is a natural cubic spline whereas pchip is not. The main advantage of pchip is that it is what is known as a shape-preserving fit. It avoids overshooting. On the other hand, spline prioritizes a smooth curve, but can overshoot. There is also a fourth method called makima which stands for Modified Akima piecewise cubic Hermite interpolation. It is a good compromise between the lack of oscillations from pchip and the smoothness of spline.

x = -3:3;

y = [-1 -1 -1 0 1 1 1];

x_query = -3:0.01:3;

p = pchip(x, y, x_query);

s = spline(x, y, x_query);

m = makima(x, y, x_query);

figure

plot(x, y, 'o', x_query, p, '-', x_query, s, '-.', x_query, m, '--')

xlabel('x')

ylabel('y')

title('Comparison of Interpolation Methods')

legend('Sample Points', 'pchip', 'spline', 'makima', 'Location', 'southeast')

grid on;

2.7 Numerical Differentiation

Numerical differentiation is used to approximate the derivative of a function that does not have an easy analytical derivative, if you only have discrete data points from an experiment, or if you have no equation at all. Recall the idea of finite differences. There are three approaches:

Forward difference is where you use the point ahead of x:

Backward difference is where you use the point behind x:

Central difference is where you use points on both sides of x:

Forward and backward difference are

accurate and central difference is

accurate, meaning that it is a much more accurate choice. Keep in mind that including more terms from the Taylor expansion will mean more accuracy. Truncating the Taylor series introduces error. For a central difference we have

where

. By expanding the Taylor series to include more points

, we can acquire a fourth-order central difference formula of order

where

Let's try this out for

. We want to approximate

. The true value of this is

. Let's begin with step size

using the second-order central difference formula:

Using the fourth-order central difference formula, we get

We will repeat these calculations for different step sizes and compare the error:

Step Size h	Second-Order	Error	Fourth-Order	Error

The fourth-order central difference method is generally more accurate. Let's look at the formulas for higher-order derivatives.

This same idea holds true for the forward difference method too:

Lower-Order	Error	Higher-Order (more Taylor series terms)	Error

And it holds for the backward difference method:

Lower-Order	Error	Higher-Order (more Taylor series terms)	Error

And lastly, here's the rest of the central difference method:

Lower-Order	Error	Higher-Order (more Taylor series terms)	Error

We can use the following data to test out these three finite difference methods. We are trying to find the derivative of

where

and

. The true value is

. We need to first run some calculations:

and

Let's now compute the results:

Richardson extrapolation uses two derivative estimates to compute a more accurate approximation. Let's say we have derivative approximations

and

with step sizes

and

. Then, we have the formula for Richardson extrapolation:

For an approximation using central difference methods that is

, the application of Richardson extrapolation will yield a new derivative estimate that is

Let's apply this using the same function as we did previously. We have

where

, and

. Let's now calculate the first derivative estimates:

We can then combine these to yield

which also gives the true answer.

In the case with the higher-order central difference method, recall that we got

. This answer is exact because the function being analyzed was a fourth-order polynomial. And we fitted a fourth-order polynomial through the data points to find the formula we used to evaluate it. The Richardson extrapolation yielded

, which is also exact, and this is a result of how the method works; it fitted a higher-order polynommial through the data and then evaluated the derivatives through central divided differences. Both of these cases were lucky because in reality, both of these will yield estimates with error far more often than not.

Keep in mind that both of these examples worked because the data were evenly-spaced, and we also knew the functions. In a lot of cases, neither of these will be the case. One way to deal with this would be to fit a Lagrange interpolating polynomial to a set of adjacent points that are around the point you want to differentiate at. Another problem can be the presence or creation of errors. Experiments can yield data with errors and roundoff errors can occur during differentiation. The best way to approach this is to use least squares regression to fit a smooth, differentiable function to your data. If you don't know where to start, using lower-order polynomial regression is the safest first choice.

MATLAB has two functions that can help you find the derivatives of data (both also have many other purposes outside of numerical derivatives). First, we have diff, which calculates the difference between adjacent elements in a vector of length n. It will return a vector of length

. For instance, let's try differentiating the function

from

f = @(x) 0.2 + 25*x - 200*x.^2 + 675*x.^3 - 900*x.^4 + 400*x.^5; % Express f(x) as a function

x = 0:0.1:0.8; % Generate evenly spaced values from 0 to 0.8

y = f(x);

% These compute differences between the pairs of elements in x and y.

% Take a look at what they output.

diff(x)

ans = 1×8

    0.1000    0.1000    0.1000    0.1000    0.1000    0.1000    0.1000    0.1000

diff(y)

ans = 1×8

    1.0890   -0.0010    0.3190    0.8490    0.8690    0.1390   -1.1010   -2.1310

% Let's calculate the difference approximations using vector division

d = diff(y) ./ diff(x);

% The vector d now contains derivative estimates corresponding to a

% midpoint between adjacent elements, almost like slope

% We need to store the x-values corresponding to the midpoints so we can

% graph the relationship

n = length(x);

xm = (x(1:n-1) + x(2:n)) ./ 2;

% Let's use the actual derivative for a comparison

xa = 0:0.01:0.8;

ya = 25 - 400*xa + 3*675*xa.^2 - 4*900*xa.^3 + 5*400*xa.^4;

% Plot the numerical derivative vs the true derivative

figure

plot(xm, d, 'o', xa, ya, '-')

title('diff Derivative Comparison')

xlabel('x')

ylabel('f''(x)')

legend('Estimates', 'True Derivative')

grid on

On the other hand, the gradient(f,v) function returns the gradient vector of symbolic scalar field f with respect to vector v in Cartesian coordinates.

The gradient vector of

with respect to

is the vector of the first partial derivatives of f:

In other words, it calculates an approximate derivative at every value in your vector rather than at midpoints. I will do this using the same function as before:

dy = gradient(y,0.1)

dy = 1×9

   10.8900    5.4400    1.5900    5.8400    8.5900    5.0400   -4.8100  -16.1600  -21.3100

figure

plot(x,dy,'o',xa,ya)

title('gradient Derivative Comparison')

xlabel('x')

ylabel('f''(x)')

legend('Estimates', 'True Derivative')

grid on

In this case, gradient is less accurate than diff. This is because gradient uses wider intervals. That being said, there's no clear answer on which is more accurate. It depends on your function and needs. I think gradient is just so much more versatile and has more features, so I would generally use it more.

You can also use it to symbolically calculate a vector of partial derivatives for a multivariable function:

syms x y z % Define symbolic variables

f = 2*y*z*sin(x) + 3*x*sin(z)*cos(y);

v = [x, y, z];

gradient(f, v)

ans =

We can also use it to create vector fields:

syms x y z

f = 2*y*z*sin(x) + 3*x*sin(z)*cos(y);

v = [x, y, z];

grad_f = gradient(f, v);

% Substitute z = 1 to reduce to a 2D vector field in x-y space

grad_f_2D = subs(grad_f, z, 1); % Fix z=1 for visualization

% Convert the two components to MATLAB functions for numeric evaluation

fx = matlabFunction(grad_f_2D(1), 'Vars', [x y]);

fy = matlabFunction(grad_f_2D(2), 'Vars', [x y]);

% Create grid points for x and y

[X, Y] = meshgrid(-2:0.2:2, -2:0.2:2);

% Evaluate the gradient components over the grid

U = fx(X, Y);

V = fy(X, Y);

figure

quiver(X, Y, U, V)

xlabel('x')

ylabel('y')

title('Gradient Vector Field')

axis equal tight

grid on

That is really awesome, isn't it?

2.8 Numerical Integration

Numerical integration is used to approximate definite integrals that cannot be solved analytically.

From your first calculus class, you may remember the trapezoidal rule. Please see Section 2.1 if you do not. The trapezoidal rule belongs to a larger family of formulas known as the Newton-Cotes formulas. These are the most common methods of numerical integration. The Newton-Cotes strategy is very simple; you replace a more complicated function or data set with a polynomial that is easily to integrate:

where

is an nth-degree polynomial in the form

For the trapezoidal method, we have a first-order polynomial:

As I'm sure you remember, there is error with the trapezoidal rule. We can make an estimate for truncation error with

Let's numerically integrate

from

. We will set the true value as

. We simply plug in our values:

We can calculate

. This is absolutely horrible, but keep in mind that this is exact error because we have the true value. In most cases, we won't because there's basically no point in numerically integrating if we have the true value. We need the second derivative which is

. Then, the average value of the second derivative can be computed as:

We can now find the approximate error:

We used the formula for

, but keep in mind that the average second derivative is not necessarily a sufficient approximation of

. This is why we denote it with

instead.

Even with approximate error, this method is clearly not very accurate. We can improve the trapezoidal rule by dividing the bounds of integration into segments and applying the method to each individual section. Let

and

Then, we can substitute in the formula for the trapezoidal rule to get

where

By summing the individual errors for each segment, we can get error:

where i is an individual segment and

. Alternatively,

Finally, we can get aproximate error

This method is known as the composite trapezoidal rule. Let's try it out for the same function as last time but with

and

This is an improvement, but it's still not quite accurate enough. We would need to increase the number of segments. For

and

, we have

and

. This is a much better estimate, but it takes too many steps to get to.

Another method is known as Simpson's rule. There are different versions of Simpson's rule: one for second-order polynomials called Simpson's 1/3 rule and one for third-order polynomials called Simpson's 3/8 rule. The formula for Simpson's 1/3 rule is

with truncation error of

. Trying this out for the same function that we have been using yields

. This comes with an error of

. This comes with an estimated error of

. Our estimated error matches our error because this deals with fifth-order polynomials.

The trapezoidal rule fits straight lines between points, but Simpson's rule fits parabolas between points. This makes it generally more accurate. Thus, our results make sense. That being said, it is more restrictive; it requires evenly spaced points and even number of intervals.

There is a composite version of Simpson's 1/3 rule which is given by

with

We can apply this to the same function to yield

. This comes with an error of

. This comes with an estimated error of

which matches our error. This is even more accurate.

For Simpson's 3/8 rule, we have

with

. This rule is more accurate than Simpson's 1/3 rule because it uses a higher-order polynomial. Keep in mind that you can combine both rules by using either for a certain number of segments and then adding them together. Higher-order formulas do exist, but they are generally unnecessary for applications in science and engineering.

It's funny that your AP calculus teacher (the blame is actually more on College Board than them) likely spent more time on symbolic integration than numerical integration. They likely spent a week breezing through methods for approximating integrals despite the fact that these are what are used in engineering and scientific computing!

All of the methods we've used so far have been for evenly spaced points. There are a lot of situations where this isn't the case. We can use a similar form of the composite trapezoidal rule but with nonconstant values of h to represent the width of each segment i:

MATLAB has a function called trapz that uses the trapezoidal method. For I = trapz(x,y), it integrates y with scaling specified by x. If

is concave up, trapz will overestimate the value of the integral and vice-versa. Here's an example with uniform spacing:

x = 0:pi/100:pi; % Domain vector
y = sin(x);
I = trapz(x,y)
I = 1.9998

Here's an example with nonuniform spacing where the rows of y represent velocity data taken at times in x:

x = [1, 2.5, 7, 10];
y = [5.2,   7.7,   9.6,   13.2;
     4.8,   7.0,  10.5,   14.5;
     4.9,   6.5,  10.2,   13.8];
I = trapz(x,y,2) % Each entry in vector I represents the integral for that corresponding row of y
I = 3×1
   82.8000
   85.7250
   82.1250

Here, we use a third parameter dim to specify the dimension to integrate along. In this case, we have dim = 2 because the data is in the rows of y.

If you only have numeric data and don't know the function, you can also use trapz to evaluate double or triple integrals. Let's say we want to approximate

We would simply nest function calls to trapz:

x = -3:0.1:3;
y = -5:0.1:5;
 
[X, Y] = meshgrid(x, y);
F = X.^2 + Y.^2;
 
I = trapz(y, trapz(x, F, 2))
I = 680.2000

For a triple integral

x = -3:0.1:3;
y = -5:0.1:5;
z = -7:0.1:7;
 
[X, Y, Z] = meshgrid(x, y, z);
F = X.^2 + Y.^2 + Z.^2;
 
I = trapz(z, trapz(y, trapz(x, F, 2), 1), 3)
I = 2.3244e+04

We also have the poorly named function cumtrapz that computes cumulative integrals. It essentially is the same thing as trapz, but it tells you the total accumulated area up to a specific point.

x = 0:0.1:2;

y = sin(x);

cumulative_area = cumtrapz(x, y);

figure

plot(x, cumulative_area, 'o-')

xlabel('x')

ylabel('Area')

title('Cumulative Integral of {e^x} Using cumtrapz')

grid on

We can also use cumtrapz to evaluate double integrals:

x = -3:0.1:3;
y = -5:0.1:5;
 
[X, Y] = meshgrid(x, y);
F = X.^2 + Y.^2;
 
I_cum = cumtrapz(y, cumtrapz(x, F, 2)) % This is the cumulative area at each of the points
I_cum = 101×61
         0         0         0         0         0         0         0         0         0         0         0         0         0         0         0         0         0         0         0         0         0         0         0         0         0         0         0         0         0         0         0         0         0         0         0         0         0         0         0         0         0         0         0         0         0         0         0         0         0         0
         0    0.3321    0.6584    0.9791    1.2944    1.6045    1.9096    2.2099    2.5056    2.7969    3.0840    3.3671    3.6464    3.9221    4.1944    4.4635    4.7296    4.9929    5.2536    5.5119    5.7680    6.0221    6.2744    6.5251    6.7744    7.0225    7.2696    7.5159    7.7616    8.0069    8.2520    8.4971    8.7424    8.9881    9.2344    9.4815    9.7296    9.9789   10.2296   10.4819   10.7360   10.9921   11.2504   11.5111   11.7744   12.0405   12.3096   12.5819   12.8576   13.1369
         0    0.6544    1.2972    1.9288    2.5496    3.1600    3.7604    4.3512    4.9328    5.5056    6.0700    6.6264    7.1752    7.7168    8.2516    8.7800    9.3024    9.8192   10.3308   10.8376   11.3400   11.8384   12.3332   12.8248   13.3136   13.8000   14.2844   14.7672   15.2488   15.7296   16.2100   16.6904   17.1712   17.6528   18.1356   18.6200   19.1064   19.5952   20.0868   20.5816   21.0800   21.5824   22.0892   22.6008   23.1176   23.6400   24.1684   24.7032   25.2448   25.7936
         0    0.9671    1.9168    2.8497    3.7664    4.6675    5.5536    6.4253    7.2832    8.1279    8.9600    9.7801   10.5888   11.3867   12.1744   12.9525   13.7216   14.4823   15.2352   15.9809   16.7200   17.4531   18.1808   18.9037   19.6224   20.3375   21.0496   21.7593   22.4672   23.1739   23.8800   24.5861   25.2928   26.0007   26.7104   27.4225   28.1376   28.8563   29.5792   30.3069   31.0400   31.7791   32.5248   33.2777   34.0384   34.8075   35.5856   36.3733   37.1712   37.9799
         0    1.2704    2.5176    3.7424    4.9456    6.1280    7.2904    8.4336    9.5584   10.6656   11.7560   12.8304   13.8896   14.9344   15.9656   16.9840   17.9904   18.9856   19.9704   20.9456   21.9120   22.8704   23.8216   24.7664   25.7056   26.6400   27.5704   28.4976   29.4224   30.3456   31.2680   32.1904   33.1136   34.0384   34.9656   35.8960   36.8304   37.7696   38.7144   39.6656   40.6240   41.5904   42.5656   43.5504   44.5456   45.5520   46.5704   47.6016   48.6464   49.7056
         0    1.5645    3.1000    4.6075    6.0880    7.5425    8.9720   10.3775   11.7600   13.1205   14.4600   15.7795   17.0800   18.3625   19.6280   20.8775   22.1120   23.3325   24.5400   25.7355   26.9200   28.0945   29.2600   30.4175   31.5680   32.7125   33.8520   34.9875   36.1200   37.2505   38.3800   39.5095   40.6400   41.7725   42.9080   44.0475   45.1920   46.3425   47.5000   48.6655   49.8400   51.0245   52.2200   53.4275   54.6480   55.8825   57.1320   58.3975   59.6800   60.9805
         0    1.8496    3.6644    5.4456    7.1944    8.9120   10.5996   12.2584   13.8896   15.4944   17.0740   18.6296   20.1624   21.6736   23.1644   24.6360   26.0896   27.5264   28.9476   30.3544   31.7480   33.1296   34.5004   35.8616   37.2144   38.5600   39.8996   41.2344   42.5656   43.8944   45.2220   46.5496   47.8784   49.2096   50.5444   51.8840   53.2296   54.5824   55.9436   57.3144   58.6960   60.0896   61.4964   62.9176   64.3544   65.8080   67.2796   68.7704   70.2816   71.8144
         0    2.1259    4.2112    6.2573    8.2656   10.2375   12.1744   14.0777   15.9488   17.7891   19.6000   21.3829   23.1392   24.8703   26.5776   28.2625   29.9264   31.5707   33.1968   34.8061   36.4000   37.9799   39.5472   41.1033   42.6496   44.1875   45.7184   47.2437   48.7648   50.2831   51.8000   53.3169   54.8352   56.3563   57.8816   59.4125   60.9504   62.4967   64.0528   65.6201   67.2000   68.7939   70.4032   72.0293   73.6736   75.3375   77.0224   78.7297   80.4608   82.2171
         0    2.3936    4.7408    7.0432    9.3024   11.5200   13.6976   15.8368   17.9392   20.0064   22.0400   24.0416   26.0128   27.9552   29.8704   31.7600   33.6256   35.4688   37.2912   39.0944   40.8800   42.6496   44.4048   46.1472   47.8784   49.6000   51.3136   53.0208   54.7232   56.4224   58.1200   59.8176   61.5168   63.2192   64.9264   66.6400   68.3616   70.0928   71.8352   73.5904   75.3600   77.1456   78.9488   80.7712   82.6144   84.4800   86.3696   88.2848   90.2272   92.1984
         0    2.6529    5.2536    7.8039   10.3056   12.7605   15.1704   17.5371   19.8624   22.1481   24.3960   26.6079   28.7856   30.9309   33.0456   35.1315   37.1904   39.2241   41.2344   43.2231   45.1920   47.1429   49.0776   50.9979   52.9056   54.8025   56.6904   58.5711   60.4464   62.3181   64.1880   66.0579   67.9296   69.8049   71.6856   73.5735   75.4704   77.3781   79.2984   81.2331   83.1840   85.1529   87.1416   89.1519   91.1856   93.2445   95.3304   97.4451   99.5904  101.7681
I = I(end) % Final cumulative value (total integral)
I = 2.3244e+04

We can create a beautiful visualization for the function

as a 3D surface and the surface of cumulative integration where a red star marks the final area:

x = -2:0.1:2;

y = -2:0.2:2;

[X,Y] = meshgrid(x,y);

F = 10*X.^2 + 20*Y.^2;

I_cum = cumtrapz(y,cumtrapz(x,F,2));

surf(X,Y,F,'EdgeColor','none')

xlabel('X')

ylabel('Y')

hold on

surf(X,Y,I_cum,'FaceAlpha',0.5,'EdgeColor','none')

plot3(X(end),Y(end),I_cum(end),'r*')

title('cumtrapz Visualization')

hold off

MATLAB used to have dblquad and triplequad for double and triple integrals of known functions. These have been replaced with integral, integral2, and integral3.

The general syntax for each is as follows:

q = integral(fun,xmin,xmax,Name,Value)

q = integral2(fun,xmin,xmax,ymin,ymax,Name,Value)

q = integral3(fun,xmin,xmax,ymin,ymax,zmin,zmax,Name,Value)

This may look like a lot of parameters, but keep in mind that you only really need up to xmax, ymax, and zmax respectively.

Here's an example for integral:

f = @(x) exp(-x.^2);
I = integral(f, 0, 2)
I = 0.8821

Here's an example for integral2:

f2 = @(x,y) x.^2 + y.^2;
I2 = integral2(f2, -3, 3, -5, 5)
I2 = 680.0000

Here's an example for integral3:

f3 = @(x,y,z) x.*y + z.^2;
I3 = integral3(f3, 0,1, 0,2, -1,1)
I3 = 3.3333

2.9 Ordinary Differential Equations

Numerically solving ordinary differential equations (ODEs) is one of the most important parts of scientific computing because real-world systems are rarely modeled by differential equations that have easy closed-form solutions. First, we will look at ODEs in their most basic form of

We will be working specifically initial value problems (IVPs). That is,

. We can say that on some interval

, there exists a differentiable function

such that

and

for all

. In other words, the solution curve

must pass through the initial point

and we the slope which is given by

. From this, we can visualize the solutions of ODEs through slope fields across a region of interest. A slope field essentially defines slopes over the region. The process is to start at an initial point

, move a small horizontal step of size h in the direction of the slope

, and then move vertically by

such that the desired slope is being matched. After that, you will be at the next point on the solution curve

, and thus the process repeats.

The slope field for

over the rectangle (domain)

with two initial values

and

can be coded up as such:

t = 0:0.5:5;

y = 0:0.5:4;

[T, Y] = meshgrid(t, y);

F = (T - Y) / 2;

% Normalize vectors for quiver

dt = ones(size(F));

dy = F;

L = sqrt(dt.^2 + dy.^2);

dt = dt ./ L;

dy = dy ./ L;

figure

quiver(T, Y, dt, dy, 0.5)

xlabel('t')

ylabel('y')

title('Slope Field for dy/dt = (t - y)/2')

hold on

% Plot the two exact solution curves from the example

t_fine = linspace(0, 5, 100);

y1 = 3*exp(-t_fine/2) - 2 + t_fine;

y2 = 6*exp(-t_fine/2) - 2 + t_fine;

plot(t_fine, y1, 'r', 'LineWidth', 2)

plot(t_fine, y2, 'b', 'LineWidth', 2)

legend('Slope Field', 'Solution: y(0)=1', 'Solution: y(0)=4', 'Location', 'southoutside')

axis([0 5 0 4])

grid on

hold off

The Lipschitz condition is a way of ensuring that small changes in y don't cause a continuous function to change too fast. A function

that is continuous on a rectangle R satisfies the Lipschitz condition provided that a Lipschitz constant

exists such that

where

. In other words, if

with respect to y is bounded by a constant L, the function satisfies the Lipschitz constant. And additionally, if

is continuous, then the IVP has a locally defined solution. This is known as the Picard–Lindelöf theorem, the Cauchy-Lipschitz theorem, or most commonly the existence and uniqueness theorem.

Like many other numerical methods, the general approach is to construct an algorithm consisting of muultiple steps that converge to a solution. The most famous of these is Euler's method. The first derivative provides an estimate of the slope at

From this, we can get the formula for each update in Euler's method:

The goal is to step forward from

by following the slope at each step. Euler's method is a first-order method, which means that error is proportional to the square of each step h squared. Reducing step size will improve accuracy, but increase computation time because more iterations will be needed.

Let's work through an IVP problem using Euler's method. We have

from

to 4 with a step size of 1. We are given the condition

. The exact y can be determined analytically as

. Then, the true solution is

. Let's now implement Euler's method:

Thus, the relative percent error is

We would then continue on for the next three steps. Results are tabulated:

t
0
1
2
3
4

Let's look at the graph:

t = 0:1:4;

y_true = [2.00000, 6.19463, 14.84392, 33.67717, 75.33896];

y_euler = [2.00000, 5.00000, 11.40216, 25.51321, 56.84931];

figure

plot(t, y_true, 'r-', 'LineWidth', 2)

hold on

plot(t, y_euler, 'bo-', 'LineWidth', 2, 'MarkerSize', 6)

xlabel('t')

ylabel('y')

title('Euler’s Method Approximation vs. True Solution')

legend('True Solution', 'Euler Approximation', 'Location', 'southoutside')

grid on

hold off

As you can see, the error here is non-negligible. Numerical solutions of ODEs are subject to two types of error: truncation and roundoff. Error of Euler's method is synonymous with error of the expansion of a Taylor series. The solution to an ODE with continuous derivatives can be represented as a Taylor series about a starting value

where

is the remainder term, and

We can combine these equations to yield an alternate form that uses big O notation:

This says the truncation error is proportional to

We can also cut out some of the true solution by subtracting

For sufficiently small h, the higher-order terms are usually negligible, and thus approximate local truncation error is

There is also a metric known as stability. Stability refers to the rate at which errors grow. If error grows exponentially for a problem with a bounded solution, the numerical solution is unstable. For a differential equation

with

, we can use Euler's method to solve:

Combining, we can get the update formula:

where

is known as the amplification factor. If

, the solution is considered stable. This is technically a very simple version of Von Neumann stability analysis applied to ODEs, which is essentially the analysis of the amplication factors in the contet of partial differential equations (PDEs). PDEs will be covered in a future project.

One way to improve on Euler's method is to factor in a second evaluation of slope. In other words, we determine two derivatives for the interval: one at the beginning and another at the end. Then, you average the two derivatives. This is known as Heun's method. In Euler's method, the slope at the beginning of an interval is given by

, which is then extrapolated linearly to

to yield

. In Euler's method, we would stop here. In Heun's method, we continue. This is now known as the predictor equation.

We can now get an equation that gives slope at the end of the interval:

where

denotes that it is a prediction.

We can combine the equation for slope at the beginning of an interval and slope at the end of an interval to get an average:

Using Euler's method, we can then extrapolate linearly from

This is known as the corrector equation. Keep in mind that if the ODE is solely a function of the independent variable t, you don't really need the predictor equation.

The corrector equation in this case is

Error of the Heun method is similar to the trapezoidal rule. Recall that the trapezoidal rule is given by

If we solve the ODE

using integration like this:

We can then substitute this result into the equation for the trapezoidal rule to get

This is identical to the corrector equation in Heun's method. Thus, local truncation error is given by the same equation:

We will use Heun's method to work through part of the same function that was used in the Euler's method example. We will add a stopping criterion of

so that corrector iterations do not go on forever.

First, we calculate the slope at

. Then, we use the predictor to compute a value at 1:

This is the result that would be obtained by Euler's method. This comes with a percent relative error of

. To improve the estimate for

, we will use the value

to predict the slope at the end of the interval:

Combining this with the intial slope, we can calculate average slope over the interval from

to 1:

We will then substitute this into the corrector equation to get the prediction at

which comes with a true percent relative error of

and

We can then improve our value by substituting the new result into the corrector:

which comes with a true percent relative error of

and

The next iteration gives

where

and

. After 12 iterations, the result at

which has

. This is the point at which the approximate error is below the stopping criterion. I've added the both results for the Heun method to the previous plot:

t = 0:1:4;

y_true = [2.00000, 6.19463, 14.84392, 33.67717, 75.33896];

y_euler = [2.00000, 5.00000, 11.40216, 25.51321, 56.84931];

y_heun = [2.00000, 6.70108, 16.31978, 37.19925, 83.33777];

y_heun_iter = [2.00000, 6.36087, 15.30224, 34.74328, 77.73510];

figure

plot(t, y_true, 'r-', 'LineWidth', 2, 'DisplayName', 'True Solution')

hold on

plot(t, y_euler, 'bo-', 'LineWidth', 2, 'MarkerSize', 6, 'DisplayName', 'Euler Approximation')

plot(t, y_heun, 'gs-', 'LineWidth', 2, 'MarkerSize', 6, 'DisplayName', 'Heun (No Iteration)')

plot(t, y_heun_iter, 'yd-', 'LineWidth', 2, 'MarkerSize', 6, 'DisplayName', 'Heun (With Iteration)')

xlabel('t')

ylabel('y')

title('Comparison of All Methods')

legend('Location', 'southoutside')

grid on

hold off

Runge-Krutta methods (RK) are the workhorse of numerical ODE solvers. They are just as accurate as the Taylor series methods but do not require you to calculate higher order derivatives. There are many variations of Runge-Kutta, but they can all be generalized as

where ϕ is called an increment function. It essentially is the weighted sum of slope over the interval. It can be written generally as

where the a terms are constant and the k terms are

where the p and q terms are constants.

The second-order version of Range-Kutta (RK2) is

where

and

. If we assume that

, and

, then RK2 becomes

where

and

. This is also known as the midpoint method.

The most popular version of Range-Kutta is by far the fourth-order version (RK4):

where

, and

For ODEs that are solely a function of t, RK4 is somewhat similar to Simpson's 1/3 rule and Heun's method. The idea of all of them is to average either the function or the derivative of the function (slope) at multiple points and then to combine them using a weighted average.

Let's use RK4 to solve

by integrating from

to 1 with a step size of 1. The initial condition is

First, let's calculate the slope at the beginning of the interval:

We can then use this value to find a value of y and then slope at the midpoint:

We can then use this value to find another value of y and another slope at the midpoint:

We can then use this value to find a value of y and then slope at the end of the interval:

Now, we will combine these four slope estimates to yield an average slope. Then, we will use it to make the final prediction at the end of the interval:

This is a satisfactory estimate of the true value

. We have

It's possible to develop higher-order RK methods. For instance, the fifth-order RK method is written as

There is a classification for Runge-Kutta methods known as the Butcher tableau:

This is a table that organizes the coefficients for RK methods. The c-values tell you where in the step each slope is evaluated. a-values tell you how previous k-values influence the next step. b-values are the final weights for combining all the k-values, which are then combined to compute the next y. Here is the one for RK4:

Going from row 1 to row 4, here is what the c-values tell you:

0 means that you evaluate at the start of the interval .
means that you evaluate halfway through the step at .
means that you evaluate halfway through the step but using the new y-values from the previous step.
1 means that you evaluate at the end of the interval

Going from row 1 to row 4, here is what the a-values tell you:

All zeros means that doesn't depend on any previous slopes. This makes sense because it is the first slope value.
means that depends only on and is scaled by .
means that depends only on and is scaled by .
1 means that depends only on and is scaled by 1.

Going from row 1 to row 4, here is what the b-values tell you:

is the weight for .
is the weight for .
is the weight for .
is the weight for .

This of course is represented in

Can you figure out which methods these two are?

The first is RK3 and the second is Heun's third-order.

All of the problems we've solved so far are single equations. In real-world applications, you will often have a system of ODEs. A system as such can be represented generally as

All of the methods we have discussed so far can be applied to systems of ODEs. The only difference is that y becomes a vector and your slope function returns a vector of derivatives.

Let's work through an example. These equations for velocity and displacement, respectively, come from the bungee jumper system:

We can express these as ODEs:

Let's first solve for the slopes at the beginning of the interval:

where

is the ith value of k for the jth dependent variable.

Next, we will calculate the first values of x and v at the midpoint of the first step:

We will then use these values to calculate the first set of midpoint slopes:

We will then use these values to find the second set of predicted midpoint values and slopes:

These are then used to find the predicted values and slopes at the end of the interval:

We can now plug these in to RK4:

The true values are

and

. From our approximations, we have

and

Let's say that the solution to a numerical ODE changes very slowly. For the first half of the interval, over multiple iterations, you aren't getting much change. Then, a large step size would be sufficient. But after that, what happens if the results start to change rapidly? You would need a small step size to track the fast change. But implementing a small step size there would mean that you would also be using an unnecessarily small step size on the first half of the interval. This would be inefficient computation. A dilemma like this reflects one key drawback of solving a numerical ODE with a constant step size. Thankfully, there are adaptive methods that can automatically adjust step size. MATLAB's ode45 (Runge-Kutta 4,5) function is the standard for solving ODEs and handles all of this for us. Keep in mind that ode45 only works for non-stiff ODEs. Non-stiff ODEs are well-behaved whereas stiff ODEs can be numerically unstable. For stiff ODEs, it's best to use a function like ode15s.

Let's start from the top. While MATLAB is mostly for numeric work, it can also perform symbolic work, as we've seen previously. It can solve simple differential equations symbolically through the function dsolve. Let's say we wanted to solve the first-order differential equation

Using older syntax, we would do this:

s = dsolve('Dy = y*x', 'x')

s =

I'm including the older syntax because the modern syntax requires Symbolic Math Toolbox which you might not have.

Let's instead do it using the modern syntax:

syms y(x)

eqn = diff(y,x) == x*y;

sol = dsolve(eqn)

sol =

The modern syntax is easier to read, easier to work with, and is more in line with other languages.

We could also turn this into an IVP problem with

. Using older syntax, we would do this:

eqn1 = 'Dy = y*x';

y = dsolve(eqn1, 'y(1)=1', 'x')

y =

Let's do it using the modern syntax:

syms y(x)

ode = diff(y, x) == y*x;

cond = y(1) == 1;

sol = dsolve(ode, cond)

sol =

Let's solve the second-order equation

where

and

. Using older syntax, we would do this:

eqn2 = 'D2y + 8*Dy + 2*y = cos(x)';

inits2 = 'y(0) = 0, Dy(0) = 1';

x = linspace(0,1,20);

y = dsolve(eqn2, inits2, 'x')

y =

z = eval(vectorize(y));

figure

plot(x, z)

xlabel('x')

ylabel('y')

title('Solution to Second-Order ODE')

grid on

Using the modern syntax we would have this:

syms y(x)

ode = diff(y, x, 2) + 8*diff(y, x) + 2*y == cos(x);

cond1 = y(0) == 0;

cond2 = subs(diff(y, x), x, 0) == 1;

sol = dsolve(ode, [cond1, cond2]);

disp(sol)

x_vals = linspace(0, 1, 200);

f_numeric = matlabFunction(sol);

y_vals = f_numeric(x_vals);

figure

plot(x_vals, y_vals)

xlabel('x')

ylabel('y')

title('Solution to Second-Order ODE using syms')

grid on

One application of differential equations is population dynamics in biological systems. One of the most famous that you likely already know is the Lotka-Volterra predator-prey model:

where prey birth rate

, predation rate coefficient

, predator death rate

, predator reproduction efficiency

, amd

. These values are actually real data from the Hudson's Bay Company between 1900 and 1920. Here is the function:

function yprime = lv(t, y)
% LV: Lotka-Volterra predator-prey equations
% y(1) = prey population
% y(2) = predator population
 
a = 0.5471;
b = 0.0281;
c = 0.0266; 
r = 0.8439;
 
yprime = [a*y(1) - b*y(1)*y(2); -r*y(2) + c*y(1)*y(2)];              
 
end

To solve this, we finally get to use ode45:

[t, y] = ode45(@lv, [0 20], [30 4])

t = 77×1

y = 77×2

0000    4.0000
5464    3.9882
1727    3.9959
8809    4.0242
6723    4.0745
1440    4.4378
5185    5.1629
5240    6.4846
4499    8.8629
6721   12.3253

figure

plot(t, y(:,1), 'b-')

hold on

plot(t, y(:,2), 'r--')

hold off

xlabel('Time (years)')

ylabel('Population')

title('Lotka-Volterra Predator-Prey Model')

legend('Prey', 'Predator', 'Location', 'southoutside')

grid on

The vector t holds the time values where the system was evaluated. The matrix y holds the corresponding prey and predator populations. Because this is from the Canadian ecosystem, we can assume that the prey and predators are snowshoe hares and Canadian lynxes respectively. For instance, at t = 0.3467 years, we have y(1) = 34.8809 and y(2) = 4.0242. This means that the prey population has grown from its initial value of 30 thousand hares to approximately 34.8809 thousand hares. The predator population has increased from 4 thousand lynxes to approximately 4.0242 thousand lynxes. We can match this up with the real data:

Fundamentals of Ecology by Odum

Let's work through another example, this time for a system of Lorenz equations:

where Prandtl number

, Rayleigh number

. We will set the initial conditions to

, and

The Lorenz system is a model for atmospheric convection that is known for having chaotic behavior in deterministic systems. You may know the analogy of "a butterfly flapping its wings in Brazil causing a tornado in Texas." This is what Lorenz was working on, and you may have heard of it as the butterfly effect.

Here is the function:

function xprime = lorenz(t, x)
    % LORENZ: Computes the derivatives for the Lorenz system
    sigma = 10;
    beta = 8/3;
    rho = 28;
 
    xprime = [
        -sigma * x(1) + sigma * x(2);
        rho * x(1) - x(2) - x(1) * x(3);
        -beta * x(3) + x(1) * x(2)
    ];
end

And now let's use ode45 to solve:

x0 = [-8; 8; 27];
tspan = [0 20]; % Arbitrary time span
 
[t, x] = ode45(@lorenz, tspan, x0)
t = 1213×1
         0
    0.0025
    0.0050
    0.0075
    0.0100
    0.0226
    0.0352
    0.0477
    0.0603
    0.0716
x = 1213×3
   -8.0000    8.0000   27.0000
   -7.6036    7.9571   26.6639
   -7.2182    7.9093   26.3387
   -6.8436    7.8574   26.0242
   -6.4796    7.8021   25.7199
   -4.8119    7.4981   24.3412
   -3.3774    7.1889   23.1694
   -2.1473    6.9160   22.1667
   -1.0914    6.7046   21.3016
   -0.2618    6.5763   20.6165

t is a vector of time points and x is a matrix with 3 columns for each variable.

Let's plot it:

figure

plot(x(:,1), x(:,3))

xlabel('x')

ylabel('z')

title('Lorenz Strange Attractor')

grid on

This is Lorenz's strange sttractor, the attractor associated with the butterfly effect.

We can also plot each component of the solution as a function of t:

figure

subplot(3,1,1)

plot(t, x(:,1))

ylabel('x(t)')

subplot(3,1,2)

plot(t, x(:,2))

ylabel('y(t)')

subplot(3,1,3)

plot(t, x(:,3))

xlabel('Time (t)')

ylabel('z(t)')

sgtitle('Lorenz System Components vs Time')

Here's a 3D model of Lorenz's strange attractor. Please try running it in MATLAB yourself!

figure

plot3(x(:,1), x(:,2), x(:,3), 'y', 'LineWidth', 1.5)

set(gca, 'Color', 'k')

xlabel('x', 'Color', 'w')

ylabel('y', 'Color', 'w')

zlabel('z', 'Color', 'w')

title('Lorenz Strange Attractor', 'Color', 'w')

grid on

set(gca, 'XColor', 'w', 'YColor', 'w', 'ZColor', 'w')

Here's the code for a fun animation of a point moving along Lorenz's strange attractor over time. Please try running it in MATLAB yourself!

lorenz_animated = @(t, x) [

10 * (x(2) - x(1));

28 * x(1) - x(2) - x(1) * x(3);

(8/3) * (x(1) * x(2) - x(3))

];

tspan = [0 50];

x0 = [1; 1; 1];

[t, x] = ode45(lorenz_animated, tspan, x0);

figure

set(gca, 'Color', 'k')

hold on

xlabel('x', 'Color', 'w')

ylabel('z', 'Color', 'w')

title('Lorenz Strange Attractor Animation', 'Color', 'w')

set(gca, 'XColor', 'w', 'YColor', 'w')

grid on

plot(x(:,1), x(:,3), 'Color', [0.3 0.3 0.3])

% Animation

h = plot(NaN, NaN, 'yo', 'MarkerFaceColor', 'y', 'MarkerSize', 6);

for k = 1:length(t)

set(h, 'XData', x(k,1), 'YData', x(k,3));

drawnow

pause(0.01);

end

Boundary Value Problems (BVPs) are not included in too many introductory courses on ODEs. BVPs are differential equations that are subject to constriants at their extremes. That is, they have conditions at both the beginning and end of their intervals. This is in contrast with IVPs which only have conditions at the beginning of their intervals. In MATLAB, we use the function bvp4c which is a fourth-order method or bvp5c which is a fifth-order method.

Let's test this out in MATLAB. First, here is our system:

We have boundary conditions

and

. The exact analytical solution to this system is

Let's define the systen and its boundary condition as functions:

function dydx = bvpfcn(x,y) % System
dydx = [y(2)
       -2*y(2)/x - y(1)/x^4];
end
 
function res = bcfcn(ya,yb) % Boundary conditions
res = [ya(1)
       yb(1)-sin(1)];
end
 
function dfdy = jac(x,y) % Analytical jacobian for f
dfdy = [0      1
       -1/x^4 -2/x];
end

Next, let's specify the analytical Jacobian while also setting options with crude error tolerance and handling:

opts = bvpset('FJacobian',@jac,'RelTol',0.1,'AbsTol',0.1,'Stats','on');

Let's now create an initial guess. We will specify a constant function with an initial mesh of 10 points in the interval

xmesh = linspace(1/(3*pi), 1, 10);
solinit = bvpinit(xmesh, [1; 1]);

Let's now solve the equation using both functions:

sol4c = bvp4c(@bvpfcn, @bcfcn, solinit, opts);
The solution was obtained on a mesh of 9 points.
The maximum residual is  9.794e-02. 
There were 157 calls to the ODE function. 
There were 28 calls to the BC function. 
sol5c = bvp5c(@bvpfcn, @bcfcn, solinit, opts);
The solution was obtained on a mesh of 11 points.
The maximum error is  6.742e-02. 
There were 244 calls to the ODE function. 
There were 29 calls to the BC function. 

Let's plot:

xplot = linspace(1/(3*pi),1,200);

yplot = [sin(1./xplot); -cos(1./xplot)./xplot.^2];

figure

plot(xplot,yplot(1,:),'k',sol4c.x,sol4c.y(1,:),'r*',sol5c.x,sol5c.y(1,:),'bo')

title('Comparison of BVP Solvers with Crude Error Tolerance')

legend('True','BVP4C','BVP5C')

xlabel('x')

ylabel('solution y')

As is often the trend with higher-order methods, bvp5c is more accurate, but requires more computational effort. bvp5c controls true error in the calculation whereas bvp4c controls it indirectly.

A strange attractor is the shape that chaotic systems tend to follow over time, but they do so in an extremely sensitive and unpredictable manner. They are characterized by their fractal structure, and showcase the beauty of nonlinear dynamical systems. The butterfly effect is really interesting because it shows how tiny differences can lead to massive events. And it belongs to a larger area of study known as chaos theory, which is all about trying to find hidden structure in what may seem like randomness. This might seem really out there, and in some ways it is, but remember that a majority of the real-world is governed by nonlinear systems.

Indeed, the real world rarely hands us systems with easy, closed-form solutions. To understand it, we have to linearize, approximate, and iterate through step-by-step. And that is why numerical methods are the heart of science and engineering, and thus nearly everything around you. Your phone, the internet, medicine, food you buy at the grocery store, bridges, planes, and the very ground you walk on would all be infeasible without years and years of refining algorithms. We do all this to solve the unsolvable, build the unbuildable, and make our chaotic world just a little more understandable.

For the second part, please go to MATLAB Applications Part 2: Simulink and Scientific Computing.

Bibliography

MATLAB Applications Part 1 uses and was inspired by the following sources. Sources are listed in the order they first appear in. Some may reappear later.

Chapter 1:

Fowler et al — "Square Root Approximations in Old Babylonian Mathematics: YBC 7289 in Context"
Trefethen — The Definition of Numerical Analysis
Jeff Anderson — YouTube
Smith III — Introduction to Digital Filters
Driscoll et al — Fundamentals of Numerical Computation
Whitehead et al — Floating Point and IEEE 754 Compliance for NVIDIA GPUs
Sevegnani et al — Floating-Point 8: An Introduction to Efficient, Lower-Precision AI Training

Chapter 2:

Matthews et al — Numerical Methods Using MATLAB
Chapra — Applied Numerical Methods with MATLAB for Engineers and Scientists
Oscar Veliz — YouTube
Chan — STA663-2020
Nick Higham — Blog
Howard — Solving ODE in MATLAB
Howard — Numerical Methods for ODE in MATLAB