We have that $A = LU$, where $U$ is the upper triangular matrix that we've been computing all this time and $L$ is a lower triangular matrix that is the inverse of the product of the elimination matrices that we applied to $A$ to obtain $U$. How does this help us solve the equation $A \mathbf x = \mathbf b$?
Notice that if $A = LU$, then we have the equation \[A \mathbf x = LU \mathbf x = \mathbf b.\] One way to read this product is $L (U \mathbf x) = \mathbf b$. Since $U \mathbf x$ is a matrix-vector product, the outcome is some vector, say $U \mathbf x = \mathbf c$. This means that we can first solve the equation \[L \mathbf c = \mathbf b.\] This is easy to do: since $L$ is lower triangular, we can use substitution to solve this. Then once we have $\mathbf c$, we can solve $U \mathbf x = \mathbf c$ to obtain $\mathbf x$, also using substitution.
Recall our original goal of solving $A \mathbf x = \mathbf b$. Let $A = \begin{bmatrix} 2 & -1 & 5 \\ 4 & 1 & 19 \\ 2 & 8 & 34\end{bmatrix}$. The LU factorization fo $A$ is $U = \begin{bmatrix} 2 & -1 & 5 \\ 0 & 3 & 9 \\ 0 & 0 & 2 \end{bmatrix}$ and $L = \begin{bmatrix} 1 & 0 & 0 \\ 2 & 1 & 0 \\ 1 & 3 & 1 \end{bmatrix}$. We can use this to solve $A \mathbf x = \mathbf b = \begin{bmatrix}5 \\ 31 \\ 72 \end{bmatrix}$.
Verifying that this works:
>>> A = np.array([[2, -1, 5], [4, 1, 19], [2, 8, 34]])
>>> U = np.array([[2, -1, 5], [0, 3, 9], [0, 0, 2]])
>>> L = np.array([[1, 0, 0], [2, 1, 0], [1, 3, 1]])
>>> b = np.array([5, 31, 72])
>>> c = np.linalg.solve(L, b)
>>> c
array([ 5., 21., 4.])
>>> np.linalg.solve(U, c)
array([-2., 1., 2.])
>>> np.linalg.solve(A, b)
array([-2., 1., 2.])
Now, let's try putting everything together. The big result we want to use with this is the idea that every invertible matrix $A$ has a decomposition into $LU$. This is not automatically the case, since we saw that there are situations where we need to change the order of the rows. How do we deal with this? Instead, we make a slightly weaker claim.
Let $A$ be an invertible matrix. Then there exists a permutation matrix $P$, lower triangular matrix $L$, and upper triangular matrix $U$ such that $PA = LU$.
In other words, if $A$ is invertible, there's a way to permute the rows and that permuted matrix can be decomposed into $LU$.
So suppose we have a decomposition $PA = LU$. Applying $P$ to our linear equation gives $PA\mathbf x = P \mathbf b$, which leads us to the equation $LU \mathbf x = P \mathbf b$. As above, we first solve the equation $L\mathbf c = P \mathbf b$ for $\mathbf c$ and then solve the equation $U \mathbf x = \mathbf c$ to obtain $\mathbf x$.
There is an intriguing question of why we get such a nice result when we consider the inverse of $E$. Recall that elimination proceeds from the top row and moves progressively down. So which rows are we taking linear combinations of when we perform elimination on, say, the third row?
The answer is that we are taking combinations of rows of $E_{21}A$ (not $A$!). But these rows are actually rows of $U$—they don't get changed after we move on to the third row (assuming we are working systematically). We have \[\text{Row 3 of $U$} = \text{Row 3 of A} - \ell_{31} \times \text{Row 1 of $U$} - \ell_{32} \times \text{Row 2 of $U$}\] Notice that we can take this and describe the corresponding row of $A$ by rearranging into \[\text{Row 3 of $A$} = \ell_{31} \times \text{Row 1 of $U$} + \ell_{32} \times \text{Row 2 of $U$} + \text{Row 3 of $U$}.\] This is what the third row of $L$ is doing in the product $A = LU$: it is describing how to combine rows of $U$ to create rows of $A$. This same argument applies for larger matrices: when we're performing elimination on row $k$, we are taking linear combinations of rows 1 through $k$ of $U$.
But there's more—recall that we can view matrix multiplication as a sum of rank-one matrix products. That is, for $A = LU$, we take the $i$th column of $L$ and multiply it by the $i$th row of $U$. What does this give us?
If we consider the first row of $U$, this is exactly the first row of $A$—we never do anything to this row. But what is the first column of $L$? It describes how we eliminated the entries underneath. So multiplying the first column of $L$ and the first row of $U$ tells how to reconstruct the first column of $A$! \[\begin{bmatrix} 1 \\ \ell_{21} \\ \ell_{31} \\ \ell_{41} \end{bmatrix} \begin{bmatrix} a_{11} & a_{12} & a_{13} & a_{14} \end{bmatrix} = \begin{bmatrix} a_{11} & a_{12} & a_{13} & a_{14} \\ a_{21} & \ell_{21} a_{12} & \ell_{21} a_{13} & \ell_{21} a_{14} \\ a_{31} & \ell_{31} a_{12} & \ell_{31} a_{13} & \ell_{31} a_{14} \\ a_{41} & \ell_{41} a_{12} & \ell_{41} a_{13} & \ell_{41} a_{14} \end{bmatrix}. \]
However, notice what we do when we're doing elimination. We're actually removing this matrix from $A$. While we've framed elimination as "zeroing" out particular entries, this gives us another view, of removing a rank 1 matrix from $A$. Let's call this matrix $A_1$. Then we have \[A - A_1 = \begin{bmatrix} 0 & 0 & 0 & 0 \\ 0 & * & * & * \\0 & * & * & * \\0 & * & * & * \end{bmatrix}.\]
Then we repeat this process for the second, third, and so on rows of $U$ and columns of $L$. For instance, for $i = 2$, we have \[\begin{bmatrix} 0 \\ 1 \\ \ell_{32} \\ \ell_{42} \end{bmatrix} \begin{bmatrix} 0 & u_{22} & u_{23} & u_{24} \end{bmatrix} = \begin{bmatrix} 0 & 0 & 0 & 0 \\ 0 & u_{22} & \ell_{22} u_{23} & \ell_{22} u_{24} \\ 0 & \ell_{32} u_{22} & \ell_{32} u_{23} & \ell_{32} u_{24} \\ 0 & \ell_{42} u_{22} & \ell_{42} u_{23} & \ell_{42} u_{24} \end{bmatrix}. \] and if we call this $A_2$, then removing this matrix from $A - A_1$ results in a matrix \[A - A_1 - A_2 = \begin{bmatrix} 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 \\0 & 0 & * & * \\0 & 0 & * & * \end{bmatrix}.\]
Finally, we observe that the LU factorization describes exactly how to combine these pieces together to get $A$: \[A = LU = \begin{bmatrix} \quad \\ \mathbf \ell_1 & \cdots & \mathbf \ell_n \\ \quad \end{bmatrix} \begin{bmatrix} \quad & \mathbf u_1 & \quad \\ & \vdots \\ & \mathbf u_n \end{bmatrix} = \mathbf \ell_1 \mathbf u_1 + \cdots + \mathbf \ell_n \mathbf u_n.\]
Classical treatments of linear algebra make a big deal of computing the inverse of a matrix to solve $A \mathbf x = \mathbf b$. Practically speaking, this is not often done just to solve a linear system for two main reasons:
Let's discuss the second briefly first. Recall that in computers cannot represent real numbers completely accurately. The best they can do is approximate them—this is what floating point numbers are. This means that every computation carries with it a small possible amount of error. Certain operations amplify this error more than others. This is the notion of numerical stability and is the subject of numerical methods courses.
Now, let's discuss how much work, computationally speaking, it is to compute an inverse.
Recall that we can stick matrices together and treat them as one unit. For example, when computing LU to solve $A \mathbf x = \mathbf b$, we need to remember that we want to perform elimination on each equation and therefore, we need to perform the same row operations on $\mathbf b$. An easy way to do this is to make one big matrix that contains $\mathbf b$ as an extra column. So if we have $A = \begin{bmatrix} 2 & 1 & 5 \\ -1 & 3 & 2 \\ 4 & 1 & 6 \end{bmatrix}$ and $\mathbf b = \begin{bmatrix} 6 \\ 4 \\ 7 \end{bmatrix}$, we would perform elimination on a matrix $\begin{bmatrix} 2 & 1 & 5 & 6 \\ -1 & 3 & 2 & 4 \\ 4 & 1 & 6 & 7 \end{bmatrix}$. We can write this symbolically as $\begin{bmatrix} A & \mathbf b \end{bmatrix}$.
Taking this idea, we can ask ourselves what computing the inverse of a matrix $A$ would look like. Recall that by definition, if $A^{-1}$ exists, we have $AA^{-1} = I$. How do we solve such an equation?
Again, it helps to view matrix multiplication as a series of matrix-vector products. Recall that the $i$th column of $I$ is a unit vector with the $i$th component set to 1 and all others 0. What produces this column? It should be $A \mathbf a^{-1}_i$, the product of $A$ with the $i$th column of $A^{-1}$.
So what we're really doing is solving a bunch of linear equations: one for each column vector of $I$. Practically speaking, the LU decomposition gives us a way to do this, so we just end up solving $n$ equations. This is suggestive of why computing $A^{-1}$ is costly compared to computing $LU$: we end up doing it anyway.
Our discussion of linear systems is helpful for establishing some key ideas: getting used to thinking in terms of linear combinations, thinking of matrix products as operations that combine rows and columns, and so on. However, this assumes that all we want is a single solution to a system of linear equations and that we will always have systems for which we have $n$ equations and $n$ unknowns. In other words, our discussion thus far assumes that we are working with square matrices.
However, we will rarely work with such matrices. Matrices that represent data are often long and skinny: many, many data points and relatively few attributes or properties that are measured. So we need a way to extend the ideas we explored above to matrices of varying dimensions. Furthermore, we need a way to discuss the various sets of vectors we get and their properties.
All of this is ties back to one of the initial questions of the course: what is a vector? This is necessary for the idea of sets of vectors having special characteristics. We'll consider such sets as vector spaces. What follows is a very formal definition for a vector space.
A vector space $V$ is a set of elements, called vectors, equipped with two operations, vector addition and scalar multiplication, which satisfy the following properties:
That is all just a very formal way to say that vector spaces are sets of things that behave like vectors. Specifically, vector addition and scalar multiplication on these things work how you'd expect vector addition and scalar multiplication to work.
| Formal definition | Name | Practical consequence |
|---|---|---|
| $\mathbf u + \mathbf v = \mathbf v + \mathbf u$ for all vectors $\mathbf u$ and $\mathbf v$ in $V$ | commutativity of vector addition | It doesn't matter which order you add two vectors together |
| $\mathbf u + (\mathbf v + \mathbf w) = (\mathbf u + \mathbf v) + \mathbf w$ for all vectors $\mathbf u, \mathbf v, \mathbf w$ in $V$ | associativity of vector addition | If you're adding three vectors, it doesn't matter which of the two you add together first |
| there is a unique zero vector $\mathbf 0$ such that $\mathbf v + \mathbf 0 = \mathbf v$ for all vectors $\mathbf v$ in $V$ | identity for vector addition | A zero vector exists and adding it to a vector gives you the same vector |
| for each vector $\mathbf v$ in $V$, there is a unique vector $-\mathbf v$ such that $\mathbf v + (- \mathbf v) = \mathbf 0$ | inverse for vector addition | Every vector has an inverse vector that you can add to it to get the zero vector |
| $1 \mathbf v = \mathbf v$ for all vectors $\mathbf v$ in $V$ | identity for scalar multiplication | 1 exists as a scalar and scalar multiplication by 1 should give you the same vector |
| $(c_1 c_2) \mathbf v = c_1 (c_2 \mathbf v)$ for all vectors $\mathbf v$ in $V$ and all scalars $c_1$ and $c_2$ | associativity of scalar multiplication | It shouldn't matter whether you combine two scalars together first before scaling a vector with it or if you do them both separately |
| $c (\mathbf u + \mathbf v) = c\mathbf u + c\mathbf v$ for all vectors $\mathbf u, \mathbf v$ in $V$ and all scalars $c$ | distributivity over vector addition | It doesn't matter whether you add two vectors together first before scaling them |
| $(c_1+c_2) \mathbf v = c_1\mathbf v + c_2 \mathbf v$ for all vectors $\mathbf v$ in $V$ and all scalars $c_1$ and $c_2$ | distributivity over scalar addition | It doesn't matter whether you add two scalars together first before scaling a vector |
As an aside, there is an important condition left out of this definition, which is that the scalars have to be from a structure called a field. A field is just a structure where addition works like we expect and multiplication works like we expect and that every element has a multiplicative inverse. In our setting, vectors in $\mathbb R^n$, always works because the real numbers $\mathbb R$ are a field, so this is never brought up in any of the texts. Where this does matter is in more abstract linear algebra with weirder structures. This is important to note because you might notice that taking, say, the integers as your scalars will actually satisfy all of the eight conditions above, but the integers can't be scalars because they're not a field—most integers do not have a multiplicative inverse.
One consequence of these rules is that it makes it impossible for us to make things that aren't vectors by using linear combinations. In other words, linear combinations should always only make vectors.