There is an intriguing question of why we get such a nice result when we consider the inverse of $E$. Recall that elimination proceeds from the top row and moves progressively down. So which rows are we taking linear combinations of when we perform elimination on, say, the third row?
The answer is that we are taking combinations of rows of $E_{21}A$ (not $A$!). But these rows are actually rows of $U$—they don't get changed after we move on to the third row (assuming we are working systematically). We have \[\text{Row 3 of $U$} = \text{Row 3 of A} - \ell_{31} \times \text{Row 1 of $U$} - \ell_{32} \times \text{Row 2 of $U$}\] Notice that we can take this and describe the corresponding row of $A$ by rearranging into \[\text{Row 3 of $A$} = \ell_{31} \times \text{Row 1 of $U$} + \ell_{32} \times \text{Row 2 of $U$} + \text{Row 3 of $U$}.\] This is what the third row of $L$ is doing in the product $A = LU$: it is describing how to combine rows of $U$ to create rows of $A$. This same argument applies for larger matrices: when we're performing elimination on row $k$, we are taking linear combinations of rows 1 through $k$ of $U$.
But there's more—recall that we can view matrix multiplication as a sum of rank-one matrix products. That is, for $A = LU$, we take the $i$th column of $L$ and multiply it by the $i$th row of $U$. What does this give us?
If we consider the first row of $U$, this is exactly the first row of $A$—we never do anything to this row. But what is the first column of $L$? It describes how we eliminated the entries underneath. So multiplying the first column of $L$ and the first row of $U$ tells how to reconstruct the first column of $A$! \[\begin{bmatrix} 1 \\ \ell_{21} \\ \ell_{31} \\ \ell_{41} \end{bmatrix} \begin{bmatrix} a_{11} & a_{12} & a_{13} & a_{14} \end{bmatrix} = \begin{bmatrix} a_{11} & a_{12} & a_{13} & a_{14} \\ a_{21} & \ell_{21} a_{12} & \ell_{21} a_{13} & \ell_{21} a_{14} \\ a_{31} & \ell_{31} a_{12} & \ell_{31} a_{13} & \ell_{31} a_{14} \\ a_{41} & \ell_{41} a_{12} & \ell_{41} a_{13} & \ell_{41} a_{14} \end{bmatrix}. \]
However, notice what we do when we're doing elimination. We're actually removing this matrix from $A$. While we've framed elimination as "zeroing" out particular entries, this gives us another view, of removing a rank 1 matrix from $A$. Let's call this matrix $A_1$. Then we have \[A - A_1 = \begin{bmatrix} 0 & 0 & 0 & 0 \\ 0 & * & * & * \\0 & * & * & * \\0 & * & * & * \end{bmatrix}.\]
Then we repeat this process for the second, third, and so on rows of $U$ and columns of $L$. For instance, for $i = 2$, we have \[\begin{bmatrix} 0 \\ 1 \\ \ell_{32} \\ \ell_{42} \end{bmatrix} \begin{bmatrix} 0 & u_{22} & u_{23} & u_{24} \end{bmatrix} = \begin{bmatrix} 0 & 0 & 0 & 0 \\ 0 & u_{22} & \ell_{22} u_{23} & \ell_{22} u_{24} \\ 0 & \ell_{32} u_{22} & \ell_{32} u_{23} & \ell_{32} u_{24} \\ 0 & \ell_{42} u_{22} & \ell_{42} u_{23} & \ell_{42} u_{24} \end{bmatrix}. \] and if we call this $A_2$, then removing this matrix from $A - A_1$ results in a matrix \[A - A_1 - A_2 = \begin{bmatrix} 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 \\0 & 0 & * & * \\0 & 0 & * & * \end{bmatrix}.\]
Finally, we observe that the LU factorization describes exactly how to combine these pieces together to get $A$: \[A = LU = \begin{bmatrix} \quad \\ \mathbf \ell_1 & \cdots & \mathbf \ell_n \\ \quad \end{bmatrix} \begin{bmatrix} \quad & \mathbf u_1 & \quad \\ & \vdots \\ & \mathbf u_n \end{bmatrix} = \mathbf \ell_1 \mathbf u_1 + \cdots + \mathbf \ell_n \mathbf u_n.\]
Recall our original goal of solving $A \mathbf x = \mathbf b$. Let $A = \begin{bmatrix} 2 & -1 & 5 \\ 4 & 1 & 19 \\ 2 & 8 & 34\end{bmatrix}$. The LU factorization fo $A$ is $U = \begin{bmatrix} 2 & -1 & 5 \\ 0 & 3 & 9 \\ 0 & 0 & 2 \end{bmatrix}$ and $L = \begin{bmatrix} 1 & 0 & 0 \\ 2 & 1 & 0 \\ 1 & 3 & 1 \end{bmatrix}$. We can use this to solve $A \mathbf x = \mathbf b = \begin{bmatrix}5 \\ 31 \\ 72 \end{bmatrix}$. To do so, we need to transform $A \mathbf x = \mathbf b$ into a system $U \mathbf x = \mathbf c$. But we know that $\mathbf c = E \mathbf b = L^{-1} \mathbf b = \begin{bmatrix}5 \\ 21 \\ 4 \end{bmatrix}$. In other words, we first solve the equation $L\mathbf c = \mathbf b$ for $\mathbf c$ via substitution and we can solve $U \mathbf x = \mathbf c$ to get $\mathbf x$ for $A \mathbf x = \mathbf b$.
Now, let's try putting everything together. The big result we want to use with this is the idea that every invertible matrix $A$ has a decomposition into $LU$. This is not automatically the case, since we saw that there are situations where we need to change the order of the rows. How do we deal with this? Instead, we make a slightly weaker claim.
Let $A$ be an invertible matrix. Then there exists a permutation matrix $P$, lower triangular matrix $L$, and upper triangular matrix $U$ such that $PA = LU$.
In other words, if $A$ is invertible, there's a way to permute the rows and that permuted matrix can be decomposed into $LU$.
So suppose we have a decomposition $PA = LU$. Applying $P$ to our linear equation gives $PA\mathbf x = P \mathbf b$, which leads us to the equation $LU \mathbf x = P \mathbf b$. Applying $E = L^{-1}$ brings us to the equation $U \mathbf x = \mathbf c$ where $\mathbf c = EP\mathbf b$. We can easily solve for $\mathbf x$ and to get the actual answer we want, we simply reverse the permutation.
Classical treatments of linear algebra make a big deal of computing the inverse of a matrix to solve $A \mathbf x = \mathbf b$. Practically speaking, this is not often done just to solve a linear system, since this is computationally expensive compared to computing the LU factorization. Strang discusses this at the beginning of (2.3).
Recall that we can stick matrices together and treat them as one unit. For example, when computing LU to solve $A \mathbf x = \mathbf b$, we need to remember that we want to perform elimination on each equation and therefore, we need to perform the same row operations on $\mathbf b$. An easy way to do this is to make one big matrix that contains $\mathbf b$ as an extra column. So if we have $A = \begin{bmatrix} 2 & 1 & 5 \\ -1 & 3 & 2 \\ 4 & 1 & 6 \end{bmatrix}$ and $\mathbf b = \begin{bmatrix} 6 \\ 4 \\ 7 \end{bmatrix}$, we would perform elimination on a matrix $\begin{bmatrix} 2 & 1 & 5 & 6 \\ -1 & 3 & 2 & 4 \\ 4 & 1 & 6 & 7 \end{bmatrix}$. We can write this symbolically as $\begin{bmatrix} A & \mathbf b \end{bmatrix}$.
Taking this idea, we can ask ourselves what computing the inverse of a matrix $A$ would look like. Recall that by definition, if $A^{-1}$ exists, we have $AA^{-1} = I$. How do we solve such an equation?
Again, it helps to view matrix multiplication as a series of matrix-vector products. Recall that the $i$th column of $I$ is a unit vector with the $i$th component set to 1 and all others 0. What produces this column? It should be $A \mathbf a^{-1}_i$, the product of $A$ with the $i$th column of $A^{-1}$.
So what we're really doing is solving a bunch of linear equations: one for each column vector of $I$. Practically speaking, the LU decomposition gives us a way to do this, so we just end up solving $n$ equations. This is suggestive of why computing $A^{-1}$ is costly compared to computing $LU$: we end up doing it anyway.
Our discussion of linear systems is helpful for establishing some key ideas, but there's a problem: it is relies on the idea that what we want is a single solution to a system of linear equations and that we will always have systems for which we have $n$ equations and $n$ unknowns. In other words, our discussion thus far assumes that we are working with square matrices.
However, we will rarely work with such matrices. Matrices that represent data are often long and skinny: many, many data points and relatively few attributes or properties that are measured. So we need a way to extend the ideas we explored above to matrices of varying dimensions. Furthermore, we need a way to discuss the various sets of vectors we get and their properties.
All of this is ties back to one of the initial questions of the course: what is a vector? This is necessary for the idea of sets of vectors having special characteristics. We'll consider such sets as vector spaces. What follows is a very formal definition for a vector space.
A vector space $V$ is a set of elements, called vectors, equipped with two operations, vector addition and scalar multiplication, which satisfy the following properties:
That is all just a very formal way to say that vector spaces are sets of things that behave like vectors. Specifically, vector addition and scalar multiplication on these things work how you'd expect vector addition and scalar multiplication to work.
| Formal definition | Name | Practical consequence |
|---|---|---|
| $\mathbf u + \mathbf v = \mathbf v + \mathbf u$ for all vectors $\mathbf u$ and $\mathbf v$ in $V$ | commutativity of vector addition | It doesn't matter which order you add two vectors together |
| $\mathbf u + (\mathbf v + \mathbf w) = (\mathbf u + \mathbf v) + \mathbf w$ for all vectors $\mathbf u, \mathbf v, \mathbf w$ in $V$ | associativity of vector addition | If you're adding three vectors, it doesn't matter which of the two you add together first |
| there is a unique zero vector $\mathbf 0$ such that $\mathbf v + \mathbf 0 = \mathbf v$ for all vectors $\mathbf v$ in $V$ | identity for vector addition | A zero vector exists and adding it to a vector gives you the same vector |
| for each vector $\mathbf v$ in $V$, there is a unique vector $-\mathbf v$ such that $\mathbf v + (- \mathbf v) = \mathbf 0$ | inverse for vector addition | Every vector has an inverse vector that you can add to it to get the zero vector |
| $1 \mathbf v = \mathbf v$ for all vectors $\mathbf v$ in $V$ | identity for scalar multiplication | 1 exists as a scalar and scalar multiplication by 1 should give you the same vector |
| $(c_1 c_2) \mathbf v = c_1 (c_2 \mathbf v)$ for all vectors $\mathbf v$ in $V$ and all scalars $c_1$ and $c_2$ | associativity of scalar multiplication | It shouldn't matter whether you combine two scalars together first before scaling a vector with it or if you do them both separately |
| $c (\mathbf u + \mathbf v) = c\mathbf u + c\mathbf v$ for all vectors $\mathbf u, \mathbf v$ in $V$ and all scalars $c$ | distributivity over vector addition | It doesn't matter whether you add two vectors together first before scaling them |
| $(c_1+c_2) \mathbf v = c_1\mathbf v + c_2 \mathbf v$ for all vectors $\mathbf v$ in $V$ and all scalars $c_1$ and $c_2$ | distributivity over scalar addition | It doesn't matter whether you add two scalars together first before scaling a vector |
As an aside, there is an important condition left out of this definition, which is that the scalars have to be from a structure called a field. A field is just a structure where addition works like we expect and multiplication works like we expect and that every element has a multiplicative inverse. In our setting, vectors in $\mathbb R^n$, always works because the real numbers $\mathbb R$ are a field, so this is never brought up in any of the texts. Where this does matter is in more abstract linear algebra with weirder structures. This is important to note because you might notice that taking, say, the integers as your scalars will actually satisfy all of the eight conditions above, but the integers can't be scalars because they're not a field—most integers do not have a multiplicative inverse.
One consequence of these rules is that it makes it impossible for us to make things that aren't vectors by using linear combinations. In other words, linear combinations should always only make vectors.
As mentioned earlier, we're mostly interested in vectors over $\mathbb R$, so most of these rules are no-brainers. But it helps to see some examples of "weirder" vector spaces to drive home the point that there is a wider world of structures out there and that the rules that we have are not necessarily "obvious".
For any $n$, the vector space that consists only of the zero vector $\mathbf 0 = \begin{bmatrix} 0 \\ \vdots \\ 0 \end{bmatrix}$ is a vector space. The rules on our operations work the same way and any combinations of $\mathbf 0$ produce $\mathbf 0$ itself.
We can view polynomials of degree $n$ with real coefficients as a vector space. Think about how polynomial addition works: we gather the terms for each $x^k$ and add the coefficients together. In this way, we can view a function $-3x^2 + x - 5$ like the vector $\begin{bmatrix} -3 \\ 1 \\ -5 \end{bmatrix}$ in $\mathbb R^3$. And scaling a polynomial works in the way that you'd expect.
But what isn't a vector space?
Here's a simple example: suppose we consider only vectors with non-negative real number scalars. This can be a reasonable restriction: for instance, if we know our data has only nonnegative measurements. But this isn't a vector space—if I have a vector $\mathbf v$, there's no corresponding vector $-\mathbf v$ that I can combine with it to get $\mathbf 0$.
Since we're remaining firmly planted in $\mathbb R^n$, one of the things we gain from talking about vector spaces is the fact that vector spaces can contain vector spaces. Such spaces are called subspaces.
A subspace is a vector space such that for any vectors $\mathbf u$ and $\mathbf v$ in the subspace and any scalar $c$,
In other words, subspaces are vector spaces inside of another vector space where any linear combination of the vectors stays inside the subspace.
We said earlier that matrices define some sort of structure. It is the subspaces defined by each matrix that are the structures that we're interested in. The most obvious structure is one we've already been working with: the column space of a matrix.