We have just one more piece of the puzzle missing that we alluded to earlier—our rows might not be in the right order! We addressed this briefly: simply swap the rows by using a permutation. We saw already that this can be expressed as a linear combination. Here, we'll briefly mention a few more properties of permutations and permutation matrices that are nice to know.
First, we can give a definition.
A permutation matrix is a matrix that has exactly one 1 in each row and column with all other entries 0.
Recall that a permutation matrix $P$ can either describe how to rearrange columns or rows of a matrix $A$. Which of these depends on which side $P$ is on. The matrix $PA$ should be thought of as $A$ with the rows rearranged according to $P$, while the matrix $AP$ should be thought of as $A$ with the columns rearranged according to $P$.
This leads to a few interesting properties. Let $P$ be a permutation matrix.
First, let's be more explicit about what the transpose is, since we'll be using it a lot more.
The transpose of a matrix $A$, denoted $A^T$ is the matrix with entries $a^{T}_{ij} = a_{ji}$.
Let $A = \begin{bmatrix} 1 & 1 & 1 \\ 2 & 2 & 2 \\ 3 & 3 & 3 \end{bmatrix}$. Then $A^T = \begin{bmatrix} 1 & 2 & 3 \\ 1 & 2 & 3 \\ 1 & 2 & 3 \end{bmatrix}$.
We've often talked about "row vectors", even though they're written horizontally. The transpose gives us an easier way to write them: \[\begin{bmatrix} 3 & 1 & 5 \end{bmatrix}^T = \begin{bmatrix} 3 \\ 1 \\ 5 \end{bmatrix}.\]
One way to think about the transpose is to notice that it's "flipped" along the diagonal of the matrix. This gives us another kind of matrix.
A matrix $A$ is symmetric if $A = A^T$.
The explicit definition for the transpose tells us why this makes sense: we must have for every entry that $a_{ij} = a_{ji}$.
The definition for the transpose also makes it a bit more clear why the transpose of a permutation matrix is its inverse—if I swap rows $i$ with $j$, then the transpose swaps $j$ with $i$.
Let $P = \begin{bmatrix} 0 & 0 & 1 & 0 \\ 1 & 0 & 0 & 0 \\ 0 & 0 & 0 & 1 \\ 0 & 1 & 0 & 0 \end{bmatrix}$. From the row perspective, this gives us a matrix $\begin{bmatrix} \text{Row 3} \\ \text{Row 1} \\ \text{Row 4} \\ \text{Row 2} \end{bmatrix}$. Then the inverse of $P$ is $P^{-1} = P^T = \begin{bmatrix} 0 & 1 & 0 & 0 \\ 0 & 0 & 0 & 1 \\ 1 & 0 & 0 & 0 \\ 0 & 0 & 1 & 0 \end{bmatrix}$. This permutation matrix produces a matrix $\begin{bmatrix} \text{Row 2} \\ \text{Row 4} \\ \text{Row 1} \\ \text{Row 3} \end{bmatrix}$. But if we examine closely the action of applying both permutations, we get the following: \[\begin{bmatrix} \text{Row 1} \\ \text{Row 2} \\ \text{Row 3} \\ \text{Row 4} \end{bmatrix} \xrightarrow{P} \begin{bmatrix} \text{Row 3} \\ \text{Row 1} \\ \text{Row 4} \\ \text{Row 2} \end{bmatrix} \xrightarrow{P^{-1}} \begin{bmatrix} \text{Row 1} \\ \text{Row 2} \\ \text{Row 3} \\ \text{Row 4} \end{bmatrix}\]
The big result we want to use with this is the idea that every invertible matrix $A$ has a decomposition into $LU$. This is not automatically the case, since we saw that there are situations where we need to change the order of the rows. How do we deal with this?
Instead, we make a slightly weaker claim: If $A$ is invertible, there exists a permutation matrix $P$ such that $PA = LU$. In other words, if $A$ is invertible, there's a way to permute the rows and that permuted matrix can be decomposed into $LU$.
So suppose we have a decomposition $PA = LU$. Applying $P$ to our linear equation gives $PA\mathbf x = P \mathbf b$, which leads us to the equation $LU \mathbf x = P \mathbf b$. Applying $E = L^{-1}$ brings us to the equation $U \mathbf x = \mathbf c$ where $\mathbf c = EP\mathbf b$. We can easily solve for $\mathbf x$ and to get the actual answer we want, we simply reverse the permutation.
Classical treatments of linear algebra make a big deal of computing the inverse of a matrix to solve $A \mathbf x = \mathbf b$. Practically speaking, this is not often done just to solve a linear system, since this is computationally expensive compared to computing the LU factorization. Strang discusses this at the beginning of (2.3).
Recall that we can stick matrices together and treat them as one unit. For example, when computing LU to solve $A \mathbf x = \mathbf b$, we need to remember that we want to perform elimination on each equation and therefore, we need to perform the same row operations on $\mathbf b$. An easy way to do this is to make one big matrix that contains $\mathbf b$ as an extra column. So if we have $A = \begin{bmatrix} 2 & 1 & 5 \\ -1 & 3 & 2 \\ 4 & 1 & 6 \end{bmatrix}$ and $\mathbf b = \begin{bmatrix} 6 \\ 4 \\ 7 \end{bmatrix}$, we would perform elimination on a matrix $\begin{bmatrix} 2 & 1 & 5 & 6 \\ -1 & 3 & 2 & 4 \\ 4 & 1 & 6 & 7 \end{bmatrix}$. We can write this symbolically as $\begin{bmatrix} A & \mathbf b \end{bmatrix}$.
Taking this idea, we can ask ourselves what computing the inverse of a matrix $A$ would look like. Recall that by definition, if $A^{-1}$ exists, we have $AA^{-1} = I$. How do we solve such an equation?
Again, it helps to view matrix multiplication as a series of matrix-vector products. Recall that the $i$th column of $I$ is a unit vector with the $i$th component set to 1 and all others 0. What produces this column? It should be $A \mathbf a^{-1}_i$, the product of $A$ with the $i$th column of $A^{-1}$.
So what we're really doing is solving a bunch of linear equations simultaneously: one for each column vector of $I$. The strategy then is to perform elimination on all of these equations. We do this by sticking $A$ and $I$ together and perform Gaussian elimination on them. The effect of this is that we transform $\begin{bmatrix} A & I \end{bmatrix}$ into a matrix $\begin{bmatrix}I & A^{-1} \end{bmatrix}$.
Notice that we were able to accomplish solving $A \mathbf x = \mathbf b$ without computing the inverse. Both of solving for $\mathbf x$ and computing the inverse requires carrying out elimination. The difference is that using elimination to compute $\mathbf x$ or to factor into LU requires less work than computing $A^{-1}$, which really involves solving for $n$ vectors.
We will soon discuss how to extend elimination to do this, but for a related but slightly more interesting purpose.