What we've accomplished so far works for projections onto subspaces. That is, if all we want to do was take a subspace and we know its basis, we can transform that basis into an orthogonal basis and then go to town with the result. However, things get trickier when we want to solve or approximate a specific linear equation $A \mathbf x = \mathbf b$—we can't just transform $A$ into $Q$ and call it a day.
We have to remember that, as with any transformation to our matrix $A$ that we make, whether it is $A = CR$ or $A = LU$, transforming $A$ to $Q$ means that there will be another matrix involved, with information that we "lose" when going from $A$ to $Q$. This matrix is $R$ (confusingly, not the same $R$ that we've been using).
The QR factorization of an $m \times n$ matrix $A$ is given by $A = QR$, where $Q$ is an orthogonal matrix and $R$ is an upper triangular matrix, defined by \[A = \begin{bmatrix} \\ \mathbf a_1 & \mathbf a_2 & \cdots & \mathbf a_n \\ \\ \end{bmatrix} = \begin{bmatrix} \\ \mathbf q_1 & \mathbf q_2 & \cdots & \mathbf q_n \\ \\ \end{bmatrix} \begin{bmatrix} \mathbf q_1^T \mathbf a_1 & \mathbf q_1^T \mathbf a_2 & \mathbf q_1^T \mathbf a_3 & \cdots & \mathbf q_n^T \mathbf a_n \\ 0 & \mathbf q_2^T \mathbf a_2 & \mathbf q_2^T \mathbf a_3 & \cdots & \mathbf q_2^T \mathbf a_n \\ 0 & 0 & \mathbf q_3^T \mathbf a_3 & \cdots & \mathbf q_3^T \mathbf a_n \\ \vdots & \vdots & & \ddots & \vdots \\ 0 & 0 & \cdots & 0 & \mathbf q_n^T \mathbf a_n \end{bmatrix} = QR.\]
It is clear that $Q$ contains the orthonormal basis of the column space spanned by columns of $A$. But what about $R$? If we examine closely, we see that each entry of the matrix $R$ is defined as follows, \[r_{ij} = \begin{cases} \mathbf q_i^T \mathbf a_j & \text{if $i \leq j$,} \\ 0 & \text{otherwise.} \end{cases}\]
The result is that $\mathbf a_i$ can be written as a linear combination of $\mathbf q_i$'s. The matrix $R$ tells us exactly how: \[\mathbf a_i = \mathbf q_1 (\mathbf q_1^T \mathbf a_i) + \cdots + \mathbf q_i (\mathbf q_i^T \mathbf a_i).\]
What does this mean? Recall that in order to get our orthogonal basis, we had to break up every one of our basis vectors into orthogonal pieces. When computing $\mathbf q_i$, we first acquired $\mathbf b_i$, the vector orthogonal to the projection of $\mathbf a_i$ onto $\mathbf q_1, \dots, \mathbf q_{i-1}$. That is, \[\mathbf b_i = \mathbf a_i - (\mathbf q_1 \mathbf q_1^T \mathbf a_i + \cdots \mathbf q_{i-1} \mathbf q_{i-1}^T \mathbf a_i).\] Substituting this into our equation, we have \[\mathbf b_i = \mathbf q_i (\mathbf q_i^T \mathbf a_i).\] This tells us that $\mathbf b_i$, the $i$th basis vector that we computed based on $\mathbf a_i$ and its projection onto the other vectors, is really the projection of $\mathbf a_i$ onto $\mathbf q_i$. This makes sense, because $\mathbf q_i$ is nothing but $\mathbf b_i$ scaled down to unit length.
We can also get $R$ in another way. We have \begin{align*} A &= QR \\ Q^{-1}A &= Q^{-1}QR \\ Q^TA &= R \end{align*}
This is a nice decomposition for a few reasons:
Let's return to $A = \begin{bmatrix}-1 & -1 & 0\\-1 & 0 & 1\\0 & -2 & -1\end{bmatrix}$. We got $Q = \begin{bmatrix}- \frac{\sqrt{2}}{2} & - \frac{\sqrt{2}}{6} & - \frac{2}{3}\\- \frac{\sqrt{2}}{2} & \frac{\sqrt{2}}{6} & \frac{2}{3}\\0 & - \frac{2 \sqrt{2}}{3} & \frac{1}{3}\end{bmatrix}$ from the Gram–Schmidt process. But what is the associated $R$? We simply go through each entry and take the appropriate dot products.
This gives us $R = \begin{bmatrix}\sqrt{2} & \frac{\sqrt{2}}{2} & - \frac{\sqrt{2}}{2}\\0 & \frac{3 \sqrt{2}}{2} & \frac{5 \sqrt{2}}{6}\\0 & 0 & \frac{1}{3}\end{bmatrix}$.
We saw that using $Q$ improved our least squares computation, but we don't always have a perfectly orthonormal basis $Q$. We do always have $A = QR$, though. The question is whether $A = QR$ improves the situation at all. First, let's consider $A^TA$: \[A^TA = (QR)^T QR = R^TQ^TQR = R^TR.\] If we substitute this into our least squares equation, we get \begin{align*} R^TR\mathbf{\hat x} &= R^TQ^T\mathbf b \\ R\mathbf{\hat x} &= Q^T\mathbf b \end{align*} At this point, it looks like we haven't gained anything—we still need to move $R$ to the other side, so we compute $R^{-1}$. However, that omits a very crucial property of $R$. First, remember that we can compute $Q^T \mathbf b = \mathbf d$, so let's rewrite our equation: \[R \mathbf{\hat x} = \mathbf d.\] We are back to solving a linear equation. But recall that $R$ is upper triangular! This means that we can just use substitution to solve this equation and avoid the expensive computation for the inverse.
We'll now step back into the world of $n \times n$ matrices to discuss eigenvalues and eigenvectors. As we mentioned before, it's unlikely that we'll encounter data matrices that are square. However, they still play an important role in representing certain kinds of data (for example, graphs) and are particularly important from the point of view as transformations.
Recall that any matrix can be viewed as a transformation of a vector to another vector. In the case of a square matrix, we are taking an $n$-dimensional vector and mapping it to some other $n$-dimensional vector. This amounts to moving it around, squishing and stretching it into some other vector in the same space. Intuitively, this tells us that there must be some vectors that stay roughly the same under such transformations—these vectors are the eigenvectors of the matrix.
Let $A$ be an $n \times n$ matrix. A scalar $\lambda$ is an eigenvalue of $A$ if $A \mathbf x = \lambda \mathbf x$ for some vector $\mathbf x$. The vector $\mathbf x$ is the eigenvector that corresponds to $\lambda$.
This definition tells us that eigenvectors are those vectors that stay the same, but are scaled by a transformation—in other words, the direction of the vector doesn't change. In some sense, we can think of these vectors as remaining stable under the transformation. This suggests that they capture some information about the structure of the transformation.
Let $A = \begin{bmatrix} 7 & -9 \\ 2 & -2 \end{bmatrix}$. The vector $\begin{bmatrix} 3 \\ 1 \end{bmatrix}$ is an eigenvector of $A$. Its associated eigenvalue is 4. To see this, we have: \[\begin{bmatrix} 7 & -9 \\ 2 & -2 \end{bmatrix} \begin{bmatrix} 3 \\ 1 \end{bmatrix} = \begin{bmatrix} 12 \\ 4 \end{bmatrix} = 4 \begin{bmatrix} 3 \\ 1 \end{bmatrix}.\]
Traditionally, the motivation for eigenvectors is in the evolution of linear systems: if we view a matrix $A$ as a process applied repeatedly to some input vector $\mathbf x$, then we can view $A^t \mathbf x$ as the evolution of the system through time step $t$. Obviously, taking powers of matrix products is expensive.
Eigenvectors give us a way around this. For an eigenvector $\mathbf x$ with eigenvalue $\lambda$, we have \[A^k \mathbf x = A^{k-1} \lambda \mathbf x = \lambda A^{k-1} \mathbf x = \lambda^2 A^{k-2} \mathbf x = \cdots = \lambda^k \mathbf x.\] This leads to the following ideas:
The set of eigenvalues of $A$ is called the eigenspectrum or spectrum of $A$. From this, we get spectral analysis—studying matrices (and other mathematical objects that can be represented as graphs) based on their eigenvalues and eigenvectors and the structures that they admit.
But we'll see that despite these being useful tools in their own right, what we would really like is for an analogous tool for any matrix, not just square ones. Eigenvalues and eigenvectors are only the first step towards this goal.