CAPP 30271 — Lecture 3

Linear Independence

When thinking about the linear combinations of the columns of a matrix, something we might wonder is whether we really need all of them.

Let $A = \begin{bmatrix} 1 & 2 & 4 \\ 3 & -2 & 4 \\ 2 & 3 & 7 \end{bmatrix}$. If we label the columns of $A$ by $\mathbf a_1, \mathbf a_2, \mathbf a_3$, then we observe that $\mathbf a_3 = 2\mathbf a_1 + \mathbf a_2$.

In other words we have a column in our matrix that is a linear combination of the others. If we view $A$ as a description or function that produces matrices, what this tells us that we could remove that column and we would still be able to get all the vectors we had before. How?

Suppose we have $\mathbf v = \begin{bmatrix} 3 \\ -1 \\ 5 \end{bmatrix}$. Then by definition, $A \mathbf v = 3 \begin{bmatrix} 1 \\ 3 \\ 2 \end{bmatrix} + -1 \begin{bmatrix} 2 \\ -2 \\ 3 \end{bmatrix} + 5 \begin{bmatrix} 4 \\ 4 \\ 7 \end{bmatrix}$. But remember that we can write the final column of $A$ in terms of the other two, so we have \[A \mathbf v = 3 \begin{bmatrix} 1 \\ 3 \\ 2 \end{bmatrix} + -1 \begin{bmatrix} 2 \\ -2 \\ 3 \end{bmatrix} + 5 \left(2\begin{bmatrix}1 \\ 3 \\ 2\end{bmatrix} + \begin{bmatrix} 2 \\ -2 \\ 3 \end{bmatrix} \right) = 13\begin{bmatrix}1 \\ 3 \\ 2\end{bmatrix} + 4 \begin{bmatrix} 2 \\ -2 \\ 3 \end{bmatrix} \]

This is the idea of linear independence. While we thought about this in terms of the columns of a matrix, independence is really something about a set of vectors.

A set of vectors $\mathbf v_1, \mathbf v_2, \dots, \mathbf v_n$ is linearly independent if no $\mathbf v_i$ can be written as a linear combination of the other vectors in the set. Otherwise, such a vector is said to be dependent.

Above, we saw the math: we can write a vector as a linear combination of the others. But what does this say intuitively about all of the vectors we get from linear combinations of our columns (i.e. the column space)?

If we have two linearly dependent vectors, then they're multiples of each other. What are all the linear combinations? If we add an independent vector, what happens to the space of vectors we can describe?

In other words, adding a vector that is independent from the ones we already have gives us a new direction or dimension and a whole bunch of other vectors we can describe using linear combinations.

There are some fairly obvious consequences from this. The first is observing that many matrices will contain more columns than we may need, in the sense that we can describe the same set of vectors using fewer columns. In a way, we can think of linearly dependent columns to be superfluous.

From a data science perspective, this tells us something very interesting as well. It's easy to imagine a data table of, say, thousands of columns. How likely is it that even a small number of these are linearly independent? This suggests that one of the things we might want to do is to find a way to describe the same set of vectors but using far fewer columns. In doing so, we can reason that we pick out the most important vectors that say something interesting about our data.

Let's say we want a list or matrix $C$ of linearly independent columns from $A$. There's a very simple algorithm to do this:

If the first column of $A$ is not all zero, put it into $C$.
If the second column of $A$ is not a multiple of the first, put it into $C$.
If the third column of $A$ is not a linear combination of the first two, put it into $C$.
If the fourth column of $A$ is not a linear combination of the first three, put it into $C$.

\[\vdots\]

If the $n$th column of $A$ is not a linear combination of the first $n-1$, put it into $C$.

Basically, just go through each column of $A$ and check if it's a linear combination of the vectors we've chosen to be in $C$ already. Then if we view $C$ as sticking all of the columns we chose together, we have a matrix that is made up entirely of independent columns. The trick here of course is that checking whether a column is a linear combination of some set of vectors is not quite as easy as we've made it seem.

Of course, while we've talked about identifying columns that are dependent, there's also the question of how many we "need"? Can we construct a matrix with arbitrarily many independent columns? Or is there a point where any column we add is going to be dependent, no matter what?

This leads to a suite of questions we can ask about any particular matrix:

How many columns of $A$ are independent?
What are the $r$ independent columns in $A$?
What combinations of these $r$ columns make up the other columns of $A$?

On a final note, if we observe our example matrix from before, we recall that we could get rid of that last column. We have \[A \mathbf v = \begin{bmatrix} 1 & 2 & 4 \\ 3 & -2 & 4 \\ 2 & 3 & 7 \end{bmatrix} \begin{bmatrix} 3 \\ -1 \\ 5 \end{bmatrix} = \begin{bmatrix}1 & 2 \\ 3 & -2 \\ 2 & 3 \end{bmatrix} \begin{bmatrix} 13 \\ 4 \end{bmatrix} = C \mathbf x.\] Something to note here is that although we don't need that extra column, this does change what our vector needs to be to get the same vector, since (naturally) the exact nature of the linear combination changes. This suggests that there's some information somewhere that the extra column represents that we need to account for somehow.

Observe that we didn't really talk about what sizes our matrices and vectors need to be when we multiply them together. Instead, we can probably piece together that our vector needs to have as many components as there are columns of the matrix, since this is supposed to represent a particular linear combination. We can generalize this idea.

Matrix multiplication

After seeing this, a very natural question you may have is whether we can multiply a matrix with another matrix. The answer is yes and our discussion gives some hints at how to do it and when we're allowed to.

We can start with something very simple: a column matrix multiplied by a row matrix.

Let $A = \begin{bmatrix} 1 \\ 3 \\ 2 \end{bmatrix}$ and $B = \begin{bmatrix} 2 & -1 & 4 \end{bmatrix}$. Then $AB = \begin{bmatrix} 2 & -1 & 4 \\ 6 & -3 & 12 \\ 4 & -2 & 8 \end{bmatrix}$.

What is going on here? If we imagine the row matrix instead as a matrix of columns each consisting of a single row, we compute the matrix-vector product for each column and slap them together.

This particular matrix has an interesting property: all of its columns are scalar multiples of each other. That is, there's only one independent column. Of course, this is not surprising given how we computed it.

But there's an interesting side effect of this: the rows are also scalar multiples of each other. This is not so surprising if you make the observation that it's basically the same observation as for columns if we just flipped the matrix over.

From this, we can see where the generalization takes us: we compute the matrix-vector products and stick them together to form the desired matrix product.

Let $A$ and $B$ be matrices. If $B = \begin{bmatrix}\mathbf u & \mathbf v & \mathbf w\end{bmatrix}$ with column vectors $\mathbf u, \mathbf v, \mathbf w$, then \[AB = \begin{bmatrix} A\mathbf u & A \mathbf v & A \mathbf w\end{bmatrix}.\]

In other words, multiplying two matrices is really just multiplying each column vector by a matrix and sticking them all together.

Let $A = \begin{bmatrix} 2 & 4 \\ -3 & 4 \\ 1 & -2 \end{bmatrix}$ and $B = \begin{bmatrix} 3 & -1 & 5 \\ 9 & 3 & -1\end{bmatrix}$ Then \begin{align*} AB &= \begin{bmatrix} A\begin{bmatrix} 3 \\ 9 \end{bmatrix} & A \begin{bmatrix} -1 \\ 3 \end{bmatrix} & A \begin{bmatrix} 5 \\ -1 \end{bmatrix} \end{bmatrix} \\ &= \begin{bmatrix} 3 \begin{bmatrix} 2 \\ -3 \\ 1 \end{bmatrix} + 9 \begin{bmatrix} 4 \\ 4 \\ -2 \end{bmatrix} & -1 \begin{bmatrix} 2 \\ -3 \\ 1 \end{bmatrix} + 3 \begin{bmatrix} 4 \\ 4 \\ -2 \end{bmatrix} & 5 \begin{bmatrix} 2 \\ -3 \\ 1 \end{bmatrix} + -1 \begin{bmatrix} 4 \\ 4 \\ -2 \end{bmatrix} \end{bmatrix} \\ &= \begin{bmatrix} 42 & 10 & 6 \\ 27 & 15 & -19 \\ -15 & -7 & 7 \end{bmatrix} \end{align*}

Of course, this is the way to view matrix multiplication when viewed via columns. We could just as well adapt the row/dot product view of matrix multiplication: \[ AB = \begin{bmatrix} 3 \cdot 2 + 9 \cdot 4 & -1 \cdot 2 + 3 \cdot 4 & 5 \cdot 2 + -1 \cdot 4 \\ 3 \cdot -3 + 9 \cdot 4 & -1 \cdot -3 + 3 \cdot 4 & 5 \cdot -3 + -1 \cdot 4 \\ 3 \cdot 1 + 9 \cdot 2 & -1 \cdot 1 + 3 \cdot -3 & 5 \cdot 1 + -1 \cdot -2 \end{bmatrix} = \begin{bmatrix} 42 & 10 & 6 \\ 27 & 15 & -19 \\ -15 & -7 & 7 \end{bmatrix}. \]

It's important to note here that not every pair of matrices can be multiplied together. There's a very big restriction: $A$ must have the same number of columns as $B$ has rows. This follows pretty naturally from our prior discussions and the properties we've seen.

This actually tells us another important thing about matrix multiplication: it's not commutative. In other words, $AB$ is not necessarily the same as $BA$, if such a product even exists at all.

An interesting question that arises from this definition is how many multiplications we need to perform in order to compute a matrix multiplication. That is, suppose that I have two matrices $A$ and $B$, where $A$ is an $m \times p$ matrix and $B$ is a $p \times n$ matrix. The resulting matrix $AB$ is then an $m \times n$ matrix.

So it's clear that we need to compute $m \times n$ entries, but how long does it take to compute each entry? The above formulation of matrix multiplication tells us that each one is really a dot product. How many multiplications this dot product requires is exactly $p$. If we put that all together, we get $m \times n \times p$.

Now for the sake of comparison, let's simplify things a bit and consider the case where we have $n \times n$ square matrices. Then it's clear that we need something like $n^3$ multiplication operations. Is this the best we can do?

One of the earliest surprises of computer science was showing that we can get away with far fewer multiplications! By slicing and dicing your matrices appropriately and making a few clever observations, we can construct our desired matrix with only $n^{2.373}$ or so multiplications. This was a result from 1970, due to Strassen, which followed a similar breakthrough about multiplying ordinary numbers. The question of whether we can get down to closer to $n^2$ has been open since the late 1980s. Progress on this question was stalled until the 2010s and we've seen a burst of activity around it since.

There is one more observation we can make about matrix multiplication. Let's look back at our column definition of matrix multiplication again. We have \begin{align*} AB &= \begin{bmatrix} 3 \begin{bmatrix} 2 \\ -3 \\ 1 \end{bmatrix} + 9 \begin{bmatrix} 4 \\ 4 \\ -2 \end{bmatrix} & -1 \begin{bmatrix} 2 \\ -3 \\ 1 \end{bmatrix} + 3 \begin{bmatrix} 4 \\ 4 \\ -2 \end{bmatrix} & 5 \begin{bmatrix} 2 \\ -3 \\ 1 \end{bmatrix} + -1 \begin{bmatrix} 4 \\ 4 \\ -2 \end{bmatrix} \end{bmatrix} \\ &= \begin{bmatrix} 3 \begin{bmatrix} 2 \\ -3 \\ 1 \end{bmatrix} & -1 \begin{bmatrix} 2 \\ -3 \\ 1 \end{bmatrix} & 5 \begin{bmatrix} 2 \\ -3 \\ 1 \end{bmatrix} \end{bmatrix} + \begin{bmatrix} 9 \begin{bmatrix} 4 \\ 4 \\ -2 \end{bmatrix} & 3 \begin{bmatrix} 4 \\ 4 \\ -2 \end{bmatrix} & -1 \begin{bmatrix} 4 \\ 4 \\ -2 \end{bmatrix} \end{bmatrix} \\ &= \begin{bmatrix} 2 \\ -3 \\ 1 \end{bmatrix} \begin{bmatrix} 3 & -1 & 5 \end{bmatrix} + \begin{bmatrix} 4 \\ 4 \\ -2 \end{bmatrix} \begin{bmatrix} 9 & 3 & -1 \end{bmatrix} \end{align*}

This view of matrix multiplication tells us that we can view a matrix as a bunch of simpler matrices summed together. We'll see what consequences this has soon.

For now, let's return to the idea of independent columns of $A$. Our procedure from before gives us a matrix $C$ of linearly independent columns. However, this is not necessarily the same as $A$. Even though the set of vectors that are described are essentially the same, we've still lost some information along the way. What happened to it? Where is it? What is it?

Let's return to this example from above. We have \[A \mathbf v = \begin{bmatrix} 1 & 2 & 4 \\ 3 & -2 & 4 \\ 2 & 3 & 7 \end{bmatrix} \begin{bmatrix} 3 \\ -1 \\ 5 \end{bmatrix} = \begin{bmatrix}1 & 2 \\ 3 & -2 \\ 2 & 3 \end{bmatrix} \begin{bmatrix} 13 \\ 4 \end{bmatrix} = C \mathbf x.\] Clearly, the vectors $\mathbf v$ and $\mathbf x$ are different. How do we account for this difference, especially if we're interested in reconstructing $A$ based on $C$?

One thing we just learned is that we can use a matrix to do this: describe how to combine columns of $C$ in a new matrix! If we want to add that last column, we just need a vector telling us how to do this: \[A = \begin{bmatrix} 1 & 2 & 4 \\ 3 & -2 & 4 \\ 2 & 3 & 7 \end{bmatrix} = \begin{bmatrix}1 & 2 \\ 3 & -2 \\ 2 & 3 \end{bmatrix} \begin{bmatrix} 1 & 0 & a_1 \\ 0 & 1 & a_2 \end{bmatrix}.\] Since we know that the first two columns are in place, we can "select" them in the first and second columns of our matrix by the very simple linear combination 1 of the column we want and 0 of the column we don't want.

Then to construct the third column, we need $a_1$ and $a_2$ such that $a_1 \begin{bmatrix} 1 \\ 3 \\ 2 \end{bmatrix} + a_2 \begin{bmatrix} 2 \\ -2 \\ 3 \end{bmatrix} = \begin{bmatrix} 4 \\ 4 \\ 7 \end{bmatrix}$. We know already that $a_1 = 2$ and $a_2 = 1$, so we have \[A = \begin{bmatrix}1 & 2 \\ 3 & -2 \\ 2 & 3 \end{bmatrix} \begin{bmatrix} 1 & 0 & 2 \\ 0 & 1 & 1 \end{bmatrix} = CR.\]

Notice that this allows us to combine columns in any order we wish. For instance, suppose instead we had $A = \begin{bmatrix} 1 & 4 & 2 \\ 3 & 4 & -2 \\ 2 & 7 & 3 \end{bmatrix}$. In this case, we can keep $C$ the same and change $R$ so we have \[A = \begin{bmatrix}1 & 2 \\ 3 & -2 \\ 2 & 3 \end{bmatrix} \begin{bmatrix} 1 & 2 & 0 \\ 0 & 1 & 1 \end{bmatrix} = CR.\]

So $C$ and $R$ together give us the complete picture of $A$. But there's more—there's a sort of symmetry that we can infer here. If we view things from the perspective of rows we can argue very much the same thing: that $R$ consists of, in some form, the independent rows of $A$ and $C$ describes how to construct these rows.

This leads to a very interesting fact: the number of independent columns of a matrix $A$ is exactly the same as the number of its independent rows. This number is called the rank of the matrix. The rank of a matrix tells us a lot about the structure of the matrix.

Consider a matrix $A$ that represents a users as column vectors, and each row represents an item of interest—a movie, a TV show, a book, etc. Each entry is a rating that the user gives for that item.

If we have such a matrix, what can the decomposition of $A = CR$ tell us?

$C$ is a matrix of linearly independent columns—these can be thought of as "representative" user profiles.
$R$ is a matrix of rows which describe hwo to combine columns of $C$ to construct $A$—these represent the actual users, which can be described as some weighted combination of the representative profiles in $C$.
How many columns does $C$ need to have to get a reasonable approximation of $A$? This value is the rank $r$.

Already, this leads to some interesting ideas to play around with. For instance, it's very likely that we don't have $C$ and $R$. Rather, we have some much larger set of data and with a much larger set of data, we have very many dependent columns and rows. How good of an approximation we might want depends on a choice of rank $r$.

Or what may be much more likely is that we have an incomplete matrix. For example, we only get a complete matrix if everyone watches and rates every movie. This is both very unlikely and really the point of this exercise—if we have a complete matrix, there's nothing left to recommend! So we can use this idea to come up with a model $C$ and $R$ for our incomplete matrix and use these components to compute the missing entries. These missing entries then form the basis for our recommendations.

But this is getting ahead of ourselves—factorization is not an easy problem. So far, in the problems that you've been asked in homework are small examples that you can solve by using a bit of observation and trial and error.

We'll now begin our dive into linear algebra, starting with the classical problem of how to solve linear systems. Classically, this is about finding solutions to the system, but for our purposes, this process gives us the first steps for what we're looking for: a way to compute $C$ and $R$.