CAPP 30271 — Lecture 24

Low rank approximation

The fact that we can decompose any matrix $A$ in a way that extracts the most important basis directions is a powerful tool. The idea that the most important directions are ones with the largest singular values suggests that the pieces associated with the smaller singular values only add small numerical details or noise.

The reduced SVD actually immediately gives us an application: compression. Consider an $m \times n$ matrix. Such a matrix obviously has $mn$ values. But do we need to store all of those values? If the rank of the matrix is $r$, then the reduced SVD gives us a way to store those values using only $mr + nr + r = r(m+n+1)$ values—this is much smaller than $m \times n$ if $r$, the rank of the matrix, is much smaller.

For example, suppose we have a $12000 \times 1500$ matrix. That's 18000000 entries. If we know that the rank of this matrix is really 240, then we only need to store a $12000 \times 230$ matrix $U$ of left singular vectors, the list of 230 singular values, and a $1500 \times 230$ matrix $V$ of right singular vectors. In total, the number of floats is \[12000 \times 230 + 230 + 1500 \times 230 = 230 \times 13501 = 3105230.\] And this is without losing any information at all—we'll see soon that if we're okay with losing some information, SVD gives us a principled way to do lossy compression.

The main idea behind data science and machine learning applications is that we often have data that is too large and too complicated to work with. So far, we have invested a lot of time into understanding how to break down matrices so that we can work with simpler pieces: linearly independent vectors and bases, subspaces, orthogonal and orthonormal bases, least squares approximations, diagonalization.

What we're really looking for is a low-dimensional approximation of our data. In other words, we want a simpler version of our data that still retains the essential characteristics and structures in our original dataset, but in a more tractable form.

One of the main complicating factors of this is the fact that there are too many dimensions to the data. This is an issue for a number of reasons. One obvious one is that it makes the data computationally expensive to process. Another reason is that high-dimensional data is often more sparse—there are more dimensions/ways for data to be "different" from each other, making identifying commonalities like clusters more difficult. And yet, it's observed that real data tends to lie along low-dimensional subspaces.

The SVD gives us a way to extract these low dimensional subspaces. The idea is simple: we can construct an approximation for $A$ of rank $k$ by using the top $k$ singular vectors.

Let $A = \sigma_1 \mathbf u_1 \mathbf v_1^T + \cdots + \sigma_r \mathbf u_r \mathbf v_r^T$ with $\sigma_1 \geq \cdots \geq \sigma_r \gt 0$. We define the rank $k$ approximation of $A$ to be the matrix \[A_k = \sigma_1 \mathbf u_1 \mathbf v_1^T + \cdots + \sigma_k \mathbf u_k \mathbf v_k^T.\]

The idea is simple: if we view the SVD as a weighted sum of rank 1 matrices weighted by the singular values, then to get a rank $k$ approximation, we take the rank 1 matrices with the largest weight—those associated with the first $k$ singular values.

Our matrix that we've been working with has rank 2. So we can get a rank 1 approximation of it by taking the top 1 singular value and vectors.


    >>> A = np.array([[ 1,  3,  1,  0,  9],
    ...               [ 0,  0,  0,  1,  3],
    ...               [ 1,  3,  1,  1, 12]])
    >>> U, S, Vt = np.linalg.svd(A)
    >>> A1 = U[:,:1] * S[:1] @ Vt[:1]
    >>> A1
    array([[ 0.82344686,  2.47034059,  0.82344686,  0.5758932 ,  9.13870137],
           [ 0.25244633,  0.757339  ,  0.25244633,  0.17655314,  2.80167641],
           [ 1.0758932 ,  3.22767959,  1.0758932 ,  0.75244633, 11.94037778]])
    >>> A - A1
    array([[ 0.17655314,  0.52965941,  0.17655314, -0.5758932 , -0.13870137],
           [-0.25244633, -0.757339  , -0.25244633,  0.82344686,  0.19832359],
           [-0.0758932 , -0.22767959, -0.0758932 ,  0.24755367,  0.05962222]])

If we examine this closely, we notice that the numbers seem pretty close and the difference between the two matrices contains values that are pretty small and hence close to 0.

So at the very least, the SVD gives us an approximation. The obvious question to ask is how good of an approximation this really is—is there some measure that we can use on $A - A_k$ to quantify the "closeness" of two matrices like we do with vectors? We consider the following result due to Eckart and Young in 1936.

Let $A$ be an $m \times n$ matrix and let $A_k$ be the rank $k$ approximation obtained via the SVD. If $B$ is an $m \times n$ matrix of rank $k$, then $\|A - A_k\| \leq \|A-B\|$.

Eckart and Young were both here, at the University of Chicago (Eckart as a professor in Physics and Young as a grad student), when they published their proof of this result.

What this says is that the rank $k$ matrix obtained by summing the $k$ largest rank 1 matrices based on singular values is a best rank $k$ approximation of $A$. That is, among all matrices of rank $k$, $A_k$ as we defined above is always closest to $A$. That is, the SVD gives us the best possible rank $k$ approximation of $A$.

There's just one piece missing from this: what does it mean for two matrices to be "close"? The theorem tells us, it's when the "distance" between the two is small, as measured by $\|A-B\|$. Notationally, this looks like the "length" of the matrix, but what does a matrix length even mean?

Matrix norms

The technical term for what we've been calling the "length" of a vector is the norm of the vector. Norms are functions that obey certain rules to give us a notion of "closeness" or "similarity" between two objects. Here's a technical definition of the rules involved.

A norm on a vector space $V$ is a function $\|\cdot\|: V \to \mathbb R$, which assigns to a vector $\mathbf x$ a scalar $\|\mathbf x\|$ such that for all scalars $c$ and vectors $\mathbf x, \mathbf y$,

Formal condition	Informal meaning
$\\|\mathbf x\\| \geq 0$	The norm is nonnegative
$\\|\mathbf x\\| = 0$ if and only if $\mathbf x = \mathbf 0$	A vector has norm 0 if and only if it is the zero vector
$\\|c\mathbf x\\| = \|c\| \\|\mathbf x\\|$	Norms are scaled by scalars
$\\|\mathbf x + \mathbf y\\| \leq \\|\mathbf x\\| + \\|\mathbf y\\|$	The norm obeys the triangle inequality: the length of one side ($\mathbf x + \mathbf y$) can't be longer than the sum of the other two sides
$\\|\mathbf u^T \mathbf v\\| \leq \\|\mathbf u\\| \\|\mathbf v\\|$	The norm obeys the Schwarz inequality: the norm of the (inner) product of the vectors can't exceed the product of the norms of the vectors

Notice that this defines a norm over a vector space—recall that matrices are also "vectors" with respect to some vector space. So any matrix norms we define will use the same rules, which is very convenient. Just like the vector space rules, you don't really need to memorize these, since the idea behind these rules is that a norm should in some sense look and feel like what you'd expect a norm to do: give some notion of "length" or "size" to the object.

This means that objects can have more than one norm defined for them. We've already seen an example of this. Recall that our usual notion of vector length is really the Euclidean norm, also called the $\ell_2$ norm. But we have also seen, informally, the taxicab norm

The $\ell_1$ norm, informally called the taxicab norm or Manhattan norm, is defined for a vector $\mathbf v = \begin{bmatrix} v_1 \\ \vdots \\ v_n \end{bmatrix}$ in $\mathbb R^n$ by $\|\mathbf v\|_1 = |v_1| + \cdots + |v_n|$.

Recall that stochastic matrices are those whose column entries add to 1. This is really just the condition that the $\ell_1$ norm of the columns is 1. Such vectors would not necessarily have unit length under the Euclidean norm.

Why should we want different notions of length? Though these notions are defined mathematically, they appear in the real world as well. Suppose you're at State and Madison ($(0,0)$ on Chicago's grid) and you need to head over to the Thompson Center (Clark and Randolph) to renew your driver's license. How long does it take for you to get there?

The answer that you would expect is that it's two blocks north then two blocks west. Since a block is an eigth of a mile (or 200 m), this is half a mile (or 800 m). This answer is an example of the $\ell_1$ norm. The $\ell_2$ norm, the Euclidean distance would draw a straight line from State and Madison to Clark and Randolph and give us a distance of $\frac{\sqrt 2}{4}$ miles (or $400\sqrt 2$ m). Which is "correct"? It depends on your application! In this case, since you probably can't clip through walls, it would make more sense to use the $\ell_1$ norm.

We give three norms for a matrix $A$:

The spectral norm of $A$, denoted $\|A\|_2$, is defined to be \[\|A\|_2 = \max \frac{\|A \mathbf x\|}{\|\mathbf x\|}.\] This is also called the $\ell_2$ norm. Informally, this quantity describes the longest that any vector can get when transformed by $A$.
The Frobenius norm of $A$, denoted $\|A\|_F$, is defined to be \[\|A\|_F = \sqrt{\sum_{i=1}^m\sum_{j=1}^n a_{ij}^2}.\] This norm is a natural extension of the Euclidean norm—it is the square root of the sum of squares of all entries in $A$.
The nuclear norm of $A$, denoted $\|A\|_N$, is defined to be the trace of the matrix $\sqrt{A^TA}$, \[\|A\|_N = \operatorname{tr}(\sqrt{A^TA}).\] This is also called the trace norm. This is a bit trickier to see, but the idea is that $\operatorname{tr} A^T A$ can be thought of as a "dot product" for matrices. This is then the matrix extension of the idea that $\|\mathbf x\| = \sqrt{\mathbf x^T \mathbf x}$.

Luckily, we don't need to remember these definitions to compute them, since we can just use NumPy to do it the same way we do with vectors. Recall that numpy.linalg.norm has a parameter ord which specifies the desired norm.


    >>> A = np.array([[ 1,  3,  1,  0,  9],
    ...               [ 0,  0,  0,  1,  3],
    ...               [ 1,  3,  1,  1, 12]])
    >>> np.linalg.norm(A, ord=2)        # spectral/2-norm
    15.992486205087896
    >>> np.linalg.norm(A, ord='fro')    # Frobenius norm
    16.06237840420901
    >>> np.sum(A ** 2) ** 0.5           # def. of Frobenius
    16.06237840420901
    >>> np.linalg.norm(A, ord='nuc')    # nuclear norm
    17.489277767087565

These are the formal definitions of the norms. However, we will see that each one of them is connected to the singular values of $A$.

Let $A$ be a matrix with nonzero singular values $\sigma_1 \geq \dots \geq \sigma_r \gt 0$. Then

$\|A\|_2 = \sigma_1$,
$\|A\|_F = \sqrt{\sigma_1^2 + \cdots + \sigma_r^2}$,
$\|A\|_N = \sigma_1 + \cdots + \sigma_r$.

Since we have access to the singular values, we can use them to compute the norms directly and verify.


    >>> U, S, Vt = np.linalg.svd(A)
    >>> S
    array([1.59924862e+01, 1.49679156e+00, 7.52107468e-16])
    >>> S[0]                        # spectral/2-norm
    15.992486205087893
    >>> np.sum(S[:2] ** 2) ** 0.5   # Frobenius norm
    16.06237840420901
    >>> np.sum(S[:2])               # nuclear norm
    17.48927776708756

These are probably the definitions of the norms you want to keep in mind. One interesting consequence of this is that it tells us that singular values themselves obey the rules of norms.

Consider two matrices $A$ and $B$ with first singular values $\alpha_1$ and $\beta_1$, respectively. Then the first singular value for $A+B$ cannot be any larger than $\alpha_1 + \beta_1$. Why? If this were the case, then $\|A+B\|_2 \gt \|A\|_2 + \|B\|_2$, which would violate the triangle inequality.

We've actually already seen much of the argument that $\|A\|_2 = \max \frac{\|A \mathbf x \|}{\|\mathbf x\|} = \sigma_1$. One way to think about this is that the first singular vectors are the direction in which any vector gets "stretched" the most in the transformation under $A$. We know this has to be the case because the geometric interpretation of SVD "stretches" a vector in each direction of the singular vectors that have a nonzero singular value. Because each of these vectors is orthogonal, none of them interferes with the other.

In summary,

Name	Notation	NumPy	Formal definition	Relationship to singular values
spectral	$\\|A\\|_2$	`ord=2`	maximum of ratio $\frac{\\|A\mathbf x\\|}{\\|\mathbf x\\|}$ over all vectors $\mathbf x$ (i.e. the longest a vector can get when multiplied by $A$)	$\sigma_1$, the largest singular value of $A$
Frobenius	$\\|A\\|_F$	`ord='fro'`	square root of sum of squares of entries of $A$	square root of sum of squares of singular values of $A$
nuclear	$\\|A\\|_N$	`ord='nuc'`	sum of diagonal entries (i.e. trace) of matrix square root of $A^TA$ (i.e. matrix inner product)	sum of singular values of $A$

Why do we have three different matrix norms? In fact, there are many more matrix (and vector) norms out there, each with their own meanings and applications. Each norm captures something different about the matrix $A$ as a transformation.

However, we mention these ones in particular because these norms all apply to the Eckart–Young theorem. This result is due to Mirsky in 1960: Eckart–Young holds for any norm that depends only on singular values. So for any of the matrix norms we mentioned, it will always be the case that $\|A - A_k\| \leq \|A - B\|$ for all rank $k$ matrices $B$.

We can get some interesting results from considering $\|A - A_k\|$ for each norm.


    >>> U, S, Vt = np.linalg.svd(A)
    >>> A1 = U[:,:1] * S[:1] @ Vt[:1]
    >>> A1
    array([[ 0.82344686,  2.47034059,  0.82344686,  0.5758932 ,  9.13870137],
           [ 0.25244633,  0.757339  ,  0.25244633,  0.17655314,  2.80167641],
           [ 1.0758932 ,  3.22767959,  1.0758932 ,  0.75244633, 11.94037778]])
    >>> np.linalg.norm(A - A1, ord=2)
    1.4967915619996677
    >>> S[1]
    1.4967915619996677
    >>> np.linalg.norm(A - A1, ord='fro')
    1.4967915619996677
    >>> np.linalg.norm(A - A1, ord='nuc')
    1.4967915619996677

We appear to get the same value for every norm. Why does this happen? Recall that our matrix $A$ has rank 2. So $A - A_1 = \sigma_2 \mathbf u_2 \mathbf v_2^T$. Then we can see that:

The largest singular value in the matrix $\sigma_2 \mathbf u_2 \mathbf v_2^T$ is $\sigma_2$.
The square root of sum of squares of the singular values of the matrix $\sigma_2 \mathbf u_2 \mathbf v_2^T$ is $\sqrt{\sigma_2^2} = \sigma_2$.
The sum of singular values of the matrix $\sigma_2 \mathbf u_2 \mathbf v_2^T$ is $\sigma_2$.

So this is just a quirk of this example. If our matrix had rank $r$ and we had an approximation where $k \lt r-1$, then the resulting values of the norms should be different.