# Notes can be found as interactive webpage at

12-13: Least Squares & ML


# Least Squares #

## Motivation #

• We want to find an approximate solution – one that satisfies all the given equations/information as closely as possible to…
• Minimize the impact of noise/errors
• Solve overdetermined systems – we’ll often be collecting abundant information, thus we have more equations than unknowns
• Applications: Least squares is the fundamental idea behind data fitting and machine learning

## What #

The goal of least squares is to minimize the error vector’s magnitude (norm) – the square of each of it’s components:

$$\Vert \vec e \Vert = \sqrt{\sum_{i=1}^n e_i^2 }$$

• With $m$ measurements (equations) and $n$ unknowns; $m > n$
• $\text{col}(A)$ is an $n$-dimensional subspace within the larger $m$-dimensional space that $\vec b$ lies in
• That is, $\vec b$ cannot be exactly reached by $A\vec x$
• $\vec x$: Best-guess that minimizes $\Vert \vec e \Vert$
• $\vec x \in \mathbb{R}^{n}$
• $\vec e = \vec b - A\vec x = \vec b - \hat x$: Error vector
• $\vec e \in \mathbb{R}^{n}$
• We want to find $\vec x$ where $\vec e \perp \text{col}(A)$
• $\vec b$: Actual, observed values
• $\vec b \in \mathbb{R}^{m}$
• $\hat x = A \vec x = A (A^T A)^{-1} A^T \vec b$: Predicted values
• $A \in \mathbb{R}^{m \times n}; A\vec x, \hat x \in \mathbb{R}^{m}$
• Can be thought of a linear combination of the columns of $A$, with $\vec x$ acting as weights

## Proof #

• We know from the last section that the projection of $\vec a$ onto $\vec b$ results in a vector within the span of $\vec b$ that is closest and orthogonal to $\vec a$
• So we’ll project $\vec b$ onto $\text{col}(A)$ to find the smallest possible $\vec e$
• That is, we know that $\vec e \perp \text{col}(A)$ $\Longrightarrow \vec e \perp \vec a_i$ for each column of $A$ so we can write: $$\langle \vec a_i, \vec e \rangle = 0$$ $$\langle \vec a_i, \vec b - A\vec x \rangle = 0$$ $$\vec a_i^T (\vec b - A\vec x) = 0$$ $$\therefore A^T (\vec b - A \vec x) = \vec 0$$
• Then solving for $\vec x$ in terms of the known $A$ and $\vec b$: $$A^T (\vec b - A \vec x) = \vec 0$$ $$A^T \vec b - A^T A \vec x = \vec 0$$ $$A^T \vec b = A^T A \vec x$$ $$\therefore \vec x = (A^T A )^{-1} A^T \vec b$$

# Theorems #

## Theorem 23.1 #

Vector $\vec e$ is orthogonal to every vector in the column space of $A$ iff it is orthogonal to each of the columns $\vec a_i$ that form the basis of its column space

• If $\vec e$ is orthogonal to every vector in the column space of $A$, then it is orthogonal to each of the $\vec a_i$, as each of the $\vec a_i$ are in the column space of $A$ too
• Now, we will try to prove the converse: consider an arbitrary vector $\vec v \in \text{span}(A)$– by definition, we know that there exist coefficients $\alpha_i$ such that we can express $\vec v = \sum_{i=1}^m \alpha_i \vec a_i$
• We also know that $\vec e$ is orthogonal to each $\vec a_i$; that is $\langle \vec e, \vec a_i \rangle = 0$
• We wish to show that $\vec{e}$ is orthogonal to $\vec{v}$ too: $$\langle\vec{e},\vec{v}_i \rangle = \left\langle \vec{e}, \sum_{i=1}^m \alpha_i \vec a_i \right\rangle$$ $$\dots = \sum \left\langle \vec{e}, \alpha_i \vec a_i \right\rangle$$ $$\dots = \sum \alpha_i \left\langle \vec{e}, \vec a_i \right\rangle$$ $$\dots = 0$$
• That is, if $\vec e$ is orthogonal to all the basis vectors of a subspace, it is orthogonal to every vector in that subspace as well!

## Theorem 23.2 #

$\text{Null}(A^TA) = \text{Null}(A)$ even when $A$ has a nontrivial nullspace

• Consider an arbitrary $\vec v \in \text{Null}(A^T A)$– by definition, we have $$A^T A \vec v = \vec 0$$ $$\vec v^T (A^T A \vec v) = \vec v^T (\vec 0)$$ $$(A \vec v)^T (A \vec v) = 0$$ $$\langle A \vec v, A \vec v \rangle = \Vert A\vec v \Vert ^2 = 0$$
• Thus, it is clear that $A \vec v = 0$, so $\vec v \in \text{Null}(A)$
• That is, $\text{Null}(A^T A) \subseteq \text{Null}(A)$
• Consider an arbitrary vector $\vec v’ \in \text{Null}(A)$– pre-multiplying by $A^T$ we have that… $$A \vec v’ = \vec 0$$ $$A^T A \vec v’ = \vec 0$$
• Therefore $\vec v’ \in \text{Null}(A^T A)$
• That is, $\text{Null}(A) \subseteq \text{Null}(A^T A)$
• Combining this with the prior statement, we find our desired $\text{Null}(A) = \text{Null}(A^T A)$