Notes can be found as interactive webpage at

12-13: Least Squares & ML

Least Squares #

Motivation #

  • We want to find an approximate solution – one that satisfies all the given equations/information as closely as possible to…
    • Minimize the impact of noise/errors
    • Solve overdetermined systems – we’ll often be collecting abundant information, thus we have more equations than unknowns
  • Applications: Least squares is the fundamental idea behind data fitting and machine learning

What #

The goal of least squares is to minimize the error vector’s magnitude (norm) – the square of each of it’s components:

$$\Vert \vec e \Vert = \sqrt{\sum_{i=1}^n e_i^2 }$$

  • With $m$ measurements (equations) and $n$ unknowns; $m > n$
    • $\text{col}(A)$ is an $n$-dimensional subspace within the larger $m$-dimensional space that $\vec b$ lies in
    • That is, $\vec b$ cannot be exactly reached by $A\vec x$
  • $\vec x$: Best-guess that minimizes $\Vert \vec e \Vert$
    • $\vec x \in \mathbb{R}^{n}$
  • $\vec e = \vec b - A\vec x = \vec b - \hat x$: Error vector
    • $\vec e \in \mathbb{R}^{n}$
    • We want to find $\vec x$ where $\vec e \perp \text{col}(A)$
  • $\vec b$: Actual, observed values
    • $\vec b \in \mathbb{R}^{m}$
  • $\hat x = A \vec x = A (A^T A)^{-1} A^T \vec b$: Predicted values
    • $A \in \mathbb{R}^{m \times n}; A\vec x, \hat x \in \mathbb{R}^{m}$
    • Can be thought of a linear combination of the columns of $A$, with $\vec x$ acting as weights

Proof #

  • We know from the last section that the projection of $\vec a$ onto $\vec b$ results in a vector within the span of $\vec b$ that is closest and orthogonal to $\vec a$
    • So we’ll project $\vec b$ onto $\text{col}(A)$ to find the smallest possible $\vec e$
    • That is, we know that $\vec e \perp \text{col}(A)$ $\Longrightarrow \vec e \perp \vec a_i $ for each column of $A$ so we can write: $$\langle \vec a_i, \vec e \rangle = 0$$ $$\langle \vec a_i, \vec b - A\vec x \rangle = 0$$ $$\vec a_i^T (\vec b - A\vec x) = 0$$ $$\therefore A^T (\vec b - A \vec x) = \vec 0$$
    • Then solving for $\vec x$ in terms of the known $A$ and $\vec b$: $$A^T (\vec b - A \vec x) = \vec 0$$ $$A^T \vec b - A^T A \vec x = \vec 0$$ $$A^T \vec b = A^T A \vec x$$ $$ \therefore \vec x = (A^T A )^{-1} A^T \vec b$$

Theorems #

Theorem 23.1 #

Vector $\vec e$ is orthogonal to every vector in the column space of $A$ iff it is orthogonal to each of the columns $\vec a_i$ that form the basis of its column space

  • If $\vec e$ is orthogonal to every vector in the column space of $A$, then it is orthogonal to each of the $\vec a_i$, as each of the $\vec a_i$ are in the column space of $A$ too
  • Now, we will try to prove the converse: consider an arbitrary vector $\vec v \in \text{span}(A)$– by definition, we know that there exist coefficients $\alpha_i$ such that we can express $\vec v = \sum_{i=1}^m \alpha_i \vec a_i$
  • We also know that $\vec e$ is orthogonal to each $\vec a_i$; that is $\langle \vec e, \vec a_i \rangle = 0$
  • We wish to show that $\vec{e}$ is orthogonal to $\vec{v}$ too: $$\langle\vec{e},\vec{v}_i \rangle = \left\langle \vec{e}, \sum_{i=1}^m \alpha_i \vec a_i \right\rangle$$ $$\dots = \sum \left\langle \vec{e}, \alpha_i \vec a_i \right\rangle$$ $$\dots = \sum \alpha_i \left\langle \vec{e}, \vec a_i \right\rangle$$ $$\dots = 0$$
  • That is, if $\vec e$ is orthogonal to all the basis vectors of a subspace, it is orthogonal to every vector in that subspace as well!

Theorem 23.2 #

$\text{Null}(A^TA) = \text{Null}(A)$ even when $A$ has a nontrivial nullspace

  • Consider an arbitrary $\vec v \in \text{Null}(A^T A)$– by definition, we have $$A^T A \vec v = \vec 0$$ $$\vec v^T (A^T A \vec v) = \vec v^T (\vec 0)$$ $$(A \vec v)^T (A \vec v) = 0$$ $$\langle A \vec v, A \vec v \rangle = \Vert A\vec v \Vert ^2 = 0$$
  • Thus, it is clear that $A \vec v = 0$, so $\vec v \in \text{Null}(A)$
    • That is, $\text{Null}(A^T A) \subseteq \text{Null}(A)$
  • Consider an arbitrary vector $\vec v’ \in \text{Null}(A)$– pre-multiplying by $A^T$ we have that… $$A \vec v’ = \vec 0$$ $$A^T A \vec v’ = \vec 0$$
  • Therefore $\vec v’ \in \text{Null}(A^T A)$
    • That is, $\text{Null}(A) \subseteq \text{Null}(A^T A)$
    • Combining this with the prior statement, we find our desired $\text{Null}(A) = \text{Null}(A^T A)$