Linear Regression

Least Mean Squares

We define a function $h_{\theta}(x)$ to model the $d$ features in each $x^{(i)}$ as:

h_{\theta}(x) = \sum_{j=1}^{d} \theta_j x_j^{(i)} = \theta^T x

For $n$ training examples, we also define a cost function that we want to minimize:

J(\theta) = \frac{1}{2} \sum_{i=1}^{n} \left(h_{\theta}(x^{(i)}) - y^{(i)} \right)^2

Linear regression fit with vertical residuals from each data point to the fitted line — The cost function measures the squared vertical residuals between each training label and the fitted line.

Taking the derivative with respect to any $\theta_j$ we get:

\frac{\partial}{\partial \theta_j} J(\theta) = \sum_{i=1}^{n} \left[ h_\theta(x^{(i)}) - y^{(i)} \right] x_j^{(i)}

See derivation

With this derivative, we can now use gradient descent to take small steps towards the optimal $\theta$ with the following update rule:

\theta_j \leftarrow \theta_j + \alpha \left( y^{(i)} - h_\theta (x^{(i)}) \right) x_j^{(i)}

Gradient descent path over contour lines of the linear regression cost function — Gradient descent updates $\theta$ by moving downhill along the contours of $J(\theta)$ until it reaches the minimum.

\text{Repeat until convergence \{}

\quad \text{For } i = 1 \text{ to } n, \text{\{}

\quad \quad \text{For } j = 1 \text{ to } d, \text{\{}

\quad \quad \quad \theta_j \leftarrow \theta_j + \alpha \left( y^{(i)} - h_\theta (x^{(i)}) \right) x_j^{(i)}

\quad \quad \text{\}}

\quad \text{\}}

\text{\}}

Closed Form Solution

Let $X$ be a matrix that contains each $\left(x^{(i)} \right)^T$ in its rows and has a size of $n$ by $d+1$ where $d$ is the number of features in $x^{(i)}$ and $+ 1$ is for the intercept term.

Also, $\vec{y}$ is an $n$ dimensional vector that contains the labels for each training example. And $\theta$ is a $d$ dimensional vector that contains the weights for each feature.

Now, since $h_{\theta}(x) = \left(x^{(i)} \right)^T\theta$ , we can rewrite this in matrix-vector form as $h_{\theta}(x) = X \theta$

With these, we can now rewrite $J(\theta)$ using the fact that $z^T z = \sum_i z_i^2$

J(\theta) = \frac{1}{2} \sum_{i=1}^{n} \left(h_{\theta}(x^{(i)}) - y^{(i)} \right)^2

= \frac{1}{2} \left(X\theta - \vec{y} \right)^T \left(X\theta - \vec{y} \right)

Finally, to minimize $J(\theta)$ , we find its derivative with respect to $\theta$ , set it equal to $0$ and simplify to see that $\theta$ is minimized when,

\theta = \left(X^TX\right)^{-1}X^T \vec{y}

Orthogonal projection of the target vector onto the column space of the design matrix — The closed-form solution chooses the prediction vector $\hat{y} = X\theta$ that is closest to $\vec{y}$ inside the column space of $X$ .

See derivation

Probabilistic Interpretation

We assume that $y^{(i)}$ with $\epsilon^{(i)}$ as the noise in the $i$ th example such that $\epsilon^{(i)} \sim \mathcal{N}(0, \sigma^2)$

y^{(i)} = \theta^T x^{(i)} + \epsilon^{(i)}

This can be rewritten as $\epsilon^{(i)} = \theta^T x^{(i)} - y^{(i)}$ . Now, since $\epsilon^{(i)} \sim \mathcal{N}(0, \sigma^2)$ , hence $\left(y^{(i)} - \theta^T x^{(i)} \right) \sim \mathcal{N}(0, \sigma^2)$ .

Finally, if we are given $x^{(i)}$ , then $y^{(i)} \mid x^{(i)}$ will also follow $\mathcal{N} (0, \sigma^2)$ shifted by $\theta^T x^{(i)}$ .

P \left(y^{(i)} \mid x^{(i)}; \theta \right) \sim \mathcal{N} \left(\theta^T x^{(i)}, \sigma^2 \right)

To find the maximum likelihood estimate of $\theta$ , we need to maximize:

L(\theta) = L(\theta; X, Y) = \prod_{i=1}^{n} p(y^{(i)} \mid x^{(i)}; \theta)

Since $log$ is a monotonically increasing function, we can maximize the following instead:

\ell(\theta) = \log \prod_{i=1}^{n} p(y^{(i)} \mid x^{(i)}; \theta)

Solving this we see that maximizing $\ell(\theta)$ is equivalent to minimizing:

-\frac{1}{2} \sum_{i=1}^{n} \left(y^{(i)} - \theta^T x^{(i)} \right)^2

Notice that the function that we need to minimize does not depend on $\sigma$ , which means that we don't need to know the variance of our noise to be able to maximize our likelihood.

See derivation

Please use a larger screen

This content is best viewed on a laptop or desktop device.