Linear Regression
Logistic Regression
The classification problem is different from the regression problem in that y y y takes a discrete value (a category label) rather than a continuous value.
Therefore, for logistic regression, we will choose our h θ ( x ) h_{\theta}(x) h θ ( x ) to be a sigmoid function that squishes any real number to a value between 0 and 1.
h θ ( x ) = g ( θ T x ) = 1 1 + e − θ T x h_{\theta}(x) = g(\theta^T x) = \frac{1}{1 + e^{-\theta^T x}} h θ ( x ) = g ( θ T x ) = 1 + e − θ T x 1
Let's us assume that:
p ( y = 1 ; x , θ ) = h θ ( x ) p(y=1 ; x, \theta) = h_{\theta}(x) p ( y = 1 ; x , θ ) = h θ ( x )
p ( y = 0 ; x , θ ) = 1 − h θ ( x ) p(y=0 ; x, \theta) = 1 - h_{\theta}(x) p ( y = 0 ; x , θ ) = 1 − h θ ( x )
Now this can be rewritten as:
p ( y ∣ x , θ ) = ( h θ ( x ) ) y ( 1 − h θ ( x ) ) 1 − y p(y \mid x, \theta) = (h_{\theta}(x))^y (1 - h_{\theta}(x))^{1 - y} p ( y ∣ x , θ ) = ( h θ ( x ) ) y ( 1 − h θ ( x ) ) 1 − y
Now, the log-likelihood can be written as:
ℓ ( θ ) = log ∏ i = 1 n p ( y ( i ) ∣ x ( i ) , θ ) \ell(\theta) = \log \, \prod_{i=1}^n \, p(y^{(i)} \mid x^{(i)}, \theta) ℓ ( θ ) = log i = 1 ∏ n p ( y ( i ) ∣ x ( i ) , θ )
= ∑ i = 1 n log p ( y ( i ) ∣ x ( i ) , θ ) = \sum_{i=1}^n \, \log \, p(y^{(i)} \mid x^{(i)}, \theta) = i = 1 ∑ n log p ( y ( i ) ∣ x ( i ) , θ )
= ∑ i = 1 n log [ ( h θ ( x ( i ) ) ) y ( i ) ( 1 − h θ ( x ( i ) ) ) 1 − y ( i ) ] = \sum_{i=1}^n \, \log \, \left[\left(h_{\theta}(x^{(i)})\right)^{y^{(i)}} \left(1 - h_{\theta}(x^{(i)})\right)^{1 - y^{(i)}} \right] = i = 1 ∑ n log [ ( h θ ( x ( i ) ) ) y ( i ) ( 1 − h θ ( x ( i ) ) ) 1 − y ( i ) ]
Taking it's derivative with respect to θ \theta θ , we get:
∂ ∂ θ j ℓ ( θ ) = ∑ i = 1 n ( y ( i ) − h θ ( x ( i ) ) ) x j ( i ) \frac{\partial}{\partial \theta_j} \, \ell(\theta) = \sum_{i=1}^n \left(y^{(i)} - h_{\theta}(x^{(i)}) \right) x_j^{(i)} ∂ θ j ∂ ℓ ( θ ) = i = 1 ∑ n ( y ( i ) − h θ ( x ( i ) ) ) x j ( i )
And so our gradient descent update rule to maximize the log-likelihood becomes:
θ j ← θ j + α ( y ( i ) − h θ ( x ( i ) ) ) x j ( i ) \theta_j \leftarrow \theta_j + \alpha \left(y^{(i)} - h_{\theta}(x^{(i)}) \right) x_j^{(i)} θ j ← θ j + α ( y ( i ) − h θ ( x ( i ) ) ) x j ( i )
Also, note that maximizing the log-likelihood is equivalent to minimizing the logistic loss where t = θ T x t = \theta^T x t = θ T x
arg min θ ℓ l o g i s t i c ( t , y ) = arg max θ ℓ ( θ ) \arg \min_{\theta} \ell_{logistic}(t, y) = \arg \max_{\theta} \ell(\theta) arg θ min ℓ l o g i s t i c ( t , y ) = arg θ max ℓ ( θ )
Multiclass Classification
For multi-class classification, if we have k k k classes, we will have k ∗ θ k * \theta k ∗ θ parameters and will use a one-vs-all approach.
p ( y = i ∣ x ; θ ) = ϕ i = e x p ( θ i T x ) ∑ j = 1 k e x p ( θ j T x ) p(y = i \mid x ; \theta) = \phi_i = \frac{exp\left(\theta_i^T x\right)}{\sum_{j=1}^{k} exp\left(\theta_j^T x\right)} p ( y = i ∣ x ; θ ) = ϕ i = ∑ j = 1 k e x p ( θ j T x ) e x p ( θ i T x )
Our cross-entropy loss (which is the negative log-likelihood) can then be written as:
ℓ c e ( θ ) = ∑ i = 1 m − log ( e x p ( θ y ( i ) T x ( i ) ) ∑ j = 1 k e x p ( θ j T x ( i ) ) ) \ell_{ce}(\theta) = \sum_{i=1}^m \,- \log \left( \frac{exp\left(\theta_{y^{(i)}}^T x^{(i)}\right)}{\sum_{j=1}^k exp\left(\theta_j^T x^{(i)}\right)} \right) ℓ ce ( θ ) = i = 1 ∑ m − log ∑ j = 1 k e x p ( θ j T x ( i ) ) e x p ( θ y ( i ) T x ( i ) )
Taking the derivative of the cross-entropy loss with respect to θ j \theta_j θ j , we get:
∂ ∂ θ j ℓ c e ( θ ) = ∑ i = 1 m ( ϕ j ( i ) − 1 { y ( i ) = j } ) x ( i ) \frac{\partial}{\partial \theta_j} \ell_{ce}(\theta) = \sum_{i=1}^m \left( \phi_j^{(i)} - 1\left\{y^{(i)} = j\right\} \right) x^{(i)} ∂ θ j ∂ ℓ ce ( θ ) = i = 1 ∑ m ( ϕ j ( i ) − 1 { y ( i ) = j } ) x ( i )
Note that ϕ j ( i ) = p ( y ( i ) = j ∣ x ( i ) ; θ ) \phi_j^{(i)} = p(y^{(i)} = j \mid x^{(i)}; \theta) ϕ j ( i ) = p ( y ( i ) = j ∣ x ( i ) ; θ ) .
Therefore, since ϕ j ( i ) \phi_j^{(i)} ϕ j ( i ) is a probability between 0 and 1, if y ( i ) = j y^{(i)} = j y ( i ) = j , we add a negative value of x ( i ) x^{(i)} x ( i ) to our gradient. And if y ( i ) ≠ j y^{(i)} \neq j y ( i ) = j , we add a positive value of x ( i ) x^{(i)} x ( i ) to our gradient.
Our gradient descent update rule is then:
θ j ← θ j − α ∑ i = 1 m ( ϕ j ( i ) − 1 { y ( i ) = j } ) x ( i ) \theta_j \leftarrow \theta_j - \alpha \sum_{i=1}^m \left( \phi_j^{(i)} - 1\left\{y^{(i)} = j\right\} \right) x^{(i)} θ j ← θ j − α i = 1 ∑ m ( ϕ j ( i ) − 1 { y ( i ) = j } ) x ( i )