Generative Learning Algorithms
Gaussian Discriminant Analysis
If we want to classify between cats and dogs, discrimnative learning algorithms try to learn a hyperplane that separates the two classes directly. Generative learning algorithms, try to learn a model for what cats look like and a separate model to learn what dogs look like.
After modelling p(x∣y) for each class and the class prior p(y), we can classify a new example by computing the posterior p(y∣x) for each class and picking the class with the highest posterior probability.
argymaxp(y∣x)=argymaxp(x)p(x∣y)p(y)
=argymaxp(x∣y)p(y)
In Gaussian Discriminant Analysis (GDA), we assume that each of p(x∣y) follows a multivariate gaussian distribution with unique mean but a shared covariance matrix.
p(y)=ϕy(1−ϕ)1−y
p(x∣y=0)=(2π)d/2∣Σ∣1/21exp(−21(x−μ0)TΣ−1(x−μ0))
p(x∣y=1)=(2π)d/2∣Σ∣1/21exp(−21(x−μ1)TΣ−1(x−μ1))
The log-likelihood of the data is then given by:
ℓ(ϕ,μ0,μ1,Σ)=logi=1∏np(x(i),y(i);ϕ,μ0,μ1,Σ)
By maximizing ℓ with respect to the parameters, we find the maximum likelihood estimate of the parameters to be:
ϕ=n1i=1∑n1{y(i)=1}
μ0=∑i=1n1{y(i)=0}∑i=1n1{y(i)=0}x(i)
μ1=∑i=1n1{y(i)=1}∑i=1n1{y(i)=1}x(i)
Σ=n1i=1∑n(x(i)−μy(i))(x(i)−μy(i))T
Naive Bayes
When our feacture vectors x take on discrete values, we can use the Naive Bayes.
x∈{0,1}d
We also make the assumption that xi's are conditionally independent given y.
p(x1,…,xd∣y)=j=1∏dp(xj∣y)
We also assume that y takes on two values, 0 and 1. Our likelihood function then becomes:
ℓ(ϕy,ϕj∣y=0,ϕj∣y=1)=logi=1∏np(x(i),y(i);ϕy,ϕj∣y=0,ϕj∣y=1)
=logi=1∏n(j=1∏dp(xj(i)∣y(i);ϕj∣y=0,ϕj∣y=1))p(y(i);ϕy)
Maximizing our log likelihood function with respect to our parameters, we get:
ϕj∣y=1=∑i=1n1{y(i)=1}∑i=1n1{xj(i)=1∧y(i)=1}
ϕj∣y=0=∑i=1n1{y(i)=0}∑i=1n1{xj(i)=1∧y(i)=0}
ϕy=n1i=1∑n1{y(i)=1}
Having fit all these parameters, to make a prediction on a new example, we simply calculate:
p(y=1∣x)=p(x)p(x∣y=1)p(y=1)
p(y=1∣x)=(∏j=1dp(xj∣y=1))p(y=1)+(∏j=1dp(xj∣y=0))p(y=0)(∏j=1dp(xj∣y=1))p(y=1)
However, there is a problem with this approach. If any of the p(xj∣y=1), then our prediction will be 0. This happens if we don't have any training example for which a particular xj is 1.
As a work around, we apply Laplace smoothing. Our parameters then become:
ϕj∣y=1=2+∑i=1n1{y(i)=1}1+∑i=1n1{xj(i)=1∧y(i)=1}
ϕj∣y=0=2+∑i=1n1{y(i)=0}1+∑i=1n1{xj(i)=1∧y(i)=0}
ϕy=n1i=1∑n1{y(i)=1}