Expectancy Maximization Algorithm
General Formulation
Suppose we have a latent variable model p(x,z;θ) with z being the latent variable. The density of x can be obtained using marginal probability over z:
p(x;θ)=z∑p(x,z;θ)
Now to maximize likelihood, we need to maximize the following function:
ℓ(θ)=i=1∑nlogp(x(i);θ)
=i=1∑nlogz(i)∑p(x(i);z(i);θ)
However, the above equation is not concave with respect to θ. Hence, we can not use gradient ascent to find the maximum likelihood estimate.
Moreover, if p(x;z;θ) is an exponential family distribution, then taking the derivative of the above equation with respect to θ doesn't typically lead to a solvable expression.
To get around this, we imagine Q(z) to be some distribution over z so that ∑zQ(z)=1 and Q(z)≥0
Jensen's Inequality
Jensen's inequality states that for a convex function [f′′(x)≥0] the following inequality holds:
E[f(X)]≥f(E[X]).And for concave functions:
E[f(X)]≤f(E[X]).Also, if X is a constant then X=E[X] and:
E[f(X)]=f(E[X]).
Using Jensen's inequality, we can see that the following inequality holds:
logp(x;θ)=logz∑Q(z)Q(z)p(x,z;θ)
≥z∑[Q(z)logQ(z)p(x,z;θ)]
From Jensen's inequality, we also know that our inequality becomes an equality (the bound becomes right) when the expectation is over a constant:
Q(z)p(x,z;θ)=c
It turns out that this bound is tight when:
Q(z)=p(z∣x;θ)
For convenience, let's define Evidence Lower Bound (ELBO) as:
ELBO(x;Q,θ)=Ez∼Q(z)[logQ(z)p(x,z;θ)]=z∑[Q(z)logQ(z)p(x,z;θ)]
Therefore,
logp(x;θ)≥ELBO(x;Q,θ)
def e_step(X, theta):
"""E-step: Compute Q_i(z^(i)) = p(z^(i) | x^(i); θ)"""
Q = []
for i in range(len(X)):
q_i = posterior_distribution(X[i], theta) # p(z^(i) | x^(i); θ)
Q.append(q_i)
return Q
def m_step(X, Q):
"""M-step: θ ← argmax_θ Σ ELBO(x^(i); Q_i, θ)"""
theta_new = maximize_elbo(X, Q)
return theta_new
for iteration in range(100):
# E-step: For each i, set Q_i(z^(i)) ← p(z^(i) | x^(i); θ)
Q = e_step(X, theta)
# M-step: Set θ ← argmax_θ Σ ELBO(x^(i); Q_i, θ)
theta = m_step(X, Q)
In the E step we find the estimated distributions over z(i) using our current value of θ and thus our current estimated distribution over x(i) given z(i). We do so using the Bayes' rule:
p(z(i)=j∣x(i);θ)=∑l=1kp(x(i)∣z(i)=l;θ)p(z(i)=l;θ)p(x(i)∣z(i)=j;θ)p(z(i)=j;θ)
In the M step we update the value of θ to maximizes the ELBO for which it is equal to logp(x;θ) for our current values of θ.
Proof of Convergence
It can be shown that the EM algorithm converges to the maximum likelihood estimate of θ as the number of iterations goes to infinity.
We saw that the for any given value of θ and any distribution over z, our log-likelihood ℓ(θ)=logp(x;θ) is always greater than the ELBO(x;Q,θ).
Hence,
ℓ(θ(t+1))≥i=1∑nELBO(x(i);Qi(t),θ(t+1))
Moreover, since in the M step, θ(t+1) is chosen to maximize ELBO(x;Q(t),θ(t)):
Therefore,
i=1∑nELBO(x(i);Qi(t),θ(t+1))≥i=1∑nELBO(x(i);Qi(t),θ(t))
Finally, in the E step, Q(z) is choosen such that for the current value of θ, the log-likelihood and the evidence lower bound are equal.
i=1∑nELBO(x(i);Qi(t),θ(t))=ℓ(θ(t))
Putting it all together, we have the following
ℓ(θ(t+1))≥ℓ(θ(t))
Note that the reason we wanted a tight bound on our inequality was to guarantee convergence. Otherwise after the E step, our ELBO(x;Q(t),θ(t)) could have been lower than our log-likelihood ℓ(θ(t)). This would've meant that ℓ(θ(t+1)) could have been lower than ℓ(θ(t)).
Other Interpretations
The evidence lower bound can be rewritten as:
ELBO(x;Q,θ)=Ez∼Q[logp(x∣z;θ)]−DKL(Q∥pz)
This would mean that maximizing ELBO over θ is equivalent to maximizing p(x∣z;θ) since the KL-divergence term does not depend on θ.
Similarly, the evidence lower bound can also be rewritten as:
ELBO(x;Q,θ)=logp(x)−DKL(Q∥pz∣x)
This shows us that ELBO is maximized for a given value of θ when the KL divergence between Q(z) and p(z∣x;θ) is minimized, which is when they are equal. This is exactly what we saw while deriving the algorithm above.