Gaussian Mixture Models

Suppose that we are given a training set $x^{(1)}, \ldots, x^{(n)}$ and we wish to model the data by specifying a joint distribution $p(x^{(i)}, z^{(i)})$ where $z^{(i)} \sim \text{Multinomial}(\phi)$ is a latent variable where $p(z^{(i)} = j) = \phi_j$ and $\sum_{j=1}^k \phi_j = 1$ where $k$ is the number of values that $z^{(i)}$ can take.

Moreover, we assume that $x^{(i)} | z^{(i)} = j \sim \mathcal{N}(\mu_j, \Sigma_j)$ . Gaussian mixture models are similar to the K-means Algorithm except that we allow for overlapping clusters and each each cluster follows a Gaussian distribution.

To maximize the log likelihood, we need to maximize:

\ell(\phi, \mu, \Sigma) = \sum_{i=1}^{n} \log p(x^{(i)}; \phi, \mu, \Sigma)

= \sum_{i=1}^{n} \log \sum_{z^{(i)}=1}^{k} p(x^{(i)} | z^{(i)}; \mu, \Sigma) p(z^{(i)}; \phi)

The random variables $z^{(i)}$ 's indicate which of the $k$ Gaussians each $x^{(i)}$ had come from. Note that if we knew what the $z^{(i)}$ 's were, the maximum likelihood problem would have been easy and almost similar to that of Gaussian Discriminant Analysis.

Since there is no way for us to take the derivative of the above equation and find a closed form solution, we need to use the Expectancy Maximization Algorithm instead.

In the $E$ step, for all values of $i$ and $j$ , we find $p(z^{(i)} = j | x^{(i)}; \phi, \mu, \Sigma)$ using the current parameters and the bayes rule:

p(z^{(i)} = j | x^{(i)}; \phi, \mu, \Sigma) = \frac{p(x^{(i)} | z^{(i)} = j; \mu, \Sigma) p(z^{(i)} = j; \phi)}{\sum_{l=1}^k p(x^{(i)} | z^{(i)} = l; \mu, \Sigma) p(z^{(i)} = l; \phi)}

When we plug in the equations for the gaussian distribution, this takes a form very similar to the softmax function. So in the $E$ step, we find our estimated probability distribution for what value $z^{(i)}$ should take for each $x^{(i)}$ .

See derivation

Now for the $M$ step, for each $j$ , we find the values of our parameters that maximize our new $\text{ELBO}$ with:

w_j^{(i)} = Q_i(z^{(i)} = j) = p(z^{(i)} = j \mid x^{(i)}; \phi, \mu, \Sigma)

Therefore we need to maximize the following:

\sum_{i=1}^{n} \sum_{j=1}^{k} \left[Q_i(z^{(i)} = j) \log \frac{p(x^{(i)} | z^{(i)} = j; \mu, \Sigma) p(z^{(i)} = j; \phi)}{Q_i(z^{(i)} = j)}\right]

= \sum_{i=1}^{n} \sum_{j=1}^{k} \left[w_j^{(i)} \log \frac{p(x^{(i)} | z^{(i)} = j; \mu, \Sigma) p(z^{(i)} = j; \phi)}{w_j^{(i)}}\right]

When we take the derivative of the above equation with respect to each of our parameters $\phi, \mu, \Sigma$ , and set it equal to $0$ , we get our update equations for each of the parameters:

\phi_l = \frac{1}{n} \sum_{i=1}^n w_l^{(i)}

\mu_l = \frac{\sum_{i=1}^n w_l^{(i)} x^{(i)}}{\sum_{i=1}^n w_l^{(i)}}

\Sigma_l = \frac{\sum_{i=1}^n w_l^{(i)} (x^{(i)} - \mu_l)(x^{(i)} - \mu_l)^T}{\sum_{i=1}^n w_l^{(i)}}

See derivation

\text{Repeat until convergence \{}

\hspace{2em} \text{(E-step) For each } i \text{ and } j, \text{ set } \text{\{}

\hspace{4em} w_j^{(i)} \leftarrow p(z^{(i)} = j \mid x^{(i)}; \phi, \mu, \Sigma).

\hspace{2em} \text{\}}

\hspace{2em} \text{(M-step) For each } j, \text{ set } \text{\{}

\hspace{4em} \phi_j \leftarrow \frac{1}{n} \sum_{i=1}^n w_j^{(i)}

\hspace{4em} \mu_j \leftarrow \frac{\sum_{i=1}^n w_j^{(i)} x^{(i)}}{\sum_{i=1}^n w_j^{(i)}}

\hspace{4em} \Sigma_j \leftarrow \frac{\sum_{i=1}^n w_j^{(i)} (x^{(i)} - \mu_j)(x^{(i)} - \mu_j)^T}{\sum_{i=1}^n w_j^{(i)}}

\hspace{2em} \text{\}}

\text{\}}

def e_step_gmm(X, phi, mu, sigma):
    """E-step: Compute w_j^(i) = p(z^(i) = j | x^(i); φ, μ, Σ)"""
    w = []
    for i in range(len(X)):
        w_i = compute_responsibilities(X[i], phi, mu, sigma)  # p(z^(i) = j | x^(i); φ, μ, Σ)
        w.append(w_i)
    return w

def m_step_gmm(X, w):
    """M-step: Update φ, μ, Σ using responsibilities"""
    n = len(X)
    k = len(w[0])  # number of components
    
    # Update mixing coefficients: φ_j ← (1/n) Σ w_j^(i)
    phi_new = []
    for j in range(k):
        phi_j = (1/n) * sum(w[i][j] for i in range(n))
        phi_new.append(phi_j)
    
    # Update means: μ_j ← Σ w_j^(i) x^(i) / Σ w_j^(i)
    mu_new = []
    for j in range(k):
        numerator = sum(w[i][j] * X[i] for i in range(n))
        denominator = sum(w[i][j] for i in range(n))
        mu_j = numerator / denominator
        mu_new.append(mu_j)
    
    # Update covariances: Σ_j ← Σ w_j^(i) (x^(i) - μ_j)(x^(i) - μ_j)^T / Σ w_j^(i)
    sigma_new = []
    for j in range(k):
        numerator = sum(w[i][j] * outer_product(X[i] - mu_new[j], X[i] - mu_new[j]) for i in range(n))
        denominator = sum(w[i][j] for i in range(n))
        sigma_j = numerator / denominator
        sigma_new.append(sigma_j)
    
    return phi_new, mu_new, sigma_new

for iteration in range(100):
    # E-step: For each i and j, set w_j^(i) ← p(z^(i) = j | x^(i); φ, μ, Σ)
    w = e_step_gmm(X, phi, mu, sigma)
    
    # M-step: Update φ, μ, Σ using current responsibilities
    phi, mu, sigma = m_step_gmm(X, w)

In the $E$ step, we estimate the probability of each $z^{(i)}$ given $x^{(i)}$ using the current parameters $\phi, \mu, \Sigma$ .

In the $M$ step, we update the value of our parameters to maximize the $\text{ELBO}$ for which it is equal to $p(x; \mu, \Sigma)$ for our current values of $\phi, \mu, \Sigma$ .

Please use a larger screen

This content is best viewed on a laptop or desktop device.