Variational Autoencoders
Let z be a latent variable such that, z∼N(0,I), and let θ be the collections of weights of a neural network g(z;θ) that maps z∈Rk to Rd.
Also, let x given z follow a normal distribution x∣z∼N(g(z;θ),σ2I). We can find this distribution using an Expectation Maximization Algorithm.
Note that for Gaussian Mixture Models, the optimal choice of Q(z)=p(z∣x;θ) and we found it using Bayes' rule. We were able to do so because z was discrete and could take only k values.
However, for more complex models like the Variational Autoencoder, z is continous. Therefore, it is intractable to compute p(z∣x;θ) explicitly. Instead, we will try to find an approximation of p(z∣x;θ).
Also note that we wanted to find Q(z)=p(z∣x;θ) so that our ELBO would be tight with logp(x;θ).
Since,
ELBO(x;Q,θ)=logp(x;θ)−DKL(Q∥pz∣x)
Therefore,
logp(x;θ)=ELBO(x;Q,θ)+DKL(Q∥pz∣x)
From this we can see that if we choose a Q(z) such that it maximizes our ELBO for a given value of θ, then our KL-divergence is minimized.
From this, we can find an approximate value for p(z∣x;θ) by choosing a Q from our family of distributions Q:
p(z∣x;θ)≈Q∈Qargmax(θmaxELBO(x;Q,θ))
We make a mean field assumption that assumes that Q(z) gives a distribution with independent coordinates which means that we can decompose Q(z) into Q1(z1)…Qk(zk).
Finally, we assume that Q is a normal distribution where the mean and variance come from the neural networks q(x;ϕ) and v(x;ψ) respectively. We set the covariance matrix to be a diagonal to enforce our mean field assumption. We can write this as:
Qi(z(i))=N(q(x(i);ϕ),diag(v(x(i);ψ))2)
We use this Qi(z(i)) to find our ELBO for the M step.
ELBO(x;ϕ,ψ,θ)=i=1∑nEz(i)∼Qi[logQi(z(i))p(x(i)z(i);θ)]
Note that we no longer need the E step because we are directly using our Qi(z(i)). Therefore, instead of alternating maximization, we can use gradient ascent.
θ←θ+η∇θELBO(x(i);ϕ,ψ,θ)
ϕ←ϕ+η∇ϕELBO(x(i);ϕ,ψ,θ)
ψ←ψ+η∇ψELBO(x(i);ϕ,ψ,θ)
Taking the derivative of our ELBO with respect to θ, we get:
∇θELBO(x(i);ϕ,ψ,θ)=i=1∑nEz(i)∼Qi[∇θlogp(x(i)∣z(i);θ)]
Similarly, taking the derivative of our ELBO with respect to ϕ and ψ, we get:
∇ϕELBO(x(i);ϕ,ψ,θ)=i=1∑n[Eξ(i)∼N(0,1)[∇ϕlogp(x(i)∣z^(i);θ)]−∇ϕDKL(Qi(z^(i))∥p(z^(i))]
∇ψELBO(x(i);ϕ,ψ,θ)=i=1∑n[Eξ(i)∼N(0,1)[∇ψlogp(x(i)∣z^(i);θ)]−∇ψDKL(Qi(z^(i))∥p(z^(i))]
where z^(i) is just the reparameterized version of z(i) and z^(i)=q(x(i);ϕ)+v(x(i);ψ)⊙ξ(i)
We can find all three derivatives by backpropagating through their respective neural networks.
p(x(i)∣z(i);θ) depends on the decoder neural network g(z(i);θ). And Qi(z(i)) depends on the encoder neural networks q(x(i);ϕ) and v(x(i);ψ).
Ofcourse, since we still have an expectancy, we will have to take multiple samples of z(i) in case of θ, and ξ(i) in case of ψ and ϕ and then take an average over the derivatives.