The EM algorithm is used to find (local) maximum likelihood parameters of a statistical model in cases where the equations cannot be solved directly. Typically these models involve latent variables in addition to unknown parameters and known data observations. In this post, we will talk about how the algorithm works and then prove its correctness, finally we will show a concrete yet the most common use case where the algorithm is applied.
The typical setting of the EM algorithm is a parameter estimation problem in which we have
- A training set consisting of independent observed data points. Each may be discrete or continuous. Associated with each data point may be a vector of observations .
- A set of unobserved latent data or missing values associated with each data point. They are discrete, drawn from a fixed number of values, and with one latent variable per observed unit.
- A vector of unknown parameters which are continuous, and are of two kinds:
- Parameters that are associated with all data points
- Those associated with a specific value of a latent variable (i.e., associated with all data points which corresponding latent variable has that value).
We wish to fit the parameters of a model (may or may not include ) and the log marginal likelihood function of all data points to maximize is given by
However, maximizing explicityly might be difficult because are the latent random variables. So let’s refine it by Bayes’ theorem:
This equation will be used to prove the correctness. Let’s denote it as equation
The EM algorithm seeks to find the MLE of the marginal likelihood by iteratively applying these two steps:
Define as the expected value of the log likelihood function of , with respect to the current conditional distribution of , given and the current estimates of the parameters
Find the parameters that maximize this quantity:
The trick of EM algorithm is to improve rather than directly improving . Here is shown that improvements to the former imply improvements to the latter.
We take the expectation over possible values of the unknown data under the current parameter estimate for both sides of equation by multiplying and summing (or integrating) over . The left-hand side is the expectation of a constant since it is independent of , so we get:
where is defined by the negated sum it is replacing.
This last equation holds for every value of including ,
and subtracting this last equation from the previous equation gives
Based on Gibbs’ inequality: the information entropy of a distribution P is less than or equal to its cross entropy with any other distribution Q, which means
Hence, we can conclude that
In words, choosing to improve causes to improve at least as much.
A typical application of EM algorithm is the gaussian mixture model (GMM) which is a clustering problem as the following:
- A training set which is a sample of independent observations from a mixture of multivariate normal distributions of dimension . (since we are in the unsupervised learning setting, these points do not come with any labels.)
- Associated with the latent variables that determine the component from which the observation originates. where and the parameter gives .
- The model posits that each was generated by randomly choosing from , and then was drawn from one of Gaussians depending on . Then,
- The parameters to estimate . For simplicity,
And we wish to model the data by specifying the joint distribution . Hence, the log likelihood is
However, if we set to zero the derivatives of this formula with respect to the parameters and try to solve, we’ll find that it is not possible to find the maximum likelihood estimates of the parameters in closed form. (Try this yourself at home.) This is where the EM algorithm comes in.
Given our current estimate of the parameters , the conditional distribution of the is determined by Bayes theorem to be normalized Gaussian density weighted by :
We need to maximize, with respect to our parameters , the quantity
Grouping together only the terms that depend on , we find that we need to maximize
with subject to
So we construct the Lagrangian
where is the Lagrange multiplier. Taking derivative, we find
Setting this to zero and solving, we get
Using the constraint that and knowing the fact that (probabilities sum to 1), we easily find:
We therefore have updates for the parameters :
Setting this to zero and solving for therefore yields the update rule
Setting the partial derivative to zero and solving for therefore yields the update rule
More details on the derivative calculus
Note that is invertible, symmetric, square, positive-definite matrix, then the following holds
- The inverse of a symmetric matrix is symmetric, then
- is positive-definite, then is positive-definite
- Positive-definite matrix is invertible, then exists
So we get
Compared to K-Means which uses hard assignment, GMM uses soft assignment which a good way to represent uncertainty. But GMM is limited in its number of features since it requires storing a covariance matrix which has size quadratic in the number of features. Even when the number of features does not exceed this limit, this algorithm may perform poorly on high-dimensional data.
This is due to high-dimensional data:
- making it difficult to cluster at all (based on statistical/theoretical arguments)
- numerical issues with Gaussian distributions