Expectation-Maximization Algorithm Explained

The EM algorithm is used to find (local) maximum likelihood parameters of a statistical model in cases where the equations cannot be solved directly. Typically these models involve latent variables in addition to unknown parameters and known data observations. In this post, we will talk about how the algorithm works and then prove its correctness, finally we will show a concrete yet the most common use case where the algorithm is applied.


The typical setting of the EM algorithm is a parameter estimation problem in which we have

  • A training set X={x1,,xn}\mathbf{X} = \{ x_1, \ldots, x_n \} consisting of nn independent observed data points. Each may be discrete or continuous. Associated with each data point may be a vector of observations Y={y1,,yn}\mathbf{Y} = \{ y_1, \ldots, y_n \}.
  • A set of unobserved latent data or missing values Z={z1,,zn}\mathbf{Z} = \{ z_1, \ldots, z_n \} associated with each data point. They are discrete, drawn from a fixed number of values, and with one latent variable per observed unit.
  • A vector of unknown parameters θ\boldsymbol{\theta} which are continuous, and are of two kinds:
    • Parameters that are associated with all data points
    • Those associated with a specific value of a latent variable (i.e., associated with all data points which corresponding latent variable has that value).

We wish to fit the parameters of a model p(x;θ)p(x ; \boldsymbol{\theta}) (may or may not include yy) and the log marginal likelihood function of all data points to maximize is given by

L(θ;X)=  i=1nlogp(xi;θ)=  i=1nlogzip(xi,zi;θ)\begin{aligned} & L(\boldsymbol{\theta};\mathbf{X}) \\ = \ \ &\sum_{i=1}^{n} \log p(x_i ; \boldsymbol{\theta}) \\ = \ \ &\sum_{i=1}^{n} \log \sum_{z_i} p(x_i, z_i ; \boldsymbol{\theta}) \end{aligned}

However, maximizing L(θ;X)L(\boldsymbol{\theta};\mathbf{X}) explicityly might be difficult because Z\mathbf{Z} are the latent random variables. So let’s refine it by Bayes’ theorem:

i=1nlogp(xi;θ)=  i=1n(logp(xi,zi;θ)logp(zixi;θ))=  i=1nlogp(xi,zi;θ)i=1nlogp(zixi;θ)\begin{aligned} & \sum_{i=1}^{n} \log p(x_i ; \boldsymbol{\theta}) \\ = \ \ & \sum_{i=1}^{n} (\log p(x_i, z_i ; \boldsymbol{\theta}) - \log p(z_i | x_i; \boldsymbol{\theta})) \\ = \ \ & \sum_{i=1}^{n} \log p(x_i, z_i ; \boldsymbol{\theta}) - \sum_{i=1}^{n} \log p(z_i | x_i; \boldsymbol{\theta}) \end{aligned}


L(θ;X,Z)=i=1nlogp(xi,zi;θ)L(\boldsymbol{\theta}; \mathbf{X}, \mathbf{Z}) = \sum_{i=1}^{n} \log p(x_i, z_i ; \boldsymbol{\theta})


L(θ;X)=L(θ;X,Z)i=1nlogp(zixi;θ)L(\boldsymbol{\theta};\mathbf{X}) = L(\boldsymbol{\theta}; \mathbf{X}, \mathbf{Z}) - \sum_{i=1}^{n} \log p(z_i | x_i; \boldsymbol{\theta})

This equation will be used to prove the correctness. Let’s denote it as equation (eq.)(eq.*)

The EM algorithm seeks to find the MLE of the marginal likelihood by iteratively applying these two steps:

Expectation step (E step)

Define Q(θθ(t))Q(\boldsymbol{\theta} \mid \boldsymbol{\theta}^{(t)}) as the expected value of the log likelihood function of θ\boldsymbol{\theta}, with respect to the current conditional distribution of Z\mathbf{Z}, given X\mathbf{X} and the current estimates of the parameters θ(t)\boldsymbol{\theta}^{(t)}

Q(θθ(t))=  Ezx;θ(t)[L(θ;X,Z)]=  Ezx;θ(t)[i=1nlogp(xi,zi;θ)]=  i=1nEzixi;θ(t)[logp(xi,zi;θ)]=  i=1nzip(zixi;θ(t))logp(xi,zi;θ)\begin{aligned} & Q\left(\boldsymbol{\theta} \mid \boldsymbol{\theta}^{(t)} \right) \\ = \ \ & \mathrm{E}_{z \mid x; \boldsymbol{\theta}^{(t)}} \left [ L(\boldsymbol{\theta}; \mathbf{X}, \mathbf{Z}) \right ] \\ = \ \ & \mathrm{E}_{z \mid x; \boldsymbol{\theta}^{(t)}} \left [ \sum_{i=1}^{n} \log p(x_i, z_i ; \boldsymbol{\theta}) \right ] \\ = \ \ & \sum_{i=1}^{n} \mathrm{E}_{z_i \mid x_i; \boldsymbol{\theta}^{(t)}} \left [ \log p(x_i, z_i ; \boldsymbol{\theta}) \right ] \\ = \ \ & \sum_{i=1}^{n} \sum_{z_i} p(z_i \mid x_i; \boldsymbol{\theta}^{(t)}) \log p(x_i, z_i ; \boldsymbol{\theta}) \end{aligned}

Maximization step (M step)

Find the parameters that maximize this quantity:

θ(t+1)=argmaxθQ(θθ(t))\boldsymbol{\theta}^{(t+1)} = \underset{\boldsymbol{\theta}}{\arg \max} \, Q\left(\boldsymbol{\theta} \mid \boldsymbol{\theta}^{(t)}\right)

Proof of Correctness

The trick of EM algorithm is to improve Q(θθ(t))Q(\boldsymbol{\theta} \mid \boldsymbol{\theta}^{(t)}) rather than directly improving L(θ;X)L(\boldsymbol{\theta};\mathbf{X}). Here is shown that improvements to the former imply improvements to the latter.

We take the expectation over possible values of the unknown data ziz_i under the current parameter estimate θ(t)\boldsymbol{\theta}^{(t)} for both sides of (eq.)(eq.*) equation by multiplying p(zixi;θ(t))p(z_i \mid x_i; \boldsymbol{\theta}^{(t)}) and summing (or integrating) over zz . The left-hand side is the expectation of a constant since it is independent of ziz_i, so we get:

L(θ;X)=  Ezx;θ(t)[L(θ;X,Z)]Ezx;θ(t)[i=1nlogp(zixi;θ)]=  Q(θθ(t))i=1nzip(zixi;θ(t))logp(zixi;θ)=  Q(θθ(t))+H(θθ(t))\begin{aligned} & L(\boldsymbol{\theta};\mathbf{X}) \\ = \ \ & \mathrm{E}_{z \mid x; \boldsymbol{\theta}^{(t)}} \left [ L(\boldsymbol{\theta}; \mathbf{X}, \mathbf{Z}) \right ] - \mathrm{E}_{z \mid x; \boldsymbol{\theta}^{(t)}} \left[\sum_{i=1}^{n} \log p(z_i | x_i; \boldsymbol{\theta}) \right] \\ = \ \ & Q \left(\boldsymbol{\theta} \mid \boldsymbol{\theta}^{(t)}\right) - \sum_{i=1}^{n}\sum_{z_i} p \left( z_i \mid x_i; \boldsymbol{\theta}^{(t)} \right) \log p(z_i \mid x_i ; \boldsymbol{\theta}) \\ = \ \ & Q \left( \boldsymbol{\theta} \mid \boldsymbol{\theta}^{(t)} \right) + H \left( \boldsymbol{\theta} \mid \boldsymbol{\theta}^{(t)} \right) \end{aligned}

where Hi(θθ(t))H_i \left(\boldsymbol{\theta} \mid \boldsymbol{\theta}^{(t)}\right) is defined by the negated sum it is replacing.

This last equation holds for every value of θ\boldsymbol{\theta} including θ=θ(t)\boldsymbol{\theta} = \boldsymbol{\theta}^{(t)},

L(θ(t);X)=Q(θ(t)θ(t))+H(θ(t)θ(t))L\left( \boldsymbol{\theta}^{(t)};\mathbf{X} \right) = Q \left( \boldsymbol{\theta}^{(t)} \mid \boldsymbol{\theta}^{(t)} \right) + H \left( \boldsymbol{\theta}^{(t)} \mid \boldsymbol{\theta}^{(t)} \right)

and subtracting this last equation from the previous equation gives

L(θ;X)L(θ(t);X)=Q(θθ(t))Q(θ(t)θ(t))+H(θθ(t))H(θ(t)θ(t))L(\boldsymbol{\theta};\mathbf{X}) - L\left( \boldsymbol{\theta}^{(t)};\mathbf{X} \right) = Q \left( \boldsymbol{\theta} \mid \boldsymbol{\theta}^{(t)} \right) - Q \left( \boldsymbol{\theta}^{(t)} \mid \boldsymbol{\theta}^{(t)} \right) + H \left( \boldsymbol{\theta} \mid \boldsymbol{\theta}^{(t)} \right) - H \left( \boldsymbol{\theta}^{(t)} \mid \boldsymbol{\theta}^{(t)} \right)

Note that

H(θθ(t))H(θ(t)θ(t))=  i=1nzip(zixi;θ(t))logp(zixi;θ)crossentropy(zixi;θ(t),zixi;θ)[zip(zixi;θ(t))logp(zixi;θ(t))]entropy(zixi;θ(t))\begin{aligned} & H \left( \boldsymbol{\theta} \mid \boldsymbol{\theta}^{(t)} \right) - H \left( \boldsymbol{\theta}^{(t)} \mid \boldsymbol{\theta}^{(t)} \right) \\ = \ \ & \sum_{i=1}^{n} \underbrace{- \sum_{z_i} p \left( z_i \mid x_i; \boldsymbol{\theta}^{(t)} \right) \log p(z_i \mid x_i ; \boldsymbol{\theta})}_{cross-entropy(z_i \mid x_i; \boldsymbol{\theta}^{(t)}, z_i \mid x_i; \boldsymbol{\theta})} - \underbrace{\left[ - \sum_{z_i} p \left( z_i \mid x_i; \boldsymbol{\theta}^{(t)} \right) \log p(z_i \mid x_i ; \boldsymbol{\theta}^{(t)}) \right]}_{entropy(z_i \mid x_i; \boldsymbol{\theta}^{(t)})} \end{aligned}

Based on Gibbs’ inequality: the information entropy of a distribution P is less than or equal to its cross entropy with any other distribution Q, which means

H(θθ(t))H(θ(t)θ(t))H\left(\boldsymbol{\theta} \mid \boldsymbol{\theta}^{(t)} \right) \geq H\left(\boldsymbol{\theta}^{(t)} \mid \boldsymbol{\theta}^{(t)} \right)

Hence, we can conclude that

L(θ;X)L(θ(t);X)Q(θθ(t))Q(θ(t)θ(t))L(\boldsymbol{\theta};\mathbf{X}) - L\left( \boldsymbol{\theta}^{(t)};\mathbf{X} \right) \geq Q \left( \boldsymbol{\theta} \mid \boldsymbol{\theta}^{(t)} \right) - Q \left( \boldsymbol{\theta}^{(t)} \mid \boldsymbol{\theta}^{(t)} \right)

In words, choosing θ\boldsymbol{\theta} to improve Q(θθ(t))Q(\boldsymbol{\theta} \mid \boldsymbol{\theta}^{(t)}) causes L(θ;X)L(\boldsymbol{\theta};\mathbf{X}) to improve at least as much.

Alternatively, you can also prove the correctness by using Jensen’s equality. More details in this note.

Gaussian Mixture Model (GMM)

A typical application of EM algorithm is the gaussian mixture model (GMM) which is a clustering problem as the following:

  • A training set X={x1,,xn}\mathbf{X} = \{ x_1, \ldots, x_n \} which is a sample of nn independent observations from a mixture of kk multivariate normal distributions of dimension dd. (since we are in the unsupervised learning setting, these points do not come with any labels.)
  • Associated with the latent variables Z={z1,z2,,zn}\mathbf{Z} = \{ z_1,z_2,\ldots,z_n \} that determine the component from which the observation originates. zi Multinomial (ϕ)z_i \sim \text { Multinomial }(\phi) where ϕj0,j=1kϕj=1\phi_j \geq 0, \sum_{j=1}^{k} \phi_j=1 and the parameter ϕj\phi_j gives p(zi=j)p(z_i=j).
  • The model posits that each xix_i was generated by randomly choosing ziz_i from 1,,k{1, \ldots, k}, and then xix_i was drawn from one of kk Gaussians depending on ziz_i. Then, xizi=jN(μj,Σj)x_i | z_i=j \sim \mathcal{N}\left(\mu_{j}, \Sigma_{j}\right)
  • The parameters to estimate θ={ϕj,μj,Σjj1,,k}\boldsymbol{\theta} = \{\phi_j, \mu_j, \Sigma_j | j \in 1,\ldots,k\}. For simplicity, θj={ϕj,μj,Σj}\theta_j = \{\phi_j, \mu_j, \Sigma_j\}

And we wish to model the data by specifying the joint distribution p(xi,zi)p(x_i, z_i). Hence, the log likelihood is

L(θ;X)=  i=1nlogj=1kp(xi,zi=j;θj)=  i=1nlogj=1kp(xizi=j;μj,Σj)p(zi=j;ϕj)=  i=1nlogj=1kf(xi;μj,Σj)ϕj\begin{aligned} & L(\boldsymbol{\theta} ; \mathbf{X}) \\ = \ \ & \sum_{i=1}^{n} \log \sum_{j=1}^{k} p(x_i, z_i = j; \theta_j) \\ = \ \ & \sum_{i=1}^{n} \log \sum_{j=1}^{k} p(x_i \mid z_i = j; \mu_j, \Sigma_j) \cdot p(z_i = j; \phi_j) \\ = \ \ & \sum_{i=1}^{n} \log \sum_{j=1}^{k} f\left(x_i ; \mu_j, \Sigma_j\right) \cdot \phi_j \end{aligned}

However, if we set to zero the derivatives of this formula with respect to the parameters and try to solve, we’ll find that it is not possible to find the maximum likelihood estimates of the parameters in closed form. (Try this yourself at home.) This is where the EM algorithm comes in.


  • Define Q(θθ(t))Q(\boldsymbol{\theta} \mid \boldsymbol{\theta}^{(t)})

    L(θ;X,Z)=i=1nlogf(xi;μj,Σj)ϕjL(\theta ; \mathbf{X}, \mathbf{Z}) = \sum_{i=1}^{n} \log f\left(x_i ; \mu_j, \Sigma_j\right) \cdot \phi_{j}


    f(xi;μj,Σj)=1(2π)d/2Σj1/2exp(12(xiμj)TΣj1(xiμj))f\left(x_i ; \mu_j, \Sigma_j\right) = \frac{1}{(2 \pi)^{d / 2}\left|\Sigma_{j}\right|^{1 / 2}} \exp \left(-\frac{1}{2}\left(x_i-\mu_{j}\right)^{T} \Sigma_{j}^{-1}\left(x_i-\mu_{j}\right)\right)


    Q(θθ(t))=i=1nj=1kp(zi=jxi;θ(t))logf(xi;μj,Σj)ϕjQ(\boldsymbol{\theta} \mid \boldsymbol{\theta}^{(t)}) = \sum_{i=1}^{n} \sum_{j=1}^{k} p(z_i = j \mid x_i; \boldsymbol{\theta}^{(t)}) \cdot \log f\left(x_i ; \mu_j, \Sigma_j\right) \cdot \phi_j

  • Compute wij(t)w_{ij}^{(t)}
    Given our current estimate of the parameters θ(t)\theta^{(t)}, the conditional distribution of the ziz_i is determined by Bayes theorem to be normalized Gaussian density weighted by ϕj\phi_j:

    wij(t)=  p(zi=jxi;θ(t))=  p(xi,zi=j;θ(t))p(xi;θ(t))=  p(xi,zi=j;θ(t))l=1kp(xi,zi=l;θ(t))=  f(xi;μj(t),Σj(t))ϕj(t)l=1kf(xi;μl(t),Σl(t))ϕl(t)\begin{aligned} & w_{ij}^{(t)} \\ = \ \ & p(z_i = j \mid x_i; \boldsymbol{\theta}^{(t)}) \\ = \ \ & \frac{p(x_i, z_i = j ; \boldsymbol{\theta}^{(t)})}{p(x_i; \boldsymbol{\theta}^{(t)})} \\ = \ \ & \frac{p(x_i, z_i = j ; \boldsymbol{\theta}^{(t)})}{\sum_{l = 1}^{k} p(x_i, z_i = l; \boldsymbol{\theta}^{(t)})} \\ = \ \ & \frac{f\left(x_i ; \mu_j^{(t)}, \Sigma_j^{(t)}\right) \cdot \phi_{j}^{(t)}}{\sum_{l = 1}^{k} f\left(x_i ; \mu_l^{(t)}, \Sigma_l^{(t)}\right) \cdot \phi_{l}^{(t)}} \end{aligned}


We need to maximize, with respect to our parameters θ(t)\boldsymbol{\theta}^{(t)}, the quantity

Q(θθ(t))=  i=1nj=1kwij(t)log1(2π)d/2Σj1/2exp(12(xiμj)TΣj1(xiμj))ϕj=  i=1nj=1kwij(t)[logϕj12logΣj12(xiμj)TΣj1(xiμj)d2log(2π)]\begin{aligned} & Q(\boldsymbol{\theta} \mid \boldsymbol{\theta}^{(t)}) \\ = \ \ & \sum_{i=1}^{n}\sum_{j=1}^{k} w_{ij}^{(t)} \log \frac{1}{(2 \pi)^{d / 2}\left|\Sigma_{j}\right|^{1 / 2}} \exp \left(-\frac{1}{2}\left(x_i-\mu_{j}\right)^{T} \Sigma_{j}^{-1}\left(x_i-\mu_{j}\right)\right) \cdot \phi_{j} \\ = \ \ & \sum_{i=1}^{n}\sum_{j=1}^{k} w_{ij}^{(t)} \left[\log \phi_j - \frac{1}{2} \log \left|\Sigma_{j}\right| - \frac{1}{2}(x_i-\mu_j)^T \Sigma_{j}^{-1}(x_i - \mu_j) - \frac{d}{2} \log(2\pi) \right] \end{aligned}

  • Update ϕj\phi_j
    Grouping together only the terms that depend on ϕj\phi_j, we find that we need to maximize

    i=1nj=1kwij(t)logϕj\sum_{i=1}^{n}\sum_{j=1}^{k} w_{ij}^{(t)} \log \phi_j

    with subject to

    j=1kϕj=1\sum_{j=1}^{k} \phi_j = 1

    So we construct the Lagrangian

    L(ϕ)=i=1nj=1kwij(t)logϕj+λ(j=1kϕj1)\mathcal{L}(\phi) = \sum_{i=1}^{n}\sum_{j=1}^{k} w_{ij}^{(t)} \log \phi_j + \lambda \left( \sum_{j=1}^{k} \phi_j - 1 \right)

    where λ\lambda is the Lagrange multiplier. Taking derivative, we find

    ϕjL(ϕ)=i=1nwij(t)ϕj+λ\frac{\partial}{\partial \phi_{j}} \mathcal{L}(\phi)=\sum_{i=1}^{n} \frac{w_{ij}^{(t)}}{\phi_{j}}+\lambda

    Setting this to zero and solving, we get

    ϕj=i=1nwij(t)λ\phi_{j}=\frac{\sum_{i=1}^{n} w_{ij}^{(t)}}{-\lambda}

    Using the constraint that j=1kϕj=1\sum_{j=1}^{k} \phi_j = 1 and knowing the fact that j=1kwij(t)=1\sum_{j=1}^{k} w_{ij}^{(t)} = 1 (probabilities sum to 1), we easily find:

    λ=i=1nj=1kwij(t)=i=1n1=n-\lambda=\sum_{i=1}^{n} \sum_{j=1}^{k} w_{ij}^{(t)} = \sum_{i=1}^{n} 1= n

    We therefore have updates for the parameters ϕj\phi_j:

    ϕj:=1ni=1nwij(t)\phi_{j} :=\frac{1}{n} \sum_{i=1}^{n} w_{ij}^{(t)}

  • Update μj\mu_j

    μjQ(θθ(t))=  μji=1n12wij(t)(xiμj)TΣj1(xiμj)=  12i=1nwij(t)μj(2μjTΣj1xiμjTΣj1μj)=  i=1nwij(t)(Σj1xiΣj1μj)\begin{aligned} & \frac{\partial}{\partial \mu_{j}} Q(\boldsymbol{\theta} \mid \boldsymbol{\theta}^{(t)}) \\ = \ \ & \frac{\partial}{\partial \mu_{j}} \sum_{i=1}^{n} -\frac{1}{2} w_{ij}^{(t)} (x_i-\mu_j)^T \Sigma_{j}^{-1}(x_i - \mu_j) \\ = \ \ & \frac{1}{2} \sum_{i=1}^{n} w_{ij}^{(t)} \dfrac{\partial}{\partial \mu_{j}} \left( 2 \mu_j^T\Sigma_j^{-1}x_i - \mu_j^T\Sigma_j^{-1}\mu_j \right) \\ = \ \ & \sum_{i=1}^{n} w_{ij}^{(t)} \left( \Sigma_j^{-1} x_i - \Sigma_j^{-1}\mu_j \right) \end{aligned}

    Setting this to zero and solving for μj\mu_j therefore yields the update rule

    μj:=i=1nwij(t)xii=1nwij(t)\mu_{j} :=\frac{\sum_{i=1}^{n} w_{ij}^{(t)} x_i}{\sum_{i=1}^{n} w_{ij}^{(t)}}

  • Update Σj\Sigma_j

    ΣjQ(θθ(t))=  Σji=1n12wij(t)[logΣj+(xiμj)TΣj1(xiμj)]=  12i=1nwij(t)Σj[logΣj+(xiμj)TΣj1(xiμj)]=  12i=1nwij(t)(Σj1(xiμj)(xiμj)TΣj2)\begin{aligned} & \frac{\partial}{\partial \Sigma_{j}} Q(\boldsymbol{\theta} \mid \boldsymbol{\theta}^{(t)}) \\ = \ \ & \frac{\partial}{\partial \Sigma_{j}} \sum_{i=1}^{n} -\frac{1}{2} w_{ij}^{(t)} \left[ \log\left|\Sigma_{j}\right| + (x_i-\mu_j)^T \Sigma_{j}^{-1}(x_i - \mu_j)\right] \\ = \ \ & -\frac{1}{2} \sum_{i=1}^{n} w_{ij}^{(t)} \dfrac{\partial}{\partial \Sigma_{j}} \left[ \log\left|\Sigma_{j}\right| + (x_i-\mu_j)^T \Sigma_{j}^{-1}(x_i - \mu_j)\right] \\ = \ \ & -\frac{1}{2} \sum_{i=1}^{n} w_{ij}^{(t)} \left( \Sigma_{j}^{-1} - (x_i-\mu_j)(x_i - \mu_j)^T \Sigma_{j}^{-2} \right) \end{aligned}

    Setting the partial derivative to zero and solving for Σj\Sigma_j therefore yields the update rule

    Σj:=i=1nwij(t)(xiμj)(xiμj)Ti=1nwij(t)\Sigma_{j} :=\frac{\sum_{i=1}^{n} w_{ij}^{(t)} (x_i-\mu_j)(x_i - \mu_j)^T}{\sum_{i=1}^{n} w_{ij}^{(t)}}

    More details on the derivative calculus
    Note that Σj\Sigma_{j} is invertible, symmetric, square, positive-definite matrix, then the following holds

    • The inverse of a symmetric matrix is symmetric, then (Σj1)T=Σj1(\Sigma_{j}^{-1})^T = \Sigma_{j}^{-1}
    • Σj\Sigma_{j} is positive-definite, then Σj1\Sigma_{j}^{-1} is positive-definite
    • Positive-definite matrix is invertible, then Σj2\Sigma_{j}^{-2} exists

    So we get

    ΣjlogΣj=  1ΣjΣjΣj=  1ΣjΣj(Σj1)T=  Σj1\begin{aligned} & \dfrac{\partial}{\partial \Sigma_{j}} \log\left|\Sigma_{j}\right| \\ = \ \ & \frac{1}{\left|\Sigma_{j}\right|} \frac{\partial}{\partial \Sigma_{j}} \left|\Sigma_{j}\right| \\ = \ \ & \frac{1}{\left|\Sigma_{j}\right|} \left|\Sigma_{j}\right| \cdot (\Sigma_{j}^{-1})^T \\ = \ \ & \Sigma_{j}^{-1} \end{aligned}


    Σj(xiμj)TΣj1(xiμj)=  (xiμj)(xiμj)TΣjΣj1=  (xiμj)(xiμj)TΣj2\begin{aligned} & \frac{\partial}{\partial \Sigma_{j}} (x_i-\mu_j)^T \Sigma_{j}^{-1}(x_i - \mu_j) \\ = \ \ & (x_i-\mu_j)(x_i - \mu_j)^T \frac{\partial}{\partial \Sigma_{j}} \Sigma_{j}^{-1} \\ = \ \ & - (x_i-\mu_j)(x_i - \mu_j)^T \Sigma_{j}^{-2} \\ \end{aligned}

More about GMM

Compared to K-Means which uses hard assignment, GMM uses soft assignment which a good way to represent uncertainty. But GMM is limited in its number of features since it requires storing a covariance matrix which has size quadratic in the number of features. Even when the number of features does not exceed this limit, this algorithm may perform poorly on high-dimensional data.
This is due to high-dimensional data:

  • making it difficult to cluster at all (based on statistical/theoretical arguments)
  • numerical issues with Gaussian distributions