Previous
Clustering
Contents
Table of Contents
Next
Dimension Reduction

11.6. Gaussian Mixture Models (GMM)

The GMMs are more flexible than the K-Means clustering [122]. However, any GMM needs to start with an assumption that the data points are Gaussian distributed. Every cluster has a Gaussian distribution. Thus, two parameters are needed to describe the shape of each cluster, the mean and the standard deviation. Taking GMM for a 2D problem for example, the distribution of points in any cluster can take an elliptical shape represented by a 2D Gaussian distribution.
We use an optimization algorithm known as Expectation Maximization (EM) to find out the parameters of the Gaussian for each cluster. In a typical clustering task, we need to cluster I I III points. Each point, or sample, has J J JJJ features. We can imagine that the I I III points are distributed in a J J JJJ-dimensional space. We assume the points can be clustered into several groups. In GMM, each group of points is assumed to follow a Gaussian distribution, N N N\mathcal{N}N. The goal of GMM is to search for the Gaussian distributions that best describe the distributions of points. Instead of a deterministic label stating that one point belongs to a specific group in KNN and K-Means, any point in GMM has different probabilities of belonging to different groups. Therefore, the input X X XXX has a shape of I × J I × J I xx JI \times JI×J. The probability that a point x i x i x_(i)x_{i}xi is generated is
(11.8) p i = k = 1 K [ π k N k ( x i μ , Σ ) ] (11.8) p i = k = 1 K π k N k x i μ , Σ {:(11.8)p_(i)=sum_(k=1)^(K)[pi_(k)N_(k)(x_(i)∣mu,Sigma)]:}\begin{equation*} p_{i}=\sum_{k=1}^{K}\left[\pi_{k} \mathcal{N}_{k}\left(x_{i} \mid \mu, \Sigma\right)\right] \tag{11.8} \end{equation*}(11.8)pi=k=1K[πkNk(xiμ,Σ)]
where π k π k pi_(k)\pi_{k}πk is the weight of the k k kkk th normal distribution (or group). To obtain the Gaussian distributions, we need to determine both the parameters for N N N\mathcal{N}N and the weights π π pi\piπ, which are related. For such a problem, the EM iterations are usually adopted. Accordingly, we first calculate some parameters such as those for N N N\mathcal{N}N (i.e., μ μ mu\muμ and Σ Σ Sigma\SigmaΣ ) and then, based on
the results, calculate the weights. The calculated weights can be used to update N N N\mathcal{N}N. This process will be continued until the improvement is less than a small tolerance.
The detailed EM procedure for implementing the GMM is as follows.
  1. Assume the data can be clustered into K K KKK groups, e.g., 3. Generate μ μ mu\muμ and σ σ sigma\sigmaσ for all the groups. When there is more than one variable (dimensions), we use the covariance matrix Σ Σ Sigma\SigmaΣ (e.g., in NumPy, we use np.cov(Sample, rowvar=False)) 2 2 ^(2){ }^{2}2 instead of standard deviation σ σ sigma\sigmaσ (square root of variance). Meanwhile, μ μ mu\muμ will become an array with J J JJJ elements for each normal distribution (group).
  2. Calculate the probabilities of each sample in different groups. Thus, there are N 1 ( μ 1 , Σ 1 ) , , N K ( μ K , Σ K ) N 1 μ 1 , Σ 1 , , N K μ K , Σ K N_(1)(mu_(1),Sigma_(1)),cdots,N_(K)(mu_(K),Sigma_(K))\mathcal{N}_{1}\left(\mu_{1}, \Sigma_{1}\right), \cdots, \mathcal{N}_{K}\left(\mu_{K}, \Sigma_{K}\right)N1(μ1,Σ1),,NK(μK,ΣK). For any sample x i x i x_(i)x_{i}xi, we will have N 1 ( x i μ 1 , Σ 1 ) , , N K ( x i μ K , Σ K ) N 1 x i μ 1 , Σ 1 , , N K x i μ K , Σ K N_(1)(x_(i)∣mu_(1),Sigma_(1)),cdots,N_(K)(x_(i)∣mu_(K),Sigma_(K))\mathcal{N}_{1}\left(x_{i} \mid \mu_{1}, \Sigma_{1}\right), \cdots, \mathcal{N}_{K}\left(x_{i} \mid \mu_{K}, \Sigma_{K}\right)N1(xiμ1,Σ1),,NK(xiμK,ΣK). Calculate N k ( x i μ k , Σ k ) N k x i μ k , Σ k N_(k)(x_(i)∣mu_(k),Sigma_(k))\mathcal{N}_{k}\left(x_{i} \mid \mu_{k}, \Sigma_{k}\right)Nk(xiμk,Σk) using the following equation
(11.9) N k ( x i μ , Σ ) = 1 ( 2 π ) J det ( Σ ) exp ( 1 2 ( x i μ ) T Σ 1 ( x i μ ) ) (11.9) N k x i μ , Σ = 1 ( 2 π ) J det ( Σ ) exp 1 2 x i μ T Σ 1 x i μ {:(11.9)N_(k)(x_(i)∣mu,Sigma)=sqrt((1)/((2pi)^(J)det(Sigma)))exp(-(1)/(2)(x_(i)-mu)^(T)Sigma^(-1)(x_(i)-mu)):}\begin{equation*} \mathcal{N}_{k}\left(x_{i} \mid \mu, \Sigma\right)=\sqrt{\frac{1}{(2 \pi)^{J} \operatorname{det}(\Sigma)}} \exp \left(-\frac{1}{2}\left(x_{i}-\mu\right)^{T} \Sigma^{-1}\left(x_{i}-\mu\right)\right) \tag{11.9} \end{equation*}(11.9)Nk(xiμ,Σ)=1(2π)Jdet(Σ)exp(12(xiμ)TΣ1(xiμ))
Therefore, there will be I × K I × K I xx KI \times KI×K probabilities in total.
3. Calculate the contribution of each N k N k N_(k)\mathcal{N}_{k}Nk in the generation of any data point x i x i x_(i)x_{i}xi
(11.10) γ i , k = π k N k ( x i μ , Σ ) k = 1 K [ π k N k ( x i μ , Σ ) ] (11.10) γ i , k = π k N k x i μ , Σ k = 1 K π k N k x i μ , Σ {:(11.10)gamma_(i,k)=(pi_(k)N_(k)(x_(i)∣mu,Sigma))/(sum_(k=1)^(K)[pi_(k)N_(k)(x_(i)∣mu,Sigma)]):}\begin{equation*} \gamma_{i, k}=\frac{\pi_{k} \mathcal{N}_{k}\left(x_{i} \mid \mu, \Sigma\right)}{\sum_{k=1}^{K}\left[\pi_{k} \mathcal{N}_{k}\left(x_{i} \mid \mu, \Sigma\right)\right]} \tag{11.10} \end{equation*}(11.10)γi,k=πkNk(xiμ,Σ)k=1K[πkNk(xiμ,Σ)]
where γ i , k γ i , k gamma_(i,k)\gamma_{i, k}γi,k can also be viewed as the weight of the point in the Gaussian distribution N k N k N_(k)\mathcal{N}_{k}Nk.
4. Update the Gaussian distributions, that is, obtain the new μ μ mu\muμ and Σ Σ Sigma\SigmaΣ for each Gaussian distribution. This is similar to updating the centroids of the groups in the K-Means; the difference is that we use a (multivariate) distribution characterized by μ μ mu\muμ and Σ Σ Sigma\SigmaΣ instead of the coordinates of the centroids and circles marking the group boundary. The new μ μ mu\muμ and Σ Σ Sigma\SigmaΣ will be obtained with the weights of the points obtained in Step 3. In NumPy, this can be done as
Mu[k] = np.average(X,axis=0,weights=R[:,k])
              E[k] = np.cov(X, rowvar=False, aweights=R[:,k])
              
where R [ , k ] ) R [ , k ] ) R[,k])R[, k])R[,k]) has the same number of elements as X X XXX.
5. Next, we will update the weights of the Gaussian distributions as
(11.11) π k = 1 I i = 1 I γ i , k (11.11) π k = 1 I i = 1 I γ i , k {:(11.11)pi_(k)=(1)/(I)sum_(i=1)^(I)gamma_(i,k):}\begin{equation*} \pi_{k}=\frac{1}{I} \sum_{i=1}^{I} \gamma_{i, k} \tag{11.11} \end{equation*}(11.11)πk=1Ii=1Iγi,k
  1. Repeat Steps 2-5 until the difference between μ μ mu\muμ and/or Σ Σ Sigma\SigmaΣ values in two consecutive steps, e.g., the root of the sum of square differences, is less than a tolerance, e.g., 1 × 10 3 1 × 10 3 1xx10^(-3)1 \times 10^{-3}1×103. This generated the results in Fig. 11.6.

11.6.1. Pros and Cons

GMM maintains a higher level of flexibility in cluster covariance compared to the K-Means clustering in light of the use of standard deviation. Also, GMM does not assume spherical clusters as those algorithms using Euclidean distances. The introduction of probabilities enables us to tell the probabilities that one point belongs to different clusters in addition to the clustering result. However, we need to be aware that GMM assumes Gaussian distributions, needs sufficient data for each cluster, and requires us to specify the number of clusters. In addition, this algorithm may be sensitive to outliers and initialization conditions and require more computing time.

  1. 2 2 ^(2){ }^{2}2 rowrar = = === False specifies a row is a sample instead of a variable. Thus it generates a J × J J × J J xx JJ \times JJ×J covariance matrix.

 

 

 

 

 

 

Enjoy and Build the AI World

Sample Code from AI Engineering

Cite the code in your publications

Linear Models