The GMMs are more flexible than the K-Means clustering [122]. However, any GMM needs to start with an assumption that the data points are Gaussian distributed. Every cluster has a Gaussian distribution. Thus, two parameters are needed to describe the shape of each cluster, the mean and the standard deviation. Taking GMM for a 2D problem for example, the distribution of points in any cluster can take an elliptical shape represented by a 2D Gaussian distribution.
We use an optimization algorithm known as Expectation Maximization (EM) to find out the parameters of the Gaussian for each cluster. In a typical clustering task, we need to cluster II points. Each point, or sample, has JJ features. We can imagine that the II points are distributed in a JJ-dimensional space. We assume the points can be clustered into several groups. In GMM, each group of points is assumed to follow a Gaussian distribution, N\mathcal{N}. The goal of GMM is to search for the Gaussian distributions that best describe the distributions of points. Instead of a deterministic label stating that one point belongs to a specific group in KNN and K-Means, any point in GMM has different probabilities of belonging to different groups. Therefore, the input XX has a shape of I xx JI \times J. The probability that a point x_(i)x_{i} is generated is
where pi_(k)\pi_{k} is the weight of the kk th normal distribution (or group). To obtain the Gaussian distributions, we need to determine both the parameters for N\mathcal{N} and the weights pi\pi, which are related. For such a problem, the EM iterations are usually adopted. Accordingly, we first calculate some parameters such as those for N\mathcal{N} (i.e., mu\mu and Sigma\Sigma ) and then, based on
the results, calculate the weights. The calculated weights can be used to update N\mathcal{N}. This process will be continued until the improvement is less than a small tolerance.
The detailed EM procedure for implementing the GMM is as follows.
Assume the data can be clustered into KK groups, e.g., 3. Generate mu\mu and sigma\sigma for all the groups. When there is more than one variable (dimensions), we use the covariance matrix Sigma\Sigma (e.g., in NumPy, we use np.cov(Sample, rowvar=False)) ^(2){ }^{2} instead of standard deviation sigma\sigma (square root of variance). Meanwhile, mu\mu will become an array with JJ elements for each normal distribution (group).
Calculate the probabilities of each sample in different groups. Thus, there are N_(1)(mu_(1),Sigma_(1)),cdots,N_(K)(mu_(K),Sigma_(K))\mathcal{N}_{1}\left(\mu_{1}, \Sigma_{1}\right), \cdots, \mathcal{N}_{K}\left(\mu_{K}, \Sigma_{K}\right). For any sample x_(i)x_{i}, we will have N_(1)(x_(i)∣mu_(1),Sigma_(1)),cdots,N_(K)(x_(i)∣mu_(K),Sigma_(K))\mathcal{N}_{1}\left(x_{i} \mid \mu_{1}, \Sigma_{1}\right), \cdots, \mathcal{N}_{K}\left(x_{i} \mid \mu_{K}, \Sigma_{K}\right). Calculate N_(k)(x_(i)∣mu_(k),Sigma_(k))\mathcal{N}_{k}\left(x_{i} \mid \mu_{k}, \Sigma_{k}\right) using the following equation
Therefore, there will be I xx KI \times K probabilities in total.
3. Calculate the contribution of each N_(k)\mathcal{N}_{k} in the generation of any data point x_(i)x_{i}
where gamma_(i,k)\gamma_{i, k} can also be viewed as the weight of the point in the Gaussian distribution N_(k)\mathcal{N}_{k}.
4. Update the Gaussian distributions, that is, obtain the new mu\mu and Sigma\Sigma for each Gaussian distribution. This is similar to updating the centroids of the groups in the K-Means; the difference is that we use a (multivariate) distribution characterized by mu\mu and Sigma\Sigma instead of the coordinates of the centroids and circles marking the group boundary. The new mu\mu and Sigma\Sigma will be obtained with the weights of the points obtained in Step 3. In NumPy, this can be done as
Repeat Steps 2-5 until the difference between mu\mu and/or Sigma\Sigma values in two consecutive steps, e.g., the root of the sum of square differences, is less than a tolerance, e.g., 1xx10^(-3)1 \times 10^{-3}. This generated the results in Fig. 11.6.
11.6.1. Pros and Cons
GMM maintains a higher level of flexibility in cluster covariance compared to the K-Means clustering in light of the use of standard deviation. Also, GMM does not assume spherical clusters as those algorithms using Euclidean distances. The introduction of probabilities enables us to tell the probabilities that one point belongs to different clusters in addition to the clustering result. However, we need to be aware that GMM assumes Gaussian distributions, needs sufficient data for each cluster, and requires us to specify the number of clusters. In addition, this algorithm may be sensitive to outliers and initialization conditions and require more computing time.
^(2){ }^{2} rowrar == False specifies a row is a sample instead of a variable. Thus it generates a J xx JJ \times J covariance matrix.