Previous
Gaussian Mixture Model
Contents
Table of Contents
Next
PCA

Chapter 12

Dimension Reduction

12.1. Overview

This chapter discussed dimensionality reduction. First, we will try to understand the concepts and possible uses of this unsupervised learning technique. Next, the classification of dimensionality reduction methods and popular algorithms will be reviewed. Then, two categories of dimensionality reduction methods, i.e., feature selection and feature extraction, will be introduced. In particular, two feature extraction algorithms, i.e., Principal Component Analysis and Linear Discriminant Analysis, will be explained in detail, considering their significance and predominance in dimensionality reduction studies. For each of them, we will try to understand the idea first, then formulate the idea using mathematical equations, and finally show how to implement the method with the equations.

12.2. Basics of Dimension Reduction

12.2.1. Concepts and Needs

Dimensionality reduction, or called dimension reduction, is the transformation of data from a high-dimensional space, e.g., with a large number of features (also counted as random variables or attributes), into a relatively low-dimensional space, e.g., with fewer features than the original data, while retaining essential properties of the original data. Dimension reduction could be very useful or/and needed for a variety of reasons as follows.
  • The size of data can be significantly reduced as the dimensionality decreases. Thus, dimensionality reduction can help compress data. Such compression can help us save effort in handling data such as storage space and transmitting time.
  • Low-dimensional data can, in general, reduce the time for computing and training.
  • Many algorithms do not show good performance when the data dimensionality is too high, and dimension reduction can improve their performance by transforming the data to a low-dimensional space.
  • Dimension reduction can help address the curse of dimensionality: various problems would arise when analyzing and organizing data in high-dimensional spaces yet not occur in low-dimensional settings. A common theme of these problems is that, when the dimensionality increases, the volume of the space increases so fast that the available data becomes sparse.
  • Dimension reduction can help reduce the redundancy in data, such as features that are dependent on others. For example, if a feature is a linear combination of the other feature values, this feature can be removed without hurting the value of the data. One application is denoising in signal processing.
  • Dimension reduction can facilitate data visualization, which can become difficult as data dimensionality becomes high. For example, it is relatively easy to plot 2D or 3D data for visualization, but this job becomes much more difficult as the dimensionality gets above 3 .
Therefore, dimensionality reduction helps remove the redundant or less significant variables. In this way, it can help us find the major variables or the functions of such variables (e.g., a linear combination of the variables) that make a major contribution to the task of interest like classification. In practice, dimensionality reduction is used a lot in fields that deal with large numbers of observations (number of instances) and/or large numbers of variables (size/dimension of individual
instance), such as signal processing, speech recognition, bioinformatics, and geoinformatics.
Many methods have been proposed for reducing the dimensionality of data [124]. These methods can be classified according to different criteria. The most common classification is to divide the methods into two major categories: feature selection and feature projection. Feature selection seeks to find a subset of the input variables (or features). Therefore, we select features directly according to some criteria, such as the filter strategy (e.g., information gain) and wrapper strategy (e.g., search guided by accuracy), and drop less desired features to reduce the data dimensionality. By contrast, feature extraction tends to project data in a high-dimensional space to a space of fewer dimensions, in which the combinations of more desired features are extracted. Accordingly, instead of selecting features directly, we try to extract those feature functions (e.g., combinations) that better reflect the data characteristics (e.g., unsupervised learning) or better serve the data analysis tasks (e.g., supervised learning) by finding a low-dimensional space to map the data to. Feature extraction can be further divided into two sub-categories based on whether the projection is made to a lower-dimensional space directly or to a manifold in a lower-dimensional space.
Feature selection methods [125] primarily refer to those data reduction techniques studied in statistics for variable selection and are now mostly discussed in high-dimensional regression analysis. Classical methods include the missing value ratio, low variance filter, high correlation filter, random forest, backward feature extraction, and forward feature selection. Feature projection methods are more popular in the dimensionality reduction literature, and are even used to represent feature projection in a narrow sense [126]. This category includes the most popular dimensionality reduction methods such as Principal Component(s) Analysis (PCA) [127], Linear Discriminant Analysis (LDA) [128], Independent Component Analysis (ICA) [129], Isomap [130], and Multifactor Dimensionality Reduction (MDR) [131].
In addition to the above classification, dimensionality reduction methods can also be classified based on whether they process labeled or unlabeled data. A typical example is that PCA was developed for unlabeled data. Thus, PCA is mostly used for clustering in unsupervised learning. On the contrary, LDA was proposed for processing labeled data. For example, it is usually used to condense data with labels for classification tasks. Another way of classifying dimensionality reduction methods depends on whether they are linear or nonlinear in nature. For example, PCA, LDA, and ICA are linear while Locally Linear Embedding (LLE) [132], Isomap, MDR, and kernel PCA are nonlinear. Table 12.2.2 lists popular dimension reduction methods, which are categorized according to multiple criteria.
It is important to point out that dimensionality reduction is different from other machine learning topics in that it contains significantly different methods for the same purpose rather than one method and its variants. Therefore, popular dimensionality reduction methods seem to have little common theoretical basis and may be much different from each other. This is unlike other machine learning topics, such as Bayesian classifiers, in which most methods are built on Bayes' theorem. Due to this reason, this chapter for dimensionality reduction is organized in a way slightly different from the other chapters. We will first briefly introduce the common feature selection methods, which are relatively straightforward and easy to understand. Then, two of the most popular/classic dimensionality reduction methods, i.e., PCA and LDA, will be introduced to cover both their theories and implementations.

12.3. Common Feature Selection Methods

The following are common feature selection methods.

Missing Value Ratio

The missing value ratio is a criterion that we can intuitively select for removing attributes. Missing attribute values for some samples are very common in real-world data. In some cases, we choose to fill the missing value, which may be represented by "nan" or "null", with values such as " 0 ". However, when too many samples have missing values for a specific attribute, we may consider deleting that attribute. The deletion of an attribute is equivalent to removing a variable containing little information or eliminating an insignificant direction/dimension.
For this purpose, we need to define a threshold value to determine whether an attribute is insignificant enough to remove. A threshold or critical missing value ratio is proposed for this purpose. When the missing value ratio is higher than the threshold, we delete the corresponding attribute or dimension. Thus, a lower threshold indicates a more aggressive dimension reduction strategy.

  1. 1 1 ^(1){ }^{1}1 extend MDS by incorporating the geodesic distances imposed by a weighted graph.

 

 

 

 

 

 

Enjoy and Build the AI World

Sample Code from AI Engineering

Cite the code in your publications

Linear Models