Previous
AI Basics
Contents
Table of Contents
Next
AI Issues

2.3.2. Common Algorithms

Overview and Machine Learning Tasks

This subsection presents a very quick overview of common machine learning algorithms in the order of major categories: supervised, unsupervised, semi-supervised, and reinforcement learning. However, prior to the introduction to detailed algorithms, it is necessary to differentiate the types of algorithms from the types of tasks these algorithms can be applied to. This is because, when selecting algorithms, both pros and cons of algorithms and the jobs that they can do are what we need to first consider. Unfortunately, the algorithms and the tasks that they can handle have complicated relationships. For example, some algorithms may handle a task that most other algorithms in the same category cannot. Moreover, such relationships are not fixed, as researchers may extend an algorithm to perform tasks that it could not. This is possibly one reason why such relationships are usually not explained adequately in typical machine learning literature.
Here, we try to lay down the relationships in a simple and general way and leave out the exceptions. The major categories of machine learning are shown in Fig. 2.8. As can be seen, regression tasks and classification tasks are usually discussed in supervised learning. This is because these two tasks both require labeled data: regression requires a continuous number or an array of continuous numbers as the label for each sample, while classification needs a discrete number (or another symbol) or an array of discrete numbers (other symbols) as the label. Unsupervised learning does not contain labels, so it is usually used to process data by analyzing the relationships between data points, such as distances between points. Typical tasks for unsupervised machine machine include clustering, dimension reduction, anomaly detection, and association rule learning. Reinforcement learning is more for planning and decision-making. It helps the learning agent improve its decision-making ability by letting the agent interact with an environment. Therefore, reinforcement learning is also adopted for "learning": helping models created by other algorithms gain a higher learning ability.
Figure 2.8: Major categories and tasks of machine learning
In a more abstract way, it can be interpreted as that machine learning can be used for predictive, descriptive, and prescriptive tasks. The predictive function helps predict what will happen with data. The descriptive function helps explain what has happened with data and what the data conveys. The prescriptive function helps make suggestions about what action to take based on data. These three functions correspond to the above three categories to some extent but not completely. An algorithm from any of the three categories may provide multiple functions depending on how the algorithm is applied.
Fig. 2.9 lists the machine learning topics (or types of algorithms) and the popular algorithms in each topic within the three major machine learning categories as well as the background knowledge. The listed topics and algorithms are selected based on a pragmatic criterion: what are the most needed and valuable contents for a typical semester-long course in the college. Accordingly, such topics correspond to sixteen weeks, in which some topics are marked with * to indicate that they can be optional if the semester is shorter. In the following, common algorithm types and representative algorithms for each type will be explained using language with as little technical detail as possible to help everybody get a quick understanding of the ideas behind the algorithms.
Week Topics: Algorithms Categories
1 Al Basics: Definition, History, Terms, Types & Math Background Knowledge
2 Al Implementations: Tasks, Metrics & Tools
3 Linear Models (Basic, Ridge, Lasso, Elastic Net) Supervised Learning
4 Decision Trees (ID3, C4.5, CART)
5 Support Vector Machines (linear, nonlinear (kernel))
6 Bayesian Methods* (Naïve, Semi, Bayesian Network, Gaussian Process)
7 Artificial Neural Network (Perceptron, Feed Forward NN, RBF)
8 Deep Learning (DNN, CNN, RNN) Unsupervised Learning
9 Ensemble Learning (Bagging (Random Forest), Boosting (Adaboost), Stacking)
10 Clustering (K-Means, Mean-Shift, DBScan, OPTICS, GMM, HAC)
11 Dimension Reduction (PCA, LDA, ICA, Isomap, MDR) Reinforcement Learning
12 Anomaly Detection (3sigma, boxplot, KNN, LOF, DBSCAN, iForest, AutoEncoder)
13 Association Rule Learning* (Apriori, FP Growth, Eclat)
14 Value-Based RL (Q-learning, Sarsa, Monte Carlo)
15 Policy-Based RL (Policy Gradient)
16 Advanced RL* (DQN, DDPG, Actor-Critic, A3C)
Week Topics: Algorithms Categories 1 Al Basics: Definition, History, Terms, Types & Math Background Knowledge 2 Al Implementations: Tasks, Metrics & Tools 3 Linear Models (Basic, Ridge, Lasso, Elastic Net) Supervised Learning 4 Decision Trees (ID3, C4.5, CART) 5 Support Vector Machines (linear, nonlinear (kernel)) 6 Bayesian Methods* (Naïve, Semi, Bayesian Network, Gaussian Process) 7 Artificial Neural Network (Perceptron, Feed Forward NN, RBF) 8 Deep Learning (DNN, CNN, RNN) Unsupervised Learning 9 Ensemble Learning (Bagging (Random Forest), Boosting (Adaboost), Stacking) 10 Clustering (K-Means, Mean-Shift, DBScan, OPTICS, GMM, HAC) 11 Dimension Reduction (PCA, LDA, ICA, Isomap, MDR) Reinforcement Learning 12 Anomaly Detection (3sigma, boxplot, KNN, LOF, DBSCAN, iForest, AutoEncoder) 13 Association Rule Learning* (Apriori, FP Growth, Eclat) 14 Value-Based RL (Q-learning, Sarsa, Monte Carlo) 15 Policy-Based RL (Policy Gradient) 16 Advanced RL* (DQN, DDPG, Actor-Critic, A3C) | Week | Topics: Algorithms | Categories | | :---: | :--- | :--- | | 1 | Al Basics: Definition, History, Terms, Types & Math | Background Knowledge | | 2 | Al Implementations: Tasks, Metrics & Tools | | | 3 | Linear Models (Basic, Ridge, Lasso, Elastic Net) | Supervised Learning | | 4 | Decision Trees (ID3, C4.5, CART) | | | 5 | Support Vector Machines (linear, nonlinear (kernel)) | | | 6 | Bayesian Methods* (Naïve, Semi, Bayesian Network, Gaussian Process) | | | 7 | Artificial Neural Network (Perceptron, Feed Forward NN, RBF) | | | 8 | Deep Learning (DNN, CNN, RNN) | Unsupervised Learning | | 9 | Ensemble Learning (Bagging (Random Forest), Boosting (Adaboost), Stacking) | | | 10 | Clustering (K-Means, Mean-Shift, DBScan, OPTICS, GMM, HAC) | | | 11 | Dimension Reduction (PCA, LDA, ICA, Isomap, MDR) | Reinforcement Learning | | 12 | Anomaly Detection (3sigma, boxplot, KNN, LOF, DBSCAN, iForest, AutoEncoder) | | | 13 | Association Rule Learning* (Apriori, FP Growth, Eclat) | | | 14 | Value-Based RL (Q-learning, Sarsa, Monte Carlo) | | | 15 | Policy-Based RL (Policy Gradient) | | | 16 | Advanced RL* (DQN, DDPG, Actor-Critic, A3C) | |
Figure 2.9: Machine learning topics and popular algorithms

Supervised Learning

Linear Models refer to a category of algorithms that can be used for regression analysis or curve fitting, which engineers are familiar with. Thus, we can simply understand this as the use of a simple mathematical function to fit the measurement data, in which attributes are the independent variables ( x x xxx 's) and the label (if only one label) is the dependent variables ( y ) ( y ) (y)(y)(y). In a narrow sense, such a model adopts a linear function: y y yyy is a linear function of x x xxx 's (or linear combination of x x xxx 's). In the simplest case, a linear function parameterized by the weights for different dependent variables and the bias, which can be directly obtained via an analytical solution, forms the basic linear model. To overcome overfitting issues, L2, L1, and the combination of L2 and L1 norms of the model parameters can be considered when searching for the best curve fitting function, leading to Ridge, Lasso, and Elastic Net algorithm, respectively. These linear models are only suitable for linear regression problems. In a broader sense, these linear models can be extended with kernel functions to deal with nonlinear problems. Linear models are initially proposed for regression or curve fitting problems, so they are of a numeric nature both attribute and label values are numeric. Notwithstanding, we can make further changes like adding a logistic function or a Softmax function to convert the output of linear models to a probability and finally to a discrete label. In this way, linear models can also be applied to classification problems.
Decision Trees are a family of algorithms that use a tree-like structure to guide/mimic humans' decision-making process. Starting from a node representing the root of the tree, each node, including the root node and any node on the following layers except the leaf nodes, is associated with an attribute. The possible values of the attribute will lead to the generation of "son" nodes on the next layer (or splitting). Thus, in a decision-making process like classifying a sample, the sample will start from the root node. Then, depending on the attribute values, the sample will move from one node to another one on the next layer based on the value of the attribute associated with the node, and eventually to a leaf node. Each leaf node is associated with a class. This is how a classification decision is used for prediction. For training, all labeled samples will start from the root node as well. Then, an attribute is selected at each non-leaf node according to a general criterion of reducing the chaos (uncertainty) in the data the best. The node-by-node selection of attributes and concurrent splitting of the dataset into subsets belonging to nodes on the next level continue until we reach a node where further splitting is not possible, e.g., reaching a leaf node. This process will generate a decision tree according to this dataset. Different criteria for selecting the attributes lead to different decision tree algorithms: ID3, C4.5, and CART. Preand post-pruning techniques are very important in controlling the tree size, e.g., depth, to deal with overfitting. These techniques are also frequently included as part of the modern decision tree algorithms.
Support Vector Machines, or more commonly referred to as SVMs, are a machine learning topic that once held a predominant role before deep learning rose. Such machine learning algorithms stem from the idea of performing
binary classification by finding the widest margin to separate the samples or the corresponding data points from the two classes. This conceptualization leads to a typical constrained optimization problem: maximize the width of the margin by searching for the margin boundaries controlled by both the margin direction and the locations of samples on the boundaries, which are the "support vectors". Accordingly, no samples can get into the margin or the region belonging to the other class. This strict constraint called "hard margin" can be loosened as "soft margin" to allow for noisy data points. This unconstrained optimization problem, though can be solved using an optimization package, is more commonly converted to a dual problem to facilitate a more convenient computer solution to the corresponding quadratic convex optimization problem. Besides, the basic linear SVM can be extended with kernels to address nonlinear problems. The basic version of SVM was proposed for binary classification tasks. However, it is not difficult to extend it to multiclass classification and regression tasks.
Bayesian Methods are also called Bayesian algorithms, Bayesian machine learning, and probabilistic machine learning in the literature, though the exact meanings of these terms may slightly differ depending on the context and research areas. Bayesian methods were intrinsically constructed for classification tasks. Possibly due to this reason, the use of the 'Bayesian classifier' is very common or even predominant in some technical publications. Such machine learning algorithms stem from Bayesianism, which uses probabilities to quantify the level of belief and consequently updates such beliefs in the evidence of new data. Accordingly, such methods correlate the probability of a sample with a certain combination of attribute values and target values with the probabilities that the different classes and attribute values appear conditionally or simultaneously. Depending on how to calculate such probabilities, we have different types of Bayesian algorithms: from Naive Bayes, which assumes strong (naive) independence between attributes, to Bayesian network, which can formulate complicated interdependence between different attributes with a network-like probabilistic graphical model. It is worthwhile to mention that such Bayesian classifiers, which belong to parametric machine learning methods, can also be extended to handle regression tasks. Other nonparametric Bayesian methods like Gaussian process are proposed more for regression and may appear much different from traditional Bayesian classifiers.
Artificial Neural Networks (ANN or simply called NN in machine learning) use mathematical operations like inner products, element-wise products, and activation functions that can work on arrays to mimic the working mechanisms of biological neural networks. In particular, most modern ANNs adopt the M-P neuron model: the input to a neuron is multiplied by the neuron's weights and then the difference between this product and a threshold is fed into an activation function to generate the neuron output. A network that typically consists of multiple layers of such neurons treats and processes the data according to the architecture of the network, e.g., direction and inter-neuron/inter-layer connections, and finally outputs the predicted label. One key piece of knowledge in ANN studies is how to train an ANN. So far, backpropagation has been accepted as the most common method for training ANNs, in both shallow ANNs and deep learning. A typical backpropagation process includes a forward pass for predicting the label and a backward pass for passing the "gradients" of loss to different model parameters, e.g., network weights, to update these parameters so that the ANN can make better predictions in the next forward pass. Multi-layer feed-forward neural network, especially 3-layer, is the most widely known ANN architecture, though other types, such as RBF, are also available.
Deep Learning is about the development and use of deep NNs. The "deep" in this definition, in general, denotes the depth of layers in an ANN. A neural network that consists of more than three layers - which would be inclusive of the input and the output layers - can be considered a deep learning model. Thus, in theory, it can be viewed as a subset of or an extension to traditional ANNs. However, the breakthroughs and widespread applications of deep neural network have gained knowledge that makes deep learning much different from traditional (shallow) ANN studies. Some of the typical factors that contribute to the development of ANNs especially the transition from shallow NNs to deep learning, including breakthroughs specific to deep learning and a few advances from the general field of machine learning such as pre-training, transfer learning, solvers, and regularizers are listed as follows:
  • better data: more data, preprocessing, normalization;
  • better weights: initialization, pre-training, transfer learning;
  • better network structure: activation, batch normalization, CNN, LSTM, NIN, Residual Network, transformer;
  • better solvers;
  • better regularizers;
  • better computing resources: GPU, parallel computing.
Deep learning in the third wave of AI exhibits its success via Convolutional Neural Network (CNN) in computer vision, Recurrent Neural Network (like Long-Short Term Memory) in natural language processing, integration with reinforcement learning for learning and control, and more recently in Large Language Models (LLM).
Ensemble Learning refers to algorithms that employ models generated by other algorithms as constituent models to obtain an ensemble model with performance that is better than what can be obtained with individual constituent models. These constituent models, which are called base models, can be generated by one of those basic machine learning algorithms, like linear model, SVM, decision tree, KNN and NN, or a combination of them. The former is called homogeneous, while the latter is heterogeneous. Currently, homogeneous base learners are more frequently used, and among them, CART decision tree and neural networks are the most common algorithms for generating homogeneous base learners. The idea behind ensemble learning can be described as "Union is Strength" or "Many hands provide great strength." Therefore, ensemble learning is viewed as an optimization method that generates a strong learner from several weak learners. There are three common categories of ensemble learning methods: bagging (e.g., meta-estimator (basic bagging) and random forest), boosting (e.g., AdaBoost and Gradient Boosting), and stacking. Bagging is focused on the utilization of democracy to reduce variance. Boosting, which features elitism, generates better or elite models by focusing on samples associated with wrong predictions and gives elite models more weights in decision to improve bias. Stacking replaces simple combination rules with a machine learning model as a second layer learner to process the results from the first layer learners for better predictions.

Unsupervised Learning

Clustering is a machine learning topic aiming to divide unlabeled data (data points or samples) into different groups. A general goal is to generate groups so that samples within the same group are similar to others while samples from different groups are different from each other. These groups or groupings are referred to as 'clusters'. Both the ways of generating clusters and evaluating the generated groups (or clustering process) are the content of this unsupervised learning topic. For the former, many different types of clustering algorithms have been proposed based on the patterns that the data points need to be arranged in. Centroid models refer to clustering algorithms wherein the clusters are formed by the proximity of the data points to the cluster center or centroid. Data points are clustered based on multiple centroids in the data. K-Means clustering and Mean-Shift clustering are the most popular algorithms in this category. Density models are generated by clustering algorithms that group data by areas with a high concentration of data points surrounded by areas with a low concentration of data points. DBScan and OPTICS are two popular density-based algorithms. In distribution models, data points are grouped together based on the probability that they may belong to the same distribution. In a particular distribution, the distance of a data point from a center point is determined to infer the probability of being in that cluster. Gaussian Mixture Theory (GMM) is the most common option in this category. Hierarchical (connectivity) models involve top-to-bottom or bottom-up hierarchies. Agglomerative Hierarchical Algorithm is the most popular example in this category.
Dimensionality Reduction, or called dimension reduction, is the transformation of data from a high-dimensional space, e.g., with a large number of attributes, into a relatively low-dimensional space, e.g., with fewer attributes than the original data, while retaining essential properties of the original data. Therefore, dimensionality reduction helps remove redundant or less significant variables. These methods can be classified into two major categories: feature selection and feature projection. Feature selection seeks to find a subset of the input variables (or features, attributes, dimensions). Therefore, we select features directly according to some criteria, such as filter strategy (e.g., information gain) and wrapper strategy (e.g., search guided by accuracy), and drop the less desired features to reduce data dimensionality. By contrast, feature extraction tends to project the data in a high-dimensional space to a space of fewer dimensions, in which the more desired features or their combinations are extracted. Feature selection methods primarily refer to those data reduction techniques studied in statistics for variable selection and now mostly in high-dimensional regression analysis. Classical methods are missing value ratio, low variance filter, high correlation filter, random forest, backward feature extraction, and forward feature selection. Feature projection methods are more popular in the dimensionality reduction literature and are even used to represent dimensionality reduction in a narrow sense. This category includes the most popular dimensionality reduction methods such as Principal Component(s) Analysis (PCA), Linear Discriminant Analysis (LDA), Independent Component Analysis (ICA), Isomap, and MDR. Dimensionality reduction methods can also be classified based on whether they are for labeled or unlabeled data. A typical example is that PCA was developed for unlabeled data. Thus, PCA is mostly used for clustering in unsupervised learning. On the contrary, LDA was proposed for processing labeled data. Another way of classifying dimensionality reduction methods is based on whether these methods are linear or nonlinear in nature. For example, PCA, ICA, and LDA are linear, while LLE, Isomap, MDR, and kernel PCA are nonlinear.
Anomaly Detection, also called novelty detection, outlier detection, forgery detection, or out-of-distribution detection in different areas, is intended to identify rare items, events, or observations that significantly deviate from the majority of the data and do not conform to a well-defined notion of normal behavior. Anomaly detection has been applied to
a variety of areas such as fraud detection, web hack detection, medical (disease) detection, sensor network anomaly detection, IoT bid data anomaly detection, log anomaly detection, and industrial hazard detection. In a broad sense, the available methods for anomaly detection can be roughly grouped into rule-based methods, statistics-based methods, and machine learning-based methods. Among them, the anomaly detection methods based on machine learning algorithms are anomaly detection in a narrow sense and represent the state of the art. The machine learning-based methods can be further categorized into supervised, unsupervised, and semi-supervised methods. In unsupervised learning, common methods can be divided into five groups: statistics-based, distance-based, density-based, clustering-based, and tree-based. In semisupervised learning, popular methods include one-class SVM, AutoEncoder (or autoencoder), and GMM. In supervised meaning, we usually need to pay attention to data labeling and imbalanced data for possible issues, and such methods are suitable for considering data with new classes. Common methods in this category include linear models, SVM, and ANN.
Association Rule Learning, which is also called association rule analysis and association rule mining in many publications, is a rule-based type of machine learning for discovering interesting relations between variables. This definition is not that straightforward to understand, especially considering the vague meanings of terms like "relations between variables", "rule", and "interesting". Also, association rule learning and algorithms (or methods) for it contain various new parameters/concepts, such as those embedded in the definition. This fact makes it difficult for people without some expertise in data mining to understand the topic and implement association rule learning algorithms. In addition to these parameters and concepts, association rule learning treats data that are usually formulated in a format slightly different from data dealt with in other machine learning areas due to historical and practical reasons. Therefore, association rule learning may appear much different from other supervised and unsupervised machine learning topics in many aspects, which can further confuse learners. Popular association rule learning algorithms include Apriori, FP Growth, and Eclat.

Reinforcement Learning

Reinforcement Learning is the third category of machine learning, in which no raw data is given as input. Instead, the reinforcement learning algorithm needs to design a way to generate and label data in the training process. Reinforcement learning is frequently used for robotics, gaming, and navigation. With reinforcement learning, the algorithm discovers through trial and error to identify actions yielding the most significant rewards. Thus, actions and rewards are similar to the attributes and labels in supervised machine learning in some way. As illustrated in 2.10, this type of training has three main components: an agent that can be described as the learner or decision maker, an environment that the agent lives in, and the actions that the agent takes for rewards. The objective is to let the agent take actions that maximize the expected reward over a given measure of time. The agent will reach the goal much quicker by following a good policy. So the purpose of reinforcement learning is to learn the best policy. Reinforcement learning can be roughly categorized into value-based, policy-based, and hybrid (involving both value and policy) algorithms.
Figure 2.10: Workflow of reinforcement Learning
Typical examples of value-based reinforcement learning algorithms are Q -learning and Sarsa. In Q-learning, the agent learns by updating its own Q table, which quantifies the values of different actions in specific states. The agent interacts with the environment and generates a sequence of state-action-reward values called a trajectory. The yielded rewards will help the agent update its own Q table and consequently improve the decision-making ability. Q-learning is an offpolicy method because it learns an optimal policy no matter which strategy it carries out. Sarsa follows a very similar procedure for learning, but it adopts the current policy for decision-making during the learning process (or call episode) and thus is an on-policy algorithm. By contrast, policy-based reinforcement learning does not use action and state values to determine the optimal action. But instead, it adopts a probability function for selecting an action. The selected action will be evaluated via the reward function to update the policy parameterized by this distribution. A typical example of policy-based algorithms is policy gradient. More complicated reinforcement learning integrates both the values and policy such as critic-actor and later variations like C3A. Such hybrid algorithms use one type of reinforcement learning to
generate an action and use the other type of algorithm to assess the algorithm. In addition, reinforcement learning has been integrated with deep learning. For example, deep neural networks can be used to replace the Q table as the mapping from actions to values, leading to algorithms like deep Q-learning. Many others are being proposed following this direction like Deep Deterministic Policy Gradient (DDPG).

Semi-Supervised Learning

As the fourth type of machine learning, Semi-Supervised Learning uses both labeled data and unlabeled data. Though semi-supervised learning can be viewed as a hybrid of supervised and unsupervised machine learning, it is mostly used for the same purposes as supervised learning, e.g., classification, regression, and prediction. Semi-supervised learning usually employs a small amount of labeled data with a large amount of unlabeled data. Semi-supervised learning brought obvious benefits such as the save in the effort of labeling massive amounts of data and the reduction in the bias caused by labeling. Such algorithms still need to pose strict requirements on the data, such as the accuracy of the labels for labeled data and class balance for the unlabeled data. Semi-supervised learning has sub-categories like simple self-training, co-training, label propagation algorithm, semi-supervised SVM, and more recently, semi-supervised deep learning. The workflows of different semi-supervised learning algorithms may be considerably different. Fig. 2.11 illustrates a typical workflow in self-training.
Figure 2.11: Typical workflow of semi-supervised machine learning

Summary

AI is a highly dynamic field. In the third wave of AI represented by machine learning algorithms, the popularity of the algorithms is changing every day. Fig. 2.12 presents is a list of common algorithms (top-10 ranking) according to a survey conducted by Kaggle with 18996 respondents. This presents a snapshot of this field for a condition around 2020.

 

 

 

 

 

 

Enjoy and Build the AI World

Sample Code from AI Engineering

Cite the code in your publications

Linear Models