Previous
AI vs. Engineering
Contents
Table of Contents
Next
AI Algorithms

2.3. Basics of AI

The representation, optimization, and evaluation, as well as underlying math, constitute the basics of AI. In this section, the basic concepts will be introduced first. What follows will be a quick overview of the common machine learning algorithms. The challenges and issues in machine learning will be discussed next. The math knowledge needed for machine learning, e.g., more details about optimization and evaluation, will be presented in the appendices.

2.3.1. Basic Concepts

This subsection covers the basic concepts including key machine learning elements, data format, and typical machine learning workflows.

Key Machine Learning Elements

Machine learning, or most machine learning algorithms, can be conceptually divided into three main elements:
  1. Representation: What does the model look like? harr\leftrightarrow How knowledge is represented?
  2. Solution (Optimization): How to generate (optimal) models? harr\leftrightarrow How to extract the knowledge?
  3. Evaluation: How to evaluate the performance of models? harr\leftrightarrow How to measure the obtained knowledge?
There are eight key concepts in the above three main elements.

Representation

Concepts 1 - Data: Data is the food or nutrients for machine learning and the generation of machine learning models. Thus, data is where the experience or knowledge is embedded. In supervised learning, data is divided into input and output to represent the unlabeled data and labels, respectively. In unsupervised machine, there is only unlabeled data that can be processed to find specific patterns in such data. In reinforcement learning, data is generated as the learning agent interacts with the environment and 'labeled' using a reward function. More explanation about the format/structure of data will be provided in a later subsection.
Concepts 1.1 - Input: Input is what we feed into machine learning models for learning. Such input data represents the observations or measurements excluding the labels or target values. Each observation is a sample or an instance composed of values for different aspects of the observation, which are called attributes, features, (random) variables, independent variables, and predictors (predictor variables) in different literature.
Concepts 1.2 - Output: This is what we want the models to predict or estimate. Output is also called labels, targets, dependent variables, and response variables in different places.
Concept 2 - Algorithms: Algorithms or learning algorithms are different ways of constructing machine learning models based on data. We can understand an algorithm as a procedure that may contain one way of conceptualizing the model, mathematical equations, and logic statements. Algorithms usually can be outlined using pseudo-code to illustrate how to implement the procedure. We can understand algorithms as methods in a more general sense. Different algorithms can be used to address different machine learning tasks, e.g., regression, classification, and clustering.
Concept 3 - Model: A model or a machine learning model is a math function or a more complicated entity that can be mathematically formulated. As the product of running an algorithm on data, a model can be a fitting function in the simplest case or neural networks with fixed weights. We may encounter untrained models, e.g., an artificial neural network with randomly generated weights, which have not gone through the training process and thus contain model parameters that are irrelevant to data. By contrast, trained or pre-trained models contain parameters that have been determined with data and thus represent some knowledge from such data.
Concept 4 - Parameters: Parameters are also called coefficients and weights, depending on the algorithms and contexts. Models can be understood to be formed by two parts: a general template (or architecture) and detailed parameters that fix the template into a specific object. The number of parameters can range from a few, e.g., 2 in a linear model, and to millions, e.g., 138 million in the VGG12 deep neural network.
Concept 5 - Hyperparameters: Hyperparameters are distinct from the (model) parameters in that they are not part of the model. Instead, hyperparameters are those numbers that we set in the initial configuration or setting before training machine learning models. Such numbers are needed to determine how the learning will be performed and mostly determined by the selected algorithm including the solver. Typical hyperparameters include the coefficients determining the loss function, solvers, visualization, and the way that the data and model are processed. Hyperparameters are critical to the success of learning tasks in addition to the selected algorithm, model architecture and initial model parameters (if any).

Solution

Solution is the search for the model that best extracts the knowledge from the data with the selected machine learning algorithm. We can perform the solution by both deriving an analytical solution, i.e., equation(s) for calculating the model parameters directly, or using an optimizer or a solver to help us find a model. The former can be done within one or a couple of steps, while the latter will involve an iterative optimization process. Also, the former usually can secure the exact
or best model, while the latter, in most cases, can only help us find a local optimum as a relatively good (approximate) model instead of the best model. While both approaches can be adopted for many machine learning models, the former is mostly adopted for algorithms with simple models like linear models whereas the latter is usually adopted for algorithms with complicated models like deep neural networks.
Concept 6 - Loss/Cost/Objective Function: Loss functions, cost functions, and objective functions are usually used interchangeably, though the objective function can be minimized or maximized while the loss/cost functions need to be minimized. Such a function is an essential part of the algorithm especially when approximate optimization is needed for solution. This is because it measures the performance of machine learning models, which can determine the optimization direction during the solution process. The idea is to minimize the loss function so that the most appropriate parameters for the machine learning model can be obtained.
Concept 7 - Optimization Methods (solvers and optimizer): Optimization methods, or called solvers and optimizers, dictate how the solution process is performed to optimize the loss function. The optimization method is usually not fixed when developing an algorithm, and it entails knowledge that is much different from and independent of the machine learning algorithms and models. Therefore, optimization methods can be separate from the algorithms. Common solvers are borrowed from the optimization realm. Such methods can work as long as a loss function, parameters to be optimized (e.g., model parameters in machine learning), and constraints (if needed) are given.

Evaluation

Evaluation is usually implemented during testing or cross-validation to assess the performance of the model. Evaluation needs to be discussed with respect to different machine learning tasks (or problems). This is because different evaluation metrics have been proposed for different types of tasks, e.g., classification, regression, and clustering. A key in machine learning model evaluation is the selection of the evaluation metrics.
Concept 8 - Evaluation metrics present different methods for measuring the performance of a model including but way beyond accuracy and error. A complete list and description of the common metrics will be provided in the appendices.

Data Format

This subsection discusses how to organize data, or in other words, the structure of data. However, the term "data structure" has a connotation of the way of organizing and storing data in a computer's memory or storage. Thus, we here use data format in the title to avoid confusion. Despite this arrangement, data format and data structure are interchangeable in this book, and both refer to the structure or organization of data on a higher level that human beings can easily understand or visualize.
Data is a collection of numbers and symbols that we use to describe things. In machine learning, data is usually used to quantitatively describe measurements (or called observations) and their assessments. Therefore, an intuitive structure of data is to organize different measurements as different data points, which are the elements in the data structure. Due to the same reason, a data point can also be called a sample, an instance, or an observation (a record occasionally). As shown in Fig. 2.6, every observation may contain information for multiple aspects, thus every data point may also have different values termed as attributes or features (less frequently as property). Data points are usually grouped together as a dataset for a purpose, e.g., training and testing a model, leading to training and testing datasets, respectively. Usually, all the data points in the same dataset have the same number of attribute values. The assessments of data are stored as labels (or called targets) in the dataset. In a labeled dataset, each data point has one to multiple label values. Every data point with its label(s) is called a labeled sample, a labeled instance, or an example.
The realizations of data structure for different machine learning purposes share a lot in common in spite of minor differences between such realizations in different packages. Fig. 2.6 presents a representative way of organizing data. One item that was not mentioned above is the sample ID. In some AI tools, the sample ID is explicitly used for referring to different data points, while in some other tools, ID is not explicitly defined, and parameters such as the location, row number, and order are implicitly used as the ID. As can be seen, it is common to use different rows for different data points and different columns for different attributes. Label values are usually stored either as a separate array or column(s) to the right of the unlabeled data.
As can be seen in the above introduction, many terms are used interchangeably, though there may be some negligible differences. In this book, though these terms are used interchangeably most of the time, we try to follow the convention of their use in individual machine learning topics. In addition to the convention, though not always true, the following rules will be followed. "Data point" is preferred when talking about plotting, space, and distributions; "sample" is used when an experimental context is emphasized; "instance" is used in a pure machine learning context. "Attribute" is preferred
Sample ID Sa mple 4 Attrib utes (Feat tures)
Attribu
1
Attribu 1| Attribu | | :--- | | 1 |
Names Labels
ID Clump UnifSize UnifShape MargAdh SingEpiSize BareNuc BlandChrom NormNucl Mit Class
1000025 -5 1 , 1 2 1 3 1 1 benign
1002945 5 4 4 5 7 10 3 2 1 benign
1015425 3 1 1 1 2 2 3 1 1 malignant
1016277 6 8 8 1 3 4 3 7 1 benign
1017023 4 1 1 3 2 1 3 1 1 benign
1017122 8 10 10 8 7 10 7 1 malignant
1018099 1 1 1 1 2 10 3 1 1 benign
1018561 2 1 2 H 2 1 3 1 1 benign
1033078 2 1 1 1 2 1 1 1 5 benign
1033078 4 2 1 1 2 1 2 1 1 benign
Attribute 4 Attribute Values Example 6 (Labeled Instance 6)
Sample ID Sa mple 4 Attrib utes (Feat tures) "Attribu 1" Names Labels https://cdn.mathpix.com/cropped/2025_02_22_5787421c465a64046523g-4.jpg?height=220&width=89&top_left_y=392&top_left_x=277 ID Clump UnifSize UnifShape MargAdh SingEpiSize BareNuc BlandChrom NormNucl Mit Class 1000025 -5 1 , 1 2 1 3 1 1 benign 1002945 5 4 4 5 7 10 3 2 1 benign 1015425 3 1 1 1 2 2 3 1 1 malignant 1016277 6 8 8 1 3 4 3 7 1 benign 1017023 4 1 1 3 2 1 3 1 1 benign https://cdn.mathpix.com/cropped/2025_02_22_5787421c465a64046523g-4.jpg?height=215&width=89&top_left_y=606&top_left_x=277 1017122 8 10 10 8 7 10 7 1 malignant 1018099 1 1 1 1 2 10 3 1 1 benign 1018561 2 1 2 H 2 1 3 1 1 benign 1033078 2 1 1 1 2 1 1 1 5 benign 1033078 4 2 1 1 2 1 2 1 1 benign Attribute 4 Attribute Values Example 6 (Labeled Instance 6) | | Sample | ID Sa | mple 4 | | Attrib | utes (Feat | tures) | Attribu <br> 1 | Names | | Labels | | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | | ![](https://cdn.mathpix.com/cropped/2025_02_22_5787421c465a64046523g-4.jpg?height=220&width=89&top_left_y=392&top_left_x=277) | ID | Clump | UnifSize | UnifShape | MargAdh | SingEpiSize | BareNuc | BlandChrom | NormNucl | Mit | Class | | | 1000025 | -5 | 1 | , | 1 | 2 | 1 | 3 | 1 | 1 | benign | | | 1002945 | 5 | 4 | 4 | 5 | 7 | 10 | 3 | 2 | 1 | benign | | | 1015425 | 3 | 1 | 1 | 1 | 2 | 2 | 3 | 1 | 1 | malignant | | | 1016277 | 6 | 8 | 8 | 1 | 3 | 4 | 3 | 7 | 1 | benign | | | 1017023 | 4 | 1 | 1 | 3 | 2 | 1 | 3 | 1 | 1 | benign | | ![](https://cdn.mathpix.com/cropped/2025_02_22_5787421c465a64046523g-4.jpg?height=215&width=89&top_left_y=606&top_left_x=277) | 1017122 | 8 | 10 | 10 | 8 | 7 | 10 | | 7 | 1 | malignant | | | 1018099 | 1 | 1 | 1 | 1 | 2 | 10 | 3 | 1 | 1 | benign | | | 1018561 | 2 | 1 | 2 | H | 2 | 1 | 3 | 1 | 1 | benign | | | 1033078 | 2 | 1 | 1 | 1 | 2 | 1 | 1 | 1 | 5 | benign | | | 1033078 | 4 | 2 | 1 | 1 | 2 | 1 | 2 | 1 | 1 | benign | | | Attribute 4 | | | | Attribute Values | | | Example 6 (Labeled Instance 6) | | | |
Figure 2.6: Structure of data for (supervised) machine learning
for samples whose individual aspect can be described with simple math entities such as a real number; "feature" is used for samples with more complicated characteristics, such as those that need to be represented by a collection of numbers. "Label" is used when emphasizing the application or experimental flavor; "target" is used in contexts with heavy math content.
Finally, we need to point out that data has been a core component in both data mining and machine learning. However, the role of data can still be slightly different due to the different objectives of these two areas: data mining aims to uncover patterns in the data, while machine learning is intended for reproducing known patterns and making predictions based on them. That is, data is being explored in data mining to gain knowledge, by contrast, data is used in machine learning for creating models that can handle future prediction tasks.

Machine Learning Workflow

The general workflow of machine learning can be summarized as a process of preparing data and feeding it into a model so that the model parameters can be optimized to reach a model with satisfactory performance in the analysis of new data. Despite the common traits, the detailed workflow of performing machine learning varies across different categories of machine learning algorithms: supervised, unsupervised, semi-supervised, and reinforcement machine learning. Typical differences are listed below.
  • How to prepare data? E.g., collected before training (supervised, unsupervised) vs. during training (reinforcement learning)
  • What does the data look like? E.g., labeled (supervised) vs. unlabeled (unsupervised)
  • How to define the best performance? E.g., labels (supervised) vs. data characteristics (unsupervised) vs. rewards (reinforcement)
In fact, the process can be slightly different even between algorithms in the same category.
Despite the differences, some terms referring to specific stages of the workflow are widely used everywhere. In particular, the process and some terms associated with supervised learning are more familiar to us due to the predominant role of supervised learning in traditional machine learning. Here, we will introduce the machine learning process in common supervised learning tasks. The workflows of other categories of machine learning algorithms can be understood with deviations from this one. Those workflows, together with the deviations, can be seen in later chapters for other categories of machine learning algorithms.
Let us first take a look at how (supervised) machine learning is adopted for addressing typical engineering problems. As shown in Fig.2.7, the following steps are what we typically adopt in supervised machine learning.
  1. Collect and prepare data (including data augmentation, labeling the data, dividing data into training and testing datasets).
Figure 2.7: Machine learning (supervised) workflow
2) Choose or develop a machine learning algorithm.
3) Train the model with the algorithm using the training data.
4) Evaluate the model using the testing data.
5) Finetune hyperparameters to improve the model.
6) Make predictions for new data samples.
A few things in the above procedure need to be clarified and emphasized.
First, data has an essential role in the above process. We can even understand this process as a sequence of data operations.
Second, training and testing appear to be the core stages, whose significance usually needs no emphasis. However, in real engineering practice, it is common to find that data preparation takes a major portion or even the majority of effort. In particular, the availability of out-of-box AI tools eases the selection and use of algorithms/models, while data in specific engineering applications usually needs to be collected, cleaned, labeled, and pre-processed before use. Work needed for generating, cleaning, and augmenting data varies case by case, depending on the data quality, algorithms, expectations for performance, etc. In many cases, data labeling needs to be done manually, which can be trivial and labor-intensive. Sometimes, model evaluation and finetuning can also take a lot of time and care.
Third, training and testing stages should use two datesets that are independent and identically distributed (IID), which is critical to the success of the above process. Independence and distribution are two major characteristics of datasets. The independence implies that the two datasets share no common samples. Otherwise, testing will be redundant with training in some way, depending on the overlap. The identical distribution denotes that the random variables (attributes of the data samples) should have the same distributions in both datasets. Otherwise, the two datasets will have different natures, leading to two different problems for training and testing. Particularly, a model trained to work well for one problem very likely work poorly for another problem with a totally different nature. In some cases, we also have a sub-stage called cross-validation in training. This cross-validation is different from testing in that testing usually involves exclusively IID data, while cross-validation shares data with training. Cross-validation typically involves shuffling and sampling steps and runs alternately with the training process.
Finally, the above process may be simpler than many actual machine learning jobs. For example, in more advanced data analytics work, feature engineering and feature extraction may need to be separated from data preparation as a significant, separate step. Also, for people who are more focused on studying instead of applying algorithms, the algorithm
development and model evaluation will be emphasized. In more complex industrial AI applications, we may also need to involve more steps for model testing and deployment, for example, deploying the model in a hyper care model before it goes live, during which model evaluation may happen multiple times.

 

 

 

 

 

 

Enjoy and Build the AI World

Sample Code from AI Engineering

Cite the code in your publications

Linear Models