Previous
AI Algorithms
Contents
Table of Contents
Next
AI Tools

2.3.3. Challenges and Issues in Machine Learning

Data Issues

As the "food" for machine learning, data, especially its health, can easily affect or even determine the development, validation, and application of machine learning models. Unfortunately, real-world problems usually do not provide datasets that have been well assessed, structured, and annotated as those well-documented datasets from machine learning packages, such as the Iris dataset from Scikit-learn. In engineering, we usually need to use data collected in different environments in different ways, which may lead to highly heterogeneous, highly unstructured, incomplete, erroneous, and unknown data. Such data may consume a lot or even most of the project time and threaten the validity of machine learning models
Figure 2.12: 2020 Kaggle survey on most commonly used machine learning algorithms (18996 respondents)
built on it. In short, common data issues can be summarized as inadequate data, immature data, incorrect data, noisy data, and biased data.
Inadequate data is a very common issue in traditional machine learning, especially before the advent of big data. In many cases, a major cause of poor performance of machine learning models is an inadequate amount of data. That is why much effort in traditional machine learning was devoted to the utilization of data such as sampling techniques for cross-validation like bagging. Even after we entered the era of big data, inadequate data is still haunting over many machine learning tasks. This is because, as the complexity/capacity of a model increases, the amount of data that is needed for the model to reach the same performance also increases. Besides, the inadequacy of data sometimes does not necessarily mean that the number of samples is not enough, but instead, the number of usable samples is not adequate. For example, it is usually not difficult to obtain images for computer vision, however, labeled images, which may need labor- and expertise-intensive work to generate, can be hard to obtain. Data augmentation, innovative labeling techniques, and semi-supervised machine learning are common solutions to data inadequacy.
Immature data refers to data that will need a significant amount of pre-processing work before it can be used for training or testing a model. This is very common as data may be incomplete, heterogeneous, and structured in ways that are not compatible with a model. For example, many deep learning models will require a specific shape of the input arrays and a certain way of labeling the data (e.g., one-hot labeling). This usually requires us to spend a lot of time converting every sample and label into the required format. The workload can be astonishing when we need to deal with a great amount of such data. Under this condition, it could be even more difficult to tell what data immaturity issues the data may have, because it is hard to check the samples one by one as it may take minutes or hours to load the dataset. It will be helpful to develop code that can automatically assess and preprocess the data, such as format check, trimming, resizing, and removal, though the development of such code may also be time-consuming.
Incorrect data is another type of data issue. Compared with other issues, this type of issue is hard to detect, and if overlooked, can cause serious outcomes such as incorrect models. One typical example is mislabeled data. For example, some samples may be assigned wrong labels due to a variety of reasons. Unfortunately, when such data is used for training, the misinformation will also be learned by the model. In particular, for instance-based algorithms such as KNN, the data will be included as part of the model. Errors will also be integrated into such models, leading to problematic predictions for future data. In regression tasks, wrong label values due to systematic errors such as sensor drift are also this type. Manual data assessment by human experts and algorithms that can detect such issues. For example, anomaly detection algorithms can be used to identify these issues.
Noisy data is very similar to incorrect data but can be different in some ways. It is characterized by the existence of a small amount of data that exhibits trends different from the others. So, it can be caused by random errors like those mislabels due to accidental operations, which affect specific data points, rather than systematic errors, which can lead to
offset in all the data. Furthermore, it is also possible that the noise appearance is not caused by an error, but instead, due to the distributions associated with the data. Such issues can be handled by both processing the data to screen out the noisy samples and developing more rigorous algorithms. For example, the soft margin in the SVM can help consider samples that do not meet the basic assumptions of data distributions.
Biased data is produced when certain samples are heavily weighted or need more importance than others. Such data causes a typical issue: the data cannot represent the real problem or cannot be representative of new cases that we need to generalize. For example, a training dataset does not cover all cases that have already occurred or/and are occurring. Biased data may lead to inaccurate predictions, skewed outcomes, and other analytical errors. In other words, the model may learn from data that only represents a part or an aspect of the problem and extend the knowledge to the whole problem. Such issues can be resolved by determining where data is actually biased in the dataset and countermeasures can be proposed to rectify the bias.

Inductive Bias

Strictly speaking, inductive bias is not an issue. However, it can cause issues if we do not understand it and treat it properly. It is among the concepts that are the most difficult to understand in machine learning. Meanwhile, it is an essential element of machine learning, though it may not even be noticed by many machine learning practitioners. However, a better understanding of it can help us search for more suitable models and avoid issues due to inappropriate selection or treatment of the inductive bias.
Inductive bias can be formally defined as the assumption(s) that a machine learning algorithm adopts to generalize a limited set of observations (training samples) into a general model. As introduced in the section for symbolic versus numerical AI, machine learning as a numerical AI boils down to induction: the process of moving from special observations to general rules or models.
Inductive bias is needed because we will need to provide information to describe what is "general" in induction. Take the regression problem in Fig. 2.13 as an example. Two models, i.e., Model 1 (linear, green) and Model 2 (nonlinear, blue), can be obtained based on the same training dataset (black circles). These two models exhibit the same performance if we use typical regression metrics like mean absolute error because both curves pass the centers of all the training data points. In this case, how can we tell which model is better or reflect more general rules?
Figure 2.13: Example of need for inductive bias
We can rephrase the above problem using machine learning terms to obtain a strict description. First, supervised learning can be viewed as a process of searching in a set of all possible mappings, or more broadly, hypotheses. This set is called the hypothesis space. The learning goal is to find a hypothesis that can match or provide the best description of the training data. However, in many cases, there is more than one hypothesis from the hypothesis space that is compatible with training data. These compatible hypotheses constitute the version space. Just like the above example, both Model 1 and Model 2 are the best fitting models. In this case, if no information about future data is provided, we will need a bias to indicate our inclination or preference for selecting a model. "Occam's Razor" principle is a common inductive bias. This principle states that we should choose the simpler model when two models exhibit comparable performance.
The word "bias" already indicates that inductive biases present priori and subjective information. So, it does not mean that they are always correct. Getting back to the above example, we can see that, if data that will appear in future applications for the model is more like the blue dots, then the nonlinear model is better. On the contrary, if the future data
is more like the green squares, the linear model is better. This leads to an extremely important idea in machine learning:
it makes little sense to talk about models without mentioning the data. Thus, instead of saying a model is good or not, we may need to say whether it is suitable for an application or the data associated with that application. When the target application changes, or more essentially, the probability distribution of the possible data associated with the application changes, we may also need to adjust the model so that it can maintain its performance.
Figure 2.14: Need for hypotheses on different layers
We can see from the above example that it will be helpful to get a better understanding of possible future data to avoid issues that may be caused by inductive bias. That is also the reason for introducing testing data. However, even if we do that, the use of inductive bias is also inevitable in many cases. An extreme example is illustrated in Fig. 2.14. In this example, the two models have the same level of complexity and performance, thus we will need another hypothesis on top of Occam's razor rule. Though this example is too simple and special, it shows the idea that we may need to use multiple hypotheses on different levels to help determine which model is better without knowing anything about the data that we will encounter.
Inductive bias is hard to understand also because it may appear in different formats in different algorithms. In the above regression problem, Occam's razor is widely accepted as an inductive bias in regression and has been incorporated in many regression algorithms via regularization terms. However, it is neither the only option nor a must-have inductive bias. For example, a unique inductive bias - selecting the SVM with the widest margin - is explicitly specified in most SVM algorithms. Extra examples of inductive biases include the maximum conditional independence in Naive Bayes classier, minimum number of features in feature selection, and nearest neighbor in KNN.
It is worthwhile to mention that inductive bias not only determines which solution will be selected, but also affects whether we can efficiently find a solution. From this perspective, we can also understand it as constraints that we place there, which may affect both the solution result and solution process. Let us take a look at deep learning as an example. The inductive bias of CNN can be locality (elements in the space show higher correlation as they get closer) and spatial variance (the kernel weights are shared). The inductive bias of RNN is sequentiality (points that are close to each other in time are related) and time invariance (RNN share weight cross time steps). Thus, these two types of deep NNs can be viewed as special cases of fully-connected deep NNs, which assume all elements can be related. The extra inductive biases help CNN and RNN search for solutions along the biased directions for computer vision and natural language processing problems, respectively. They generate faster and more accurate results than those general deep NNs without such biases, because computer vision and natural language processing problems exhibit locality and sequentiality, respectively.
Thus, the selection of inductive bias does not only affect the usefulness of the model, but also determines how a model can be constructed and identified. An opinion in recent computer vision research is that traditional CNNs involve too much inductive bias. Thus, deep learning algorithms like self-attention in the ViT (Vision Transformer) can provide better functions by loosening the constraints placed by the inductive bias. More recently, multi-layer perception, which has less inductive bias, can be used to achieve the accuracy of SOTA model in ImageNet. This leads to a controversial conclusion: inductive bias in CNN is not needed? In fact, a better way to understand this is that inductive bias helps reach a balance between fast and accurate solution and flexibility. When we do not have strong power to obtain a solution, e.g., better data, higher computing power, and more efficient algorithm, it is better to use inductive bias to help us stay more focused so that we can find an acceptable solution or reach it more quickly. But when our power for solution is satisfactory, we can remove some inductive bias or loosen the constraints so that we can find better solutions or better ways of reaching the solutions.

Underfitting and Overfitting

Generalization

As explained, machine learning represents a process to learn general rules from specific observations. From this perspective, the goal of machine learning is to generalize from the training data to any data from the problem domain. A good model will allow us to make predictions for data that will appear in future applications, which the model cannot see in the training stage. Thus, we use the concept of "generalization" to tell how well a model trained with specific observations can perform on the data that it will be applied to.

Overfitting and Underfitting

Two issues, or outcomes of poor generalization, of machine learning models are overfitting and underfitting. In fact, overfitting and underfitting are also the two major causes of poor performance of machine learning algorithms. As we showed in the previous section, without any knowledge about the future data, we can only rely on inductive bias to assess/select models. To improve the model selection, we usually split the available data in supervised learning into a training dataset and a testing dataset. In this way, we can have the testing data as a representation of future data. The trained model can then be assessed using the testing data. Let us assume both the training and testing data can perfectly represent all the possible data. Then, a model that is trained with the training dataset and can achieve comparable performance on the testing dataset is believed to have good generalization.
Fig. 2.15 gives an illustration of underfitting, good fitting, and overfitting as well as training and testing data. If a model's performance on training data is poorer than on testing data, e.g., lower accuracy and higher loss, we can infer the low generalization may be caused by overfitting. If poor model performance is observed on both datasets, then underfitting may be the reason. High bias (overall offset) and low variance (high scatter), which will be introduced in detail in the chapter for ensemble learning, are two common indicators of underfitting. Good fitting is associated with good results, e.g., high accuracy and low loss, on both the training and testing datasets.
Figure 2.15: Underfitting and overfitting
It is worthwhile to mention that, usually, training data and testing data cannot perfectly represent all the possible data. However, to ensure the above process for assessing and improving generalization is valid, we will need to make sure the training data and testing data are Independent and Identically Distributed (written as i.i.d. or IID). The independence requirement ensures the two datasets are not the same thing. Otherwise, it does not make sense to separate data into these two sets. The requirement of identical distribution helps us enforce that the training data and testing data have the same nature or come from the same data domain. If we consider the random variables corresponding to different attributes follow certain distributions for the problem(s) that we want to address, then we want to make sure such distributions in the training data, testing data, and all the possible data are the same.
Underfitting is easier to address compared with overfitting. A straightforward way is to increase the model complexity or switch to a new model with higher capacity (complexity). The detailed changes will be different in different algorithms. Taking NN for example, we can add in more layers and neurons. Sometimes, the capacity of the model may be enough. What is truly needed is to perform more thorough training. For example, some algorithms use iterative optimization to search for the best model. In such a case, we will need to wait until enough iterations or epochs are finished so that the loss can gradually decrease to an acceptable value. If any technique for addressing overfitting is used, we may also need to reduce the effects of such techniques to alleviate underfitting, e.g., reducing the weight of the regularization term in the
loss function.

Techniques to Prevent Overfitting

Overfitting can be addressed or controlled from three different angles: 1) controlling/reducing model complexity, 2) better monitoring and controlling the training process to avoid over-training, and 3) making data better represent the whole data domain (or sample space).
For the model complexity control, different types of machine learning algorithms may use different techniques. But the use of regularization and model trimming techniques is very common in most supervised learning algorithms.
Regularization constrains the model complexity by including functions of model parameters as a penalty term in the loss function. In this way, the complexity of the model needs to be considered when searching for the optimal model during training.
(2.3) ( w , b , α ) = α w + f ( e r r ) ] (2.3) ( w , b , α ) = α w + f ( e r r ) ] {:(2.3)ℓ( vec(w)","b"," vec(alpha))=alpha|| vec(w)||+f(err)]:}\begin{equation*} \ell(\vec{w}, b, \vec{\alpha})=\alpha\|\vec{w}\|+f(e r r)] \tag{2.3} \end{equation*}(2.3)(w,b,α)=αw+f(err)]
where f f fff (err) represents the original loss as a function of error, w w || vec(w)||\|\vec{w}\|w is the norm of the vector (or higher order tensor) containing model parameters, and α α alpha\alphaα is a parameter that we use to adjust the significance of the complexity term in the optimization. The higher the α α alpha\alphaα value, the more likely that simpler models will be sought. A norm is a measure of the magnitude of a vector - in this case, it tells the size of the array composed of all the model parameters. Usually, we use L 1 or L 2 norms, leading to L 1 and L 2 regularization, respectively.
Model trimming can be implemented in different ways depending on the algorithms. For example, in decision trees, this can be done by removing part of the tree and limiting the maximum depth (or maximum layers) of the tree in the so-called pre-pruning or post-pruning process, respectively. In deep learning, we can add in dropout and pooling layers to intentionally leave out and combine information, respectively.
Overfitting can also be addressed by better understanding and controlling the training process. The following are some widely used techniques.
Hold-out is very useful for controlling overfitting by better informing the extent of overfitting. This is actually what we did for splitting data into training and testing sets, e.g., 60 % / 40 % 60 % / 40 % 60%//40%60 \% / 40 \%60%/40% and 80 % / 20 % 80 % / 20 % 80%//20%80 \% / 20 \%80%/20%. As introduced above, the difference between the performance indicators of the two datasets serves as a measure of overfitting. If overfitting occurs, we then should stop training and take some other actions. This overfitting prevention technique has been widely accepted as an essential step of training; thus, sometimes, it is not even considered a special technique for dealing with overfitting.
Cross-validation is very similar to testing, but it is counted as part of the training process. Similarly, we split our dataset into k groups, i.e., k -fold cross-validation. Then we use one group for testing and the others for training during a training time unit, e.g., a certain number of iterations or epochs (one epoch means looping over all the samples once). Next, another group will be used for testing, while the remaining groups will be used for training during the next training time unit. This process is continued until the training is finished. We can see that this can allow us to check the extent of overfitting even before the testing. Compared with hold-out, cross-validation allows all data to be eventually used for training while being more computationally expensive.
Early stop is an action we usually take when we find overfitting occurs. In the typical loss versus iterations/epochs plot during cross-validation, once the validation loss stops decreasing but rather begins increasing and far exceeds that of the training loss, we stop the training and save the current model. An early stopping trigger can be set to stop the training automatically.
We can also help prevent or eliminate overfitting by better preparing the data.
One goal of data improvement is to make the data better represent the sample space. This can be done directly by increasing the amount of data. Data sampling techniques that can generate data to better reflect the distributions of the real data would also help. Data augmentation is another popular technique. Data augmentation adds noise to the data. In deep learning, this can be performed by flipping, rotating, trimming, rescaling, or shifting the image data.
Feature selection is a good option when we have only a limited number of training samples while each sample contains lots of features. In this case, we can select the most important features for training, so that the model does not need to learn that many features, which can more easily lead to overfitting. For this purpose, different combinations of features can be selected for training models with the best generalization. Or alternatively, we can resort to a feature selection method and use the feature selected by this model for training.

 

 

 

 

 

 

Enjoy and Build the AI World

Sample Code from AI Engineering

Cite the code in your publications

Linear Models