Contents
Table of Contents
Next
Bayesian Network

Chapter 7

Bayesian Algorithms

7.1. Overview

This chapter introduces Bayesian methods, which are also called Bayesian algorithms, Bayesian machine learning, and probabilistic machine learning in the literature. Because such Bayesian methods were initially proposed for and closer to classification tasks by nature, the use of the 'Bayesian classifier' is also very common or even predominant in some technical publications. However, it is worthwhile to mention that such Bayesian classifiers, which belong to parametric machine learning methods, can also be extended to regression tasks. Other nonparametric Bayesian methods like Bayesian process are proposed more for regression and may appear much different from traditional Bayesian classifiers.
In this chapter, we will first provide a general background for statistics-based machine learning, which contains statistical inference adopted by both frequentists and Bayesians. The frequentists' inference method, i.e., maximum likelihood estimation, is used by many other machine learning methods like artificial neural networks, while Bayesians' method, i.e., Bayesian estimation, is adopted as the basis of Bayesian methods in this chapter. Then, major parametric Bayesian methods, e.g., Naive Bayes classifier, Bayesian networks, and Markov processes, will be discussed. Next, one nonparametric Bayesian method, i.e., Gaussian Process, will be introduced.
This chapter differs from the other chapters in a few ways, which will need close attention to avoid possible confusion. First, the diverse, incomplete, and conflicting statistics contents in different machine learning literature, especially Bayesianism-related topics, can cause severe learning difficulties. We will start the chapter with an introduction to extra statistics knowledge considering the overlap between machine learning and applied statistics. You can skip this section if you just want to know how specific Bayesian methods work. Second, a few other criteria for classifying machine learning methods like discriminative vs generative and parametric vs. nonparametric can also be hard to understand. These will also be explained when appropriate as we move through the technical content.

7.2. Statistics Background for Machine Learning

7.2.1. Statistics and Machine Learning

As can be found in the Appendix for statistics, the role and significance of statistics in the general context of machine learning can vary depending on the background, goal, and focus of machine learning practitioners. Despite this fact, we need to be aware that, for some machine learning methods, especially the Bayesian methods in this chapter, a strong background in statistics is required in order to better understand and utilize such methods. Due to this consideration, statistics knowledge is presented in this chapter in addition to what is provided in the appendices. As a result, this section will offer a big picture for the role of statistics in machine learning, which is both needed in this chapter and desired or helpful for other machine learning topics.

7.2.2. Frequentists and Bayesians

In statistics, there are two mainstreams: frequentism and Bayesianism. Frequentists build the probability from past events. In this way, the probability does not depend on our beliefs, but instead, objective measurements of what have occurred. For example, if we toss a coin 100 times and heads show up 38 times, then the frequency/probability of heads is 38 / 100 = 0.38 38 / 100 = 0.38 38//100=0.3838 / 100=0.3838/100=0.38. By contrast, Bayesians view probability as a measure of belief, and thus, the probability is subjective,
and refers more to the future. Due to this reason, the workflow of Bayesians starts with a belief, or called a prior, then obtains some data to update our belief for an outcome called a posterior. As more data is obtained, the posterior becomes a new prior, and this process can be repeated to improve the model with more data.
When using statistics to address problems, such as classification and regression problems in machine learning, we essentially try to understand how different samples are distributed in the sample space, which determines their discrete (classification) or continuous (regression) labels/targets. We can understand the learning (or narrowly, training) process as an effort to search for the best statistical model to describe the data, so that such a model can be used for future predictions. Such a model can be defined with parameters represented by θ = [ θ 1 , θ 2 , ] θ = θ 1 , θ 2 , vec(theta)=[theta_(1),theta_(2),cdots]\vec{\theta}=\left[\theta_{1}, \theta_{2}, \cdots\right]θ=[θ1,θ2,]. So machine learning for both classification and regression essentially aims to infer the parameters θ θ vec(theta)\vec{\theta}θ for defining the model.
The difference between the two ways of treating probability also leads to the reality that frequentists mostly use the Maximum Likelihood Estimation, while Bayesians adopt Bayesian Estimation. In the following, these two statistical inference methods will be introduced first. The former may be brought up again when introducing other machine learning methods. The introduction to the latter will recall and further elaborate on some statistical concepts introduced in the Appendix for statistics as well as other basic statistics literature.
Here, let us first recall Bayes' theorem, which will be used in different ways in this chapter.
(7.1) P ( A B ) = P ( B A ) P ( A ) P ( B ) , if P ( B ) 0 (7.1) P ( A B ) = P ( B A ) P ( A ) P ( B ) ,  if  P ( B ) 0 {:(7.1)P(A∣B)=(P(B∣A)P(A))/(P(B))","" if "P(B)!=0:}\begin{equation*} P(A \mid B)=\frac{P(B \mid A) P(A)}{P(B)}, \text { if } P(B) \neq 0 \tag{7.1} \end{equation*}(7.1)P(AB)=P(BA)P(A)P(B), if P(B)0
where,
A A AAA and B B BBB are events and P ( B ) 0 P ( B ) 0 P(B)!=0P(B) \neq 0P(B)0.
P ( A B ) P ( A B ) P(A∣B)P(A \mid B)P(AB) is a conditional probability: the probability of event A A AAA occurring given that B B BBB is true. It is also called the posterior probability of A A AAA given B B BBB.
P ( B A ) P ( B A ) P(B∣A)P(B \mid A)P(BA) is also a conditional probability: the probability of event B B BBB occurring given that A A AAA is true. It can also be interpreted as the likelihood of A A AAA given a fixed B B BBB.
P ( A ) P ( A ) P(A)P(A)P(A) and P ( B ) P ( B ) P(B)P(B)P(B) are the prior probabilities (also called marginal probabilities) of observing A A AAA and B B BBB, respectively.
If we substitute the law of total probability into the Bayes' theorem, we can obtain the format of the Bayes' theorem that we typically use in Bayesian classifiers.
(7.2) P ( A B ) = n = 1 N P ( B n A ) P ( A ) P ( B ) (7.2) P ( A B ) = n = 1 N P B n A P ( A ) P ( B ) {:(7.2)P(A∣B)=(sum_(n=1)^(N)P(B_(n)∣A)P(A))/(P(B)):}\begin{equation*} P(A \mid B)=\frac{\sum_{n=1}^{N} P\left(B_{n} \mid A\right) P(A)}{P(B)} \tag{7.2} \end{equation*}(7.2)P(AB)=n=1NP(BnA)P(A)P(B)
where { B n : n = 1 , 2 , 3 , , N } B n : n = 1 , 2 , 3 , , N {B_(n):n=1,2,3,dots,N}\left\{B_{n}: n=1,2,3, \ldots, N\right\}{Bn:n=1,2,3,,N} is a finite or countably infinite partition of a sample space (in other words, a set of pairwise disjoint events whose union is the entire sample space), in which each event B n B n B_(n)B_{n}Bn is measurable, and A A AAA is an event in the same probability space.

7.2.3. Overview of Statistical Inference

Statistical inference aims to learn characteristics of a population, i.e., the underlying probability distribution of all the possible data, from observed data sampled from the population [59]. The goal of statistical inference or development of statistical models in machine learning, no matter with Maximum Likelihood Estimation (MLE), Maximum A Posteriori Estimation (MAP), or Bayesian inference, is to infer the optimal model parameterized by ( θ ) ( θ ) ( vec(theta))(\vec{\theta})(θ) from data ( X ¯ ) ( X ¯ ) ( bar(X))(\bar{X})(X¯).
The Bayes' theorem can be reformulated as follows for statistical inference.
(7.3) P ( θ X ¯ ) = P ( X ¯ θ ) P ( θ ) P ( X ¯ ) (7.3) P ( θ X ¯ ) = P ( X ¯ θ ) P ( θ ) P ( X ¯ ) {:(7.3)P( vec(theta)∣ bar(X))=(P(( bar(X))∣( vec(theta)))*P(( vec(theta))))/(P(( bar(X)))):}\begin{equation*} P(\vec{\theta} \mid \bar{X})=\frac{P(\bar{X} \mid \vec{\theta}) \cdot P(\vec{\theta})}{P(\bar{X})} \tag{7.3} \end{equation*}(7.3)P(θX¯)=P(X¯θ)P(θ)P(X¯)
P ( X ¯ ) P ( X ¯ ) P( bar(X))P(\bar{X})P(X¯) is something we generally cannot compute. However, the value of P ( X ¯ ) P ( X ¯ ) P( bar(X))P(\bar{X})P(X¯) does not matter that much because it is just a normalization constant. When comparing models, we are mainly interested in expressions containing θ θ vec(theta)\vec{\theta}θ, because P ( X ¯ ) P ( X ¯ ) P( bar(X))P(\bar{X})P(X¯) stays the same for each model.
To explain MLE, MAP, and Bayesian using the above equation, it is better to first recall the fact that an algorithm is usually selected prior to an estimation. This algorithm defines a hypothesis space as explained in previous chapters. In other words, this algorithm provides a collection (a space) of possible models (hypothesis). Next, let us use the following analogy to easily understand the thing we are trying to do. What we have is some data ( X ¯ X ¯ bar(X)\bar{X}X¯ ) generated by one or multiple
machines (analogous to models). Our aim is to look for such a machine(s), related to or defined by θ θ vec(theta)\vec{\theta}θ, so that we can best reproduce the same data. In this analogy, selecting an algorithm is analogous to finding a manufacturer who has a collection of such machines.
MLE aims to maximize the likelihood as follows:
(7.4) θ = arg max θ P ( X ¯ θ ) (7.4) θ = arg max θ P ( X ¯ θ ) {:(7.4) vec(theta)^(**)=arg max_( vec(theta))P( bar(X)∣ vec(theta)):}\begin{equation*} \vec{\theta}^{*}=\arg \max _{\vec{\theta}} P(\bar{X} \mid \vec{\theta}) \tag{7.4} \end{equation*}(7.4)θ=argmaxθP(X¯θ)
Thus, MLE tries to identify one model parameterized by θ θ vec(theta)\vec{\theta}θ that mostly likely generates X ¯ X ¯ bar(X)\bar{X}X¯. That is, the probability of generating X ¯ X ¯ bar(X)\bar{X}X¯ under the condition of θ θ vec(theta)\vec{\theta}θ is maximized in MLE.
MAP differs from MLE in that MAP includes prior knowledge. In the same analogy, we can understand this as that different machines have different probabilities of delivering the claimed performance. P ( θ ) P ( θ ) P( vec(theta))P(\vec{\theta})P(θ) can be viewed as prior knowledge provided by the manufacturer about such probabilities. If we include this term in the estimation, we can obtain the following equation based on the MLE equation (Eq. 7.4) and the above Bayes's equation (Eq. 7.3):
(7.5) θ = arg max θ P ( X ¯ θ ) P ( θ ) = arg max θ P ( θ X ¯ ) (7.5) θ = arg max θ P ( X ¯ θ ) P ( θ ) = arg max θ P ( θ X ¯ ) {:(7.5) vec(theta)^(**)=arg max_( vec(theta))P( bar(X)∣ vec(theta))P( vec(theta))=arg max_( vec(theta))P( vec(theta)∣ bar(X)):}\begin{equation*} \vec{\theta}^{*}=\arg \max _{\vec{\theta}} P(\bar{X} \mid \vec{\theta}) P(\vec{\theta})=\arg \max _{\vec{\theta}} P(\vec{\theta} \mid \bar{X}) \tag{7.5} \end{equation*}(7.5)θ=argmaxθP(X¯θ)P(θ)=argmaxθP(θX¯)
As can be seen, the major difference between MLE and MAP is whether to include prior knowledge P ( θ ) P ( θ ) P( vec(theta))P(\vec{\theta})P(θ). In a more strict description, P ( θ ) P ( θ ) P( vec(theta))P(\vec{\theta})P(θ) is (prior knowledge about) the probability of the model parameterized by θ θ vec(theta)\vec{\theta}θ in the hypothesis space. This difference actually explains why MLE is more susceptible to overfitting than MAP on datasets with outliers. This is because MLE is more inclined to fit to outliers without prior knowledge for excluding such outliers. Two simple algorithms that represent MLE and MAP are logistic regression with and without L2 regularization. Without regularization, logistic regression uses MLE and would contain all data points including outliers. With regularization, logistic regression can filter out outliers that do not comply with the regularization term.
Bayesian inference is much different because it searches for a possible combination of different models, or ensemble, instead of one model. Therefore, instead of the optimal model θ θ vec(theta)^(**)\vec{\theta}^{*}θ, Bayesian methods aim to obtain the possibilities of different models P ( θ X ¯ ) P ( θ X ¯ ) P( vec(theta)∣ bar(X))P(\vec{\theta} \mid \bar{X})P(θX¯), or more accurately, P ( θ t X ¯ ) P θ t X ¯ P( vec(theta)_(t)( bar(X)))P\left(\vec{\theta}_{t} \bar{X}\right)P(θtX¯). The number of models can be large or even infinite. In the latter case, we can perform a certain number of sampling, e.g., T ( t { 1 , , T } ) T ( t { 1 , , T } ) T(t in{1,cdots,T})T(t \in\{1, \cdots, T\})T(t{1,,T}). Then, the prediction made by the overall model (or ensemble) is the average weighted by P ( θ t X ¯ ) P θ t X ¯ P( vec(theta)_(t)( bar(X)))P\left(\vec{\theta}_{t} \bar{X}\right)P(θtX¯).
Compared with MLE and MAP, Bayesian methods have the following advantages.
  • Bayesian methods perform better on small datasets. This is because, as mentioned, they are by nature ensemble learning. Due to the same reason, they are less susceptible to overfitting.
  • It is more convenient to incorporate prior knowledge into the Bayesian estimate. Therefore, extra constraints or assumptions about data or model distributions can be easily integrated.
  • Bayesian methods can better handle uncertainty. MLE and MAP only obtain θ θ vec(theta)\vec{\theta}θ via learning, e.g., weights of linear models and ANNs. Such deterministic predictions may not be adequate or reasonable in many conditions. For example, a classification model of this type outputs labels corresponding the classes with the highest probabilities. However, no information is available about the confidence level of such a prediction. Bayesian can provide such confidence information, which can prompt us to intervene when the confidence level is too low.
In the following, we will first provide more information about the use of MLE. After that, the general Bayesian estimation will be introduced, followed by detailed parametric and nonparametric Bayesian Methods.

7.2.4. Maximum Likelihood Estimation (MLE)

Suppose we have a training dataset X ¯ X ¯ bar(X)\bar{X}X¯, which has I I III independent and identically distributed (called i.i.d. in many places) samples. Let us assume the distribution of the samples can be described using a statistical model characterized by θ θ vec(theta)\vec{\theta}θ. The basic idea of MLE in frequentism is that θ θ vec(theta)\vec{\theta}θ is unknown but exists objectively, and θ θ vec(theta)\vec{\theta}θ as a constant(s) can be estimated by ensuring that a given dataset (for training) has the maximum likelihood to appear (or be observed). Following this understanding, we can obtain the mathematical formulation of MLE:
(7.6) θ M L E = arg max θ O ( X ¯ θ ) = arg min θ ( X ¯ θ ) (7.6) θ M L E = arg max θ O ( X ¯ θ ) = arg min θ ( X ¯ θ ) {:(7.6)theta_(MLE)=arg max_( vec(theta))O( bar(X)∣ vec(theta))=arg min_( vec(theta))ℓ( bar(X)∣ vec(theta)):}\begin{equation*} \theta_{M L E}=\arg \max _{\vec{\theta}} O(\bar{X} \mid \vec{\theta})=\arg \min _{\vec{\theta}} \ell(\bar{X} \mid \vec{\theta}) \tag{7.6} \end{equation*}(7.6)θMLE=argmaxθO(X¯θ)=argminθ(X¯θ)
where P ( x i θ ) P x i θ P( vec(x)_(i)∣( vec(theta)))P\left(\vec{x}_{i} \mid \vec{\theta}\right)P(xiθ) is the probability that the occurrence of sample x i x i vec(x)_(i)\vec{x}_{i}xi is predicted by the model parameterized by θ θ vec(theta)\vec{\theta}θ.
The MLE function O ( X ¯ θ ) O ( X ¯ θ ) O( bar(X)∣ vec(theta))O(\bar{X} \mid \vec{\theta})O(X¯θ) is the objective function constructed as the total probability of all the samples. The loss function \ell is the opposite of the objective function O O OOO. These equations can be constructed as follows if we assume the attributes are independent of each other:
(7.7) O ( X ¯ θ ) = ( X ¯ θ ) = P ( X ¯ θ ) = P ( x 1 , , x I θ ) = P ( x 1 θ ) P ( x i θ ) P ( x I θ ) = i = 1 I P ( x i θ ) (7.7) O ( X ¯ θ ) = ( X ¯ θ ) = P ( X ¯ θ ) = P x 1 , , x I θ = P x 1 θ P x i θ P x I θ = i = 1 I P x i θ {:(7.7)O( bar(X)∣ vec(theta))=ℓ( bar(X)∣ vec(theta))=P( bar(X)∣ vec(theta))=P( vec(x)_(1),cdots, vec(x)_(I)∣( vec(theta)))=P( vec(x)_(1)∣( vec(theta)))cdots P( vec(x)_(i)∣( vec(theta)))cdots P( vec(x)_(I)∣( vec(theta)))=prod_(i=1)^(I)P( vec(x)_(i)∣( vec(theta))):}\begin{equation*} O(\bar{X} \mid \vec{\theta})=\ell(\bar{X} \mid \vec{\theta})=P(\bar{X} \mid \vec{\theta})=P\left(\vec{x}_{1}, \cdots, \vec{x}_{I} \mid \vec{\theta}\right)=P\left(\vec{x}_{1} \mid \vec{\theta}\right) \cdots P\left(\vec{x}_{i} \mid \vec{\theta}\right) \cdots P\left(\vec{x}_{I} \mid \vec{\theta}\right)=\prod_{i=1}^{I} P\left(\vec{x}_{i} \mid \vec{\theta}\right) \tag{7.7} \end{equation*}(7.7)O(X¯θ)=(X¯θ)=P(X¯θ)=P(x1,,xIθ)=P(x1θ)P(xiθ)P(xIθ)=i=1IP(xiθ)
In a general multiclass classification problem, what we try to maximize is the expectation of the total probability. When different samples and different attributes are not equally weighted, the objective is calculated using O ( x θ ) = O ( x θ ) = O( vec(x)∣ vec(theta))=O(\vec{x} \mid \vec{\theta})=O(xθ)= j = 1 J λ i , j P ( x i j θ ) j = 1 J λ i , j P x i j θ sum_(j=1)^(J)lambda_(i,j)P(x_(ij)∣( vec(theta)))\sum_{j=1}^{J} \lambda_{i, j} P\left(x_{i j} \mid \vec{\theta}\right)j=1Jλi,jP(xijθ), in which λ i , j λ i , j lambda_(i,j)\lambda_{i, j}λi,j is used to represent the weights. Accordingly, we obtain
(7.8) θ M L E = arg max θ O ( X ¯ θ ) = arg max θ [ j = 1 J λ i , j P ( x i j θ ) ] (7.8) θ M L E = arg max θ O ( X ¯ θ ) = arg max θ j = 1 J λ i , j P x i j θ {:(7.8)theta_(MLE)=arg max_( vec(theta))O( bar(X)∣ vec(theta))=arg max_( vec(theta))[sum_(j=1)^(J)lambda_(i,j)P(x_(ij)∣( vec(theta)))]:}\begin{equation*} \theta_{M L E}=\arg \max _{\vec{\theta}} O(\bar{X} \mid \vec{\theta})=\arg \max _{\vec{\theta}}\left[\sum_{j=1}^{J} \lambda_{i, j} P\left(x_{i j} \mid \vec{\theta}\right)\right] \tag{7.8} \end{equation*}(7.8)θMLE=argmaxθO(X¯θ)=argmaxθ[j=1Jλi,jP(xijθ)]
In practice, the value of P ( x i θ ) P x i θ P(x_(i)∣( vec(theta)))P\left(x_{i} \mid \vec{\theta}\right)P(xiθ) can be very small because J J JJJ can be large in a large dataset. The product of such small values may cause numerical issues. To allow for such problems, we usually use the log of the above function as the loss function, i.e., ln ( ) ln ( ) ℓlarr ln(ℓ)\ell \leftarrow \ln (\ell)ln(), for optimization. Then, the mathematical formulation of MLE excluding weights λ i , j λ i , j lambda_(i,j)\lambda_{i, j}λi,j becomes
(7.9) = i = 1 I ln [ P ( x i θ ) ] (7.10) θ M L E = arg max θ ln [ P ( X ¯ θ ) ] (7.9) = i = 1 I ln P x i θ (7.10) θ M L E = arg max θ ln [ P ( X ¯ θ ) ] {:[(7.9)ℓ=sum_(i=1)^(I)ln[-P( vec(x)_(i)∣( vec(theta)))]],[(7.10)theta_(MLE)=arg max_( vec(theta))ln[P( bar(X)∣ vec(theta))]]:}\begin{gather*} \ell=\sum_{i=1}^{I} \ln \left[-P\left(\vec{x}_{i} \mid \vec{\theta}\right)\right] \tag{7.9}\\ \theta_{M L E}=\arg \max _{\vec{\theta}} \ln [P(\bar{X} \mid \vec{\theta})] \tag{7.10} \end{gather*}(7.9)=i=1Iln[P(xiθ)](7.10)θMLE=argmaxθln[P(X¯θ)]
Let us employ a simple example to understand MLE. Here, we have a bag containing black and white balls, for which we do not know the ratio of black balls to white balls. We took a ball from the bag and put it back. We performed 100 ball-taking operations, and we got 75 white balls in total. What is the proportion of white balls?
MLE can be used to estimate the above parameters θ θ vec(theta)\vec{\theta}θ, that is, the proportion of white balls P P PPP in this example.
(7.11) O ( X ¯ θ ) = P ( x 1 θ ) P ( x 2 θ ) P ( x 100 θ ) = P 75 ( 1 P ) 25 (7.11) O ( X ¯ θ ) = P x 1 θ P x 2 θ P x 100 θ = P 75 ( 1 P ) 25 {:(7.11)O( bar(X)∣ vec(theta))=P( vec(x)_(1)∣( vec(theta)))*P( vec(x)_(2)∣( vec(theta)))cdots P( vec(x)_(100)∣( vec(theta)))=P^(75)*(1-P)^(25):}\begin{equation*} O(\bar{X} \mid \vec{\theta})=P\left(\vec{x}_{1} \mid \vec{\theta}\right) \cdot P\left(\vec{x}_{2} \mid \vec{\theta}\right) \cdots P\left(\vec{x}_{100} \mid \vec{\theta}\right)=P^{75} \cdot(1-P)^{25} \tag{7.11} \end{equation*}(7.11)O(X¯θ)=P(x1θ)P(x2θ)P(x100θ)=P75(1P)25
Then the natural log of the MLE likelihood function is
(7.12) ln P ( X ¯ θ ) = 75 ln P 25 ln ( 1 P ) (7.12) ln P ( X ¯ θ ) = 75 ln P 25 ln ( 1 P ) {:(7.12)ln P( bar(X)∣ vec(theta))=-75*ln P-25*ln(1-P):}\begin{equation*} \ln P(\bar{X} \mid \vec{\theta})=-75 \cdot \ln P-25 \cdot \ln (1-P) \tag{7.12} \end{equation*}(7.12)lnP(X¯θ)=75lnP25ln(1P)
The maximum value appears when the derivative of the above function is 0 :
(7.13) d ( ln P ( x θ ) ) d P = 75 P + 25 1 P = 0 (7.13) d ( ln P ( x θ ) ) d P = 75 P + 25 1 P = 0 {:(7.13)(d(ln P(( vec(x))∣( vec(theta)))))/(dP)=-(75 )/(P)+(25)/(1-P)=0:}\begin{equation*} \frac{d(\ln P(\vec{x} \mid \vec{\theta}))}{d P}=-\frac{75}{P}+\frac{25}{1-P}=0 \tag{7.13} \end{equation*}(7.13)d(lnP(xθ))dP=75P+251P=0
Hence, we can get P = 0.75 P = 0.75 P=0.75P=0.75P=0.75.
This method is relatively simple. However, its validity lies in a strict assumption that the dataset can represent the sample space. In other words, the distribution of the training samples needs to be the same or close to the distribution in the sample space.

7.2.5. Bayesian Estimation

As mentioned, Bayesian estimation is different from MLE in that it assumes the models or model parameters, i.e., θ θ vec(theta)\vec{\theta}θ, have their probability distribution and need to be estimated for given data. Therefore, the estimation of the model parameters can differ in different training data. Due to the same reason, the model estimation can also be updated as more data is obtained.
In the following, we can use classification tasks as an example to illustrate how Bayesian estimation models for classification, or Bayesian classifiers, are built up.
Let us consider a multiclass classification problem, in which we need to classify a dataset X ¯ X ¯ bar(X)\bar{X}X¯ consisting of I I III samples x 1 , x 2 , , x i , , x I x 1 , x 2 , , x i , , x I vec(x)_(1), vec(x)_(2),cdots, vec(x)_(i),cdots, vec(x)_(I)\vec{x}_{1}, \vec{x}_{2}, \cdots, \vec{x}_{i}, \cdots, \vec{x}_{I}x1,x2,,xi,,xI and each sample has J J JJJ attributes x i 1 , x i 2 , , x i j , , x i J x i 1 , x i 2 , , x i j , , x i J x_(i1),x_(i2),cdots,x_(ij),cdots,x_(iJ)x_{i 1}, x_{i 2}, \cdots, x_{i j}, \cdots, x_{i J}xi1,xi2,,xij,,xiJ. We will need to classify the samples
into K K KKK categories: y i { c 1 , c 2 , , c k } y i c 1 , c 2 , , c k y_(i)in{c_(1),c_(2),cdots,c_(k)}y_{i} \in\left\{c_{1}, c_{2}, \cdots, c_{k}\right\}yi{c1,c2,,ck}. The goal is to predict the probability of any sample x i x i vec(x)_(i)\vec{x}_{i}xi belonging to the k k kkk th category: P ( c k x i ) P c k x i P(c_(k)∣ vec(x)_(i))P\left(c_{k} \mid \vec{x}_{i}\right)P(ckxi).
Next, we slightly modified the above Bayes' theorem to illustrate its use in Bayesian estimation for classification purposes. We will replace θ θ vec(theta)\vec{\theta}θ with c k c k c_(k)c_{k}ck, which represents "belonging to class c k " c k " c_(k)"c_{k} "ck" ", and replace X ¯ X ¯ bar(X)\bar{X}X¯ with sample " x i j x i j vec(x)_(ij)\vec{x}_{i j}xij ". Thus, we are now considering individual models and samples instead of all the models (ensemble) and all the samples. Also, to maintain a consistent understanding with the previous subsection, we can understand c k c k c_(k)c_{k}ck as a model that solely generates or predicts Class k k kkk samples.
(7.14) P ( c k x i ) = P ( x i , c k ) P ( x i ) = P ( x i c k ) P ( c k ) P ( x i ) = P ( x i 1 , x i 2 , , x i J c k ) P ( c k ) P ( x i ) (7.14) P c k x i = P x i , c k P x i = P x i c k P c k P x i = P x i 1 , x i 2 , , x i J c k P c k P x i {:(7.14)P(c_(k)∣ vec(x)_(i))=(P( vec(x)_(i),c_(k)))/(P( vec(x)_(i)))=(P( vec(x)_(i)∣c_(k))P(c_(k)))/(P( vec(x)_(i)))=(P(x_(i1),x_(i2),cdots,x_(iJ)∣c_(k))P(c_(k)))/(P( vec(x)_(i))):}\begin{equation*} P\left(c_{k} \mid \vec{x}_{i}\right)=\frac{P\left(\vec{x}_{i}, c_{k}\right)}{P\left(\vec{x}_{i}\right)}=\frac{P\left(\vec{x}_{i} \mid c_{k}\right) P\left(c_{k}\right)}{P\left(\vec{x}_{i}\right)}=\frac{P\left(x_{i 1}, x_{i 2}, \cdots, x_{i J} \mid c_{k}\right) P\left(c_{k}\right)}{P\left(\vec{x}_{i}\right)} \tag{7.14} \end{equation*}(7.14)P(ckxi)=P(xi,ck)P(xi)=P(xick)P(ck)P(xi)=P(xi1,xi2,,xiJck)P(ck)P(xi)
where the joint probability P ( x i 1 , x i 2 , , x i J c k ) P x i 1 , x i 2 , , x i J c k P(x_(i1),x_(i2),cdots,x_(iJ)∣c_(k))P\left(x_{i 1}, x_{i 2}, \cdots, x_{i J} \mid c_{k}\right)P(xi1,xi2,,xiJck) is the probability that all the attribute values of sample x i x i vec(x)_(i)\vec{x}_{i}xi appear simultaneously. That is, each attribute of Sample x i x i vec(x)_(i)\vec{x}_{i}xi has multiple attribute values, and the value taken by Sample x i x i vec(x)_(i)\vec{x}_{i}xi will have a probability. Such probabilities, as well as the joint probability, can be estimated using the training data.
The above formulation of Bayes' theorem casts a theoretical basis for all the Bayes classifiers. In fact, in a more general sense, it presents a framework for understanding all the classification methods or even all the supervised learning methods. Based on the way of using the above equation, algorithms can be classified into discriminative models and generative models.
Discriminative models assume a function (implicit or explicit) for P ( c k x i ) P c k x i P(c_(k)∣ vec(x)_(i))P\left(c_{k} \mid \vec{x}_{i}\right)P(ckxi) directly. A majority of the classic machine learning algorithms like linear models, support vector machines, traditional ANNs, and KNNs are discriminative models. By contrast, generative models assume some function forms for P ( x i , c k ) P x i , c k P( vec(x)_(i),c_(k))P\left(\vec{x}_{i}, c_{k}\right)P(xi,ck) and P ( x i ) P x i P( vec(x)_(i))P\left(\vec{x}_{i}\right)P(xi) - estimating the parameters of P ( x i , c k ) P x i , c k P( vec(x)_(i),c_(k))P\left(\vec{x}_{i}, c_{k}\right)P(xi,ck) and P ( x i ) P x i P( vec(x)_(i))P\left(\vec{x}_{i}\right)P(xi) directly from training data - to calculate P ( c k x i ) P c k x i P(c_(k)∣ vec(x)_(i))P\left(c_{k} \mid \vec{x}_{i}\right)P(ckxi) indirectly. Bayesian methods like Naive Bayes and Bayesian networks, as well as less common approaches like Markov random fields and Hidden Markov Models ( HMM ), produce generative models.
Intuitively, we can see that discriminative models are more straightforward and thus can be simpler than generative models due to the above fact. However, generative models may provide better insights into the data. Using classification tasks as an example, we can simply understand that discriminative models identify boundaries in the data space, whereas generative models attempt to model how data is distributed throughout the space.

7.3. Parametric Bayesian Methods

7.3.1. Naive Bayes Classifier

One major difficulty in applying the above framework (Eq.7.14) for tasks like classification is that the joint probability, i.e., P ( x i 1 , x i 2 , , x i J c k ) P x i 1 , x i 2 , , x i J c k P(x_(i1),x_(i2),cdots,x_(iJ)∣c_(k))P\left(x_{i 1}, x_{i 2}, \cdots, x_{i J} \mid c_{k}\right)P(xi1,xi2,,xiJck), is not easy to deal with. This is because it is hard to tell the relationship between attributes or their corresponding random variables. In particular, some of these variables may be mutually dependent. In that case, the calculation of P ( x i 1 , x i 2 , , x i J c k ) P x i 1 , x i 2 , , x i J c k P(x_(i1),x_(i2),cdots,x_(iJ)∣c_(k))P\left(x_{i 1}, x_{i 2}, \cdots, x_{i J} \mid c_{k}\right)P(xi1,xi2,,xiJck) will be really challenging because it requires us to identify such inter-dependencies first.
The Naive Bayes classifier simplifies the problem by assuming that all the attributes (the corresponding random variables) are independent of each other. As a result, the joint probability becomes the product of the probabilities of all the attributes: P ( x i 1 , x i 2 , , x i J c k ) = j = 1 J P ( x i j c k ) P x i 1 , x i 2 , , x i J c k = j = 1 J P x i j c k P(x_(i1),x_(i2),cdots,x_(iJ)∣c_(k))=prod_(j=1)^(J)P(x_(ij)∣c_(k))P\left(x_{i 1}, x_{i 2}, \cdots, x_{i J} \mid c_{k}\right)=\prod_{j=1}^{J} P\left(x_{i j} \mid c_{k}\right)P(xi1,xi2,,xiJck)=j=1JP(xijck). Substituting this into Eq. 7.14 , we can obtain the major equation for the Naive Bayes classifier:
(7.15) P ( c k x i ) = j = 1 J P ( x i j , c k ) P ( x i ) = j = 1 J P ( x i j c k ) P ( c k ) P ( x i ) (7.15) P c k x i = j = 1 J P x i j , c k P x i = j = 1 J P x i j c k P c k P x i {:(7.15)P(c_(k)∣ vec(x)_(i))=(prod_(j=1)^(J)P(x_(ij),c_(k)))/(P( vec(x)_(i)))=(prod_(j=1)^(J)P(x_(ij)∣c_(k))P(c_(k)))/(P( vec(x)_(i))):}\begin{equation*} P\left(c_{k} \mid \vec{x}_{i}\right)=\frac{\prod_{j=1}^{J} P\left(x_{i j}, c_{k}\right)}{P\left(\vec{x}_{i}\right)}=\frac{\prod_{j=1}^{J} P\left(x_{i j} \mid c_{k}\right) P\left(c_{k}\right)}{P\left(\vec{x}_{i}\right)} \tag{7.15} \end{equation*}(7.15)P(ckxi)=j=1JP(xij,ck)P(xi)=j=1JP(xijck)P(ck)P(xi)
where P ( x i j c k ) P x i j c k P(x_(ij)∣c_(k))P\left(x_{i j} \mid c_{k}\right)P(xijck) is the probability of the attribute value of the j j jjj th attribute of Sample x i x i vec(x)_(i)\vec{x}_{i}xi in all the samples in Class c k c k c_(k)c_{k}ck. Thus, x i j x i j x_(ij)x_{i j}xij only provides a specific attribute value out of all possible V V VVV values for this attribute, i.e., a v a v a_(v)a_{v}av, in which v { 1 , 2 , , V } v { 1 , 2 , , V } v in{1,2,cdots,V}v \in\{1,2, \cdots, V\}v{1,2,,V}.
It is noted that, in some literature, the subscripts i i iii and k k kkk are omitted to simplify the use of the Bayes' theorem in Bayesian estimation for multiclass classification problems. Accordingly, x x vec(x)\vec{x}x is a vector representing any sample, and c c ccc is any class. This equation is presented in the following but will not be adopted to ensure a more accurate in the description.
(7.16) P ( c x ) = j = 1 J P ( x j c ) P ( c ) P ( x ) = P ( c ) P ( x ) j = 1 J P ( x j c ) (7.16) P ( c x ) = j = 1 J P x j c P ( c ) P ( x ) = P ( c ) P ( x ) j = 1 J P x j c {:(7.16)P(c∣ vec(x))=(prod_(j=1)^(J)P(x_(j)∣c)P(c))/(P(( vec(x))))=(P(c))/(P(( vec(x))))prod_(j=1)^(J)P(x_(j)∣c):}\begin{equation*} P(c \mid \vec{x})=\frac{\prod_{j=1}^{J} P\left(x_{j} \mid c\right) P(c)}{P(\vec{x})}=\frac{P(c)}{P(\vec{x})} \prod_{j=1}^{J} P\left(x_{j} \mid c\right) \tag{7.16} \end{equation*}(7.16)P(cx)=j=1JP(xjc)P(c)P(x)=P(c)P(x)j=1JP(xjc)
The use of Eq. 7.15 requires us to find a way of calculating P ( x i j c ) P x i j c P(x_(ij)∣c)P\left(x_{i j} \mid c\right)P(xijc) and P ( c k ) P c k P(c_(k))P\left(c_{k}\right)P(ck). For P ( c k ) P c k P(c_(k))P\left(c_{k}\right)P(ck), it can be easily computed as the percentage of Class c k c k c_(k)c_{k}ck samples in all the samples. Thus, the following equation can be used:
(7.17) P ( c k ) = | D c k | | D | (7.17) P c k = D c k | D | {:(7.17)P(c_(k))=(|D_(c_(k))|)/(|D|):}\begin{equation*} P\left(c_{k}\right)=\frac{\left|D_{c_{k}}\right|}{|D|} \tag{7.17} \end{equation*}(7.17)P(ck)=|Dck||D|
where | D c k | D c k |D_(c_(k))|\left|D_{c_{k}}\right||Dck| and | D | | D | |D||D||D| are the numbers of the samples in Class c k c k c_(k)c_{k}ck and in the whole dataset, respectively.
The calculation of P ( x i j c k ) P x i j c k P(x_(ij)∣c_(k))P\left(x_{i j} \mid c_{k}\right)P(xijck) will be much different for discrete (non-numeric) and continuous (numeric) attribute values. When we have a finite number of discrete values for an attribute, then P ( x i j c k ) P x i j c k P(x_(ij)∣c_(k))P\left(x_{i j} \mid c_{k}\right)P(xijck) can be calculated as the ratio of samples with the attribute value x i j x i j x_(ij)x_{i j}xij, i.e., D c k , x i j D c k , x i j D_(c_(k),x_(ij))D_{c_{k}, x_{i j}}Dck,xij, to the number of all the samples in this class, D c k D c k D_(c_(k))D_{c_{k}}Dck :
(7.18) P ( x i j c k ) = | D c k , x j | | D c k | (7.18) P x i j c k = D c k , x j D c k {:(7.18)P(x_(ij)∣c_(k))=(|D_(c_(k),x_(j))|)/(|D_(c_(k))|):}\begin{equation*} P\left(x_{i j} \mid c_{k}\right)=\frac{\left|D_{c_{k}, x_{j}}\right|}{\left|D_{c_{k}}\right|} \tag{7.18} \end{equation*}(7.18)P(xijck)=|Dck,xj||Dck|
where | D c k , x j | D c k , x j |D_(c_(k),x_(j))|\left|D_{c_{k}, x_{j}}\right||Dck,xj| is the number of samples with attribute value j j jjj in Class c k c k c_(k)c_{k}ck. Thus, it is actually not related to Sample x i x i vec(x)_(i)\vec{x}_{i}xi.
When we have continuous attribute values such as float numbers, the probability P ( x i j c k ) P x i j c k P(x_(ij)∣c_(k))P\left(x_{i j} \mid c_{k}\right)P(xijck) can be computed by assuming that the attribute (random variable) follows a probability distribution. A predominant assumption is the Gaussian distribution: P ( x i j c k ) N ( μ k , j , σ k , j 2 ) P x i j c k N μ k , j , σ k , j 2 P( vec(x)_(ij)∣c_(k))∼N(mu_(k,j),sigma_(k,j)^(2))P\left(\vec{x}_{i j} \mid c_{k}\right) \sim \mathcal{N}\left(\mu_{k, j}, \sigma_{k, j}^{2}\right)P(xijck)N(μk,j,σk,j2). Based on this assumption, P ( x i j c k ) P x i j c k P(x_(ij)∣c_(k))P\left(x_{i j} \mid c_{k}\right)P(xijck) can be computed as
(7.19) P ( x i j c k ) = 1 2 π σ k , j exp [ ( x i j μ k , j ) 2 2 σ k , j 2 ] (7.19) P x i j c k = 1 2 π σ k , j exp x i j μ k , j 2 2 σ k , j 2 {:(7.19)P(x_(ij)∣c_(k))=(1)/(sqrt(2pi)sigma_(k,j))exp[-((x_(ij)-mu_(k,j))^(2))/(2sigma_(k,j)^(2))]:}\begin{equation*} P\left(x_{i j} \mid c_{k}\right)=\frac{1}{\sqrt{2 \pi} \sigma_{k, j}} \exp \left[-\frac{\left(x_{i j}-\mu_{k, j}\right)^{2}}{2 \sigma_{k, j}^{2}}\right] \tag{7.19} \end{equation*}(7.19)P(xijck)=12πσk,jexp[(xijμk,j)22σk,j2]
The above equation can be used to calculate the probability of any attribute of any sample. Parameters about the variable distribution, i.e., μ k , j μ k , j mu_(k,j)\mu_{k, j}μk,j and σ k , j σ k , j sigma_(k,j)\sigma_{k, j}σk,j, can be obtained from the training data. For example, we can find all the samples in Class c k c k c_(k)c_{k}ck and assess the distribution of the values of attribute j j jjj to compute the mean and the standard deviation for the probability density function: μ k , j μ k , j mu_(k,j)\mu_{k, j}μk,j and σ k , j σ k , j sigma_(k,j)\sigma_{k, j}σk,j.
When all the attributes have continuous values, we will need to use a Gaussian distribution for each of them. The Naive Bayes classifier, in this case, is also called the Gaussian Naive Bayes classifier in some places.
The implementation of the Naive Bayes classifier is very straightforward. What can be confusing is that the training and testing processes are lumped together in some way. Let us assume that we want to predict the classification of a sample x i x i vec(x)_(i)\vec{x}_{i}xi based on a given labeled training dataset X ¯ X ¯ bar(X)\bar{X}X¯. Then we will just need to calculate P ( c k ) P c k P(c_(k))P\left(c_{k}\right)P(ck) and P ( x i j c k ) P x i j c k P(x_(ij)∣c_(k))P\left(x_{i j} \mid c_{k}\right)P(xijck) with the training data using Eq. 7.17 , Eq. 7.18 , and Eq. 7.19. In particular, P ( x i j c k ) P x i j c k P(x_(ij)∣c_(k))P\left(x_{i j} \mid c_{k}\right)P(xijck) can be calculated directly for discrete attribute values, while for continuous attribute values, the computation of μ μ mu\muμ and σ σ sigma\sigmaσ needs to be carried out with the training data before the computation of the probability. Next, we just need to use Eq. 7.15 for predicting the labels of new samples.
As can be seen, Naive Bayes classifiers need us to assess the distribution of the data. Thus, to ensure a smooth implementation, we need enough data to compute the probabilities. For those probabilities calculated using percentages, we will need to have samples in the numerators and denominators or the ratios. Usually, we modify the equations for calculating P ( c k ) P c k P(c_(k))P\left(c_{k}\right)P(ck) and P ( x i j c k ) P x i j c k P(x_(ij)∣c_(k))P\left(x_{i j} \mid c_{k}\right)P(xijck) for discrete attribute values. The following Laplace correction is commonly adopted to address this issue, which can help smoothen the predicted probability distributions.
(7.20) P ( c k ) = | D c k | + 1 | D | + K (7.21) P ( x i j c k ) = | D c k , x j | + 1 | D c k | + N j (7.20) P c k = D c k + 1 | D | + K (7.21) P x i j c k = D c k , x j + 1 D c k + N j {:[(7.20)P(c_(k))=(|D_(c_(k))|+1)/(|D|+K)],[(7.21)P(x_(ij)∣c_(k))=(|D_(c_(k),x_(j))|+1)/(|D_(c_(k))|+N_(j))]:}\begin{align*} P\left(c_{k}\right) & =\frac{\left|D_{c_{k}}\right|+1}{|D|+K} \tag{7.20}\\ P\left(x_{i j} \mid c_{k}\right) & =\frac{\left|D_{c_{k}, x_{j}}\right|+1}{\left|D_{c_{k}}\right|+N_{j}} \tag{7.21} \end{align*}(7.20)P(ck)=|Dck|+1|D|+K(7.21)P(xijck)=|Dck,xj|+1|Dck|+Nj
where K K KKK is the number of classes in the training dataset and N j N j N_(j)N_{j}Nj is the total number of attribute values in Attribute j j jjj.
For probabilities calculated using distribution functions, i.e., P ( x i j c k ) P x i j c k P(x_(ij)∣c_(k))P\left(x_{i j} \mid c_{k}\right)P(xijck) for discrete attribute values, we will need to make sure we have enough training samples to obtain a distribution function. This can help avoid the unintended removal of information about attributes due to inadequate or missing attribute values.
The following code shows how to train a regression model using the Gaussian Naive Bayes algorithm explained in this subsection. The Iris dataset from Scikit-learn was adopted for training. The performance of the trained model was compared with that obtained using the Gaussian Naive Bayes from Scikit-learn.
from sklearn.datasets import load_iris
              from sklearn.model_selection import train_test_split
              import numpy as np
              from sklearn.naive_bayes import GaussianNB
              
# ------------------- Load data ---------------------- #
              x, y = load_iris(return_x_y=True)
              x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=20190308, test_size=0.3)
              # Calculate the probabilities of every attribute value P(x_j|c) for P(x|c)
              def gaussion_pdf(x_test, x):
                return np.exp(-(x_test-x.mean(0))**2 / (2 * x.std(0)**2)) / np.sqrt(2 * np.pi * x.std(0)**2)
              # -------------- Define Naive Bayes Model --------------- #
              classes = np.unique(np.concatenate([y_train, y_test], 0))
              pred_probs = []
              # Loop over all classes to obtain the probabilities of all testing samples belonging to different
                classes
              for i in classes:
                idx_i = y_train == i # idx_i stores all indices of samples in Class c (y_train == i)
                p_c = len(idx_i) / len(y_train) # Calculate P(c)
                p_x_c = np.prod(gaussion_pdf(x_test, x_train[idx_i]), 1) # Use Gaussian to calculate P(x|c) for
                continuous attribute values
                prob_i = p_c * p_x_c # Joint probability prob_i is the probability of sample i belonging to
                Class c
                pred_probs.append(prob_i)
              # Array for probabilities of samples belonging to different classes: N_samples x N_classes
              pred_probs = np.vstack(pred_probs).T
              # Index of the class with the highest probability as the classification for each sample
              label_idx = pred_probs.argmax(1)
              y_pred = classes[label_idx] # Select the class based on the index
              score = np.count_nonzero(y_test == y_pred)/len(y_test) # Accuracy: correct when the predicted class
                is the same as the actual label
              print('The accuracy of the self-developed Gaussian Naive Bayes is: ',score)
              # ------------------ Apply Gaussian Naive Bayes from Scikit-learn --------------------- #
              model = GaussianNB()
              model.fit(x_train, y_train)
              print('The accuracy of Gaussian Naive Bayes from Scikit-learn is: ', model.score(x_test, y_test))
              
The execution of the above code produced the following results:
The accuracy of self-developed Gaussian Naive Bayes is: 0.9333333333333333
The accuracy of Gaussian Naive Bayes from Scikit-learn is: 0.9333333333333333

7.3.2. Semi-Naive Bayesian Classifier

Naive Bayesian classifiers are fairly simple to understand and implement. However, the independence assumption of attributes adopted by Naive Bayesian classifiers may be difficult to hold in many cases. To address this issue, semi-naive Bayesian classifiers are proposed by relaxing the assumption of the conditional independence of attributes to some extent. This "semi-relaxation" of the independence requirement explains the origin of "semi-naive".
Semi-naive Bayesian classifiers still consider the relatively strong attribute dependencies but overlook weak interdependence to reach a compromise between accuracy and convenience. The different ways of selecting dependencies lead to different semi-naive Bayesian classifiers. In the following, common semi-naive Bayesian classifiers, i.e., One-Dependent Estimator (ODE), Tree Augmented Naive Bayes (TAN) [60], and Averaged ODE (AODE) [61], are introduced.

One-Dependent Estimator (ODE)

ODE is one of the most popular categories of semi-naive Bayesian classifiers. ODE is distinct in that it assumes each attribute depends on at most one another attribute:
(7.22) P ( x i , c k ) = P ( c k ) j = 1 J P ( x i j c k , p a j ) (7.22) P x i , c k = P c k j = 1 J P x i j c k , p a j {:(7.22)P( vec(x)_(i),c_(k))=P(c_(k))*prod_(j=1)^(J)P(x_(ij)∣c_(k),pa_(j)):}\begin{equation*} P\left(\vec{x}_{i}, c_{k}\right)=P\left(c_{k}\right) \cdot \prod_{j=1}^{J} P\left(x_{i j} \mid c_{k}, p a_{j}\right) \tag{7.22} \end{equation*}(7.22)P(xi,ck)=P(ck)j=1JP(xijck,paj)
where p a j p a j pa_(j)p a_{j}paj is the attribute that the j j jjj th attribute of Sample x i x i vec(x)_(i)\vec{x}_{i}xi, i.e., x i j x i j x_(ij)x_{i j}xij, depends on. The above equation is formulated for Sample x i x i vec(x)_(i)\vec{x}_{i}xi, so i i iii is carried in the corresponding terms. However, such dependency is not related to the sample being

 

 

 

 

 

 

Enjoy and Build the AI World

Sample Code from AI Engineering

Cite the code in your publications

Linear Models