Previous
Gaussian Process
Contents
Table of Contents
Next
Deep Learning History

Chapter 8

Artificial Neural Networks

8.1. Overview

This chapter presents essential knowledge about the Artificial Neural Network (ANN), or more specifically, shallow ANNs. We will first check the structure of biological neurons to understand how ANN can mathematically imitate biological NN's behavior. Next, the mathematical formulation of the well-known M-P neuron model and other significant concepts, such as activation function, will be discussed. Then, we will see how perceptron, as one of the simplest and most typical NNs, works from a mathematical perspective. In particular, the training of such a network via backpropagation will be detailed. The yielded equations will be generalized for NNs with more than three layers to guide implementations. Based on that, extra skills about NN implementations, the illustration of the implementation procedure with an example, and the transformation of data as it moves through a network will be discussed in detail. Finally, a few other shallow ANN issues will be briefly mentioned.

8.2. Basics of Artificial Neural Networks

8.2.1. From Biological Neural Network to ANN

The Artificial Neural Network (ANN) was inspired by the way that biological neural networks work [66]. It conceptualizes such working mechanisms and then mimics the functions with mathematical formulations [67]. Thus, prior to an introduction to ANN, let us first take a look at the structure of a biological neural network, especially how different units, i.e., neurons and their components, work with each other so that information can be processed.
As one of the two basic types of cells, neurons serve the role of transmitting information using electrical impulses and chemical signals. As illustrated in Fig. 8.1, a typical neuron consists of a cell body and two extensions: an axon and several (e.g., 5-7) dendrites. A dendrite has the shape of tree branches and receives information from another cell. An axon has a long-tail shape and transmits information out of the cell. Information is transmitted and processed within the cell as electrical impulses. Neighboring neurons communicate with each other via chemicals across a small contact region called a synapse, which is between an axon and a dendrite of the next neuron.
In biological neural networks, every neuron is connected to other neurons. When the electrical potential of a neuron exceeds a threshold value, it will be activated. Once activated, this neuron transmits chemicals to the connected neurons to change their electrical potential. In 1943, McCulloch and Pitts conceptualized the above model [9]. This model is the well-known M-P model, which is widely adopted for building ANNs in machine learning. Mathematically, the M-P model for a neuron can be described as
(8.1) y = f ( j = 1 J w j x j θ ) (8.1) y = f j = 1 J w j x j θ {:(8.1)y=f(sum_(j=1)^(J)w_(j)x_(j)-theta):}\begin{equation*} y=f\left(\sum_{j=1}^{J} w_{j} x_{j}-\theta\right) \tag{8.1} \end{equation*}(8.1)y=f(j=1Jwjxjθ)
where the output y y yyy is formulated as a function of the input x j x j x_(j)x_{j}xj in which j j jjj is one of the J J JJJ input variables (or attributes), w j w j w_(j)w_{j}wj is the corresponding weight, and θ θ theta\thetaθ is the threshold for the neuron.
Thus, as shown in Fig. 8.2, a sample or the input (column) array for Neuron k k kkk, which is marked as x i x i vec(x)_(i)\vec{x}_{i}xi, will first be added up according to the corresponding weights: w k T x i = j = 1 J w k j x i j w k T x i = j = 1 J w k j x i j vec(w)_(k)^(T)* vec(x)_(i)=sum_(j=1)^(J)w_(kj)x_(ij)\vec{w}_{k}^{T} \cdot \vec{x}_{i}=\sum_{j=1}^{J} w_{k j} x_{i j}wkTxi=j=1Jwkjxij. The weighted sum as the result of the transfer function marked as " sum\sum " in the figure will be treated by the activation function f ( ) f ( ) f(*)f(\cdot)f() to produce the output y k y k y_(k)y_{k}yk.
Figure 8.1: Architecture and working mechanisms of neurons
Figure 8.2: M-P formulation of the working mechanism of neurons

8.2.2. Activation Function

The function f f fff in Eq. 8.1 is the activation function. The ideal activation function is the step function. But we usually adopt continuous and smooth functions such as the Sigmoid and tanh functions as the activation function. The major purpose of using an activation function is to squash the input, which can span over a wide range of values, into a small range, e.g., 0 to 1 , as the range for output. As a result, an activation function is also called a squashing function. The neural network as a network of connected neurons thus can be viewed as the combination of many functions such as y = f ( j = 1 J w j x j θ ) y = f j = 1 J w j x j θ y=f(sum_(j=1)^(J)w_(j)x_(j)-theta)y=f\left(\sum_{j=1}^{J} w_{j} x_{j}-\theta\right)y=f(j=1Jwjxjθ).
The Sigmoid function in the following is a classic activation function.
(8.2) y = 1 1 + e x (8.2) y = 1 1 + e x {:(8.2)y=(1)/(1+e^(-x)):}\begin{equation*} y=\frac{1}{1+e^{-x}} \tag{8.2} \end{equation*}(8.2)y=11+ex
This function converts any value into the range between 0 and 1 . In addition to squeezing the input into the designated range of output, this function has another significant property that promotes its widespread adoption in the backpropagation for training neural networks. That is, the derivative of the Sigmoid function can be conveniently expressed using this function itself.
(8.3) y = y ( 1 y ) (8.3) y = y ( 1 y ) {:(8.3)y^(')=y*(1-y):}\begin{equation*} y^{\prime}=y \cdot(1-y) \tag{8.3} \end{equation*}(8.3)y=y(1y)
Another popular activation function is the tanh function.
(8.4) y = e x e x e x + e x (8.4) y = e x e x e x + e x {:(8.4)y=(e^(x)-e^(-x))/(e^(x)+e^(-x)):}\begin{equation*} y=\frac{e^{x}-e^{-x}}{e^{x}+e^{-x}} \tag{8.4} \end{equation*}(8.4)y=exexex+ex
This function squeezes the input into a range between -1 and 1 . Similar to the Sigmoid function, the derivative of tanh tanh tanh\tanhtanh can also be expressed using the function itself as
(8.5) y = 1 y 2 (8.5) y = 1 y 2 {:(8.5)y^(')=1-y^(2):}\begin{equation*} y^{\prime}=1-y^{2} \tag{8.5} \end{equation*}(8.5)y=1y2

8.2.3. Perceptron

Perceptron is one of the simplest and most typical neural networks. It consists of two layers of neurons. One layer is for input, while the other layer consists of functional neurons for output. Perceptron can only learn linear problems due to its simple architecture. Otherwise, oscillation can occur in the learning process. To consider nonlinear problems, multiple layers of function neurons need to be adopted.

8.2.4. Multiple Layer Feedforward Neural Network

A feedforward neural network is an ANN wherein connections between the nodes do not form a cycle. As such, it is different from its descendant: recurrent neural networks. The most common ANNs are multiple-layer feedforward neural networks, which can be obtained by stacking multiple layers of perceptrons. In such a model, every layer of neurons is fully connected with the neurons in the next layer. Neurons from the same layer do not connect with each other. In a typical setting, we have an input layer, one to multiple hidden layers, and an output layer. While the input layer only takes the input (no other math operations), hidden layers and the output layer are composed of functional neurons like the above M-P neuron.

8.3. Training with Backpropagation

8.3.1. Concepts

Backpropagation is the most common learning algorithm for training ANNs. In a simple way, the weights, which determine the behavior of the network, are updated based on the difference between the prediction and true solution (labels). The algorithm thus involves two steps: forward propagation (forward pass) and backward propagation (backward pass or backpropagation). Forward propagation refers to the calculation and storage of the outputs of individual neuron layers, layer by layer from the input layer to the output layer.
Backward propagation, which will be termed backpropagation thereafter for consistency, strictly refers only to the calculation of the gradients and the updates on the weights using the obtained gradients. In each backward pass, the gradients are first obtained from the difference between the predicted and true solution in the output layer and then
propagated backward layer by layer to the first hidden layer. Such gradients or corrections, as the derivatives of the difference between predicted and actual labels with respective to different weights, can be viewed as the contributions of these weights to the errors. Therefore, the weights are updated according to their contributions to the errors. For example, weights or their corresponding neurons that cause an overestimate of the output will be modified so that they will lead to lower predictions in the next forward pass.
The term "backpropagation" is often used loosely to refer to the entire training process or algorithm, including both of the two passes and how the gradient is used, such as by stochastic gradient descent. In this book, to avoid confusion, "forward pass" and "backward pass" will be used for the two specific processes with different directions, while "backpropagation" will be used to refer to the algorithm including both passes and the optimization method (solver).
We can view the threshold θ θ theta\thetaθ as a dummy node. However, more frequently, θ θ theta\thetaθ is counted as part of the weights, i.e., one extra w w www. Then, in each step of the backward pass, the weights w s w s w^(')sw^{\prime} sws and θ s θ s theta^(')s\theta^{\prime} sθs will be updated as
(8.6) w j + Δ w j w j (8.6) w j + Δ w j w j {:(8.6)w_(j)+Deltaw_(j)longrightarroww_(j):}\begin{equation*} w_{j}+\Delta w_{j} \longrightarrow w_{j} \tag{8.6} \end{equation*}(8.6)wj+Δwjwj
As mentioned above, the increment or correction, Δ w j Δ w j Deltaw_(j)\Delta w_{j}Δwj, is determined with the difference between the predicted and actual values, which is included in the loss function \ell :
(8.7) Δ w j = η w j (8.7) Δ w j = η w j {:(8.7)Deltaw_(j)=-eta(delℓ)/(delw_(j)):}\begin{equation*} \Delta w_{j}=-\eta \frac{\partial \ell}{\partial w_{j}} \tag{8.7} \end{equation*}(8.7)Δwj=ηwj
where η η eta\etaη is the learning rate. One backward pass consists of updating all the weights associated with connections between any two neurons in the network. It usually takes place from the last layer of neurons connecting to the output, then propagates backward to the previous layers, and ultimately to the input layer. That is why it is called backpropagation. A learning process consists of multiple rounds of alternate forward and backward processes. The forward process obtains the predictions, and the backward process calculates the error as the difference between predictions and labels, based on which backpropagation is carried out. The whole learning process is, therefore, an optimization process whose goal is to minimize the difference between the predictions and labels.

8.3.2. Backpropagation in a 3-Layer Network

In the following, we will use a three-layer feed forward network as an example to show the detailed process of backpropagation. Equations for calculating the increments in weights will be derived, which are essential to coding a neural network. Illustrated in Fig. 8.3 is a representative neural network consisting of three layers of neurons, in which the bottom is the input layer containing J J JJJ neurons. Therefore, the forward direction in the figure is the upward direction, which converts the input x x vec(x)\vec{x}x (or x j x j x_(j)x_{j}xj ) into the output y y vec(y)\vec{y}y (or y k y k y_(k)y_{k}yk ). Except for the input layer, all the layers have input and output together with weights and their associated math operations. Taking the output layer, for example, the input is β β beta\betaβ and the output is y y yyy. The weights include w w www 's and θ θ theta\thetaθ 's between the hidden and output layer, and their associated math operations are the linear combination and activation. In this network, any neuron in the hidden layer, l { 1 , , L } l { 1 , , L } l in{1,cdots,L}l \in\{1, \cdots, L\}l{1,,L}, is connected to all the neurons in the previous layer ( J ) ( J ) (J)(J)(J) including Neuron j j jjj and all the neurons in the next layer ( K ) ( K ) (K)(K)(K) including Neuron k k kkk. A weight w k l w k l w_(kl)w_{k l}wkl connects Neuron l l lll in the hidden layer to Neuron k k kkk in the next (output) layer.
As shown in Fig. 8.3, four types of operations are involved in converting the input of the network to its output. Among them, two generates α α alpha\alphaα and β β beta\betaβ as linear combinations of of weights:
(8.8) α l = j = 1 J ( v l j x j ) (8.9) β k = l = 1 L ( w k l b l ) (8.8) α l = j = 1 J v l j x j (8.9) β k = l = 1 L w k l b l {:[(8.8)alpha_(l)=sum_(j=1)^(J)(v_(lj)x_(j))],[(8.9)beta_(k)=sum_(l=1)^(L)(w_(kl)b_(l))]:}\begin{align*} & \alpha_{l}=\sum_{j=1}^{J}\left(v_{l j} x_{j}\right) \tag{8.8}\\ & \beta_{k}=\sum_{l=1}^{L}\left(w_{k l} b_{l}\right) \tag{8.9} \end{align*}(8.8)αl=j=1J(vljxj)(8.9)βk=l=1L(wklbl)
Thus, α α alpha\alphaα and β β beta\betaβ can be viewed as intermediate weights. In fact, α α alpha\alphaα and β β beta\betaβ are inputs for the hidden and output layers, respectively. The other two types of operations convert the input of each layer into the output of that layer. For the hidden layer, that is
(8.10) b l = f ( α l γ l ) (8.10) b l = f α l γ l {:(8.10)b_(l)=f(alpha_(l)-gamma_(l)):}\begin{equation*} b_{l}=f\left(\alpha_{l}-\gamma_{l}\right) \tag{8.10} \end{equation*}(8.10)bl=f(αlγl)
and for the output, we have
(8.11) y k = f ( β k θ k ) (8.11) y k = f β k θ k {:(8.11)y_(k)=f(beta_(k)-theta_(k)):}\begin{equation*} y_{k}=f\left(\beta_{k}-\theta_{k}\right) \tag{8.11} \end{equation*}(8.11)yk=f(βkθk)
y k = f ( β k θ k ) y k = f β k θ k y_(k)=f(beta_(k)-theta_(k))y_{k}=f\left(\beta_{k}-\theta_{k}\right)yk=f(βkθk)
β k = l = 1 L ( w k l b l ) β k = l = 1 L w k l b l beta_(k)=sum_(l=1)^(L)(w_(kl)b_(l))\beta_{k}=\sum_{l=1}^{L}\left(w_{k l} b_{l}\right)βk=l=1L(wklbl)
b l = f ( α l γ l ) b l = f α l γ l b_(l)=f(alpha_(l)-gamma_(l))b_{l}=f\left(\alpha_{l}-\gamma_{l}\right)bl=f(αlγl)
α l = j = 1 J ( v l j x j ) α l = j = 1 J v l j x j alpha_(l)=sum_(j=1)^(J)(v_(lj)x_(j))\alpha_{l}=\sum_{j=1}^{J}\left(v_{l j} x_{j}\right)αl=j=1J(vljxj)
Figure 8.3: Forward propagation in a 3-layer ANN
Next, let us differentiate the predictions of the networks, y ~ y ~ tilde(vec(y))\tilde{\vec{y}}y~, from the labels, y y vec(y)\vec{y}y. There are K K KKK output nodes, thus both y ~ y ~ tilde(vec(y))\tilde{\vec{y}}y~ and y y vec(y)\vec{y}y should have K K KKK elements. For example, y ~ y ~ tilde(vec(y))\tilde{\vec{y}}y~ should be [ y ~ 1 , y ~ 2 , , y ~ k , , y ~ K ] y ~ 1 , y ~ 2 , , y ~ k , , y ~ K [ tilde(y)_(1), tilde(y)_(2),cdots, tilde(y)_(k),cdots, tilde(y)_(K)]\left[\tilde{y}_{1}, \tilde{y}_{2}, \cdots, \tilde{y}_{k}, \cdots, \tilde{y}_{K}\right][y~1,y~2,,y~k,,y~K]. Each of the elements can also be arrays with one or more dimensions. After each forward pass, y ~ y ~ tilde(vec(y))\tilde{\vec{y}}y~ is calculated. Then, the error can be calculated with the given labels. Most commonly, the mean square root is used for constructing the error or loss:
(8.12) = 1 2 k = 1 K ( y ~ k y k ) 2 (8.12) = 1 2 k = 1 K y ~ k y k 2 {:(8.12)ℓ=(1)/(2)sum_(k=1)^(K)( tilde(y)_(k)-y_(k))^(2):}\begin{equation*} \ell=\frac{1}{2} \sum_{k=1}^{K}\left(\tilde{y}_{k}-y_{k}\right)^{2} \tag{8.12} \end{equation*}(8.12)=12k=1K(y~kyk)2
Then, the backward pass can start from the above loss/error function. As introduced above, the purpose is to minimize the above error using gradient descent. There are four types of weights in the above neural network. Therefore, we need to calculate the four types of gradients in each iteration: Δ w , Δ θ , Δ v Δ w , Δ θ , Δ v Delta w,Delta theta,Delta v\Delta w, \Delta \theta, \Delta vΔw,Δθ,Δv, and, Δ γ Δ γ Delta gamma\Delta \gammaΔγ. Let us see how to derive equations for calculating each of them using primarily the chain rule. First, Δ w Δ w Delta w\Delta wΔw is the derivative of the error function \ell with respect to w w www :
(8.13) Δ w k l = η w k l (8.13) Δ w k l = η w k l {:(8.13)Deltaw_(kl)=-eta(delℓ)/(delw_(kl)):}\begin{equation*} \Delta w_{k l}=-\eta \frac{\partial \ell}{\partial w_{k l}} \tag{8.13} \end{equation*}(8.13)Δwkl=ηwkl
As seen in Fig. 8.3, multiple operations and variables are involved to connect w w www to y ~ y ~ tilde(y)\tilde{y}y~. Thus, we can apply the chain rule:
(8.14) Δ w k l = η y ~ k y ~ k β k β k w k l = η ( y ~ k y k ) [ y ~ k ( 1 y ~ k ) ] ( b l ) (8.14) Δ w k l = η y ~ k y ~ k β k β k w k l = η y ~ k y k y ~ k 1 y ~ k b l {:[(8.14)Deltaw_(kl)=-eta(delℓ)/(del tilde(y)_(k))(del tilde(y)_(k))/(delbeta_(k))(delbeta_(k))/(delw_(kl))],[=-eta( tilde(y)_(k)-y_(k))[ tilde(y)_(k)(1- tilde(y)_(k))](b_(l))]:}\begin{align*} \Delta w_{k l} & =-\eta \frac{\partial \ell}{\partial \tilde{y}_{k}} \frac{\partial \tilde{y}_{k}}{\partial \beta_{k}} \frac{\partial \beta_{k}}{\partial w_{k l}} \tag{8.14}\\ & =-\eta\left(\tilde{y}_{k}-y_{k}\right)\left[\tilde{y}_{k}\left(1-\tilde{y}_{k}\right)\right]\left(b_{l}\right) \end{align*}(8.14)Δwkl=ηy~ky~kβkβkwkl=η(y~kyk)[y~k(1y~k)](bl)
Please note that the above deduction assumed the use of the Sigmoid activation function, whose derivative is f = f ( 1 f ) f = f ( 1 f ) f^(')=f*(1-f)f^{\prime}=f \cdot(1-f)f=f(1f) (Eq. 8.3).
We can use a new symbol as follows to simplify the formulation:
(8.15) g k = β k = ( y ~ k y k ) [ y ~ k ( 1 y ~ k ) ] (8.15) g k = β k = y ~ k y k y ~ k 1 y ~ k {:(8.15)g_(k)=(delℓ)/(delbeta_(k))=( tilde(y)_(k)-y_(k))[ tilde(y)_(k)(1- tilde(y)_(k))]:}\begin{equation*} g_{k}=\frac{\partial \ell}{\partial \beta_{k}}=\left(\tilde{y}_{k}-y_{k}\right)\left[\tilde{y}_{k}\left(1-\tilde{y}_{k}\right)\right] \tag{8.15} \end{equation*}(8.15)gk=βk=(y~kyk)[y~k(1y~k)]
Then, Δ w k l Δ w k l Deltaw_(kl)\Delta w_{k l}Δwkl can be simply written as
(8.16) Δ w k l = η g k b l (8.16) Δ w k l = η g k b l {:(8.16)Deltaw_(kl)=-etag_(k)b_(l):}\begin{equation*} \Delta w_{k l}=-\eta g_{k} b_{l} \tag{8.16} \end{equation*}(8.16)Δwkl=ηgkbl
Likewise, the partial derivative of the error function with respect to θ θ theta\thetaθ is derived as
Δ θ k = η y ~ k y ~ k θ k (8.17) = η ( y ~ k y k ) [ y ~ k ( 1 y ~ k ) ] = η ( y ~ k y k ) [ y ~ k ( 1 y ~ k ) ] = η g k Δ θ k = η y ~ k y ~ k θ k (8.17) = η y ~ k y k y ~ k 1 y ~ k = η y ~ k y k y ~ k 1 y ~ k = η g k {:[Deltatheta_(k)=-eta(delℓ)/(del tilde(y)_(k))(del tilde(y)_(k))/(deltheta_(k))],[(8.17)=-eta( tilde(y)_(k)-y_(k))[- tilde(y)_(k)(1- tilde(y)_(k))]=eta( tilde(y)_(k)-y_(k))[ tilde(y)_(k)(1- tilde(y)_(k))]],[=etag_(k)]:}\begin{align*} \Delta \theta_{k} & =-\eta \frac{\partial \ell}{\partial \tilde{y}_{k}} \frac{\partial \tilde{y}_{k}}{\partial \theta_{k}} \\ & =-\eta\left(\tilde{y}_{k}-y_{k}\right)\left[-\tilde{y}_{k}\left(1-\tilde{y}_{k}\right)\right]=\eta\left(\tilde{y}_{k}-y_{k}\right)\left[\tilde{y}_{k}\left(1-\tilde{y}_{k}\right)\right] \tag{8.17}\\ & =\eta g_{k} \end{align*}Δθk=ηy~ky~kθk(8.17)=η(y~kyk)[y~k(1y~k)]=η(y~kyk)[y~k(1y~k)]=ηgk
Next, let us move to the previous layer for Δ v Δ v Delta v\Delta vΔv and Δ γ Δ γ Delta gamma\Delta \gammaΔγ.
Δ v l j = v l j (8.18) = η β k β k b l b l α l α l v l j = η g k w k l [ b l ( 1 b l ) ] x j Δ v l j = v l j (8.18) = η β k β k b l b l α l α l v l j = η g k w k l b l 1 b l x j {:[Deltav_(lj)=(delℓ)/(delv_(lj))],[(8.18)=-eta(delℓ)/(delbeta_(k))(delbeta_(k))/(delb_(l))(delb_(l))/(delalpha_(l))(delalpha_(l))/(delv_(lj))],[=-etag_(k)w_(kl)[b_(l)(1-b_(l))]x_(j)]:}\begin{align*} \Delta v_{l j} & =\frac{\partial \ell}{\partial v_{l j}} \\ & =-\eta \frac{\partial \ell}{\partial \beta_{k}} \frac{\partial \beta_{k}}{\partial b_{l}} \frac{\partial b_{l}}{\partial \alpha_{l}} \frac{\partial \alpha_{l}}{\partial v_{l j}} \tag{8.18}\\ & =-\eta g_{k} w_{k l}\left[b_{l}\left(1-b_{l}\right)\right] x_{j} \end{align*}Δvlj=vlj(8.18)=ηβkβkblblαlαlvlj=ηgkwkl[bl(1bl)]xj
Similarly, let us assume
(8.19) e l = α l = β k β k b l b l α l = g k w k l [ b l ( 1 b l ) ] (8.19) e l = α l = β k β k b l b l α l = g k w k l b l 1 b l {:(8.19)e_(l)=(delℓ)/(delalpha_(l))=(delℓ)/(delbeta_(k))(delbeta_(k))/(delb_(l))(delb_(l))/(delalpha_(l))=g_(k)w_(kl)[b_(l)(1-b_(l))]:}\begin{equation*} e_{l}=\frac{\partial \ell}{\partial \alpha_{l}}=\frac{\partial \ell}{\partial \beta_{k}} \frac{\partial \beta_{k}}{\partial b_{l}} \frac{\partial b_{l}}{\partial \alpha_{l}}=g_{k} w_{k l}\left[b_{l}\left(1-b_{l}\right)\right] \tag{8.19} \end{equation*}(8.19)el=αl=βkβkblblαl=gkwkl[bl(1bl)]
Then, Δ v k l Δ v k l Deltav_(kl)\Delta v_{k l}Δvkl can be simply written as
(8.20) Δ v l j = η g k w k l [ b l ( 1 b l ) ] x j = η e l x j (8.20) Δ v l j = η g k w k l b l 1 b l x j = η e l x j {:(8.20)Deltav_(lj)=-etag_(k)w_(kl)[b_(l)(1-b_(l))]x_(j)=-etae_(l)x_(j):}\begin{equation*} \Delta v_{l j}=-\eta g_{k} w_{k l}\left[b_{l}\left(1-b_{l}\right)\right] x_{j}=-\eta e_{l} x_{j} \tag{8.20} \end{equation*}(8.20)Δvlj=ηgkwkl[bl(1bl)]xj=ηelxj
Accordingly, Δ γ Δ γ Delta gamma\Delta \gammaΔγ can be obtained as
(8.21) Δ γ l = η e l = η g k w k l [ b l ( 1 b l ) ] (8.21) Δ γ l = η e l = η g k w k l b l 1 b l {:(8.21)Deltagamma_(l)=etae_(l)=etag_(k)w_(kl)[b_(l)(1-b_(l))]:}\begin{equation*} \Delta \gamma_{l}=\eta e_{l}=\eta g_{k} w_{k l}\left[b_{l}\left(1-b_{l}\right)\right] \tag{8.21} \end{equation*}(8.21)Δγl=ηel=ηgkwkl[bl(1bl)]
When switching to a different activation function, e.g., tanh, we just need to replace all the [ y ~ k ( 1 y ~ k ) ] y ~ k 1 y ~ k [ tilde(y)_(k)(1- tilde(y)_(k))]\left[\tilde{y}_{k}\left(1-\tilde{y}_{k}\right)\right][y~k(1y~k)] with ( 1 y ~ k 2 ) 1 y ~ k 2 (1- tilde(y)_(k)^(2))\left(1-\tilde{y}_{k}^{2}\right)(1y~k2) and replace all the [ b l ( 1 b l ) ] b l 1 b l [b_(l)(1-b_(l))]\left[b_{l}\left(1-b_{l}\right)\right][bl(1bl)] with ( 1 b l 2 ) 1 b l 2 (1-b_(l)^(2))\left(1-b_{l}^{2}\right)(1bl2).

8.3.3. Backpropagation in Neural Networks with 3+ Layers

The expressions for the weights in the above 3-layer network can be easily extended to obtain expressions for the gradients of weights in a neural network consisting of more layers, e.g., N > 3 N > 3 N > 3N>3N>3. To simplify the deduction, we will adopt the tensor notation rather than the index notation. We will also need to start from the weights connecting to the last layer of neurons. As shown in Fig. 8.4, we will get the following equation similar to deduction for the 3-layer network:
(8.22) Δ w ¯ N = η g N O N 1 (8.22) Δ w ¯ N = η g N O N 1 {:(8.22)Delta bar(w)^(N)=-eta vec(g)^(N)ox vec(O)^(N-1):}\begin{equation*} \Delta \bar{w}^{N}=-\eta \vec{g}^{N} \otimes \vec{O}^{N-1} \tag{8.22} \end{equation*}(8.22)Δw¯N=ηgNON1
and
(8.23) Δ θ N = η g N (8.23) Δ θ N = η g N {:(8.23)Delta vec(theta)^(N)=eta vec(g)^(N):}\begin{equation*} \Delta \vec{\theta}^{N}=\eta \vec{g}^{N} \tag{8.23} \end{equation*}(8.23)ΔθN=ηgN
where ox\otimes is the external (or tensor) product. In the latest version of NumPy, the operator for the external product is "numpy.tensordot" without any contraction (of axes); while '@' or "numpy.dot" is for the dot/inner product, which contracts the neighboring axes of the two tensors. You can see that we use superscripts to represent the number of the layer. This is much different from the subscripts, which were used in the above equations for node numbers. The subscripts now can be excluded because of the use of the tensor notation. O N 1 O N 1 vec(O)^(N-1)\vec{O}^{N-1}ON1 is the tensor for the output of the second last layer (i.e., Layer N 1 N 1 N-1N-1N1 ), which corresponds to b b vec(b)\vec{b}b in the above 3-layer model. g g vec(g)\vec{g}g is written as follows when Sigmoid is used as the activation function.
(8.24) g N = I N = ( y ~ y ) [ y ~ ( 1 y ~ ) ] (8.24) g N = I N = ( y ~ y ) [ y ~ ( 1 y ~ ) ] {:(8.24) vec(g)^(N)=(delℓ)/(del vec(I)^(N))=( tilde(vec(y))- vec(y))o.[ tilde(vec(y))o.(1- tilde(vec(y)))]:}\begin{equation*} \vec{g}^{N}=\frac{\partial \ell}{\partial \vec{I}^{N}}=(\tilde{\vec{y}}-\vec{y}) \odot[\tilde{\vec{y}} \odot(1-\tilde{\vec{y}})] \tag{8.24} \end{equation*}(8.24)gN=IN=(y~y)[y~(1y~)]
where o.\odot is the element-wise product, and I I III is the tensor of input of the last layer. In NumPy, we simply use the symbol '*' for the element-wise product.
Then, we can move to the second last layer, and the weight tensor w ¯ N 1 w ¯ N 1 bar(w)^(N-1)\bar{w}^{N-1}w¯N1 is expressed as
(8.25) Δ w ¯ N 1 = η g N w ¯ N [ O N 1 ( 1 O N 1 ) ] O N 2 = η g N 1 O N 2 (8.25) Δ w ¯ N 1 = η g N w ¯ N O N 1 1 O N 1 O N 2 = η g N 1 O N 2 {:(8.25)Delta bar(w)^(N-1)=-eta vec(g)^(N)* bar(w)^(N)o.[ vec(O)^(N-1)o.(1- vec(O)^(N-1))]ox vec(O)^(N-2)=-eta vec(g)^(N-1)ox vec(O)^(N-2):}\begin{equation*} \Delta \bar{w}^{N-1}=-\eta \vec{g}^{N} \cdot \bar{w}^{N} \odot\left[\vec{O}^{N-1} \odot\left(1-\vec{O}^{N-1}\right)\right] \otimes \vec{O}^{N-2}=-\eta \vec{g}^{N-1} \otimes \vec{O}^{N-2} \tag{8.25} \end{equation*}(8.25)Δw¯N1=ηgNw¯N[ON1(1ON1)]ON2=ηgN1ON2
where g N 1 g N 1 vec(g)^(N-1)\vec{g}^{N-1}gN1 is used to replace e l e l e_(l)e_{l}el in the deduction for the 3-layer network. This is because the g k g k g_(k)g_{k}gk and e l e l e_(l)e_{l}el in the deduction for the 3-layer network have a similar meaning: the derivative of the loss function with respect to the input of the corresponding layer. Therefore, g N 1 g N 1 vec(g)^(N-1)\vec{g}^{N-1}gN1 is formulated as
(8.26) g N 1 = I N 1 = g N w ¯ N [ O N 1 ( 1 O N 1 ) ] (8.26) g N 1 = I N 1 = g N w ¯ N O N 1 1 O N 1 {:(8.26) vec(g)^(N-1)=(delℓ)/(del vec(I)^(N-1))= vec(g)^(N)* bar(w)^(N)o.[ vec(O)^(N-1)o.(1- vec(O)^(N-1))]:}\begin{equation*} \vec{g}^{N-1}=\frac{\partial \ell}{\partial \vec{I}^{N-1}}=\vec{g}^{N} \cdot \bar{w}^{N} \odot\left[\vec{O}^{N-1} \odot\left(1-\vec{O}^{N-1}\right)\right] \tag{8.26} \end{equation*}(8.26)gN1=IN1=gNw¯N[ON1(1ON1)]
and the weight θ N 1 θ N 1 theta^(N-1)\theta^{N-1}θN1 is
(8.27) Δ θ N 1 = η g N 1 (8.27) Δ θ N 1 = η g N 1 {:(8.27)Delta vec(theta)^(N-1)=eta vec(g)^(N-1):}\begin{equation*} \Delta \vec{\theta}^{N-1}=\eta \vec{g}^{N-1} \tag{8.27} \end{equation*}(8.27)ΔθN1=ηgN1
Figure 8.4: Workflow of multiplayer NN
If we continue the deduction to the next layer, we will see that the equations for Layer N 2 N 2 N-2N-2N2 will be the same as those for Layer N 1 N 1 N-1N-1N1. The only difference is the substitution of N N NNN with N 1 N 1 N-1N-1N1, and as a result, N 1 N 1 N-1N-1N1 becomes N 2 N 2 N-2N-2N2. That is to say, any hidden layer n n nnn, i.e., a layer between Layer 2 and Layer N 1 N 1 N-1N-1N1 (including these two layers), has the following equations for weight updates.
(8.28) g n = I n = g n + 1 w ¯ n + 1 [ O n ( 1 O n ) ] (8.29) Δ w ¯ n = η g n O n 1 (8.30) Δ θ n = η g n (8.28) g n = I n = g n + 1 w ¯ n + 1 O n 1 O n (8.29) Δ w ¯ n = η g n O n 1 (8.30) Δ θ n = η g n {:[(8.28) vec(g)^(n)=(delℓ)/(del vec(I)^(n))= vec(g)^(n+1)* bar(w)^(n+1)o.[ vec(O)^(n)o.(1- vec(O)^(n))]],[(8.29)Delta bar(w)^(n)=-eta vec(g)^(n)ox vec(O)^(n-1)],[(8.30)Delta vec(theta)^(n)=eta vec(g)^(n)]:}\begin{gather*} \vec{g}^{n}=\frac{\partial \ell}{\partial \vec{I}^{n}}=\vec{g}^{n+1} \cdot \bar{w}^{n+1} \odot\left[\vec{O}^{n} \odot\left(1-\vec{O}^{n}\right)\right] \tag{8.28}\\ \Delta \bar{w}^{n}=-\eta \vec{g}^{n} \otimes \vec{O}^{n-1} \tag{8.29}\\ \Delta \vec{\theta}^{n}=\eta \vec{g}^{n} \tag{8.30} \end{gather*}(8.28)gn=In=gn+1w¯n+1[On(1On)](8.29)Δw¯n=ηgnOn1(8.30)Δθn=ηgn
The above three equations can be used in a loop to update the weights in all the hidden layers.

8.4. Implementation

The theory of ANN is not difficult to understand. However, the implementation of a shallow neural network, especially the implementation of the forward and backward passes could be trivial and susceptible to mistakes even with the details offered in the previous section. What is particularly troublesome is the treatment of the dimensions or sizes of data, in the form of arrays or tensors, as the data moves through the network layer by layer. This can be viewed as a flow of tensors as coined in "TensorFlow".
This section will provide enough information for you to quickly put together your own code for implementing ANNs. For this purpose, we will first get familiar with some skills and concepts that have been widely used for training ANNs. Next, a typical procedure illustrated by pseudo-code will be provided to show the detailed steps for ANN training. Finally, based on one example, more details will be presented about the tensor flow, especially how data changes in shape as it moves through a typical network.

8.4.1. Practical Skills

We will need to understand how data moves through the neural network as well as several critical skills and concepts before we can implement the training of ANNs and understand the relevant literature well. The input, x x vec(x)\vec{x}x is the data without labels while the output, y y vec(y)\vec{y}y, is the corresponding label. Each time, one sample or multiple samples are fed into the
network in the format of an array. This data array moves forward through the network to obtain a prediction that can be compared with the corresponding label. The difference between the predicted and actual labels is propagated backward in terms of gradients to adjust the weights. The new weights are used in the forward pass of the next iteration. Therefore, the model represented by the weights of the network can be updated every forward-backward process. This process is called an iteration of the training process.
The real implementation can be more complicated due to the use many extra skills and terms. The first is about the use of batches, epochs, and related optimizers. If one sample is used for each (gradient descent) iteration, then the training method (optimizer) is called stochastic gradient descent. Here, "stochastic" is used because of the fact that updating the model after processing each sample usually leads to high randomness/stochasticity due to the variability in data points (samples). In this case, each update will take a relatively short time, but lots of iterations may be needed to identify a good direction of optimization. The other extreme is to process the whole training set for each iteration, and updates on the model are made in the backward pass with gradients of all the samples. This is called batch gradient descent.
More frequently, it is not possible or efficient to process the whole dataset within one iteration. Thus, we frequently break the training dataset into multiple "mini batches" of samples. Then, updates on the model are made in each iteration for one mini batch. This is called the mini batch gradient descent, which is also loosely called batch gradient descent in many places. We can calculate the loss or gradient for samples within one mini batch one by one before adding the loss or gradient up for updating the model. Or alternatively, we can treat all samples together as one array to obtain gradients for the model update. The use of mini batches also prompts us to use another concept, "epoch". One epoch means one dataset or the same number of samples as one dataset have been processed once, depending on the sampling method.
The selections of batch size and epoch number (or iteration number) need to consider the available resources and goals and to perform a trial in some cases. As the batch sizes increase, the use of memory and the consequent calculation time for each batch will increase dramatically. Compared with the stochastic and mini-batch methods, batch gradient descent method is more susceptible to the traps of local optima. This is because the randomness associated with small batch sizes adds noise to the training process, which helps the training to get out of local optima. But such randomness may also make it hard to find a good direction within an affordable amount of time.

8.4.2. Procedure for An Example

To simply illustrate the implementation of the neural network, we will start with a shallow neural network. Let us use the Boston House Prices (BHP) dataset from Scikit-learn as the example data. The dataset consists of 506 samples; each sample has 13 numbers (attribute values) as the input data and 1 number as the label/output. The 13 numbers of each sample will be fed as input to the network, processed layer by layer, and finally outputted as the prediction for that sample. The difference between the prediction and the label is the error, which needs to be minimized. To avoid confusion and facilitate coding, we will not count the input layer as one layer, because its input and output are identical. Accordingly, the first hidden layer will be viewed as the first layer. The following is the pseudo-code for implementing the neural network. The BHP dataset and a 3-layer (excluding the input layer) network are used to mark the sizes of arrays in the training process to better present the implementation details.

Simple Neural Network Training

Prepare data, e.g., get X ¯ X ¯ bar(X)\bar{X}X¯ and y y vec(y)\vec{y}y from the BHP dataset.
Define network architecture, e.g., use a list [ 4 , 8 , 1 ] [ 4 , 8 , 1 ] [4,8,1][4,8,1][4,8,1] to represent a net with 3 layers, which have 4,8 , and 1 neurons. The number of neurons in the last layer is determined by the size of the label for each sample, i.e., 1 for BHP data.
Set up hyperparameters, e.g., η η eta\etaη, and variables to be stored during training, e.g., w ¯ , θ , O w ¯ , θ , O bar(w), vec(theta), vec(O)\bar{w}, \vec{\theta}, \vec{O}w¯,θ,O, and loss, etc.
Repeat a certain number of iterations or until a certain rule is satisfied (e.g., loss improvements are smaller than a tolerance).
Initialize the network, e.g., filling the weight arrays of w ¯ w ¯ bar(w)\bar{w}w¯ and θ θ vec(theta)\vec{\theta}θ with random numbers.
Repeat for every sample in a batch.
Repeat for every layer in a network in the forward pass using equation O n = f ( w ¯ n @ O n 1 + θ n ) O n = f w ¯ n @ O n 1 + θ n vec(O)^(n)=f( bar(w)^(n)@ vec(O^(n-1))+ vec(theta^(n)))\vec{O}^{n}=f\left(\bar{w}^{n} @ \overrightarrow{O^{n-1}}+\overrightarrow{\theta^{n}}\right)On=f(w¯n@On1+θn).
Repeat for every layer in a network in the backward pass using equations g n = g n + 1 @ w ¯ n + 1 [ O n ( 1 O n ) ] g n = g n + 1 @ w ¯ n + 1 O n 1 O n vec(g)^(n)= vec(g)^(n+1)@ bar(w)^(n+1)*[ vec(O)^(n)*(1- vec(O)^(n))]\vec{g}^{n}=\vec{g}^{n+1} @ \bar{w}^{n+1} \cdot\left[\vec{O}^{n} \cdot\left(1-\vec{O}^{n}\right)\right]gn=gn+1@w¯n+1[On(1On)],
Δ w ¯ n = η g n O n 1 , Δ θ n = η g n Δ w ¯ n = η g n O n 1 , Δ θ n = η g n Delta bar(w)^(n)=-eta* vec(g)^(n)ox vec(O)^(n-1),Delta vec(theta)^(n)=eta* vec(g)^(n)\Delta \bar{w}^{n}=-\eta \cdot \vec{g}^{n} \otimes \vec{O}^{n-1}, \Delta \vec{\theta}^{n}=\eta \cdot \vec{g}^{n}Δw¯n=ηgnOn1,Δθn=ηgn. Here, the weights of the whole batch are updated but will not be used until the next batch.
Calculate the loss.
The following code shows the training of a 3-layer feedforward NN following the above procedure. Please note that only the key steps in the forward and backward passes are given.
# Main loop for training
              while True:
                # Forward
                Alpha = V @ X_train.T #X_train[i].reshape(y_train[i].size,1)
                B = 1/(1 + np.exp(-(Alpha - Gamma)))
                Beta = W @ B
                Y_hat = 1/(1 + np.exp(-(Beta - Theta)))
                y_hat = Y_hat.T
                Ek = 0.5*np.sum((y_hat - y_train)**2,axis=1)
                # Backward
                G = (y_train.T-Y_hat)*Y_hat*(1-Y_hat)
                Delta_W = eta * G @ B.T /y_train.shape[0]
                Delta_Theta = np.sum(eta*G,axis=1) /y_train.shape[0]
                Delta_Theta = Delta_Theta.reshape(Delta_Theta.shape[0],1)
                E = -(W.T @ G) * B * (1-B)
                Delta_V = eta * E @ X_train /y_train.shape[0]
                Delta_Gamma = - eta * np.sum(E,axis=1) /y_train.shape[0]
                Delta_Gamma = Delta_Gamma.reshape(Delta_Gamma.shape[0],1) # Change the size from (Node,) to (
                Node,1)
                # Improve with the learning rate and the overall descent direction
                W = W + Delta_W
                Theta = Theta + Delta_Theta
                V = V + Delta_V
                Gamma = Delta_Gamma
                error = np.sum(Ek)/np.sum(y_train**2)
                Iteration = Iteration + 1
                print("Iteration:",Iteration,"Error is", error)
                if error < tol:
                    break
              # Prediction
              Alpha = V @ X_test.T
              B = 1/(1 + np.exp (-(Alpha - Gamma)))
              Beta = W @ B
              Y_hat_test = 1/(1 + np.exp(-(Beta - Theta)))
              y_hat_test = Y_hat_test.T
              Ek_test = 0.5*np.sum((y_hat_test - y_test)**2)
              y_pred = y_hat_test
              print('The loss (score) of training is', Ek_test)
              

8.4.3. *Shape and Arrangement of Arrays for Data

To better understand what the data, i.e., different arrays for input, weights, thresholds, and output, looks like, we first write out these arrays using the way that we used in the other chapters of the book.
X ¯ = [ x 1 , x 2 , , x I ] T (8.31) = [ x 11 x 12 x 1 J x I 1 x I 2 x I J ] I × J V ¯ = [ v 1 , v 2 , , v L ] (8.32) = [ v 11 v 12 v 1 L v J 1 v J 2 v J L ] J × L (8.33) γ = [ γ 1 , γ 2 , , γ L ] 2 X ¯ = x 1 , x 2 , , x I T (8.31) = x 11 x 12 x 1 J x I 1 x I 2 x I J I × J V ¯ = v 1 , v 2 , , v L (8.32) = v 11 v 12 v 1 L v J 1 v J 2 v J L J × L (8.33) γ = γ 1 , γ 2 , , γ L 2 {:[ bar(X)=[ vec(x)_(1), vec(x)_(2),cdots, vec(x)_(I)]^(T)],[(8.31)=[[x_(11),x_(12),cdotsx_(1J)],[vdots,ddots,vdots],[x_(I1),x_(I2),cdotsx_(IJ)]]_(I xx J)],[ bar(V)=[ vec(v)_(1), vec(v)_(2),cdots, vec(v)_(L)]],[(8.32)=[[v_(11),v_(12),cdotsv_(1L)],[vdots,ddots,vdots],[v_(J1),v_(J2),cdotsv_(JL)]]_(J xx L)],[(8.33) vec(gamma)=[gamma_(1),gamma_(2),cdots,gamma_(L)]^(2)]:}\begin{align*} \bar{X}= & {\left[\vec{x}_{1}, \vec{x}_{2}, \cdots, \vec{x}_{I}\right]^{T} } \\ = & {\left[\begin{array}{ccc} x_{11} & x_{12} & \cdots x_{1 J} \\ \vdots & \ddots & \vdots \\ x_{I 1} & x_{I 2} & \cdots x_{I J} \end{array}\right]_{I \times J} } \tag{8.31}\\ \bar{V}= & {\left[\vec{v}_{1}, \vec{v}_{2}, \cdots, \vec{v}_{L}\right] } \\ = & {\left[\begin{array}{ccc} v_{11} & v_{12} & \cdots v_{1 L} \\ \vdots & \ddots & \vdots \\ v_{J 1} & v_{J 2} & \cdots v_{J L} \end{array}\right]_{J \times L} } \tag{8.32}\\ & \vec{\gamma}=\left[\gamma_{1}, \gamma_{2}, \cdots, \gamma_{L}\right]^{2} \tag{8.33} \end{align*}X¯=[x1,x2,,xI]T(8.31)=[x11x12x1JxI1xI2xIJ]I×JV¯=[v1,v2,,vL](8.32)=[v11v12v1LvJ1vJ2vJL]J×L(8.33)γ=[γ1,γ2,,γL]2

 

 

 

 

 

 

Enjoy and Build the AI World

Sample Code from AI Engineering

Cite the code in your publications

Linear Models