Previous
Deep Learning History
Contents
Table of Contents
Next
RNN

9.2. Convolutional Neural Network

A convolutional neural network (CNN or ConvNet) is a type of ANN that is usually discussed in the context of deep learning for analyzing visual imagery. It is actually good for identifying patterns between neighboring points in a sample, such as the spatial patterns in the image pixels close to each other. A modern CNN typically contains one or more of some basic elements such as convolution layers, activation functions like ReLU or its variant, pooling layers, and fully connected layers. In addition, Softmax or Sigmoid is additionally needed for classification tasks.
An understanding of the CNN, i.e., how it works, requires us to figure out how these different elements will be handled in backpropagation (including forward and backward passes as defined in the Chapter for NN) in both training and testing. Considering that training involves both passes while testing only involves the forward pass, we will show how the two passes will be made in a typical training process for convolution layers, ReLU, and pooling layers in the following subsection. The treatment for fully connected layers is the same as what is introduced for shallow NNs. The introduction thus will present explanations for understanding the working mechanisms of these elements as well as equations and procedures for implementing the training of a CNN consisting of these elements.

9.2.1. Convolution

The implementation of backpropagation including both forward and backward passes in a deep NN with exclusively convolution layers is outlined in Fig. 9.2. As illustrated, the forward pass starts from the first layer, where the input data for training, I I III, enters and moves along the blue arrows via the blue equations until the predicted output, O O OOO, is generated as the output of the last layer (i.e., Layer N N NNN ). The difference between the prediction, O O OOO, and the labeled output, O true O true  O_("true ")O_{\text {true }}Otrue , i.e., the loss L L LLL, will be used to calculate the gradients L O N L O N (del L)/(delO^(N))\frac{\partial L}{\partial O^{N}}LON. Then, the gradients will propagate backward through the whole network until reaching the first layer.
The main operation in the forward pass, i.e., convolution, is I F = O I F = O I**F=OI * F=OIF=O. As a result, in the backward pass, the key steps are the calculation of L F L F (del L)/(del F)\frac{\partial L}{\partial F}LF and L I L I (del L)/(del I)\frac{\partial L}{\partial I}LI and the update on the convolution kernel F F FFF. To move backward from one layer to the previous layer, the calculated value of L I L I (del L)/(del I)\frac{\partial L}{\partial I}LI for the current layer, i.e., L I n L I n (del L)/(delI^(n))\frac{\partial L}{\partial I^{n}}LIn, will be used as L O L O (del L)/(del O)\frac{\partial L}{\partial O}LO of the previous layer, i.e., L O n 1 L O n 1 (del L)/(delO^(n-1))\frac{\partial L}{\partial O^{n-1}}LOn1, because the input of the current layer I n I n I^(n)I^{n}In is the output of the previous layer O n 1 O n 1 O^(n-1)O^{n-1}On1.
The key in the implementation of the above two-way process is the convolution operation, i.e., equation with an *** operator, in the forward pass and the two equations for calculating L F L F (del L)/(del F)\frac{\partial L}{\partial F}LF and L I L I (del L)/(del I)\frac{\partial L}{\partial I}LI in the backward pass, which were given in Fig. 9.2 without explanation. In the following subsection, the deduction of these equations will be presented while we explain how convolution and backpropagation work through the convolution layers. It is noted that this figure only contains convolution layers. Therefore, more operations or other modifications will be needed if other CNN elements are needed. In addition, we must notice that operations with this set of equations for using and updating F F FFF need to be performed for each convolution kernel. Usually, there is more than one kernel in each convolution layer. Therefore, such operations need to be carried out using the same equations but with different F F FFF for different kernels.
Figure 9.2: Backpropagation through convolution layers

Forward Pass

Before being exposed to "convolution" in CNN, you possibly have heard about convolution in the context of general mathematics. It usually appears as a mathematical operation on two functions, e.g., y y yyy and f f fff, generating a third function o o ooo :
(9.1) o = ( i f ) ( t ) = i ( τ ) f ( t τ ) d τ (9.1) o = ( i f ) ( t ) = i ( τ ) f ( t τ ) d τ {:(9.1)o=(i**f)(t)=int_(-oo)^(oo)i(tau)f(t-tau)d tau:}\begin{equation*} o=(i * f)(t)=\int_{-\infty}^{\infty} i(\tau) f(t-\tau) d \tau \tag{9.1} \end{equation*}(9.1)o=(if)(t)=i(τ)f(tτ)dτ
or equivalently formulated as,
(9.2) o = ( i f ) ( t ) = i ( t τ ) f ( τ ) d τ (9.2) o = ( i f ) ( t ) = i ( t τ ) f ( τ ) d τ {:(9.2)o=(i**f)(t)=int_(-oo)^(oo)i(t-tau)f(tau)d tau:}\begin{equation*} o=(i * f)(t)=\int_{-\infty}^{\infty} i(t-\tau) f(\tau) d \tau \tag{9.2} \end{equation*}(9.2)o=(if)(t)=i(tτ)f(τ)dτ
where t t ttt is the axis (or independent variables) along which the three functions are defined. The operation can be understood as the area under the function i ( τ ) i ( τ ) i(tau)i(\tau)i(τ) weighted by the function f ( τ ) f ( τ ) f(-tau)f(-\tau)f(τ) shifted by t t ttt.
The convolution in CNN is applied to data with a finite size, thus, the above equation can be slightly modified into
(9.3) o = ( i f ) ( t ) = 0 t i ( τ ) f ( t τ ) d τ (9.3) o = ( i f ) ( t ) = 0 t i ( τ ) f ( t τ ) d τ {:(9.3)o=(i**f)(t)=int_(0)^(t)i(tau)f(t-tau)d tau:}\begin{equation*} o=(i * f)(t)=\int_{0}^{t} i(\tau) f(t-\tau) d \tau \tag{9.3} \end{equation*}(9.3)o=(if)(t)=0ti(τ)f(tτ)dτ
where t t ttt is still the axis along which the three functions are defined, and the integration bounds denote the finite size of the data.
In addition, image data, such as a photo, is usually represented as an array of numbers, in which each element or entry corresponds to a pixel in the photo. Thus, the convolution operation needs to take a discrete form so that it can be applied to discrete image data.
(9.4) O [ n ] = ( I F ) [ n ] = m = 1 n I [ m ] F [ n m ] (9.4) O [ n ] = ( I F ) [ n ] = m = 1 n I [ m ] F [ n m ] {:(9.4)O[n]=(I**F)[n]=sum_(m=1)^(n)I[m]*F[n-m]:}\begin{equation*} O[n]=(I * F)[n]=\sum_{m=1}^{n} I[m] \cdot F[n-m] \tag{9.4} \end{equation*}(9.4)O[n]=(IF)[n]=m=1nI[m]F[nm]
The above equation states that the element n n nnn in the generated convolution results O O OOO is obtained as a weighted sum of the (image) data, I I III, and the convolution kernel as the weight. The above operation needs to be repeated N N NNN times, where N N NNN is the number of elements of the generated results along the direction of the operation (shifting direction).
The schematic in Fig. 9.3 shows what happens when a convolution kernel is applied to an image. As illustrated, each convolution kernel from a convolution layer, or called a learnable filter or a neuron, is slid over the image. This starts with the red box on the left upper corner, which has the same number of elements as the kernel and can be called a receptive field. The application of the kernel to his receptive field mathematically corresponds to the computation of a dot product between the entries/elements of the filter and the array of the receptive field. This sum of this regional operation generates the ( 1 , 1 ) ( 1 , 1 ) (1,1)(1,1)(1,1) element of the output data, leading to a new element ( 1 , 1 ) ( 1 , 1 ) (1,1)(1,1)(1,1) (" 4 " in the red cell ) ) ))) in the result matrix. A stride like a step size is prescribed to move to the next receptive field. If a stride of 1 is used as illustrated in the figure, then the receptive area to the right of the red box region is the green box region. The same regional operation leads to the ( 1 , 2 ) ( 1 , 2 ) (1,2)(1,2)(1,2) element (" 3 " in the green cell ) in the output. The same process will be repeated as the kernel moves over all the receptive fields.
To better show how different elements in the input and kernel are processed in the convolution operation, let us look at the example in Fig. 9.4, This example has a very simple 3 × 3 3 × 3 3xx33 \times 33×3 array I I III as input, i.e., data to which convolution is applied, and a 2 × 2 2 × 2 2xx22 \times 22×2 kernel. This example will also be employed in the following subsections for CNN to derive other equations for backpropagation.
Figure 9.3: Schematic of the application of a kernel to a 2D image (2D array)
Figure 9.4: An example for convolution operation
In the above example, the following equations can be used to get the output.
(9.5a) O 11 = I 11 F 11 + I 12 F 12 + I 21 F 21 + I 22 F 22 (9.5b) O 12 = I 12 F 11 + I 13 F 12 + I 22 F 21 + I 23 F 22 (9.5c) O 21 = I 21 F 11 + I 22 F 12 + I 31 F 21 + I 32 F 22 (9.5d) O 22 = I 22 F 11 + I 23 F 12 + I 32 F 21 + I 33 F 22 (9.5a) O 11 = I 11 F 11 + I 12 F 12 + I 21 F 21 + I 22 F 22 (9.5b) O 12 = I 12 F 11 + I 13 F 12 + I 22 F 21 + I 23 F 22 (9.5c) O 21 = I 21 F 11 + I 22 F 12 + I 31 F 21 + I 32 F 22 (9.5d) O 22 = I 22 F 11 + I 23 F 12 + I 32 F 21 + I 33 F 22 {:[(9.5a)O_(11)=I_(11)F_(11)+I_(12)F_(12)+I_(21)F_(21)+I_(22)F_(22)],[(9.5b)O_(12)=I_(12)F_(11)+I_(13)F_(12)+I_(22)F_(21)+I_(23)F_(22)],[(9.5c)O_(21)=I_(21)F_(11)+I_(22)F_(12)+I_(31)F_(21)+I_(32)F_(22)],[(9.5d)O_(22)=I_(22)F_(11)+I_(23)F_(12)+I_(32)F_(21)+I_(33)F_(22)]:}\begin{align*} & O_{11}=I_{11} F_{11}+I_{12} F_{12}+I_{21} F_{21}+I_{22} F_{22} \tag{9.5a}\\ & O_{12}=I_{12} F_{11}+I_{13} F_{12}+I_{22} F_{21}+I_{23} F_{22} \tag{9.5b}\\ & O_{21}=I_{21} F_{11}+I_{22} F_{12}+I_{31} F_{21}+I_{32} F_{22} \tag{9.5c}\\ & O_{22}=I_{22} F_{11}+I_{23} F_{12}+I_{32} F_{21}+I_{33} F_{22} \tag{9.5d} \end{align*}(9.5a)O11=I11F11+I12F12+I21F21+I22F22(9.5b)O12=I12F11+I13F12+I22F21+I23F22(9.5c)O21=I21F11+I22F12+I31F21+I32F22(9.5d)O22=I22F11+I23F12+I32F21+I33F22
The above equations can be extended for images and kernels of any size to implement the forward pass.

Backward Pass

As illustrated in Fig. 9.2, both L F L F (del L)/(del F)\frac{\partial L}{\partial F}LF and L I L I (del L)/(del I)\frac{\partial L}{\partial I}LI need to be calculated for each layer while L O L O (del L)/(del O)\frac{\partial L}{\partial O}LO is from the next layer. That is, L O L O (del L)/(del O)\frac{\partial L}{\partial O}LO is calculated as the L I L I (del L)/(del I)\frac{\partial L}{\partial I}LI of the following layer. L F L F (del L)/(del F)\frac{\partial L}{\partial F}LF is obtained to update the filter, while L I L I (del L)/(del I)\frac{\partial L}{\partial I}LI of the current layer is obtained to work as the L O L O (del L)/(del O)\frac{\partial L}{\partial O}LO of the previous layer.
The following chain rule needs to be recalled first to start deriving L F L F (del L)/(del F)\frac{\partial L}{\partial F}LF and L I L I (del L)/(del I)\frac{\partial L}{\partial I}LI :
(9.6a) L F = L O O F (9.6b) L I = L O O I (9.6a) L F = L O O F (9.6b) L I = L O O I {:[(9.6a)(del L)/(del F)=(del L)/(del O)*(del O)/(del F)],[(9.6b)(del L)/(del I)=(del L)/(del O)*(del O)/(del I)]:}\begin{align*} \frac{\partial L}{\partial F} & =\frac{\partial L}{\partial O} \cdot \frac{\partial O}{\partial F} \tag{9.6a}\\ \frac{\partial L}{\partial I} & =\frac{\partial L}{\partial O} \cdot \frac{\partial O}{\partial I} \tag{9.6b} \end{align*}(9.6a)LF=LOOF(9.6b)LI=LOOI
where L O L O (del L)/(del O)\frac{\partial L}{\partial O}LO can be obtained by passing the gradients layer by layer all the way back from the last layer. Thus, the above two types of gradients can be obtained as long as O F O F (del O)/(del F)\frac{\partial O}{\partial F}OF and O I O I (del O)/(del I)\frac{\partial O}{\partial I}OI can be obtained. Next, let us take a look at these two types of gradients.
As for Eq. 9.6a, in order to see what O F O F (del O)/(del F)\frac{\partial O}{\partial F}OF looks like, let us use O 11 O 11 O_(11)O_{11}O11 first:
(9.7) O 11 = I 11 F 11 + I 12 F 12 + I 21 F 21 + I 22 F 22 (9.7) O 11 = I 11 F 11 + I 12 F 12 + I 21 F 21 + I 22 F 22 {:(9.7)O_(11)=I_(11)F_(11)+I_(12)F_(12)+I_(21)F_(21)+I_(22)F_(22):}\begin{equation*} O_{11}=I_{11} F_{11}+I_{12} F_{12}+I_{21} F_{21}+I_{22} F_{22} \tag{9.7} \end{equation*}(9.7)O11=I11F11+I12F12+I21F21+I22F22
Then we will have O 11 F 11 = X 11 , O 11 F 12 = X 12 , O 11 F 21 = X 21 O 11 F 11 = X 11 , O 11 F 12 = X 12 , O 11 F 21 = X 21 (delO_(11))/(delF_(11))=X_(11),(delO_(11))/(delF_(12))=X_(12),(delO_(11))/(delF_(21))=X_(21)\frac{\partial O_{11}}{\partial F_{11}}=X_{11}, \frac{\partial O_{11}}{\partial F_{12}}=X_{12}, \frac{\partial O_{11}}{\partial F_{21}}=X_{21}O11F11=X11,O11F12=X12,O11F21=X21, and O 11 F 22 = X 22 O 11 F 22 = X 22 (delO_(11))/(delF_(22))=X_(22)\frac{\partial O_{11}}{\partial F_{22}}=X_{22}O11F22=X22. Similarly, we can get the local gradients for O 12 , O 21 , O 22 O 12 , O 21 , O 22 O_(12),O_(21),O_(22)O_{12}, O_{21}, O_{22}O12,O21,O22. Using these gradients, we can get the detailed equations for each element of L F L F (del L)/(del F)\frac{\partial L}{\partial F}LF. Let us check two typical elements:
(9.8a) L F 11 = L O 11 X 11 + L O 12 X 12 + L O 21 X 21 + L O 22 X 22 (9.8b) L F 12 = L O 11 X 12 + L O 12 X 13 + L O 21 X 22 + L O 22 X 23 (9.8a) L F 11 = L O 11 X 11 + L O 12 X 12 + L O 21 X 21 + L O 22 X 22 (9.8b) L F 12 = L O 11 X 12 + L O 12 X 13 + L O 21 X 22 + L O 22 X 23 {:[(9.8a)(del L)/(delF_(11))=(del L)/(delO_(11))*X_(11)+(del L)/(delO_(12))*X_(12)+(del L)/(delO_(21))*X_(21)+(del L)/(delO_(22))*X_(22)],[(9.8b)(del L)/(delF_(12))=(del L)/(delO_(11))*X_(12)+(del L)/(delO_(12))*X_(13)+(del L)/(delO_(21))*X_(22)+(del L)/(delO_(22))*X_(23)]:}\begin{align*} \frac{\partial L}{\partial F_{11}} & =\frac{\partial L}{\partial O_{11}} \cdot X_{11}+\frac{\partial L}{\partial O_{12}} \cdot X_{12}+\frac{\partial L}{\partial O_{21}} \cdot X_{21}+\frac{\partial L}{\partial O_{22}} \cdot X_{22} \tag{9.8a}\\ \frac{\partial L}{\partial F_{12}} & =\frac{\partial L}{\partial O_{11}} \cdot X_{12}+\frac{\partial L}{\partial O_{12}} \cdot X_{13}+\frac{\partial L}{\partial O_{21}} \cdot X_{22}+\frac{\partial L}{\partial O_{22}} \cdot X_{23} \tag{9.8b} \end{align*}(9.8a)LF11=LO11X11+LO12X12+LO21X21+LO22X22(9.8b)LF12=LO11X12+LO12X13+LO21X22+LO22X23
We can get the equations for all the elements in a similar way. It is not difficult to find that the above equations can be summarized as operation in Fig. 9.5:
L F = L F 11 L F 12 L F 21 L F 22 = Convolution ( I 11 I 12 I 13 I 21 I 22 I 23 I 31 I 32 I 33 , L O 11 L O 12 L O 21 L O 22 ) = I L O L F = L F 11      L F 12 L F 21      L F 22 =  Convolution  I 11      I 12      I 13 I 21      I 22      I 23 I 31      I 32      I 33 , L O 11      L O 12 L O 21      L O 22 = I L O (del L)/(del F)={:[(del L)/(delF_(11)),(del L)/(delF_(12))],[(del L)/(delF_(21)),(del L)/(delF_(22))]:}=" Convolution "([I_(11),I_(12),I_(13)],[I_(21),I_(22),I_(23)],[I_(31),I_(32),I_(33)],[(del L)/(delO_(11)),(del L)/(delO_(12))],[(del L)/(delO_(21)),(del L)/(delO_(22))])=I**(del L)/(del O)\frac{\partial L}{\partial F}=\begin{array}{|l|l|} \hline \frac{\partial L}{\partial F_{11}} & \frac{\partial L}{\partial F_{12}} \\ \hline \frac{\partial L}{\partial F_{21}} & \frac{\partial L}{\partial F_{22}} \\ \hline \end{array}=\text { Convolution }\left(\begin{array}{|l|l|l|} \hline I_{11} & I_{12} & I_{13} \\ \hline I_{21} & I_{22} & I_{23} \\ \hline I_{31} & I_{32} & I_{33} \\ \hline \end{array}, \begin{array}{|l|l|l|} \hline \frac{\partial L}{\partial O_{11}} & \frac{\partial L}{\partial O_{12}} \\ \hline \frac{\partial L}{\partial O_{21}} & \frac{\partial L}{\partial O_{22}} \\ \hline \end{array}\right)=I * \frac{\partial L}{\partial O}LF=LF11LF12LF21LF22= Convolution (I11I12I13I21I22I23I31I32I33,LO11LO12LO21LO22)=ILO
Figure 9.5: Operation for obtaining L F L F (del L)/(del F)\frac{\partial L}{\partial F}LF
As for Eq. 9.6 b, to get L I L I (del L)/(del I)\frac{\partial L}{\partial I}LI, let us see what O I O I (del O)/(del I)\frac{\partial O}{\partial I}OI looks like: O 11 I 11 = F 11 , O 11 I 12 = F 12 , O 11 I 21 = F 21 O 11 I 11 = F 11 , O 11 I 12 = F 12 , O 11 I 21 = F 21 (delO_(11))/(delI_(11))=F_(11),(delO_(11))/(delI_(12))=F_(12),(delO_(11))/(delI_(21))=F_(21)\frac{\partial O_{11}}{\partial I_{11}}=F_{11}, \frac{\partial O_{11}}{\partial I_{12}}=F_{12}, \frac{\partial O_{11}}{\partial I_{21}}=F_{21}O11I11=F11,O11I12=F12,O11I21=F21, and O 11 I 22 = F 22 O 11 I 22 = F 22 (delO_(11))/(delI_(22))=F_(22)\frac{\partial O_{11}}{\partial I_{22}}=F_{22}O11I22=F22. Similarly, we can get the local gradients for O 12 , O 21 , O 22 O 12 , O 21 , O 22 O_(12),O_(21),O_(22)O_{12}, O_{21}, O_{22}O12,O21,O22. Then, let us write out the detailed formulations of the elements of L I L I (del L)/(del I)\frac{\partial L}{\partial I}LI based on the above equations for the elements of O I O I (del O)/(del I)\frac{\partial O}{\partial I}OI :
(9.9a) L I 11 = L O 11 F 11 (9.9b) L I 12 = L O 11 F 12 + L O 12 F 11 (9.9c) L I 13 = L O 12 F 12 (9.9d) L I 21 = L O 11 F 21 + L O 21 F 11 (9.9e) L I 22 = L O 11 F 22 + L O 12 F 21 + L O 21 F 12 + L O 22 F 11 (9.9f) L I 23 = L O 12 F 22 + L O 22 F 12 (9.9~g) L I 31 = L O 21 F 21 (9.9h) L I 32 = L O 21 F 22 + L O 22 F 21 (9.9i) L I 33 = L O 22 F 22 (9.9a) L I 11 = L O 11 F 11 (9.9b) L I 12 = L O 11 F 12 + L O 12 F 11 (9.9c) L I 13 = L O 12 F 12 (9.9d) L I 21 = L O 11 F 21 + L O 21 F 11 (9.9e) L I 22 = L O 11 F 22 + L O 12 F 21 + L O 21 F 12 + L O 22 F 11 (9.9f) L I 23 = L O 12 F 22 + L O 22 F 12 (9.9~g) L I 31 = L O 21 F 21 (9.9h) L I 32 = L O 21 F 22 + L O 22 F 21 (9.9i) L I 33 = L O 22 F 22 {:[(9.9a)(del L)/(delI_(11))=(del L)/(delO_(11))*F_(11)],[(9.9b)(del L)/(delI_(12))=(del L)/(delO_(11))*F_(12)+(del L)/(delO_(12))*F_(11)],[(9.9c)(del L)/(delI_(13))=(del L)/(delO_(12))*F_(12)],[(9.9d)(del L)/(delI_(21))=(del L)/(delO_(11))*F_(21)+(del L)/(delO_(21))*F_(11)],[(9.9e)(del L)/(delI_(22))=(del L)/(delO_(11))*F_(22)+(del L)/(delO_(12))*F_(21)+(del L)/(delO_(21))*F_(12)+(del L)/(delO_(22))*F_(11)],[(9.9f)(del L)/(delI_(23))=(del L)/(delO_(12))*F_(22)+(del L)/(delO_(22))*F_(12)],[(9.9~g)(del L)/(delI_(31))=(del L)/(delO_(21))*F_(21)],[(9.9h)(del L)/(delI_(32))=(del L)/(delO_(21))*F_(22)+(del L)/(delO_(22))*F_(21)],[(9.9i)(del L)/(delI_(33))=(del L)/(delO_(22))*F_(22)]:}\begin{align*} & \frac{\partial L}{\partial I_{11}}=\frac{\partial L}{\partial O_{11}} \cdot F_{11} \tag{9.9a}\\ & \frac{\partial L}{\partial I_{12}}=\frac{\partial L}{\partial O_{11}} \cdot F_{12}+\frac{\partial L}{\partial O_{12}} \cdot F_{11} \tag{9.9b}\\ & \frac{\partial L}{\partial I_{13}}=\frac{\partial L}{\partial O_{12}} \cdot F_{12} \tag{9.9c}\\ & \frac{\partial L}{\partial I_{21}}=\frac{\partial L}{\partial O_{11}} \cdot F_{21}+\frac{\partial L}{\partial O_{21}} \cdot F_{11} \tag{9.9d}\\ & \frac{\partial L}{\partial I_{22}}=\frac{\partial L}{\partial O_{11}} \cdot F_{22}+\frac{\partial L}{\partial O_{12}} \cdot F_{21}+\frac{\partial L}{\partial O_{21}} \cdot F_{12}+\frac{\partial L}{\partial O_{22}} \cdot F_{11} \tag{9.9e}\\ & \frac{\partial L}{\partial I_{23}}=\frac{\partial L}{\partial O_{12}} \cdot F_{22}+\frac{\partial L}{\partial O_{22}} \cdot F_{12} \tag{9.9f}\\ & \frac{\partial L}{\partial I_{31}}=\frac{\partial L}{\partial O_{21}} \cdot F_{21} \tag{9.9~g}\\ & \frac{\partial L}{\partial I_{32}}=\frac{\partial L}{\partial O_{21}} \cdot F_{22}+\frac{\partial L}{\partial O_{22}} \cdot F_{21} \tag{9.9h}\\ & \frac{\partial L}{\partial I_{33}}=\frac{\partial L}{\partial O_{22}} \cdot F_{22} \tag{9.9i} \end{align*}(9.9a)LI11=LO11F11(9.9b)LI12=LO11F12+LO12F11(9.9c)LI13=LO12F12(9.9d)LI21=LO11F21+LO21F11(9.9e)LI22=LO11F22+LO12F21+LO21F12+LO22F11(9.9f)LI23=LO12F22+LO22F12(9.9~g)LI31=LO21F21(9.9h)LI32=LO21F22+LO22F21(9.9i)LI33=LO22F22
The mathematical operation performed via the above equations can be summarized into the algebraic operation in Fig. 9.6:
In the above equation, F 180 F 180 F^(180^(@))F^{180^{\circ}}F180 is obtained by rotating the array of F F FFF by 180 degrees. Taking a 2D array as an example, if there are k h k h k_(h)k_{h}kh rows (height) and k w k w k_(w)k_{w}kw columns (width), then the element F i j F i j F_(ij)F_{i j}Fij in the new array, F 180 F 180 F^(180^(@))F^{180^{\circ}}F180, will be F ( k h + 1 i ) × ( k w + 1 j ) F k h + 1 i × k w + 1 j F_((k_(h)+1-i)xx(k_(w)+1-j))F_{\left(k_{h}+1-i\right) \times\left(k_{w}+1-j\right)}F(kh+1i)×(kw+1j) in the original array, F F FFF. The "full convolution" operator represented by ***\star in this book is different
L I = L I 11 L I 12 L I 13 L I 21 L I 22 L I 23 L I 31 L I 32 L I 33 = Full Convolution ( F 22 F 21 F 12 F 11 , L O 11 L O 12 L O 21 L O 22 ] = F 180 L O L I = L I 11      L I 12      L I 13 L I 21      L I 22      L I 23 L I 31      L I 32      L I 33 =  Full Convolution  F 22      F 21 F 12      F 11 , L O 11      L O 12 L O 21      L O 22 = F 180 L O (del L)/(del I)={:[(del L)/(delI_(11)),(del L)/(delI_(12)),(del L)/(delI_(13))],[(del L)/(delI_(21)),(del L)/(delI_(22)),(del L)/(delI_(23))],[(del L)/(delI_(31)),(del L)/(delI_(32)),(del L)/(delI_(33))]:}=" Full Convolution "([F_(22),F_(21)],[F_(12),F_(11)],[(del L)/(delO_(11)),(del L)/(delO_(12))],[(del L)/(delO_(21)),(del L)/(delO_(22))]]=F^(180^(@)***(del L)/(del O))\frac{\partial L}{\partial I}=\begin{array}{|l|l|l|} \hline \frac{\partial L}{\partial I_{11}} & \frac{\partial L}{\partial I_{12}} & \frac{\partial L}{\partial I_{13}} \\ \hline \frac{\partial L}{\partial I_{21}} & \frac{\partial L}{\partial I_{22}} & \frac{\partial L}{\partial I_{23}} \\ \hline \frac{\partial L}{\partial I_{31}} & \frac{\partial L}{\partial I_{32}} & \frac{\partial L}{\partial I_{33}} \\ \hline \end{array}=\text { Full Convolution }\left(\begin{array}{|l|l|} \hline F_{22} & F_{21} \\ \hline F_{12} & F_{11} \\ \hline \end{array}, \begin{array}{|l|l|} \hline \frac{\partial L}{\partial O_{11}} & \frac{\partial L}{\partial O_{12}} \\ \hline \frac{\partial L}{\partial O_{21}} & \frac{\partial L}{\partial O_{22}} \\ \hline \end{array}\right]=F^{180^{\circ} \star \frac{\partial L}{\partial O}}LI=LI11LI12LI13LI21LI22LI23LI31LI32LI33= Full Convolution (F22F21F12F11,LO11LO12LO21LO22]=F180LO
Figure 9.6: Operation for obtaining L I L I (del L)/(del I)\frac{\partial L}{\partial I}LI
from the regular convolution, which is also called "valid convolution" in many places. The full convolution operation can be understood using the example given in Fig. 9.7. As can be seen, the array behind the operator (in yellow) is moved from the left upper corner of the array before the operator (blue), from left to right, upper to lower. The weighted sum of the overlapping area is where the local convolution occurs and generates a corresponding element in the operation result. For example, the lowest, rightmost element of the yellow array works as the weights for obtaining the sum of the highest, leftmost element (or a patch or region if a stride greater than 1 is used) of the blue array to generate the ( 1 , 1 ) ( 1 , 1 ) (1,1)(1,1)(1,1) element of the generated array. Then, the yellow array is moved one element (or patch) to the right, and the weighted sum will be the ( 1 , 2 ) ( 1 , 2 ) (1,2)(1,2)(1,2) element. The above process is repeated until all the elements (patches) of the new array are generated.
Figure 9.7: Example of full convolution operation

Padding and Stride

The above operations in backpropagation can be much more complicated in implementation. This is because the above illustrations were made with a very simple case: a padding of 0 and a stride of 1 . However, in modern CNN practice, the use of positive padding values and strides greater than 1 is very common. When working on the implementation, such as coding a CNN ourselves, we will need to pay close attention to the sizes of different matrices and the change of these sizes as we apply different operations including convolution with a padding and a stride. A padding is used to add extra space (or elements with 0's) so that an operation such as convolution can generate an array with desired sizes. Usually, positive paddings, either identical or different paddings, can be added to different directions, i.e., height and width, to increase the size of the generated images. In common situations, the same number of padding elements are added to both ends in each direction. A stride greater than 1 is used to reduce the overlap between the neighboring receptive fields.
Let us assume I I III has a size of n h × n w , F n h × n w , F n_(h)xxn_(w),Fn_{h} \times n_{w}, Fnh×nw,F has a size of k h × k w k h × k w k_(h)xxk_(w)k_{h} \times k_{w}kh×kw. In the forward pass, if we apply a convolution operation between these two arrays with paddings of p h × p w p h × p w p_(h)xxp_(w)p_{h} \times p_{w}ph×pw and strides of s h × s w s h × s w s_(h)xxs_(w)s_{h} \times s_{w}sh×sw, then the size of the generated O O OOO will be ( n h k h + p h + s h ) / s h × ( n w k w + p w + s w ) / s w n h k h + p h + s h / s h × n w k w + p w + s w / s w (n_(h)-k_(h)+p_(h)+s_(h))//s_(h)xx(n_(w)-k_(w)+p_(w)+s_(w))//s_(w)\left(n_{h}-k_{h}+p_{h}+s_{h}\right) / s_{h} \times\left(n_{w}-k_{w}+p_{w}+s_{w}\right) / s_{w}(nhkh+ph+sh)/sh×(nwkw+pw+sw)/sw. Accordingly, in the backward pass, we will apply a full convolution between F F FFF and L O L O (del L)/(del O)\frac{\partial L}{\partial O}LO. In this case, L O L O (del L)/(del O)\frac{\partial L}{\partial O}LO has a size of ( n h k h + p h + s h ) / s h × ( n w k w + p w + s w ) / s w , F n h k h + p h + s h / s h × n w k w + p w + s w / s w , F (n_(h)-k_(h)+p_(h)+s_(h))//s_(h)xx(n_(w)-k_(w)+p_(w)+s_(w))//s_(w),F\left(n_{h}-k_{h}+p_{h}+s_{h}\right) / s_{h} \times\left(n_{w}-k_{w}+p_{w}+s_{w}\right) / s_{w}, F(nhkh+ph+sh)/sh×(nwkw+pw+sw)/sw,F has the same size, i.e., k h × k w k h × k w k_(h)xxk_(w)k_{h} \times k_{w}kh×kw, and we will obtain L I L I (del L)/(del I)\frac{\partial L}{\partial I}LI with a size of n h × n w n h × n w n_(h)xxn_(w)n_{h} \times n_{w}nh×nw, which is the same as that of I I III. Therefore, from a perspective of sizes, the above processes need to be "reversible". That is, the gradients returned in the backward pass should have the same size as the arrays where the gradients correspond to (or are applied to for update). For this purpose,
a general rule needs to be maintained: the same padding and stride values need to be used in both passes. Usually, we do not change padding and stride values in different backpropagation iterations, so these values should stay the same in the training processes. Even in testing, the training network will still stick to the same paddings and strides if being applied to data of the same size.
The use of paddings will not change the operations introduced in the previous subsection too much. We just want to make sure the same padding(s) is adopted and also pay attention to the sizes of different arrays according to the equations given above. The treatment of strides requires more care. This is because, in the forward pass, the input array, I I III, is "downsampled" when a stride greater than 1 is used. That is, the output array O O OOO is smaller than that obtained with a stride of 1 . Due to this reason, we will need to upsample the gradients with respect to O O OOO, i.e., L O L O (del L)/(del O)\frac{\partial L}{\partial O}LO, before using it for calculating L F L F (del L)/(del F)\frac{\partial L}{\partial F}LF and L I L I (del L)/(del I)\frac{\partial L}{\partial I}LI. This upsampling in the backward pass needs to be done with the same stride values as adopted in the forward pass for the same layer. For example, for a stride of s s sss, the upsampling needs to add s 10 s 10 s-10s-10s10 's between the neighboring elements in the array to be upsampled.

9.2.2. ReLU

In traditional ANNs, most elements are differentiable. That is why, in the ANN chapter, continuous functions can be obtained for the derivatives (gradients) in backpropagation. By contrast, CNN has added special elements, such as ReLU, which is not differentiable at some points, and pooling, which changes the sizes of the data. Special treatment needed for these two CNN elements will be introduced in this and the following subsections.
The mathematical function of ReLU can be written as
(9.10) ReLU ( x ) = { x , x > 0 0 , x 0 (9.10) ReLU ( x ) = x , x > 0 0 , x 0 {:(9.10)ReLU(x)={[x",",x > 0],[0",",x <= 0]:}:}\operatorname{ReLU}(x)= \begin{cases}x, & x>0 \tag{9.10}\\ 0, & x \leqslant 0\end{cases}(9.10)ReLU(x)={x,x>00,x0
or equivalently as,
(9.11) f ( x ) = max ( 0 , x ) (9.11) f ( x ) = max ( 0 , x ) {:(9.11)f(x)=max(0","x):}\begin{equation*} f(x)=\max (0, x) \tag{9.11} \end{equation*}(9.11)f(x)=max(0,x)
This function is not differentiable at x = 0 x = 0 x=0x=0x=0. To deal with this issue, we can set the derivative at x = 0 x = 0 x=0x=0x=0 as 0 or 1 . The choice will not affect the gradient to be used for updating the network. For example, the update on the weight in a typical example is w = w + α ( y ^ y ) δ ReLU ( w x θ ) w w = w + α ( y ^ y ) δ ReLU ( w x θ ) w w=w+alpha**( hat(y)-y)*delta_(ReLU)*(del(w*x-theta))/(del w)w=w+\alpha *(\hat{y}-y) \cdot \delta_{\operatorname{ReLU}} \cdot \frac{\partial(w \cdot x-\theta)}{\partial w}w=w+α(y^y)δReLU(wxθ)w, where ( w x θ ) w = x ( w x θ ) w = x (del(w*x-theta))/(del w)=x\frac{\partial(w \cdot x-\theta)}{\partial w}=x(wxθ)w=x. As can be seen, at x = 0 x = 0 x=0x=0x=0, the increment (second part on the right-hand side of the equation) equals 0 no matter the gradient of ReLU ( δ ReLU ) ReLU δ ReLU  ReLU(delta_("ReLU "))\operatorname{ReLU}\left(\delta_{\text {ReLU }}\right)ReLU(δReLU ) is set as 0 or 1 .
Other activation functions such as Linear, Linear threshold, Maxout, and Softmax will not be discussed here. More information can be found in more specific deep learning literature.

9.2.3. Pooling

As mentioned above, pooling changes the size of the feature map. This prevents the gradients from being passed from the elements after pooling to their corresponding elements in the feature map before pooling. A general solution is to make sure the gradient of every element (or pixel) can be passed backward to the elements that generated this element in the previous forward pass. In this way, the total gradients will stay the same, which is needed to avoid gradient explosion or vanishing. Different pooling methods, e.g., average pooling and max pooling, have different treatments.
For the average pooling, as illustrated in Fig. 9.8, a straightforward way is to evenly divide the gradient of an element for all the elements that are pooled to generate that element. For example, the yellow element in the feature map after pooling is generated by the four elements in the feature map before pooling in the forward process. Therefore, in the backward process, the element 4 is divided by 4 . The obtained quotient, 1 , is then assigned to those four elements.
For the max pooling, a common way is to give the gradient to the element with the maximum value in the feature map before pooling. As shown in Fig. 9.9, the element with a gradient of 8 (orange) passes this gradient to the lower left element in the orange area, which has the maximum value in the forward pass. Therefore, in the implementation, the location of the element that has the highest value in each small region for pooling needs to be recorded in the forward pass, so that the gradient can be passed to it in the following backward pass.
It is worthwhile to mention that, in the above figures, the same feature map after pooling is used for the two passes (i.e., forward and backward) to help the readers better understand the operations. In reality, the feature map after pooling and the gradients that need to be passed backward are totally different, though they are the same size.
Figure 9.8: Forward and backward passes in backpropagation of average pooling layer
Figure 9.9: Forward and backward passes in backpropagation of max pooling layer

9.3. Recurrent Neural Network

RNN has many different variations with distinct structures. Shown in Fig. 9.10 is one of the most common architectures. On the left is the schematic of a unit. The one in the middle shows a more detailed flow of the data through the unit. The subplot on the right illustrates how the unit unfolds over time.
Figure 9.10: Structure and workflow of RNN
The symbols in Fig. 9.10 have the following meanings:
x t x t x^(t)x^{t}xt is the output at step t t ttt in the time series;
h t h t h^(t)h^{t}ht is the state of the hidden layer at t t ttt, which is determined by both x t x t x^(t)x^{t}xt and h t 1 h t 1 h^(t-1)h^{t-1}ht1 according to the above network architecture;
o t o t o^(t)o^{t}ot is the output at t t ttt, which is determined by the hidden state at the current step t t ttt;
L t L t L^(t)L^{t}Lt is the loss function at t t ttt;
y t y t y^(t)y^{t}yt is the actual output (or label) at t t ttt;
y ~ t y ~ t tilde(y)^(t)\tilde{y}^{t}y~t is the predicted output at t t ttt;
σ 1 σ 1 sigma_(1)\sigma_{1}σ1 and σ 2 σ 2 sigma_(2)\sigma_{2}σ2 are the hidden layer activation function and output activation function, respectively;
U , W U , W U,WU, WU,W, and V V VVV are the arrays that include the network weights to be predicted.

 

 

 

 

 

 

Enjoy and Build the AI World

Sample Code from AI Engineering

Cite the code in your publications

Linear Models