Contents
Table of Contents
Next
CNN

Chapter 9

Deep Learning

9.1. From Artificial Neural Networks to Deep Learning

9.1.1. Overview

Deep learning is a subset of machine learning that uses deep Artificial Neural Networks (ANN) - algorithms inspired by the human brain - to perform human-like tasks such as speech recognition, image identification, and decision-making. The "deep" in this definition in general refers to the depth of layers in a Neural Network (NN). A neural network that consists of more than three layers, inclusive of the inputs and the output, can be considered a deep learning algorithm. Therefore, an in-depth understanding of deep learning cannot start without mentioning NNs, especially the work at "shallow" NNs.
In this chapter, the history of deep learning, including its root from and the preceding work in shallow NNs, will be introduced in terms of three waves first. Critical insights will be gained for the answers to some common questions for deep learning: "Is deep learning just a rebranding of neural networks?", "What are the major breakthroughs that drive the development of deep learning from all the way back in NNs to where it is?", and "What deep learning can do and will be heading to?" After that, two basic elements of modern deep learning that help address vanishing gradients, i.e., activation and initialization, will be introduced. Next, the implementations of two most widely used types of deep NNs, i.e., CNN and RNN, especially the backpropagation through these networks, will be described. For CNN, detailed treatments in the backpropagation for convolution, padding and stride, ReLU, and pooling will be explained with examples. For RNN, the mathematical formulation of a typical RNN architecture will be provided. Based on that, practical deep learning skills will be first shared for various widely-accepted initialization and batch normalization methods. Also, gradient descent optimizers as an essential part of deep learning will be investigated with details for common solvers including SGD, Momentum, NAG, Adagrad, Adadelta, RMSProp, AdaMax, and Nadam. More information about data preprocessing and augmentation will be provided at the end.

9.1.2. The First Wave

As shown in Fig. 9.1, the first wave of NN or deep learning is mostly about shallow NNs, though deep NN appeared. The ANN history usually traces back to the birth of the mathematical model for the neurons of human brains, which was proposed by Walter Pitts and Warren McCulloch in 1943 [9]. This model, which was introduced in the Chapter for ANNs, was usually called the McCulloch-Pitts (M-P) model and laid the foundation for ANNs and the succeeding work in deep learning. Later work based on the M-P model was among the earliest AI efforts that used networks or circuits of connected units to simulate intelligent behavior, which was called "connectionism" [68]. Most of these approaches were abandoned in the late 1950s as symbolic reasoning became the essence of AI, following the success of programs like the Logic Theorist and the General Problem Solver [13].
The next breakthrough was the advent of a new avatar of the M-P neuron in 1958, i.e., perceptron, which was shown to have learning capabilities of performing binary classification [69]. This inspired revolutions in the research of shallow neural networks and kept the field alive. In 1960, the first backpropagation model was proposed by Henry Kelley [70] in the context of control theory, which laid the foundation for training ANNs. Another step was made when Stuart Dreyfus presented a backpropagation model that used simple derivative chain rule in 1962 [71], which is used by most ANNs these days, instead of dynamic programming as used in earlier backpropagation models. Another work that needs to be mentioned is the multiplayer NN pioneered by Alexey Grigoryevich Ivakhnenko along with Valentin Grigorevich Lapa in
Figure 9.1: History of deep learning development
1965, which created a hierarchical representation of the neural network that used a polynomial activation function [72].
Mainstream perceptron studies came to an abrupt end in 1969, when Marvin Minsky and Seymour Papert published the book Perceptrons [17], which showed that Rosenblatt's perceptron cannot solve complicated functions like XOR. This fall of perceptron triggered a winter of NN research. After that, the field was still advanced with leaps in backpropagation and deep NNs. Especially, backpropagation was implemented with computer code via a general method for automatic differentiation by Seppo Linnainmaa in 1970 [73], though the implementation of backpropagation in NN came into play a decade later. In 1971, the effort towards deep NN continued such as an 8-layer NN using the Group Method of Data Handling (GMDH) by Alexey Grigoryevich Ivakhnenko [74].
After the major AI winter in 1974-1980, two major architectures for deep learning, i.e., Convolutional Neural Network (CNN) and Recurrent Neural Network (RNN), appeared. In 1980, the first CNN was proposed by Kunihiko Fukushima as Neocognitron for recognizing visual patterns such as handwritten characters [75]. In 1982, the Hopfield Network, which was essentially an RNN, was created by John Hopfield to serve as a content-addressable memory system [76]. Besides architectures, the use of Backpropagation for propagating errors during the training of Neural Networks was proposed by Paul Werbos in 1982 [77]. Another influential study is the development of the Boltzmann Machine by Ackley et al. (1985), which was a stochastic recurrent neural network [78]. Therefore, the above first wave started with the M-P model as a single neuron, rose with perceptron, and fell as a single-layer NN is constrained by their learning ability.

9.1.3. The Second Wave

The second wave of ANN can be viewed as the time when deep learning started growing and dominating NN work. In 1986, the successful implementation of backpropagation by Hinton et al. [79] opened gates for training complex deep neural networks easily, which was the main obstruction in earlier days of research in this area and propelled the rise of the second wave of deep learning. In the same year, a variation of the Boltzmann Machine where there was no intralayer connection in input and hidden layer is proposed as Restricted Boltzmann Machine (RBM), which later became popular, especially for building recommender systems. Later, in 1989, backpropagation was successfully used to train CNN to recognize handwritten digits, which laid the foundation of modern computer vision using deep learning [80]. Besides implementations, a breakthrough in theories was achieved as George Cybenko published the earliest version of the Universal Approximation Theorem: a feed-forward neural network with a single hidden layer containing a finite number of neurons can approximate any continuous function [81]. This theorem added credibility to Deep Learning.
The second wave descended as Sepp Hochreiter reported the problem of vanishing gradient in 1991, which explained why the learning of deep neural networks could be extremely slow and almost impractical [82]. After another AI winter in 1987-1993, the development of deep learning was significantly slowed down due to the fast development of other machine learning methods such as those based on statistics in the 1990s and support vector machines in the early 2000s. This dark time of NN research also prompted researchers in the area to drive the rebranding of the frowned-upon NN research with the moniker "deep learning". Despite this fact, a milestone was still passed as an effective RNN architecture, i.e., Long Short-Term Memory (LSTM), was proposed by Sepp Hochreiter and Jürgen Schmidhuber in 1997, which helped revolutionize deep learning in the next wave [83]. As can be seen, the second wave started with the successful training
of deep NN with backpropagation and fell as the vanishing gradient issue that prevents efficient training of deep NN was identified.

9.1.4. The Third Wave

The third wave marked the explosive development of deep learning, which might be attributed to three factors: improvements in architectures especially for addressing the vanishing gradient issue, an increase in computational power represented by the use of GPU, and the growth of data in a "big data" era especially image data. The wave was usually believed to start with an architecture breakthrough: the publication of the Deep Belief Networks in Science in 2006 [22]. This deep learning architecture as a stack of multiple RBMs showed the possibility and a feasible way of training deep NNs, though this way is not widely used due to the advent of other techniques for addressing vanishing gradient issues. As for computing power, for example, Andrew Ng's group at Stanford started advocating for the use of GPUs for training Deep Neural Networks to speed up the training time by many folds in 2008 [23]. This could bring practicality of deep learning, i.e., training deep NNs on huge volumes of data efficiently. As for data, in 2009, Fei-Fei Li' team launched ImageNet, a database of 14 million labeled images, which served as a deep learning benchmark for the ImageNet competitions (ILSVRC) every year [24]. Another architecture improvement was made in 2011 as Glorot et al. [25] proposed to use ReLU to replace traditional Sigmoid and tanh functions as the activation function for addressing vanishing gradient problems. This presented another tool after GPU to deal with the issues of long and impractical training times of deep neural networks, which was later widely used and improved for prevalent Deep NNs.
A victory of deep learning was made as AlexNet, a GPU-implemented CNN model designed by Alex Krizhevsky, won ImageNet's image classification contest in 2012 [26]. This victory triggered a new deep learning boom globally and attracted industry giant's attention. Apart from ReLU, some noticeable technical merits helped contribute to the success of AlexNet. Dropout, as an effective approach to regularization in neural networks (only used in the training process), helps address overfitting issues of deep learning by reducing interdependent learning amongst the neurons [84]; while pooling as part of the model's architecture (used in training and testing) helped the model becomes less sensitive to some translations (i.e., improved translation invariance). Deep learning started gaining more momentum and making impacts in or even sweeping many disciplines such as computer vision and natural language processing.
In the fast-rising phase of this third wave, several milestones, either technical improvements or important events, need to be mentioned. The first is the development of the Generative Adversarial Network (GAN) created by Ian Goodfellow in 2014 [27], which provided a way to synthesize real-like (high fidelity) data and consequently opened a new door for the application of deep learning in fashion, art, and science. A major victory of deep learning was announced as Deepmind's deep reinforcement learning model beats human champions in the complex game of Go in 2014 [85]. The improvements in initialization techniques further helped address the vanishing gradient issue. In fact, as early as 2010, Xavier Glorot and Yoshua Bengio proposed an initialization technique [86], which quickly became the default. This technique, which was generally referred to as "Xavier initialization" ( ± 1 / n ± 1 / n +-1//sqrtn\pm 1 / \sqrt{n}±1/n where n n nnn is the number of nodes in the prior layer) when using Sigmoid and tanh activation functions, was later slightly modified to suit the use of ReLU and referred to as " He initialization" ( ± 2 / n ) ( ± 2 / n ) (+-sqrt(2//n))( \pm \sqrt{2 / n})(±2/n).
What followed was the fast development of different deep-learning architectures. In order to improve the learning ability of deep NNs, deeper networks were explored first by directly increasing the number of layers such as the VGG network in 2014 [87], which was found to be constrained by the limits in computational power and vanishing gradients. Later attempts were made in another two different directions and gained success: networks such as Inception (GoogLeNet as V1, 2014) explored "wider" Deep NNs with the idea from "Network in Network" [88], while Residual Network, e.g., ResNet (2015), attempted at deeper NNs by introducing connections between layers that are far from each other in the same network - an idea evolving from "Highway Network" [89]. Later efforts were made to merge the two variants such as Xception and ResNeXt. Besides efforts at obtaining higher accuracy deep learning such as Inception and ResNet, another direction is the advancement of deep learning in mobile devices, which is represented by many architectures proposed for such purposes such as SqueezeNet, MobileNet and ShuffleNet [90]. While the above networks were primarily proposed for classification and regress tasks, advancements in NN architectures were also made for object detection tasks in computer vision. These are represented by regions with CNN and later Fast R-CNN [91], Faster R-CNN [92], YOLO [93], and GCN [94]. Graph Neural Networks (GNNs) also gained popularity at this time [95, 96]. More recently, Large Language Models for generative AI like ChatGPT and DeepSeek [28], which was enabled by the earlier Transformer architecture [97] triggered another deep learning rush in both industry and academia.

 

 

 

 

 

 

Enjoy and Build the AI World

Sample Code from AI Engineering

Cite the code in your publications

Linear Models