Previous
AI Basics
Contents
Table of Contents
Next
SVM

Chapter 4

Linear Models

4.1. Overview

This chapter introduces one of the simplest yet very useful types of machine learning algorithms: linear models. First, a theory for the basic linear model algorithm will be laid down. Then, the employment of regularization to generate other common types of linear model algorithms, e.g., Ridge and Lasso, will be explained. Linear models are, by nature, proposed for regression. The extension of such algorithms to classification, including both binary classification and multiclass classification, will be introduced next. Finally, we will show the use of kernel functions to turn linear models into nonlinear methods for regression and classification.
It is worthwhile to clarify some common confusing points before we start the journey.
First, "linear model" or "linear models" here refers to the algorithm(s) that helps us obtain the actual models. As explained in previous chapters, algorithms are methods to obtain models, while models are algorithms that have been instantiated or fixed with model parameters, which are determined by specific datasets during training. In a simple way, an algorithm can be analogous to the use of a polynomial function to fit a dataset, while a model is the polynomial with parameters fixed in this curve fitting process. But it is kind of a convention to call such algorithms "linear model" or "linear models" in machine learning. This convention is still kept here to stay aligned with most machine learning literature.
Second, in a narrow sense, linear model as a machine learning topic contains only those linear model algorithms in which the independent variables (representing the data attributes), e.g., x i , x 2 , , x i , x i , x 2 , , x i , x_(i),x_(2),cdots,x_(i),cdotsx_{i}, x_{2}, \cdots, x_{i}, \cdotsxi,x2,,xi, in x x vec(x)\vec{x}x, maintain a linear relationship. That is, the dependent variable (representing the label) y y yyy is a linear combination of these independent variables x i x i x_(i)x_{i}xi. However, it is quite common to adopt kernel functions to give linear models nonlinear capacities. Thus, the linear models with the use of kernel functions are still introduced as part of the linear model topic in a broad sense in this book.
Third, there are big overlaps between linear model, linear regression, and curve fitting with linear functions, and they may even refer to the same thing in some cases. This confusion is caused by the fact that these terms are proposed and used by people from different disciplines, e.g., computer science, math, and engineering. In this book, the use of linear models is mostly inherited from machine learning and data analysis, thus it focuses on the use of linear models for addressing regression and classification problems. In this context, an analytical solution can be obtained directly due to the intentional confinement to simple linear models. By contrast, relevant topics like curve fitting (especially nonlinear curve learning), which may require the use of complicated math functions and optimization techniques to iteratively obtain the optimal solution, will not be discussed here.

4.2. Basics of Linear Models

4.2.1. Simple Explanation of Linear Models

The idea of linear models is to use a linear mathematical function of one or more variables to approach some given function value(s) of these variable(s). In machine learning, such a function value correspond to the label of a sample, while such variables correspond to different attributes of the sample. Frequently, in the context of math functions, independent variables and dependent variables are used to refer to the sample attributes and labels, respectively. In some cases, especially when the context of statistics and probability is adopted, it is also common to see the use of predictors and response variables.
Let us first outline a typical problem for linear models using the math language. In the simple case, let us only consider one variable x x xxx. Then, the function f ( x ) f ( x ) f(x)f(x)f(x) can be formulated as
(4.1) f ( x ) = w x + b (4.1) f ( x ) = w x + b {:(4.1)f(x)=w*x+b:}\begin{equation*} f(x)=w \cdot x+b \tag{4.1} \end{equation*}(4.1)f(x)=wx+b
where [ x , y ] [ x , y ] [x,y][x, y][x,y] represents one data point in a space consisting of 2 axes (or dimensions) and *\cdot is the (arithmetic) product between two scalars.
In a regression problem, there are usually multiple data points in this space, e.g., I I III points [ ( x 1 , y 1 ) , ( x 2 , y 2 ) , , ( x I , y I ) ] x 1 , y 1 , x 2 , y 2 , , x I , y I [(x_(1),y_(1)),(x_(2),y_(2)),cdots,(x_(I),y_(I))]\left[\left(x_{1}, y_{1}\right),\left(x_{2}, y_{2}\right), \cdots,\left(x_{I}, y_{I}\right)\right][(x1,y1),(x2,y2),,(xI,yI)], which are the "measurements" that need to be approximated with Eq. 4.1. The goal of using a linear model for regression is to find the function f ( x ) = w x + b f ( x ) = w x + b f(x)=w*x+bf(x)=w \cdot x+bf(x)=wx+b closest to all the measurements. The closest curve can be associated with the smallest total distance between the measured values y i y i y_(i)y_{i}yi and predicted values f ( x i ) f x i f(x_(i))f\left(x_{i}\right)f(xi). Usually, the Euclidean distance is selected as the measure of distance in regression tasks. Under this condition, a regression task using a linear model can be mathematically described as
(4.2) ( w , b ) = arg min ( w , b ) i = 1 I ( f ( x i ) y i ) 2 = arg min ( w , b ) i = 1 I ( w x i + b y i ) 2 (4.2) w , b = arg min ( w , b ) i = 1 I f x i y i 2 = arg min ( w , b ) i = 1 I w x i + b y i 2 {:(4.2)(w^(**),b^(**))=arg min_((w,b))sum_(i=1)^(I)(f(x_(i))-y_(i))^(2)=arg min_((w,b))sum_(i=1)^(I)(w*x_(i)+b-y_(i))^(2):}\begin{equation*} \left(w^{*}, b^{*}\right)=\arg \min _{(w, b)} \sum_{i=1}^{I}\left(f\left(x_{i}\right)-y_{i}\right)^{2}=\arg \min _{(w, b)} \sum_{i=1}^{I}\left(w \cdot x_{i}+b-y_{i}\right)^{2} \tag{4.2} \end{equation*}(4.2)(w,b)=argmin(w,b)i=1I(f(xi)yi)2=argmin(w,b)i=1I(wxi+byi)2
where w w w^(**)w^{*}w and b b b^(**)b^{*}b are the optimal model parameters w w www and b b bbb, and arg min ( w , b ) arg min ( w , b ) arg min_((w,b))\arg \min _{(w, b)}argmin(w,b) means to minimize a function parameterized by w w www and b b bbb. This equation presents a formulation of linear regression with the ordinary least squares.

4.2.2. General Formulation of Basic Linear Model

The data in general regression applications can be much more complicated than that in the above simple case. In particular, each data sample may contain more than one attribute, which can be viewed as multiple random variables from a statistics perspective. We can understand the typical data that we will deal with in linear model applications using a data structure that is compatible with those adopted in tools like NumPy and Pandas.
Figure 4.1: Illustration of general data structure
As shown in Fig. 4.1, the unlabeled data X ¯ X ¯ bar(X)\bar{X}X¯ has I I III samples (rows), and each sample x i T x i T vec(x)_(i)^(T)\vec{x}_{i}^{T}xiT consists of J J JJJ attributes (columns). In fact, any sample in its original shape, x i x i vec(x)_(i)\vec{x}_{i}xi, is essentially a column vector as follows,
(4.3) x i = [ x i 1 x i 2 x i J ] (4.3) x i = x i 1 x i 2 x i J {:(4.3) vec(x)_(i)=[[x_(i1)],[x_(i2)],[vdots],[x_(iJ)]]:}\vec{x}_{i}=\left[\begin{array}{c} x_{i 1} \tag{4.3}\\ x_{i 2} \\ \vdots \\ x_{i J} \end{array}\right](4.3)xi=[xi1xi2xiJ]
and all the unlabeled data form X ¯ = [ x 1 T , x 2 T , , x I T ] T X ¯ = x 1 T , x 2 T , , x I T T bar(X)=[ vec(x_(1))^(T), vec(x_(2))^(T),cdots, vec(x_(I))^(T)]^(T)\bar{X}=\left[{\overrightarrow{x_{1}}}^{T},{\overrightarrow{x_{2}}}^{T}, \cdots,{\overrightarrow{x_{I}}}^{T}\right]^{T}X¯=[x1T,x2T,,xIT]T (or written as X ¯ = [ x 1 T ; x 2 T ; ; x I T ] X ¯ = x 1 T ; x 2 T ; ; x I T bar(X)=[ vec(x_(1))^(T); vec(x_(2))^(T);cdots; vec(x_(I))^(T)]\bar{X}=\left[{\overrightarrow{x_{1}}}^{T} ;{\overrightarrow{x_{2}}}^{T} ; \cdots ;{\overrightarrow{x_{I}}}^{T}\right]X¯=[x1T;x2T;;xIT] ). Note that each sample i i iii in Fig. 4.1 is x i T x i T vec(x)_(i)^(T)\vec{x}_{i}^{T}xiT (a row) instead of x i x i vec(x_(i))\overrightarrow{x_{i}}xi (a column). This data structure and tensor notation (a hybrid notation, explained in the book Appendices) are adopted throughout this book.
For any sample, x i x i vec(x)_(i)\vec{x}_{i}xi, the linear function of the variables, x i j ( j { 1 , , J } ) x i j ( j { 1 , , J } ) vec(x)_(ij)(j in{1,cdots,J})\vec{x}_{i j}(j \in\{1, \cdots, J\})xij(j{1,,J}), can be formulated as the following linear combination:
f ( x i ) = w T x i + b = x i T w + b = w 1 x i 1 + w 2 x i 2 + + w J x i J + b (4.4) = [ w 1 w 2 w J ] [ x i 1 x i 2 x i J ] + b f x i = w T x i + b = x i T w + b = w 1 x i 1 + w 2 x i 2 + + w J x i J + b (4.4) = w 1 w 2 w J x i 1 x i 2 x i J + b {:[f( vec(x)_(i))= vec(w)^(T)* vec(x)_(i)+b= vec(x)_(i)^(T)* vec(w)+b=w_(1)*x_(i1)+w_(2)*x_(i2)+cdots+w_(J)*x_(iJ)+b],[(4.4)=[[w_(1),w_(2),cdots,w_(J)]][[x_(i1)],[x_(i2)],[vdots],[x_(iJ)]]+b]:}\begin{align*} f\left(\vec{x}_{i}\right) & =\vec{w}^{T} \cdot \vec{x}_{i}+b=\vec{x}_{i}^{T} \cdot \vec{w}+b=w_{1} \cdot x_{i 1}+w_{2} \cdot x_{i 2}+\cdots+w_{J} \cdot x_{i J}+b \\ & =\left[\begin{array}{llll} w_{1} & w_{2} & \cdots & w_{J} \end{array}\right]\left[\begin{array}{c} x_{i 1} \\ x_{i 2} \\ \vdots \\ x_{i J} \end{array}\right]+b \tag{4.4} \end{align*}f(xi)=wTxi+b=xiTw+b=w1xi1+w2xi2++wJxiJ+b(4.4)=[w1w2wJ][xi1xi2xiJ]+b
where [ x i 1 , x i 2 , , x i J , y i ] T x i 1 , x i 2 , , x i J , y i T [x_(i1),x_(i2),cdots,x_(iJ),y_(i)]^(T)\left[x_{i 1}, x_{i 2}, \cdots, x_{i J}, y_{i}\right]^{T}[xi1,xi2,,xiJ,yi]T represents one labeled data point in a space consisting of J + 1 J + 1 J+1J+1J+1 axes, in which J J JJJ axes represent the J J JJJ variables while the remaining axis represents the label. In some places, people consider the label separately and thus state the data has J J JJJ dimensions. f ( x i ) f x i f( vec(x)_(i))f\left(\vec{x}_{i}\right)f(xi) is still used to map the input X ¯ X ¯ bar(X)\bar{X}X¯ to the output y y vec(y)\vec{y}y. The meaning of the ', , sign changes from the multiplication in arithmetic to dot product in tensor algebra as the variables change from scalars x i x i x_(i)x_{i}xi 's to vectors x i x i vec(x)_(i)\vec{x}{ }_{i}xi 's.
In a regression problem, usually, multiple points, e.g., I I III points in this case, are available as "measurements". The goal of using a linear model for regression is to find a function (e.g., a curve in 2 D ), f ( x ) = w T x + b f ( x ) = w T x + b f( vec(x))= vec(w)^(T)* vec(x)+bf(\vec{x})=\vec{w}^{T} \cdot \vec{x}+bf(x)=wTx+b, that is the closest to all the measurements. Again, the closest curve is usually associated with the smallest total distance between the measurement points and the curve, which is usually formulated using the Euclidean distance. Under this condition, the regression task using a linear model can be mathematically described as
(4.5) ( w , b ) = arg min ( w , b ) i = 1 I ( f ( x i ) y i ) 2 = arg min ( w T , b ) i = 1 I ( w T x i + b y i ) 2 (4.5) w , b = arg min ( w , b ) i = 1 I f x i y i 2 = arg min w T , b i = 1 I w T x i + b y i 2 {:(4.5)( vec(w)^(**),b^(**))=arg min_(( vec(w),b))sum_(i=1)^(I)(f( vec(x)_(i))-y_(i))^(2)=arg min_(( vec(w)^(T),b))sum_(i=1)^(I)( vec(w)^(T)* vec(x)_(i)+b- vec(y)_(i))^(2):}\begin{equation*} \left(\vec{w}^{*}, b^{*}\right)=\arg \min _{(\vec{w}, b)} \sum_{i=1}^{I}\left(f\left(\vec{x}_{i}\right)-y_{i}\right)^{2}=\arg \min _{\left(\vec{w}^{T}, b\right)} \sum_{i=1}^{I}\left(\vec{w}^{T} \cdot \vec{x}_{i}+b-\vec{y}_{i}\right)^{2} \tag{4.5} \end{equation*}(4.5)(w,b)=argmin(w,b)i=1I(f(xi)yi)2=argmin(wT,b)i=1I(wTxi+byi)2
The linear model can be written using arrays as
f ( X ¯ ) = X w + b (4.6) = [ x 11 x 12 x 1 J x I i x I 2 x I J ] I × J [ w 1 w 2 w J ] J × 1 + b f ( X ¯ ) = X w + b (4.6) = x 11 x 12 x 1 J x I i x I 2 x I J I × J w 1 w 2 w J J × 1 + b {:[f( bar(X))=X* vec(w)+b],[(4.6)=[[x_(11),x_(12),cdotsx_(1J)],[vdots,ddots,vdots],[x_(Ii),x_(I2),cdotsx_(IJ)]]_(I xx J)[[w_(1)],[w_(2)],[vdots],[w_(J)]]_(J xx1)+b]:}\begin{align*} f(\bar{X}) & =X \cdot \vec{w}+b \\ & =\left[\begin{array}{ccc} x_{11} & x_{12} & \cdots x_{1 J} \\ \vdots & \ddots & \vdots \\ x_{I i} & x_{I 2} & \cdots x_{I J} \end{array}\right]_{I \times J}\left[\begin{array}{c} w_{1} \\ w_{2} \\ \vdots \\ w_{J} \end{array}\right]_{J \times 1}+b \tag{4.6} \end{align*}f(X¯)=Xw+b(4.6)=[x11x12x1JxIixI2xIJ]I×J[w1w2wJ]J×1+b
The above equation can be reformulated as
f ( X ^ ) = X ^ w ^ (4.7) = [ x 11 x 12 x 1 J 1 1 x I 1 x I 2 x I J 1 ] I × ( J + 1 ) [ w 1 w 2 w J b ] ( J + 1 ) × 1 f ( X ^ ) = X ^ w ^ (4.7) = x 11 x 12 x 1 J 1 1 x I 1 x I 2 x I J 1 I × ( J + 1 ) w 1 w 2 w J b ( J + 1 ) × 1 {:[f( hat(X))= hat(X) hat(w)],[(4.7)=[[x_(11),x_(12),cdotsx_(1J),1],[vdots,ddots,vdots,1],[x_(I1),x_(I2),cdotsx_(IJ),1]]_(I xx(J+1))[[w_(1)],[w_(2)],[vdots],[w_(J)],[b]]_((J+1)xx1)]:}\begin{align*} f(\hat{X}) & =\hat{X} \hat{w} \\ & =\left[\begin{array}{cccc} x_{11} & x_{12} & \cdots x_{1 J} & 1 \\ \vdots & \ddots & \vdots & 1 \\ x_{I 1} & x_{I 2} & \cdots x_{I J} & 1 \end{array}\right]_{I \times(J+1)}\left[\begin{array}{c} w_{1} \\ w_{2} \\ \vdots \\ w_{J} \\ b \end{array}\right]_{(J+1) \times 1} \tag{4.7} \end{align*}f(X^)=X^w^(4.7)=[x11x12x1J11xI1xI2xIJ1]I×(J+1)[w1w2wJb](J+1)×1
where,
X ^ ( J + 1 ) × I = [ X ^ , [ 1 ; 1 ; 1 ] T ] w ^ ( J + 1 ) × 1 = [ w ; b ] X ^ ( J + 1 ) × I = X ^ , [ 1 ; 1 ; 1 ] T w ^ ( J + 1 ) × 1 = [ w ; b ] {:[ hat(X)_((J+1)xx I)=[( hat(X)),[1;1dots;1]^(T)]],[ hat(w)_((J+1)xx1)=[ vec(w);b]]:}\begin{gathered} \hat{X}_{(J+1) \times I}=\left[\hat{X},[1 ; 1 \ldots ; 1]^{T}\right] \\ \hat{w}_{(J+1) \times 1}=[\vec{w} ; b] \end{gathered}X^(J+1)×I=[X^,[1;1;1]T]w^(J+1)×1=[w;b]
The loss function (cost function or objective function) can be written as
(4.8) ( w ^ ) = f ( X ^ ) y 2 = ( X ^ w ^ y ) T ( X ^ w ^ y ) (4.8) ( w ^ ) = f ( X ^ ) y 2 = ( X ^ w ^ y ) T ( X ^ w ^ y ) {:(4.8)ℓ( hat(w))=||f( hat(X))- vec(y)||^(2)=( hat(X)* hat(w)- vec(y))^(T)*( hat(X)* hat(w)- vec(y)):}\begin{equation*} \ell(\hat{w})=\|f(\hat{X})-\vec{y}\|^{2}=(\hat{X} \cdot \hat{w}-\vec{y})^{T} \cdot(\hat{X} \cdot \hat{w}-\vec{y}) \tag{4.8} \end{equation*}(4.8)(w^)=f(X^)y2=(X^w^y)T(X^w^y)
The use of the sign "." can be confusing: this sign can refer to either scalar product (regular multiplication between scalars) or dot product. As introduced in the previous chapters, concepts like vectors and tensors are widely used in engineering, and sometimes, the tensor notation can be more convenient than the index notation, which is widely used in traditional machine learning. For example, algorithms formulated using the tensor notation can be more easily expressed using the tensor notation without worrying about the operations between tensors if the operators are clearly defined. Also, tensors can be more straightforwardly related to the arrays used in coding and computing. Thus, the tensor notation is adopted with minor modifications in most places of this book, including the majority of this chapter. More information about notation and operations can be found in Chapter 1 and the appendices at the end of the book.
To facilitate the use of tensors and the tensor notation, it is clarified that scalars are viewed as 0th order tensors, vectors is 1 st order tensors, while 2 nd and higher order tensors, which are called tensors in a narrow sense, are termed tensors or high-order tensors. Also, as for operators or operation signs, in this book, when '. ' is used between a scalar and any other tensors (scalars, vectors or high-order tensors), it is used for scalar product: the product between the scalar and the other scalar or elements in the tensors; when it is used between a vector and another 1st or higher order tensors, it refers to a dot (vector) product or equivalently a tensor product with a single contraction. In Python coding, '*' is used for the former type of product (scalar product), while '@' is used for the latter (dot product). Following this rule will make the implementation of the theories in this book much simpler than what can be seen in other literature.
Then, the mathematical description of the regression problem using a linear model is
(4.9) w ^ = arg min ( w , b ) ( X ^ w ^ y ) T ( X ^ w ^ y ) (4.9) w ^ = arg min ( w , b ) ( X ^ w ^ y ) T ( X ^ w ^ y ) {:(4.9) hat(w)^(**)=arg min_((w,b))( hat(X)* hat(w)- vec(y))^(T)*( hat(X)* hat(w)- vec(y)):}\begin{equation*} \hat{w}^{*}=\arg \min _{(w, b)}(\hat{X} \cdot \hat{w}-\vec{y})^{T} \cdot(\hat{X} \cdot \hat{w}-\vec{y}) \tag{4.9} \end{equation*}(4.9)w^=argmin(w,b)(X^w^y)T(X^w^y)
The following closed-form solution to the above equation can be obtained:
(4.10) w ^ ( J + 1 ) × 1 = ( X ^ ( J + 1 ) × I T X ^ I × ( J + 1 ) ) 1 X ^ ( J + 1 ) × I T y I × 1 (4.10) w ^ ( J + 1 ) × 1 = X ^ ( J + 1 ) × I T X ^ I × ( J + 1 ) 1 X ^ ( J + 1 ) × I T y I × 1 {:(4.10) hat(w)_((J+1)xx1)^(**)=( hat(X)_((J+1)xx I)^(T)* hat(X)_(I xx(J+1)))^(-1)* hat(X)_((J+1)xx I)^(T)* vec(y)_(I xx1):}\begin{equation*} \hat{w}_{(J+1) \times 1}^{*}=\left(\hat{X}_{(J+1) \times I}^{T} \cdot \hat{X}_{I \times(J+1)}\right)^{-1} \cdot \hat{X}_{(J+1) \times I}^{T} \cdot \vec{y}_{I \times 1} \tag{4.10} \end{equation*}(4.10)w^(J+1)×1=(X^(J+1)×ITX^I×(J+1))1X^(J+1)×ITyI×1
The following Python code presents the construction of a linear model based on the above theory. This model was created with very a dataset consisting of eight data points, each of which has two attributes. Random noise was added to the dataset.
import numpy as np
              import matplotlib.pyplot as plt
              from mpl_toolkits import mplot3d
              # Create data
              X_list = [[1.,1.],[1.,2.],[2.,2.],[2.,3.],[1.5,2.5],[2.,4.],[1.,3.],[3.,1.5]] # Create unlabeled
                data point/instance using a Python list
              X = np.array(X_list) # Turn instance into a NumPy array
              Proj = np.array([1,2]) # Create perfect labels
              y = np.dot(X,Proj)+3 # Add intercept
              y = y + np.random.rand(y.size) # Add noise to mimic experimental measurements
              # Define a function to perform analytical solution to linear model problems
              def LinearRegression(X,y):
                X_bar = np.hstack((X,np.ones([X.shape[0],1]))) # I = 8 by J+1 = 3
                W_hat = np.linalg.inv(X_bar.T @ X_bar) @ X_bar.T @ y # J+1 = 3 by 1
                return W_hat, X_bar
              # Symbols are the same as those in the textbook
              W_hat, X_bar= LinearRegression(X,y)
              W = W_hat [:-2]
              b = W_hat[-2:]
              # Solution
              Y_fit = np.dot(X_bar,W_hat)
              # Plot and compare results
              fig = plt.figure()
              ax = plt.axes(projection='3d')
              ax.scatter(X[:,0],X[:, 1],y,'b*')
              ax.scatter(X[:,0],X[:,1],Y_fit,'rd')
              
The original data points and data points predicted with the trained model are plotted in .
Figure 4.2: Training result of a linear model

4.3. Other Linear Regression Algorithms

When the number of samples ( I ) ( I ) (I)(I)(I) is relatively small and the number of features ( J ) ( J ) (J)(J)(J) is relatively large, basic linear models such as the above one with the ordinary least squares are susceptible to overfitting issues. To avoid or prevent overfitting, we can resort to several methods. One typical method is to reduce the number of features. For this purpose, significant features will be retained while some insignificant ones will be abandoned. Another popular method for avoiding overfitting in linear models is regularization. The goal of regularization is to obtain models with both high accuracy and low complexity. The addition of a regularization term to the above basic linear model can generate some other popular linear models such as Ridge, Lasso, and elastic net regression models. In the following, information for Ridge and Lasso linear models will be presented.

4.3.1. Ridge

The Ridge regression is a linear model with L 2 L 2 L_(2)L_{2}L2 norm regularization [43]. With a regularization term, the loss function of Ridge is usually constructed as follows,
(4.11) ( w ^ ) = ( X ^ w ^ y ) T ( X ^ w ^ y ) + λ w ^ T w ^ (4.11) ( w ^ ) = ( X ^ w ^ y ) T ( X ^ w ^ y ) + λ w ^ T w ^ {:(4.11)ℓ( hat(w))=( hat(X)* hat(w)- vec(y))^(T)*( hat(X)* hat(w)- vec(y))+lambda hat(w)^(T)* hat(w):}\begin{equation*} \ell(\hat{w})=(\hat{X} \cdot \hat{w}-\vec{y})^{T} \cdot(\hat{X} \cdot \hat{w}-\vec{y})+\lambda \hat{w}^{T} \cdot \hat{w} \tag{4.11} \end{equation*}(4.11)(w^)=(X^w^y)T(X^w^y)+λw^Tw^
Then, the Ridge regression problem can be formulated as
(4.12) w ^ = arg min ( w ^ ) [ ( X ^ w ^ y ) T ( X ^ w ^ y ) + λ i = 1 J w ^ i 2 ] (4.12) w ^ = arg min ( w ^ ) ( X ^ w ^ y ) T ( X ^ w ^ y ) + λ i = 1 J w ^ i 2 {:(4.12) hat(w)^(**)=arg min_(( hat(w)))[(( hat(X))*( hat(w))-( vec(y)))^(T)*(( hat(X))*( hat(w))-( vec(y)))+lambdasum_(i=1)^(J) hat(w)_(i)^(2)]:}\begin{equation*} \hat{w}^{*}=\arg \min _{(\hat{w})}\left[(\hat{X} \cdot \hat{w}-\vec{y})^{T} \cdot(\hat{X} \cdot \hat{w}-\vec{y})+\lambda \sum_{i=1}^{J} \hat{w}_{i}^{2}\right] \tag{4.12} \end{equation*}(4.12)w^=argmin(w^)[(X^w^y)T(X^w^y)+λi=1Jw^i2]
where λ w ^ T w ^ = λ i = 1 J w ^ i 2 λ w ^ T w ^ = λ i = 1 J w ^ i 2 lambda hat(w)^(T)* hat(w)=lambdasum_(i=1)^(J) hat(w)_(i)^(2)\lambda \hat{w}^{T} \cdot \hat{w}=\lambda \sum_{i=1}^{J} \hat{w}_{i}^{2}λw^Tw^=λi=1Jw^i2.
The closed-form solution to the above Ridge regression problem can be derived as
(4.13) w ^ = ( X ^ T X ^ + λ I ¯ ) 1 X ^ T y (4.13) w ^ = X ^ T X ^ + λ I ¯ 1 X ^ T y {:(4.13) hat(w)^(**)=( hat(X)^(T)*( hat(X))+lambda( bar(I)))^(-1)* hat(X)^(T)* vec(y):}\begin{equation*} \hat{w}^{*}=\left(\hat{X}^{T} \cdot \hat{X}+\lambda \bar{I}\right)^{-1} \cdot \hat{X}^{T} \cdot \vec{y} \tag{4.13} \end{equation*}(4.13)w^=(X^TX^+λI¯)1X^Ty
where I ¯ I ¯ bar(I)\bar{I}I¯ is a identify matrix with dimensions of ( J + 1 ) ( J + 1 ) (J+1)(J+1)(J+1) by ( J + 1 ) ( J + 1 ) (J+1)(J+1)(J+1).
In addition to the closed-form solution, we can also use the gradient descent method to solve the above optimization problem associated with the Rigid regression. The gradient of the loss function, which is a key in such optimization solution processes, is given below,

 

 

 

 

 

 

Enjoy and Build the AI World

Sample Code from AI Engineering

Cite the code in your publications

Linear Models