# Lagrange Duality and Karush-Kuhn-Tucker review for SVM

1. Lagrange Duality
Solve the optimization problem:
$\mathbf{minf_{0}(x)}\\ \mathbf{\text{such that }f_{i}(x)\leq 0,\forall i\in 1,...m}\\ \text{Consider the equivalent problem:}\\ J(x)=\left\{\begin{matrix} f_{0}(x), & \text{if }f_{i}(x)\leq 0\\ \infty , & otherwise \end{matrix}\right.=f_{0}(x)+\sum_{i}^{ }I[f_{i}(x)]\\ \text{where:}\\ I[u]=\left\{\begin{matrix} 0, & \text{if }u\leq 0\\ \infty , & otherwise \end{matrix}\right.\\ \text{We replace I[u] with }\lambda u \text{ for } \lambda \geq 0\\ \lambda u \text{ is a lower bound of I[u]}\\ \text{Here we punish the } J(x)=\infty \text{ if a constraint is dissatisfied}$
Figure: I[u] and $\lambda u$
$\text{We got new function is called the Lagrangian:}\\ L(x,\lambda )=f_{0}(x)+\sum_{i}^{ }\lambda _{i}f_{i}(x)\ \text{We have:}\\ max_{\lambda }L(x, \lambda )=J(x)\\ \text{because}\\ \text{if }f_{i}(x)\leq 0,\forall i \text{ we can choose }\lambda _{i}=0,\forall i\\ \text{if }f_{i}(x)\geq 0,\text{ for some } i \text{ we can choose }\lambda _{i}=\infty,\text{ for some } i\\ \text{Our problem becomes}\\ min_{x}max_{\lambda }L(x,\lambda )\\ \text{Consider the reversed order of minimization}\\ max_{\lambda }min_{x}L(x,\lambda )=max_{\lambda }g(\lambda )\\ \text{where }g(\lambda ) = min_{x}L(x,\lambda )\\ g(\lambda )\text{ is called dual function}\\ \mathbf{\text{Maximize }g(\lambda )=max_{\lambda }min_{x}L(x,\lambda )\text{ is called dual problem}}\\ \mathbf{\text{While }min_{x}max_{\lambda }L(x,\lambda )\text{ is called primal problem}}\\ \lambda u\text{ is lower bound of I[u] for all }\lambda \geq 0\\ \Rightarrow L(x,\lambda )\leq J(x)\\ \Rightarrow min_{x}L(x,\lambda )=g(\lambda )\leq min_{x}J(x)=p^{*}\\ \Rightarrow d^{*}=max_{\lambda }g(\lambda ) \leq p^{*}\\ p^{*}, d^{*} \text{ are the optimal of the primal and dual problems}\\ \Rightarrow g(\lambda ) \text{ gives lower bound of optimal problem}\\ \mathbf{\Rightarrow max_{\lambda }min_{x}L(x,\lambda )\leq min_{x}max_{\lambda }L(x,\lambda )}\\ \mathbf{\text{This is called weak duality}}\\ \mathbf{p^{*}-d^{*} \text{ is called optimal duality gap}}\\ \mathbf{p^{*}=d^{*} \text{ is called strong duality}}\\ \text{And the solution of primal and dual is equivalent}$
2. Karush-Kuhn-Tucker (KKT) conditions
$\text{If strong duality holds and }(x^{*},\lambda ^{*}) \text{ is optimal then }x^{*}\text{ minimizes }L(x,\lambda ^{*})\\ \text{ We have first KKT condition:}\\ \mathbf{\bigtriangledown _{x}L(x^{*}, \lambda^{*})=\bigtriangledown _{x}f_{0}(x^{*})+\sum_{i}^{ }\lambda _{i}^{*}\bigtriangledown _{x}f_{i}(x^{*})=0}\\ \Rightarrow \text{gradients must be parallel}\\ \text{Moreover, we have:}\\ f_{0}(x^{*})=g(\lambda ^{*})=min_{x}L(x, \lambda ^{*})\leq f_{0}(x^{*})+\sum_{i}^{}\lambda _{i}^{*}f_{i}(x^{*})\leq f_{0}(x^{*})\\ \text{because}\\ g(\lambda ^{*})=max_{\lambda }min_{x}L(x,\lambda )=min_{x}max_{\lambda }L(x,\lambda )=f_{0}(x^{*})\\ f_{0}(x^{*}) = f_{0}(x^{*})+\sum_{i}^{ }I[f_{i}(x^{*})]\\ \Rightarrow \sum_{i}^{ }\lambda _{i}^{*}f_{i}(x^{*})=0\\\text{Since }\lambda _{i}^{*}\geq 0 \text{ and }f_{i}(x^{*})\leq 0,\forall i\\ \text{We have second KKT condition:}\\ \mathbf{\lambda _{i}^{*}f_{i}(x^{*})=0}\\ \text{This condition is called complementary slackness}\\ \text{If }\lambda _{i}^{*}=0 \text{ then }f_{i}(x^{*})\leq 0 \text{ this constraint is not active}\\ \text{If }\lambda _{i}^{*}\geq \text{ then }f_{i}(x^{*})=0\text{ this constraint is active}\\ \text{In both cases the lower bound }\lambda u \text{ and I[u] are tight}\\ \text{The last conditions are primal and dual constraints}\\ \mathbf{\lambda _{i}^{*} \geq 0}$
3. Example
Example 1:
$\text{Minimize } f(x_{1},x_{2})=x_{1}^{2}+x_{2}^{2}-14x_{1}-6x_{2}\\ \text{Subject to: }\\ g_{1}(x_{1}, x_{2})=x_{1}+x_{2}-2\leq 0\\ g_{2}(x_{1}, x_{2})=x_{1}+2x_{2}-3\leq 0$
Solution:
The Lagrangian:
$L(x_{1}, x_{2},\lambda _{1},\lambda _{2})=x_{1}^{2}+x_{2}^{2}-14x_{1}-6x_{2}+\lambda _{1}(x_{1}+x_{2}-2)+\lambda _{2}(x_{1}+2x_{2}-3)$
Apply KKT conditions:
$\text{KKT 1}\\ 2x_{1}-14+\lambda _{1}+\lambda _{2}=0\\ 2x_{2}-6+\lambda _{1}+2\lambda _{2}=0\\ \text{KKT 2}\\ \lambda _{1}(x_{1}+x_{2}-2)=0\\ \lambda _{2}(x_{1}+2x_{2}-3)=0\\ \text{KKT 3}\\ \lambda _{1}\geq 0,\lambda _{2}\geq 0$
There are 4 cases to check:
$\mathbf{\lambda _{1}=0\text{ and }\lambda _{2}=0}\\ \Rightarrow x_{1}=7,x_{2}=3,g_{1}=8,g_{2}=10 (g_{1},g_{2}\text{ are not satified})\\ \mathbf{\lambda _{1}=0\text{ and }\lambda _{2}>0}\\ \Rightarrow x_{1}=5,x_{2}=-1,\lambda _{2}=4, g_{1}=2,g_{2}=0 (g_{1}\text{ is not satified})\\ \mathbf{\lambda _{1}>0\text{ and }\lambda _{2}=0}\\ \Rightarrow x_{1}=3,x_{2}=-1,\lambda _{1}=8, g_{1}=0,g_{2}=-2,f=-26 (\text{optimal})\\ \mathbf{\lambda _{1}>0\text{ and }\lambda _{2}>0}\\ \Rightarrow x_{1}=1,x_{2}=1,\lambda _{1}=20,\lambda _{1}=-8 (\lambda_{2}\text{ is not satified})$
Example 2:
$\text{Maximize }f=x^{3}-3x\\ \text{Subject to }g=x\leq 2$
The Lagrangian:
$L=-(x^{3}-3x)+\lambda (x-2)\\ \text{We convert maximizing problem to minimizing problem by add '-'}$
Apply KKT conditions:
$-(3x^{2}-3)+\lambda =0\\ \lambda (x-2)=0)\\ \lambda \geq 0$
There are 2 cases to check:
$\mathbf{\lambda = 0}\\ \Rightarrow x=1 \text{ or }x=-1\text{ and }f(1)=-2\text{ or }f(-1)=2\text{(both are satisfied)}\\ \mathbf{\lambda \geq 0}\\ \Rightarrow x=2,\lambda =9 \text{ and } f(2)=2\text{(satisfied)}\\ \text{So we have 2 solutions }x=-1\text{ and }x=2$

# Note 3: Classification and logistic regression - Python demos

1. Introduction
2. Classification and logistic regression
3. Optimization methods
3.2 Newton method
4. Regularization

1. Introduction
This note focuses on Supervised learning that using Linear regression model and Gradient descent algorithm to build the linear regression model.
Note: python, numpy, basic linear algebra, matplotlib are required. Please refer to this post and this post.
2. Classification and logistic regression
In this topic the prediction value y just takes on only a small number of discrete values. And we only focus on the binary classification problem in which $y \in \left \{ 0, 1 \right \}$. Applying linear regression algorithm to predict y given x, we need a hypothesis that can map y to {0,1}. We define:
$h(x) = g\left ( \theta ^{T}x\right ) = \frac{1}{1 + e^{-\theta ^{T}x}}\\ \text{where: }\\ \theta ^{T}x = \sum_{j=0}^{n}\theta _{j}x_{j} \text{ and } x_{0} = 1\\ g(z) = \frac{1}{1 + e^{-z}} \text{ is called logistic function or sigmoid function}$
Figure: logistic or sigmoid function
From the plot we can see:
$g(z)\rightarrow 1 \text{ when } z\rightarrow +\infty \\ g(z)\rightarrow 0 \text{ when } z\rightarrow -\infty \\ g(z) \in \left \{ 0, 1 \right \}$
Moreover, derivation g'(z) = g(z)(1 - g(z))
Change our perspective view to probability view. We consider:
$P(y=1|x; \theta ) = h(x) \text{ probability to y=1 given x, }\theta\\ P(y=0|x; \theta ) = 1 - h(x) \text{ probability to y=0 given x, }\theta\\ \text{or in compact form: } P(y|x; \theta ) = h(x)^{y}(1-h(x))^{1-y}\\ \text{assume m training examples are independent then we have }\\ P(y|x; \theta ) = \prod_{i=1}^{m}P(y^{(i)}|x^{(i)}; \theta ) = h(x^{(i)})^{y^{(i)}}(1-h(x^{(i)}))^{1-y^{(i)}}\\ \text{Apply the log likelihood. We define the log likelihood function:}\\ l(\theta ) = log(P(y|x; \theta )) = \sum_{i=1}^{m}(y^{(i)}log(h(x^{(i)})) + (1-y^{(i)})log(1-h(x^{(i)})))$
When predicting y=1 or y=0 we expect the maximum likelihood that y=1 or y=0. It means we maximize the log likelihood function. We can convert it to equivalent problem, that is minimize "-$l(\theta )$" (inverse problem by multiplying (-1)). We define the cost function:
$J(\theta ) = -l(\theta) = \sum_{i=1}^{m}(-y^{(i)}log(h(x^{(i)})) - (1-y^{(i)})log(1-h(x^{(i)})))$ and try to minimize it. In this post we focus on 2 methods that support to find theta that minimize the cost function: Stochastic / Gradient descent and Newton method.
Why this cost function is better than using 2-norm? If using 2-norm the cost function is not convex and it may have local minimum. While new cost function is convex and has one global minimum.
Figure: new cost function is better

3. Optimization methods
We calculate derivation of cost function
$\frac{\delta }{\delta \theta _{j}}J(\theta ) = (h(x) - y)x_{j}$
And then applying stochastic gradient descent rule using partial derivative of J ($\bigtriangledown _{\theta }J(\theta )$):
$\theta = \theta - \alpha \bigtriangledown _{\theta }J(\theta )\\ \text{where }\bigtriangledown _{\theta }J(\theta )=\begin{bmatrix} \frac{\delta J(\theta )}{\delta \theta _{0}}\\ \frac{\delta J(\theta )}{\delta \theta _{1}}\\ ...\\ \frac{\delta J(\theta )}{\delta \theta _{n}} \end{bmatrix}$
Combine both and apply for 1 training example we have:
$\theta = \theta - \alpha(h(x^{(i)})-y^{(i)})x_{j}$
$\theta_{j} = \theta_{j} - \alpha\frac{1}{m}\sum_{i=1}^{m}(h(x^{(i)})-y^{(i)})x_{j}^{(i)}$
Let 's make a demo using stochastic gradient descent. The input data is here.
Running this algorithm we see that it takes long time to archive minimum (on my machine it took more than 8 hours)
  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 import numpy as np import matplotlib.pyplot as plt from numpy.linalg import inv #we read data from files line by line init = False file = open('demo1x.dat', 'rb') for row in file: r = row.decode('utf8').strip().split(' ') if(init == False): x_train = np.array([[1], [np.float(r[0])], [np.float(r[len(r)-1])]]) init = True else: x_train = np.append(x_train, [[1], [np.float(r[0])], [np.float(r[len(r)-1])]], axis=1); init = False file = open('demo1y.dat', 'rb') for row in file: if(init == False): y_train = np.array([[np.float(row.strip())]]) init = True else: y_train = np.append(y_train, [[np.float(row.strip())]], axis=1); #number of training examples m = y_train.shape[1] #init theta theta = np.array(np.zeros((x_train.shape[0], 1))) #sigmoid function def sigmoid(z): return 1/(1+np.exp(-z)) #we find all indices that make y=1 and y=0 pos = np.flatnonzero(y_train == 1) neg = np.flatnonzero(y_train == 0) #plot data points plt.plot(x_train[1, pos], x_train[2, pos], 'ro') plt.plot(x_train[1, neg], x_train[2, neg], 'bo') x = 0 xT = x_train.T yT = y_train.T preJ = 0 while True: J = 0 x = x + 1; for i in range(0, m): #calculate h, error, cost function for 1 training example h = sigmoid(theta.T.dot(x_train[:,i].T)) error = h.T - yT[i] tmp = (-1)*yT[i]*np.log(h) - (1-yT[i])*np.log((1-h)) #accumulate cost function J = J + tmp nX = np.array([x_train[:,i]]).T #update theta theta = theta - 0.000003*(error*nX) J=J/m #just print cost function for every 1000steps if(x == 1000): x = 0 print(J) if(preJ == 0): preJ = J #condition to stop learning when cost function do not decrease anymore if((preJ) < (J)): break else: preJ = J #we got theta print(theta) #plot the line that separate data poits plot_x = [np.ndarray.min(x_train[1:]), np.ndarray.max(x_train[1:])] plot_y = np.subtract(np.multiply(-(theta[2][0]/theta[1][0]), plot_x), theta[0][0]/theta[1][0]) plt.plot(plot_x, plot_y, 'b-') plt.show() 
3.2. Newton method
In order to find theta that maximize likelihood of cost function, we need to find theta so that $\frac{\delta }{\delta \theta _{j}}J(\theta ) = 0$.
The Newton method supports to find the value of x so that the function f(x) = 0 by applying repeating calculation:
$x _{t+1} = x _{t} - \frac{f^{'}(x)}{f^{''}(x)}$
This algorithm approximates the function J via a linear function that is tangent to J at the current guess θ, solving for where that linear function equals to zero, and doing the next guess for θ be where that linear function is zero.
Figure: How it works (source)
Apply for our problem, we have:
$\theta = \theta - \frac{J^{'}(\theta )}{J^{''}(\theta )}$
Because theta is vector, applying partial derivative then we have:
$\theta = \theta - \frac{\bigtriangledown _{\theta }J(\theta )}{H} \Leftrightarrow \theta = \theta - H^{-1}\bigtriangledown _{\theta }J(\theta )\\ \text{where } H_{ij} = \frac{\delta^{2} J(\theta )}{\delta \theta _{i}\theta _{j}} \text{ H is called Hessian matrix and }H\in R^{(n+1)(n+1)}\\ \bigtriangledown _{\theta }J(\theta )=\begin{bmatrix} \frac{\delta J(\theta )}{\delta \theta _{0}}\\ \frac{\delta J(\theta )}{\delta \theta _{1}}\\ ...\\ \frac{\delta J(\theta )}{\delta \theta _{n}} \end{bmatrix}$
Finally, we have:
$\bigtriangledown _{\theta }J(\theta )=\frac{1}{m}\sum_{i=1}^{m}(h(x^{(i)})-y^{(i)})x^{(i)}\\ H = \frac{1}{m}\sum_{i=1}^{m}\left [ h(x^{(i)})(1-h(x^{(i)}))x^{(i)}(x^{(i)})^{T} \right ]$
Newton method is faster convergence than (batch) gradient descent, and requires many fewer iterations to get very close to the minimum. One iteration of Newton method is more expensive than one iteration of gradient descent, since it requires finding and inverting an Hessian. If number of features is not too large, it is usually much faster. When Newton method is applied, the resulting method is also called Fisher scoring.
Let 's make a demo that reuse the data set of example above:
Running this algorithm we see that after a few iterations the minimum is archived.
  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 import numpy as np import matplotlib.pyplot as plt from numpy.linalg import inv init = False file = open('logistic_x.txt', 'rb') for row in file: r = row.decode('utf8').strip().split(' ') if(init == False): x_train = np.array([[1], [np.float(r[0])], [np.float(r[len(r)-1])]]) init = True else: x_train = np.append(x_train, [[1], [np.float(r[0])], [np.float(r[len(r)-1])]], axis=1); init = False file = open('logistic_y.txt', 'rb') for row in file: if(init == False): y_train = np.array([[np.float(row.strip())]]) init = True else: y_train = np.append(y_train, [[np.float(row.strip())]], axis=1); m = y_train.shape[1] theta = np.zeros((x_train.shape[0], 1)) def sigmoid(z): return 1/(1+np.exp(-z)) pos = np.flatnonzero(y_train == 1) neg = np.flatnonzero(y_train == 0) plt.plot(x_train[1, pos], x_train[2, pos], 'ro') plt.plot(x_train[1, neg], x_train[2, neg], 'bo') yT = y_train.T xT = x_train.T #iterator 500 steps for x in range(0, 10): h = sigmoid(theta.T.dot(x_train)) error = h - y_train tmp = (-1)*y_train*np.log(h) - (1-y_train)*np.log((1-h)) J = np.sum(tmp)/m; #calculate H H = (h*(1-h)*(x_train)).dot(x_train.T)/m #calculate dJ dJ = np.sum(error*x_train, axis=1)/m #gradient = H-1.dJ grad = inv(H).dot(dJ) #update theta theta = theta - (np.array([grad])).T print(J) print(theta) plot_x = [np.ndarray.min(x_train[1:]), np.ndarray.max(x_train[1:])] plot_y = np.subtract(np.multiply(-(theta[2][0]/theta[1][0]), plot_x), theta[0][0]/theta[1][0]) plt.plot(plot_x, plot_y, 'b-') plt.show() 
Figure: logistic regression using Newton method
Comparing between 2 methods
- Need to choose learning rate
- Need more iterations
- Each iteration is cheaper
- Use when n is large
Newton method:
- No parameters
- Need fewer iterations
- Each iteration more expensive, we need more calculations $H, H^{-1} \text{ where } H\in R^{(n+1)(n+1)}$
- Use when n is small
If n<1000: use Newton method
If 1000<n<10.000: choose the best one in 2 methods
4. Regularization
In order to avoid overfiting we use regularization.
Figure: overfitting - the black curve fit perfectly data set using complex h(x)
Using regularization we adjust old cost function and updating rules:
$J(\theta ) = \left [ \sum_{i=1}^{m}(-y^{(i)}log(h(x^{(i)})) - (1-y^{(i)})log(1-h(x^{(i)}))) \right ] + \frac{\lambda}{2m} \sum_{j=1}^{n}\theta _{j}^{2}\\ \text{Stochastic gradient descent rules:}\\ \theta_{0} = \theta_{0} - \alpha(h(x^{(i)})-y^{(i)})x_{0}^{(i)}\\ \theta_{j} = \theta_{j} - \alpha(h(x^{(i)})-y^{(i)})x_{j}^{(i)} - \frac{\lambda }{m}\theta _{j}\\ \text{Gradient descent rules:}\\ \theta_{0} = \theta_{0} - \frac{\alpha}{m}\sum_{i=1}^{m}(h(x^{(i)})-y^{(i)})x_{0}^{(i)}\\ \theta_{j} = \theta_{j} - \frac{\alpha}{m}\sum_{i=1}^{m}(h(x^{(i)})-y^{(i)})x_{j}^{(i)}- \frac{\lambda }{m}\theta _{j}$
Newton method:
$\bigtriangledown _{\theta }J(\theta )=\begin{bmatrix} \frac{1}{m}\sum_{i=1}^{m}(h(x^{(i)})-y^{(i)})x_{0}^{(i)}\\ \frac{1}{m}\sum_{i=1}^{m}(h(x^{(i)})-y^{(i)})x_{1}^{(i)} - \frac{\lambda }{m}\theta _{1}\\ ...\\ \frac{1}{m}\sum_{i=1}^{m}(h(x^{(i)})-y^{(i)})x_{n}^{(i)}- \frac{\lambda }{m}\theta _{n} \end{bmatrix} \in R^{n+1}\\ H = \frac{1}{m}\sum_{i=1}^{m}\left [ h(x^{(i)})(1-h(x^{(i)}))x^{(i)}(x^{(i)})^{T} \right ] + \frac{\lambda }{m}\begin{bmatrix} 0 & 0 & 0 & 0 & 0\\ 0 & 1 & 0 & 0 & 0\\ 0 & 0 & 1 & 0 & 0\\ . & . & . & . & .\\ 0 & 0 & 0 & 0 & 1 \end{bmatrix} \in R^{(n+1)(n+1)}$
Let 's make demos for Newton method:

  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 import numpy as np import matplotlib.pyplot as plt from numpy.linalg import inv #we read data from files line by line init = False file = open('demo2x.dat', 'rb') for row in file: r = row.decode('utf8').strip().split(',') if(init == False): x1 = np.array([np.float(r[0])]) x2 = np.array([np.float(r[1])]) init = True else: x1 = np.append(x1, [np.float(r[0])]) x2 = np.append(x2, [np.float(r[1])]) init = False file = open('demo2y.dat', 'rb') for row in file: if(init == False): y_train = np.array([[np.float(row.strip())]]) init = True else: y_train = np.append(y_train, [[np.float(row.strip())]], axis=1); def sigmoid(z): return 1/(1+np.exp(-z)) def create_feature(x1, x2): x_train = np.array([np.ones(len(x1))]) for total in range(1, 7): for p in range(total, -1, -1): x_train = np.append(x_train, [np.power(x1, p)*np.power(x2, total-p)], axis=0) return x_train x_train = create_feature(x1, x2) #number of training examples m = y_train.shape[1] #init theta theta = np.array(np.zeros((x_train.shape[0], 1))) pos = np.flatnonzero(y_train == 1) neg = np.flatnonzero(y_train == 0) plt.plot(x_train[1, pos], x_train[2, pos], 'ro') plt.plot(x_train[1, neg], x_train[2, neg], 'bo') #modify lamda to see results lamda = 1 one = np.identity(x_train.shape[0]) one[0,0] = 0 for x in range(0, 15): h = sigmoid(theta.T.dot(x_train)) error = h - y_train tmp = (-1)*y_train*np.log(h) - (1-y_train)*np.log((1-h)) J = np.sum(tmp)/m + lamda*theta[1::,0].T.dot(theta[1::,0])/2*m #calculate H with regularization from 1 H = (h*(1-h)*(x_train)).dot(x_train.T)/m + lamda*one/m #calculate dJ with regularization from 1 dJ = np.sum(error*x_train, axis=1)/m dJ[1:] = dJ[1:] - lamda*theta.T[0,1::]/m #gradient = H-1.dJ grad = inv(H).dot(dJ) #update theta theta = theta - (np.array([grad])).T print(J) print(theta) #plot boundary u = np.linspace(-1, 1.5, 200) v = np.linspace(-1, 1.5, 200) #create input grid uu, vv = np.meshgrid(u, v, sparse=False) #output contour plotz = np.zeros(uu.shape) #evaluate z for i in range(0, len(uu)): z = theta.T.dot(create_feature(uu[i], vv[i])) for j in range(0,z.shape[1]): plotz[i][j] = z[0][j] #plot contour plt.contourf(uu, vv, plotz, linestyles='dashed') plt.show() 
Figure: no regularization, the hypothesis fit perfectly
Figure: with regularization and lamda=1, eliminate overfitting

# Note 2: Linear regression - Python demos

1. Introduction
2. Linear regression
3. Optimize linear regression cost function
3.2 Normal Equations
4. Polynomial regression
5. Feature scaling
6. Underfitting - Overfitting
6.1 Model selection
6.2 Regularization

1. Introduction
This note focuses on Supervised learning that using Linear regression model and Gradient descent algorithm to build the linear regression model.
Note: python, numpy, basic linear algebra, matplotlib are required. Please refer to this post and this post.
Notation:
Supervised learning process
${x^{(i))}, y^{(i))}; i=1,...,m}\\ x^{(i)}: \text{input variables or features}\\ y^{(i)}: \text{output or target variables}\\ (i): \text{training example (i) in (m) training examples}\\ X: \text{space of input variables}\\ Y: \text{space of output variables}\\ h(x): \text{hypothesis that takes input x and predicts output y}$
2. Linear regression
In linear regression, we use h(x) as a linear function of x to predict y. So h(x) has the form:
$h(x) = \theta _{0}x_{0} + \theta _{1}x_{1} + \theta _{2}x_{2}\\ \theta _{i}: \text{parameters or weights}\\ \text{ choose} x_{0} = 1 => h(x) = \theta _{0} + \theta _{1}x_{1} + \theta _{2}x_{2}\\ \text{or } h(x) = \sum_{1}^{n}\theta _{i}x_{i} = \theta ^{T}x\\ \theta ^{T}: \text{ is the transpose of vector }\theta\\ \theta = \begin{bmatrix} \theta _{0}\\ \theta _{1}\\ ...\\ \theta _{n} \end{bmatrix}; x = \begin{bmatrix} x_{0}\\ x_{1}\\ ...\\ x_{n} \end{bmatrix}$
The learning process is to find/learn the parameters θi to make h(x) close to y for the training examples. In order to measure how close the h(x(i)) is to the corresponding y(i) for each value of θ, we define the cost function:
$J(\theta ) = \frac{1}{2m}\sum_{1}^{m}(h(x^{i}) - y^{(i)})^{2}$
So the learning process is to find/learn θi to minimize J(θ).
3. Optimize linear regression cost function
This is one of the algorithms that supports to find/learn θi to minimize J(θ). This algorithm starts with some initial of θ and repeatedly performs the update with LMS (least mean squares) or Widrow-Hoff learning rule:
$\theta _{j} = \theta _{j} - \frac{1}{m}\alpha \sum_{1}^{m}(h(x^{(i)}) - y^{(i)})x_{j}^{i} \mathbf{(1)}\\ \text{The algorithm repeats for every j until convergence \{}\\ \text{ }\theta _{j} = \theta _{j} - \frac{1}{m}\alpha \sum_{1}^{m}(h(x^{(i)}) - y^{(i)})x_{j}^{i} \mathbf{(2)}\\ \text{\}}\\ \text{The algorithm repeats for every j until convergence \{}\\ \text{ for i=1 to m \{}\\ \text{ }\theta _{j} = \theta _{j} - \frac{1}{m}\alpha (h(x^{(i)}) - y^{(i)})x_{j}^{i} \mathbf{(3)}\\ \text{ \}}\\ \text{\}}$
Formula (1): the partial derivative represents the direction to find/learn θ that can minimize J(θ).
Formula (2): on every step we update θj based on entire training examples. So it is called batch gradient descent.
Formula (3): on every step we update θj based on each training example. With this algorithm, we need not to calculate sum of all training examples that need much memory and time. It is called stochastic gradient descent or incremental gradient descent. Often, stochastic gradient descent gets θ "close" to the minimum much faster than batch gradient descent. However it may never "converge" to the minimum, and the parameters θ will keep oscillating around the minimum of J(θ); but in practice most of the values near the minimum will be reasonably good approximations to the true minimum.
- J is a convex quadratic function. So the optimization problem has only one global and gradient descent always converges.
- The learning rate determines how fast or slow the optimizer will move towards the optimal weights. You can try it here (it is λ).

Try to choose α with range [ ..., 0.001, 0.003, 0.01, 0.03, 0.1, 0.3, 1, ...] and choose the largest possible value or slightly smaller than the largest possible value.
So it is time to practise. We will make a simple demo for this algorithm. Our input only has 1 feature and 1 output. The input is a list x= [0, 0.5, 1, 1.5, ..., 9.5, 10] and output is a list y = [3, 3.5, 4, 4.5, ..., 12.5, 13] (output is generated by formula y=x+3). Applying linear regression model (y = θ0x0 + θ1x1) and gradient descent. Since we generate training set using function (y=x+3) so after training the right result will approximates to(θ0 = 3 and θ1 = 1).
Note: if your PC is not strong then disable update_line() function because it take time to plot J while calculating so the algorithm will be slow.
Here is the Python source code using batch gradient descent:
  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 import matplotlib.pyplot as plt import numpy as np plt.figure(1) ax = plt.gca() ax.set_autoscale_on(True) plt.ylabel('error') plt.xlabel('step') px = [0] py = [0] g, = plt.plot(px, py) def update_line(g, x, y): global px global py if(len(px) == 60): px = [] py = [] if(len(px) == 30): plt.text(x, y, str(y)) px = np.append(px, x) py = np.append(py, y) g.set_xdata(px) g.set_ydata(py) ax.relim() ax.autoscale_view(True,True,True) plt.draw() plt.pause(0.001) #generate training set x = np.arange(0, 10, 0.5) #input m = len(x) y = x + 3 #output #convert to vectors x_train = np.array([np.ones(len(x)), x]) y_train = (y[:, np.newaxis]) #h = theta0*x0 + theta1*x1 (x0=1) theta = np.array(np.zeros((2, 1))) #iterator 500 steps for x in range(0, 500): h = theta.T.dot(x_train) #h=thetaT*x error = h.T - y_train #error=h_predict-y_target J = error.T.dot(error)/2*m; #cost function J #update theta using batch gradient descent theta = theta - 0.06*x_train.dot(error)/m; #plot J update_line(g, x, J) #finsih training and print theta print(theta) plt.show() 
Figure: batch gradient descent θ0 = 2.999 and θ1 = 1.0001
Here is the source code using stochastic gradient descent:
  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 import matplotlib.pyplot as plt import numpy as np plt.figure(1) ax = plt.gca() ax.set_autoscale_on(True) plt.ylabel('error') plt.xlabel('step') px = [0] py = [0] g, = plt.plot(px, py) def update_line(g, x, y): global px global py if(len(px) == 60): px = [] py = [] if(len(px) == 30): plt.text(x, y, str(y)) px = np.append(px, x) py = np.append(py, y) g.set_xdata(px) g.set_ydata(py) ax.relim() ax.autoscale_view(True,True,True) plt.draw() plt.pause(0.0001) #just plot x values in range 0, 10 step 0.5 x = np.arange(0, 10, 0.5) m = len(x) #generate y values y=x+3 y = x + 3 #convert to vectors x_train = np.array([np.ones(len(x)), x]) y_train = (y[:, np.newaxis]) #h = theta0*x0 + theta1*x1 + theta2*x2 (x0=1 and x2=x^2) theta = np.array(np.zeros((2, 1))) x = 0 xT = x_train.T while True: J = 0 #scan through training set for i in range(0, m): #for each training set x = x + 1; #calculate h_predicted h = xT[i].dot(theta) #calculate error=h_predicted-y_target error = (h - y_train[i]) #accumulate error to J J = J + error*error #update theta for a training set theta = theta - 0.0001*(error*xT[i])[:, np.newaxis]; J=J/m #plot J update_line(g, x, J) print(J) if(abs(J)<0.0001): break print(theta) plt.show() 
Figure: stochastic gradient descent θ0 = 2.9996 and θ1 = 1.00005 and it is pretty fast than batch gradient descent
3.2 Normal Equations
Instead of using gradient descent, we can find the optimization point of cost function directly by solving the equation:
$\frac{\delta J(\theta )}{\delta \theta } = 0\\ \text{solve it we have normal equation: }\theta = (X^{T}X)^{-1}X^{T}Y$
$\text{we have training example (i) with n features:}\\ x^{(i)} = \begin{bmatrix} x_{0}\\ x_{1}\\ ...\\ x_{n} \end{bmatrix} \text{then } X = \begin{bmatrix} (x^{(1)})^{T}\\ (x^{(2)})^{T}\\ ...\\ (x^{(m)})^{T} \end{bmatrix} \in R^{mx(n+1))}\\\\ \text{and } Y = \begin{bmatrix} y^{(1)}\\ y^{(2)}\\ ...\\ y^{(m)} \end{bmatrix}$
In case the matrix $(X^{T}X)^{-1}$ is invert-able or singular:
+ we check and remove redundant features (linear dependent features). E.g we have feature X but we also have feature $2X$. We only need X so remove $2X$.
+ if training set size m < number of features n, we consider to remove redundant features.
+ Consider using the (Moore-Penrose) pseudo-inverse of a matrix.
Using normal equation we need not care about feature scaling (discuss later).
Let 's make a demo using normal equation. We re-use demo in 3.1 the result should be theta0=3 and theta1=1
  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 import numpy as np #for inverse matrix from numpy.linalg import inv x = np.arange(0, 10, 0.5) #input y = x + 3 #output #form matrix x and Y X = np.array([np.ones(len(x)), x]).T Y = (y[:, np.newaxis]) #apply normal equation theta = inv(X.T.dot(X)).dot(X.T).dot(Y) print(theta)#it return theta0=3 and theta1=1 
One step to the target
4. Polynomial regression
Suppose that we have training set below:
Figure: training set example
Looking the graph, we see that this training set can not be fitted by using linear regression. We may need a 'curve' or a hypothesis is polynomial function. For example with a above plot, it seems that we can fit it using a quadratic function. So the hypothesis has the form: h(x) = θ0 + θ1x1 + θ2x12. In order to reuse the gradient descent algorithm for linear regression, we have created new features: x2 = x12. So the h(x) becomes: h(x) = θ0 + θ1x1 + θ2x2 and we can apply gradient descent like we applied for linear regression.
Let 's make a demo with input x=[0, 0.5, 1, 1.5, ..., 9, 9.5] and output y=[25, 20.25, 16, 12.25,   9, ..., 16, 20.25] (output is generated by formula y=x2-10x+25). Since we generate training set using function (y=x2-10x+25) so after training the right result will approximates to(θ0 = 25, θ1 = -10, θ2 = 1).
Here is the Python code that using batch gradient descent:
  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 import matplotlib.pyplot as plt import numpy as np plt.figure(1) ax = plt.gca() ax.set_autoscale_on(True) plt.ylabel('error') plt.xlabel('step') px = [0] py = [0] g, = plt.plot(px, py) def update_line(g, x, y): global px global py if(len(px) == 60): px = [] py = [] if(len(px) == 30): plt.text(x, y, str(y)) px = np.append(px, x) py = np.append(py, y) g.set_xdata(px) g.set_ydata(py) ax.relim() ax.autoscale_view(True,True,True) plt.draw() #plt.pause(0.001) #just plot x values in range 0, 10 step 0.5 x = np.arange(0, 10, 0.5) print(x) m = len(x) #generate y values x^2-10x+25 y = np.power(x-5, 2) print(y) #convert to vectors x_train = np.array([np.ones(len(x)), x, np.power(x, 2)]) print(x_train.shape) y_train = (y[:, np.newaxis]) #h = theta0*x0 + theta1*x1 + theta2*x2 (x0=1 and x2=x^2) theta = np.array(np.zeros((3, 1))) x = 0 while True: x = x + 1 h = theta.T.dot(x_train) error = h.T - y_train J = error.T.dot(error)/2*m;print(J) if(J<0.0001): break theta = theta - 0.001*x_train.dot(error)/m; update_line(g, x, J) print(theta) plt.show() 
Figure: polynomial regression with θ0 = 24.9, θ1 = -9.9, θ2 = 0.9
5. Feature scaling
Suppose that we have 2 features:
x1 = area of the house = 2000 m2
x2 = number of bed rooms = 5
So when applying gradient descent for cost function J(θ). It will look like below:
Figure: Before scaling
J(θ) will converge slowly oscillatorily. In order to speed up convergence we need to scale input variables so that they are all roughly the same. Ideally: -1 ≤ xi ≤ 1
After scaling J(θ) will look like this:
Figure: After scaling
We apply feature scaling and mean normalization techniques.
$x_{j} = \frac{x_{j} - \mu _{j}}{s_{j}}\\ \mu _{j}=\frac{\sum_{i=1}^{m}x\tfrac{(i)}{j}}{m}: \text{average of all the values for feature (j)}\\ s_{j}: \text{the range of values (max-min) ot the standard deviation} 6. Underfitting - Overfitting From wiki: Overfitting is the production of an analysis that corresponds too closely or exactly to a particular set of data, and may therefore fail to fit additional data or predict future observations reliably. An overfitted model is a statistical model that contains more parameters than can be justified by the data. Underfitting occurs when a statistical model cannot adequately capture the underlying structure of the data. An underfitted model is a model where some parameters or terms that would appear in a correctly specified model are missing. Let 's take an example with a given data set and fitting functions like below: Figure: Overfitting and underfitting The left figure shows the result of fitting using a$y = \theta _{0} + \theta _{1}x$to a data set. The middle figure shows the result of fitting using a$y = \theta _{0} + \theta _{1}x + \theta_{2}x^{2}$to a data set. The right figure shows the result of fitting using a$y = \theta _{0} + \theta _{1}x + \theta_{2}x^{2} + \theta _{3}x^{3} + \theta _{4}x^{4}$Comparing these 3 figures: The left figure is a straight line and it is not a good fit. It cannot adequately capture the underlying structure of the data because our data set looks like a curve. The middle figure, we added one more feature$x^{2}$and it looks better than the first. The right figure, we added 2 more features$x^{3}, x^{4}$and it looks perfectly but it is not really good because it may fail to fit additional data or predict future observations reliably. In order to solve the overfitting issue, there are 2 ways: - Reduce the number of features: just keep necessary features or using model selection algorithm. - Using Regularization: reduce the magnitude of parameters θj by penalizing it. Let 's make demos for underfitting, just right fitting and overfitting. In order to speed up, I will use scikit-learn. If your machine is strong enough, you can use the gradient descent (not optimized) above instead of using scikit-learn. The training set is generated from a half of parabola function including noise that is generated by random function. Underfitting demo We use the straight line to fit the data set. And the predicting error is quite large.   1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 import matplotlib.pyplot as plt import numpy as np from sklearn import linear_model from sklearn.metrics import mean_squared_error, r2_score #just plot x values in range 0, 10 step 0.5 x = np.arange(0, 5, 0.5) #generate y values x^2-10x+25 with noise in range [1,2] #we just get the left area parabola y = np.power(x-5, 2)+np.random.randint(2, size=x.size) #use polynominal theta0+theta1x x_train = np.array([np.ones(len(x)), x]).T y_train = (y[:, np.newaxis]) #generate test data xx = np.arange(-3, 0, 0.5) yy = np.power(xx-5, 2)+np.random.randint(3, size=xx.size) x_test = np.array([np.ones(len(xx)), xx]).T y_test = yy[:, np.newaxis] # Create linear regression object regr = linear_model.LinearRegression() # Train the model using the training sets regr.fit(x_train, y_train) #plot training y_pred1 = regr.predict(x_train) plt.subplot(121) plt.title('training result') plt.scatter(x, y_train, color='black') plt.plot(x, y_pred1, color='blue', linewidth=3) # Make predictions using the testing set y_pred2 = regr.predict(x_test) plt.subplot(122) plt.title('predicting result') # Plot predicting plt.scatter(xx, y_test, color='black') plt.plot(xx, y_pred2, color='blue', linewidth=3) # The coefficients print('Coefficients: ', regr.coef_) # The mean squared error print("Mean squared error: %.2f" % mean_squared_error(y_test, y_pred2)) # Explained variance score: 1 is perfect prediction print('Variance score: %.2f' % r2_score(y_test, y_pred2)) plt.show()  Figure: Underfitting Just right fitting demo We use the parabola to fit the data set.   1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 import matplotlib.pyplot as plt import numpy as np from sklearn import linear_model from sklearn.metrics import mean_squared_error, r2_score #just plot x values in range 0, 10 step 0.5 x = np.arange(0, 5, 0.5) #generate y values x^2-10x+25 with noise in range [1,2] #we just get the left area parabola y = np.power(x-5, 2)+np.random.randint(2, size=x.size) #use polynominal theta0+theta1x+theta2x^2 x_train = np.array([np.ones(len(x)), x, np.power(x, 2)]).T y_train = (y[:, np.newaxis]) #generate test data xx = np.arange(-3, 0, 0.5) yy = np.power(xx-5, 2)+np.random.randint(3, size=xx.size) x_test = np.array([np.ones(len(xx)), xx, np.power(xx, 2)]).T y_test = yy[:, np.newaxis] # Create linear regression object regr = linear_model.LinearRegression() # Train the model using the training sets regr.fit(x_train, y_train) #plot training y_pred1 = regr.predict(x_train) plt.subplot(121) plt.title('training result') plt.scatter(x, y_train, color='black') plt.plot(x, y_pred1, color='blue', linewidth=3) # Make predictions using the testing set y_pred2 = regr.predict(x_test) plt.subplot(122) plt.title('predicting result') # Plot predicting plt.scatter(xx, y_test, color='black') plt.plot(xx, y_pred2, color='blue', linewidth=3) # The coefficients print('Coefficients: ', regr.coef_) # The mean squared error print("Mean squared error: %.2f" % mean_squared_error(y_test, y_pred2)) # Explained variance score: 1 is perfect prediction print('Variance score: %.2f' % r2_score(y_test, y_pred2)) plt.show()  Figure: just right fitting demo Overfitting demo We use the polynominal to fit the data set.$y = \theta _{0} + \theta _{1}x + \theta _{2}x^{2} + \theta _{3}x^{3} + \theta _{4}x^{4} + \theta_{5}x^{5} + \theta_{6}x^{6} + \theta_{7}x^{7} + \theta_{8}x^{8}$And the predicting error is very large.   1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 import matplotlib.pyplot as plt import numpy as np from sklearn import linear_model from sklearn.metrics import mean_squared_error, r2_score #just plot x values in range 0, 10 step 0.5 x = np.arange(0, 5, 0.5) #generate y values x^2-10x+25 with noise in range [1,2] #we just get the left area parabola y = np.power(x-5, 2)+np.random.randint(2, size=x.size) #use polynominal theta0+theta1x+theta2x^+theta3x^3++theta4x^4++theta5x^5++theta6x^6++theta7x^7++theta8x^8 x_train = np.array([np.ones(len(x)), x, np.power(x, 2), np.power(x, 3), np.power(x, 4), np.power(x, 5), np.power(x, 6), np.power(x, 7), np.power(x, 8)]).T y_train = (y[:, np.newaxis]) #generate test data xx = np.arange(-3, 0, 0.5) yy = np.power(xx-5, 2)+np.random.randint(3, size=xx.size) x_test = np.array([np.ones(len(xx)), xx, np.power(xx, 2), np.power(xx, 3), np.power(xx, 4), np.power(xx, 5), np.power(xx, 6), np.power(xx, 7), np.power(xx, 8)]).T y_test = yy[:, np.newaxis] # Create linear regression object regr = linear_model.LinearRegression() # Train the model using the training sets regr.fit(x_train, y_train) #plot training y_pred1 = regr.predict(x_train) plt.subplot(121) plt.title('training result') plt.scatter(x, y_train, color='black') plt.plot(x, y_pred1, color='blue', linewidth=3) # Make predictions using the testing set y_pred2 = regr.predict(x_test) plt.subplot(122) plt.title('predicting result') # Plot predicting plt.scatter(xx, y_test, color='black') plt.plot(xx, y_pred2, color='blue', linewidth=3) # The coefficients print('Coefficients: ', regr.coef_) # The mean squared error print("Mean squared error: %.2f" % mean_squared_error(y_test, y_pred2)) # Explained variance score: 1 is perfect prediction print('Variance score: %.2f' % r2_score(y_test, y_pred2)) plt.show()  Figure: Overfitting demo 6.1 Model selection Let 's reuse the example above. How to find the 'best' polynomial degree for your hypothesis? Following the method: - Shuffle and break down our data set into the three sets: + Training set: 60% + Cross validation set: 20% + Test set: 20% - For each polynomial degree, calculating three separate error values for the three different sets using the following method: + Optimize the cost function using the training set + Using the cross validation set to choose the polynomial degree d with the least error:$J_{cross}(\theta ) = \frac{1}{2m_{cross}}\sum_{1}^{m_{cross}}(h(x_{cross}^{(i)}) - y_{cross}^{(i)})^{^{2}}$+ For chosen polynomial degree d, estimate the generalization error using the test set with:$J_{test}(\theta ) = \frac{1}{2m_{test}}\sum_{1}^{m_{test}}(h(x_{test}^{(i)}) - y_{test}^{(i)})^{^{2}}$This way, the degree of the polynomial d has not been trained using the test set. 6.2 Regularization Let 's take the example of overfitting above and using regularization to reduce the magnitude of parameters θj by penalizing it. So it means we will make$\theta _{3} \approx 0, \theta _{4} \approx 0$and finally the hypothesis becomes similar to$y = \theta _{0} + \theta _{1}x + \theta_{2}x^{2}$The new cost function was introduced:$J(\theta ) = \frac{1}{2m}\left [ \sum_{i=1}^{m} (h(x^{(i)}) - y^{(i)^{2}}) + \lambda \sum_{j=1}^{n}\theta _{j}^{2} \right ]\\
\lambda : \text{ Regularization parameter}\\
\text{if }\lambda \text{ is very large }\theta _{j} (j=\overline{1,n}) \approx 0\\
h(x) = \theta _{0}\text{ results in underfitting}\\
\sum_{j=1}^{n}\theta _{j}^{2} = \frac{\lambda }{2m}\theta ^{T}\theta = \left \| \theta  \right \|_{2}^{2}:\text{ }
l_{2}\text{ or 2 norm regularization}$When training set size m increases the effect of regularization part$\frac{\lambda}{2m} \sum_{j=1}^{n}\theta _{j}^{2}$decreases. In order to optimize the cost function that using regularization we use gradient descent algorithm with a slightly different:$\text{The algorithm repeats for every j until convergence \{}\\
\text{ }\theta _{0} = \theta _{0} - \frac{1}{m}\alpha \sum_{1}^{m}(h(x^{(i)}) - y^{(i)})x_{0}^{i}\\
\text{ }\theta _{j} = \theta _{j} - \frac{1}{m}\alpha \sum_{1}^{m}(h(x^{(i)}) - y^{(i)})x_{j}^{i} - \alpha \frac{\lambda }{m}\theta _{j}\\
\text{\}}$If applying regularization the normal equation becoms:$\theta = (X^{T}X + \lambda \begin{bmatrix}
0 & 0 & 0 & 0\\
0 & 1 & 0 & 0\\
0 & 0 & 1 & 0\\
. & . & . & .\\
\end{bmatrix})^{-1}X^{T}Y\\
\lambda \begin{bmatrix}
0 & 0 & 0 & 0\\
0 & 1 & 0 & 0\\
0 & 0 & 1 & 0\\
. & . & . & .\\
\end{bmatrix})^{-1} \in R^{(n+1)(n+1)}\\
(X^{T}X + \lambda \begin{bmatrix}
0 & 0 & 0 & 0\\
0 & 1 & 0 & 0\\
0 & 0 & 1 & 0\\
. & . & . & .\\
\end{bmatrix})^{-1} \text{ this term is invertable}$The term above is invertable so when applying regularization we should apply normal equation for fast calculation. Let 's make a demo that use a simple training set. Try to adjust lamda to see the changes of parameters. It is not easy to choose the alpha parameter and lamda for gradient descent. You can check the error difference between 2 methods to see. If lamda is small the difference is small but lamda large the diffrence is large but the common error is large in both cases. Using normal equation:   1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 import numpy as np #for inverse matrix from numpy.linalg import pinv import matplotlib.pyplot as plt x = np.array([-0.99768, -0.69574, -0.40373, -0.10236, 0.22024, 0.47742, 0.82229]) m = len(x) #generate y values x^2-10x+25 y = np.array([2.0885, 1.1646, 0.3287, 0.46013, 0.44808, 0.10013, -0.32952]) #form matrix x and Y X = np.array([np.ones(len(x)), x, np.power(x, 2), np.power(x, 3), np.power(x, 4), np.power(x, 5)]).T Y = (y[:, np.newaxis]) one = np.identity(X.shape[1]) one[0,0] = 0 lamda = 2 #apply normal equation theta = pinv(X.T.dot(X) + lamda*one).dot(X.T).dot(Y) print(theta) y_pred = theta.T.dot(X.T) plt.scatter(x, y, color='black') plt.plot(x, y_pred[0,:], color='blue', linewidth=3) plt.show()  Using gradient descent:   1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 #we use polynominal h(x) = theta0*x0 + theta1*x1 + theta2*x2 + theta3*x3 import matplotlib.pyplot as plt import numpy as np plt.figure(1) #just plot x values in range 0, 10 step 0.5 x = np.array([-0.99768, -0.69574, -0.40373, -0.10236, 0.22024, 0.47742, 0.82229]) m = len(x) #generate y values x^2-10x+25 y = np.array([2.0885, 1.1646, 0.3287, 0.46013, 0.44808, 0.10013, -0.32952]) plt.scatter(x, y, marker=r'$\clubsuit\$') plt.hold(True) #convert to vectors x_train = np.array([np.ones(len(x)), x, np.power(x, 2), np.power(x, 3), np.power(x, 4), np.power(x, 5)]) y_train = (y[:, np.newaxis]) #h = theta0*x0 + theta1*x1 + theta2*x2 (x0=1 and x2=x^2) theta = np.array(np.zeros((x_train.shape[0], 1))) i = 0 alpha = 0.001 lamda = 1 preJ = 0 while (True): i = i + 1 h = theta.T.dot(x_train) error = h.T - y_train J = (error.T.dot(error) + lamda*theta[1::,0].T.dot(theta[1::,0]))/2*m; print(J) if(preJ == 0): preJ = J if(preJ < J): break else: preJ = J tmp = alpha*x_train.dot(error)/m tmp2 = tmp[1::,0] - alpha*lamda*theta[1::,0]/m theta[0,0::] = theta[0,0::] - tmp[0,0::] theta[1::,0] = theta[1::,0] - tmp2 print(theta) y_pred = theta.T.dot(x_train) plt.plot(x, y_pred.T) plt.show()