MLNG_线性回归_梯度下降_正规方程组

2019-05-26 21:43:50 浏览数 (1)

symbol:

m: traing examples x: input variables/features y: output variable/targer (x,y): traing example (x(i)x^{(i)},y(i)y ^{(i)}): ith traing example xix_i: ith attribute in vector xx

prior knowledge

1

θ∈ℝn 1theta in mathbb{R}^{n 1}, then ∇θJ(θ)=[...,∂J(θ)∂θi,...]∈ℝn 1nabla _theta J(theta)=[..., frac{partial J(theta)}{partial theta_i},...]in mathbb{R}^{n 1}

so the batch gradient descent:

θ:=θ−α∗∇θJ(θ)

theta := theta - alpha * nabla _theta J(theta)

the stochastic gradient descent:

θ:=θ−α∗(hθ(x(j))−y(j))x(j)

theta:= theta - alpha * (h_theta(x^{(j)})-y^{(j)})x^{(j)}

2

f is a function: ℝm∗n−>ℝmathbb{R}^{m*n} ->mathbb{R} A is a matrix, A∈ℝm∗nA in mathbb{R}^{m*n} then,

∇Af(A)=[...,∂f∂Aij,...]∈ℝm∗n

nabla_A f(A) = [...,frac{partial f}{partial A_{ij}},...] in mathbb{R}^{m*n}

3

trace operator

if A∈ℝn∗nA in mathbb{R}^{n*n},then trA=∑ni=1AiitrA = sum_{i=1}^n A_{ii}

4

trAB=trBAtr AB = tr BA trABC=trCAB=trBCAtr ABC = tr CAB = tr BCA ∇AtrAB=BTnabla_A trAB=B^T trA=trATtrA = trA^T ifa∈ℝ,tra=aif a in mathbb{R}, tra = a ∇AtrABATC=CAB CTABTnabla_A trABA^TC = CAB C^TAB^T

linear regression

formula

针对与一个数据样本,有:

h(x)=∑i=0nθixi=θTx

h(x)=sum_{i=0}^n theta_i x_i=theta^T x

where n=#features θtheta=#parameters

optimization problem

our target is

minθ∑i=1m12(hθ(x(i))−y(i))2

min_theta sum_{i=1}^m frac{1}{2} (h_theta(x^{(i)})-y^{(i)})^2

simplify:

minθJ(θ)

min_theta J(theta)

where

J(θ)=∑i=1m12(hθ(x(i))−y(i))2

J(theta) =sum_{i=1}^m frac{1}{2} (h_theta(x^{(i)})-y^{(i)})^2

algorithm

gradient descent

idea(search)

start with some θtheta(like 0→overrightarrow 0) keep changing θtheta to reduce J(θ)J(theta)

if i take a small step, what is the direction of steepest descent that would take me downhill as quickly as possible.

property
  1. must end
  2. can find local optim(linear regression must find best optim)
  3. different intinial value results in differ optim valum
  4. the speed of θtheta changes less and less , because the gradient gets smaller and smaller
formula

θi:=θi−α∂∂θiJ(θ)

theta_i := theta_i - alpha frac{partial}{partial theta_i} J(theta)

deduce
when m=1

∂∂θiJ(θ)=∂∂θi12(hθ(x(1))−y(1))2=(hθ(x(1))−y(1))x(1)i

frac{partial}{partial theta_i} J(theta) = frac{partial}{partial theta_i} frac{1}{2} (h_theta(x^{(1)})-y^{(1)})^2 =(h_theta(x^{(1)})-y^{(1)})x_i^{(1)} so, the update rule is:

θi:=θi−α∗(hθ(x(1))−y(1))x(1)i

theta_i := theta_i - alpha * (h_theta(x^{(1)})-y^{(1)})x_i^{(1)}

where αalpha represents thelearning rate.

when m>1

θi:=θi−α∗∑j=1m(hθ(x(j))−y(j))x(j)i

theta_i := theta_i - alpha * sum_{j=1}^m (h_theta(x^{(j)})-y^{(j)})x_i^{(j)}

convergence

how to tell if the function converges? solution: 1. look at how much change it (θtheta)has beyween two adjacent iterations. The change is small if it’s converged. 2. look at the change of J(θ)J(theta)

batch gradient descent

what we’ve seen before is called batch gradient descent concretely. The reason lie in it’s look at all data (form i to m) in each iteration.

θi:=θi−α∗∑j=1m(hθ(x(j))−y(j))x(j)i

theta_i := theta_i - alpha * sum_{j=1}^m (h_theta(x^{(j)})-y^{(j)})x_i^{(j)}

However, when your training set is big enough, like millions. Then you must calculate the sum of millions of numbers even before one small step forward

stochastic gradient descent

sometimes, it’s called incremental gradient descent.

update rule

update rule is(for all i) :

θi:=θi−α∗(hθ(x(j))−y(j))x(j)i

theta_i := theta_i - alpha * (h_theta(x^{(j)})-y^{(j)})x_i^{(j)} In each iteration, it scan just one data sample to update θtheta. So, it’s much more quickly than batch gradient descent.

pseudocode

pseudocode is :

代码语言:javascript复制
while True:   #repeat until convergence
    for i in range(1,m): # iterate each data sample
        theta = f(theta) #update theta 
compare with batch gradient descent
  1. stochastic is much more faster than batch gradient descent especially we have a big data set.
  2. stochastic may not find the steepest route to the extreme, even may go up sometimes. However, overall it heads to the extreme and wind around it when it’s about to converge.

normal equation

defination

X∈ℝm∗(n 1)X in mathbb{R}^{m*(n 1)} X∗θ=[...,x(i)Tθ,...]′=[...,hθ(x(i)),...]′∈ℝmX*theta = [...,{x^{(i)}}^T theta,...]'=[...,h_theta(x^{(i)}),...]' in mathbb{R}^m Y∈ℝmY in mathbb{R}^m X∗θ−Y=[...,hθ(x(i))−y(i),...]′∈ℝmX*theta - Y=[...,h_theta(x^{(i)})-y^{(i)},...]' in mathbb{R}^m 12(X∗θ−Y)T(X∗θ−Y)=∑mi=112(hθ(x(i))−y(i))2=J(θ)frac{1}{2}(X*theta - Y)^T (X*theta - Y)=sum_{i=1}^m frac{1}{2} (h_theta(x^{(i)})-y^{(i)})^2 =J(theta)

deduce

in order to fin the minimum of the J(θ)J(theta), we must solve

∇θJ(θ)=0

nabla_theta J(theta)=0

at last, we can get

θ=(XTX)−1XTY

theta = (X^T X)^{-1} X^T Y

0 人点赞