symbol:
m: traing examples x: input variables/features y: output variable/targer (x,y): traing example (x(i)x^{(i)},y(i)y ^{(i)}): ith traing example xix_i: ith attribute in vector xx
prior knowledge
1
θ∈ℝn 1theta in mathbb{R}^{n 1}, then ∇θJ(θ)=[...,∂J(θ)∂θi,...]∈ℝn 1nabla _theta J(theta)=[..., frac{partial J(theta)}{partial theta_i},...]in mathbb{R}^{n 1}
so the batch gradient descent:
θ:=θ−α∗∇θJ(θ)
theta := theta - alpha * nabla _theta J(theta)
the stochastic gradient descent:
θ:=θ−α∗(hθ(x(j))−y(j))x(j)
theta:= theta - alpha * (h_theta(x^{(j)})-y^{(j)})x^{(j)}
2
f is a function: ℝm∗n−>ℝmathbb{R}^{m*n} ->mathbb{R} A is a matrix, A∈ℝm∗nA in mathbb{R}^{m*n} then,
∇Af(A)=[...,∂f∂Aij,...]∈ℝm∗n
nabla_A f(A) = [...,frac{partial f}{partial A_{ij}},...] in mathbb{R}^{m*n}
3
trace operator
if A∈ℝn∗nA in mathbb{R}^{n*n},then trA=∑ni=1AiitrA = sum_{i=1}^n A_{ii}
4
trAB=trBAtr AB = tr BA trABC=trCAB=trBCAtr ABC = tr CAB = tr BCA ∇AtrAB=BTnabla_A trAB=B^T trA=trATtrA = trA^T ifa∈ℝ,tra=aif a in mathbb{R}, tra = a ∇AtrABATC=CAB CTABTnabla_A trABA^TC = CAB C^TAB^T
linear regression
formula
针对与一个数据样本,有:
h(x)=∑i=0nθixi=θTx
h(x)=sum_{i=0}^n theta_i x_i=theta^T x
where n=#features θtheta=#parameters
optimization problem
our target is
minθ∑i=1m12(hθ(x(i))−y(i))2
min_theta sum_{i=1}^m frac{1}{2} (h_theta(x^{(i)})-y^{(i)})^2
simplify:
minθJ(θ)
min_theta J(theta)
where
J(θ)=∑i=1m12(hθ(x(i))−y(i))2
J(theta) =sum_{i=1}^m frac{1}{2} (h_theta(x^{(i)})-y^{(i)})^2
algorithm
gradient descent
idea(search)
start with some θtheta(like 0→overrightarrow 0) keep changing θtheta to reduce J(θ)J(theta)
if i take a small step, what is the direction of steepest descent that would take me downhill as quickly as possible.
property
- must end
- can find local optim(linear regression must find best optim)
- different intinial value results in differ optim valum
- the speed of θtheta changes less and less , because the gradient gets smaller and smaller
formula
θi:=θi−α∂∂θiJ(θ)
theta_i := theta_i - alpha frac{partial}{partial theta_i} J(theta)
deduce
when m=1
∂∂θiJ(θ)=∂∂θi12(hθ(x(1))−y(1))2=(hθ(x(1))−y(1))x(1)i
frac{partial}{partial theta_i} J(theta) = frac{partial}{partial theta_i} frac{1}{2} (h_theta(x^{(1)})-y^{(1)})^2 =(h_theta(x^{(1)})-y^{(1)})x_i^{(1)} so, the update rule is:
θi:=θi−α∗(hθ(x(1))−y(1))x(1)i
theta_i := theta_i - alpha * (h_theta(x^{(1)})-y^{(1)})x_i^{(1)}
where αalpha represents thelearning rate
.
when m>1
θi:=θi−α∗∑j=1m(hθ(x(j))−y(j))x(j)i
theta_i := theta_i - alpha * sum_{j=1}^m (h_theta(x^{(j)})-y^{(j)})x_i^{(j)}
convergence
how to tell if the function converges? solution: 1. look at how much change it (θtheta)has beyween two adjacent iterations. The change is small if it’s converged. 2. look at the change of J(θ)J(theta)
batch gradient descent
what we’ve seen before is called batch gradient descent concretely. The reason lie in it’s look at all data (form i to m) in each iteration.
θi:=θi−α∗∑j=1m(hθ(x(j))−y(j))x(j)i
theta_i := theta_i - alpha * sum_{j=1}^m (h_theta(x^{(j)})-y^{(j)})x_i^{(j)}
However, when your training set is big enough, like millions. Then you must calculate the sum of millions of numbers even before one small step forward
stochastic gradient descent
sometimes, it’s called incremental gradient descent.
update rule
update rule is(for all i) :
θi:=θi−α∗(hθ(x(j))−y(j))x(j)i
theta_i := theta_i - alpha * (h_theta(x^{(j)})-y^{(j)})x_i^{(j)} In each iteration, it scan just one data sample to update θtheta. So, it’s much more quickly than batch gradient descent.
pseudocode
pseudocode is :
代码语言:javascript复制while True: #repeat until convergence
for i in range(1,m): # iterate each data sample
theta = f(theta) #update theta
compare with batch gradient descent
- stochastic is much more faster than batch gradient descent especially we have a big data set.
- stochastic may not find the steepest route to the extreme, even may go up sometimes. However, overall it heads to the extreme and wind around it when it’s about to converge.
normal equation
defination
X∈ℝm∗(n 1)X in mathbb{R}^{m*(n 1)} X∗θ=[...,x(i)Tθ,...]′=[...,hθ(x(i)),...]′∈ℝmX*theta = [...,{x^{(i)}}^T theta,...]'=[...,h_theta(x^{(i)}),...]' in mathbb{R}^m Y∈ℝmY in mathbb{R}^m X∗θ−Y=[...,hθ(x(i))−y(i),...]′∈ℝmX*theta - Y=[...,h_theta(x^{(i)})-y^{(i)},...]' in mathbb{R}^m 12(X∗θ−Y)T(X∗θ−Y)=∑mi=112(hθ(x(i))−y(i))2=J(θ)frac{1}{2}(X*theta - Y)^T (X*theta - Y)=sum_{i=1}^m frac{1}{2} (h_theta(x^{(i)})-y^{(i)})^2 =J(theta)
deduce
in order to fin the minimum of the J(θ)J(theta), we must solve
∇θJ(θ)=0
nabla_theta J(theta)=0
at last, we can get
θ=(XTX)−1XTY
theta = (X^T X)^{-1} X^T Y