1 Conditions for Normal Equation Prove the following theorem: The
matrix AT A is invertible if and only if thecolumns of A are linearly
independent.
1.先证:当矩阵A的列向量组线性无关,则矩阵ATAA^TAATA可逆。
设ATAX=0A^TAX=0ATAX=0,如果ATAA^TAATA可逆则方程有唯一解X=0X=0X=0,
原命题等价于证明当矩阵A的列向量组线性无关,则ATAX=0A^TAX=0ATAX=0有唯一解X=0X=0X=0,
有XTATAX=0X^TA^TAX=0XTATAX=0,变换得(AX)TAX=0(AX)^TAX=0(AX)TAX=0,AX=0AX=0AX=0,
设A=[a1,a2,...,an]A=[a_1,a_2,...,a_n]A=[a1,a2,...,an],X=[x1,x2,...,xn]TX=[x_1,x_2,...,x_n]^TX=[x1,x2,...,xn]T,
有a1x1+a2x2+...+anxn=0a_1x_1+a_2x_2+...+a_nx_n=0a1x1+a2x2+...+anxn=0 (1),
因为A的列向量组线性无关,所以方程有唯一解[x1,x2,...,xn]=[0,0,...,0][x_1,x_2,...,x_n]=[0,0,...,0][x1,x2,...,xn]=[0,0,...,0],即X=0X=0X=0,原命题得证。
若矩阵A的列向量组不满足线性无关的条件,观察式(1),不保证有唯一解X=0X=0X=0,因此矩阵ATAA^TAATA可逆当且仅当矩阵A的列向量组线性无关。
Newton’s Method for Computing Least SquaresIn this problem,we will
prove that if we use Newton’s method solve the leastsquares
optimization problem,then we only need one iteration to converge tothe
optimal parameter O* (a) Find the Hessian of the cost function
J(e)=责〉T(OT a()- y()2 (b)Show that the first iteration of Newton’s
method gives us 0*=(XTX)-1XTg, the solution to our least square
problem.(gdenotes the vector of the fea-tures.)
2.(a)根据Hessian矩阵定义:Hi,j=∂2J(θ)∂θi∂θj=∂∂θj(∂J(θ)∂θi)=∂∂θj(∑k=1m(θTx(k)−y(k))xi(k))=∑k=1mxi(k)xj(k)H_{i,j}=\frac{\partial ^2 J(\theta)}{\partial \theta _i \partial \theta _j}=\frac{\partial}{\partial \theta _j}(\frac{\partial J(\theta)}{\partial \theta _i})=\frac{\partial}{\partial \theta _j}(\sum_{k=1}^{m}(\theta ^Tx^{(k)}-y^{(k)})x^{(k)}_i)=\sum_{k=1}^{m}x^{(k)}_ix^{(k)}_jHi,j=∂θi∂θj∂2J(θ)=∂θj∂(∂θi∂J(θ))=∂θj∂(k=1∑m(θTx(k)−y(k))xi(k))=k=1∑mxi(k)xj(k)
(b)设X=[x(1)x(2)...x(m)]TX=[x^{(1)} x^{(2)} ... x^{(m)}]^TX=[x(1)x(2)...x(m)]T,Y=[y(1)y(2)...y(m)]Y=[y^{(1)} y^{(2)} ... y^{(m)}]Y=[y(1)y(2)...y(m)],
可得H=XTXH=X^TXH=XTX,
对于▽θJ(θ)▽_{\theta}J(\theta)▽θJ(θ),根据梯度的含义:
▽θJ(θ)=[∂J(θ)∂θ1∂J(θ)∂θ2...∂J(θ)∂θn]T▽_{\theta}J(\theta)=[\frac{\partial J(\theta)}{\partial \theta _1} \frac{\partial J(\theta)}{\partial \theta _2} ... \frac{\partial J(\theta)}{\partial \theta _n}]^T▽θJ(θ)=[∂θ1∂J(θ)∂θ2∂J(θ)...∂θn∂J(θ)]T
▽θJ(θ)i=∑k=1m(θTx(k)−y(k))xi(k)=∑k=1mθTx(k)xi(k)−∑k=1my(k)xi(k)=(XTXθ−XTy→)i▽_{\theta}J(\theta)_i=\sum_{k=1}^{m}(\theta ^Tx^{(k)}-y^{(k)})x^{(k)}_i=\sum_{k=1}^{m}\theta ^Tx^{(k)}x^{(k)}_i-\sum_{k=1}^{m}y^{(k)}x^{(k)}_i \\ =(X^TX\theta-X^T\overrightarrow{y})_i▽θJ(θ)i=∑k=1m(θTx(k)−y(k))xi(k)=∑k=1mθTx(k)xi(k)−∑k=1my(k)xi(k)=(XTXθ−XTy)i
因此▽θJ(θ)=XTXθ−XTy→▽_{\theta}J(\theta)=X^TX\theta-X^T\overrightarrow{y}▽θJ(θ)=XTXθ−XTy,
根据牛顿法迭代公式θ:=θ−H−1▽θJ(θ)\theta:=\theta-H^{-1}▽_{\theta}J(\theta)θ:=θ−H−1▽θJ(θ),
第一轮迭代后θ∗=θ−(XTX)−1(XTXθ−XTy→)=(XTX)−1XTy→\theta^*=\theta-(X^TX)^{-1}(X^TX\theta-X^T\overrightarrow{y})=(X^TX)^{-1}X^T\overrightarrow{y}θ∗=θ−(XTX)−1(XTXθ−XTy)=(XTX)−1XTy.
3Prediction using Linear Regression The sales of a company (in million
dollars)for each year are shown in the tablebelow. x (year) 2005 2006
2007 2008 2009 y (sales) 12 19 29 37 45 (a)Find the least square
regression line y = aa +b. (b)Use the least squares regression line as
a model to estimate the sales of the company in 2012.
3.(a)根据定理,最小二乘的唯一解为 θ=(XTX)−1XTY\theta=(X^TX)^{-1}X^TYθ=(XTX)−1XTY,
根据题意设
X=[1200512006120071200812009]X=\begin{bmatrix} 1 && 2005 \\ 1 && 2006 \\ 1 && 2007 \\ 1 && 2008 \\ 1 && 2009 \\ \end{bmatrix}X=1111120052006200720082009,Y=[1219293745]Y=\begin{bmatrix} 12 \\ 19 \\ 29 \\ 37 \\ 45 \\ \end{bmatrix}Y=1219293745,
代入计算得 θ=[−1.68e+04;8.40]\theta=[-1.68e+04;8.40]θ=[−1.68e+04;8.40],
即最小二乘回归直线为y=8.40a−1.68e4y=8.40a-1.68e4y=8.40a−1.68e4.
(b)预测公司2012年的销售额为70.4百万美元。
4Logistic Regression Consider the average empirical loss for logistic
regression: m J(0)= >log (1 +e-yO a’) – >log (ho(gs))) m :1 i=1 ,i=1
where g() e {-1,1} ho(z)= g(0T .c) and g(z)=1/(1+e-2). Find the
HessianH of this function,and show that for any vector z, it holds
true that H z ≥0
4.根据Hessian矩阵定义:
∂J(θ)∂θi=−1m∑k=1m(1−g(θTy(k)x(k)))y(k)xi(k)\frac{\partial J(\theta)}{\partial \theta _i}=-\frac{1}{m}\sum_{k=1}^{m}(1-g(\theta ^Ty^{(k)}x^{(k)}))y^{(k)}x^{(k)}_i∂θi∂J(θ)=−m1∑k=1m(1−g(θTy(k)x(k)))y(k)xi(k)
∂J(θ)∂θi∂θj=1m∑k=1mg(θTy(k)x(k))(1−g(θTy(k)x(k)))(y(k))2xi(k)xj(k)=Hi,j\frac{\partial J(\theta)}{\partial \theta _i \partial \theta _j}=\frac{1}{m}\sum_{k=1}^{m}g(\theta ^Ty^{(k)}x^{(k)})(1-g(\theta ^Ty^{(k)}x^{(k)}))(y^{(k)})^2x^{(k)}_ix^{(k)}_j=H_{i,j}∂θi∂θj∂J(θ)=m1∑k=1mg(θTy(k)x(k))(1−g(θTy(k)x(k)))(y(k))2xi(k)xj(k)=Hi,j
注意到y(k)∈{−1,1}y^{(k)}\in\{-1, 1\}y(k)∈{−1,1},(y(k))2=1(y^{(k)})^2=1(y(k))2=1,
以及hθ(x)=g(θTx)h_{\theta}(x)=g(\theta ^T x)hθ(x)=g(θTx),hθ(x)=hθ(−x)h_{\theta}(x)=h_{\theta}(-x)hθ(x)=hθ(−x),
Hi,j=1m∑k=1mhθ(x(k))(1−hθ(x(k)))xi(k)xj(k)H_{i,j}=\frac{1}{m}\sum_{k=1}^{m}h_{\theta}(x^{(k)})(1-h_{\theta}(x^{(k)}))x^{(k)}_ix^{(k)}_jHi,j=m1∑k=1mhθ(x(k))(1−hθ(x(k)))xi(k)xj(k),
进一步可得H=1m∑k=1mhθ(x(k))(1−hθ(x(k)))x(k)x(k)TH=\frac{1}{m}\sum_{k=1}^{m}h_{\theta}(x^{(k)})(1-h_{\theta}(x^{(k)}))x^{(k)}x^{(k)T}H=m1∑k=1mhθ(x(k))(1−hθ(x(k)))x(k)x(k)T,
zTHz=1m∑k=1mhθ(x(k))(1−hθ(x(k)))zTx(k)x(k)Tz=1m∑k=1mhθ(x(k))(1−hθ(x(k)))(zTx(k))2≥0z^THz=\frac{1}{m}\sum_{k=1}^{m}h_{\theta}(x^{(k)})(1-h_{\theta}(x^{(k)}))z^Tx^{(k)}x^{(k)T}z \\ =\frac{1}{m}\sum_{k=1}^{m}h_{\theta}(x^{(k)})(1-h_{\theta}(x^{(k)}))(z^Tx^{(k)})^2 \geq 0zTHz=m1∑k=1mhθ(x(k))(1−hθ(x(k)))zTx(k)x(k)Tz=m1∑k=1mhθ(x(k))(1−hθ(x(k)))(zTx(k))2≥0,命题得证。