logistic regression and Newton's method
Logistic Regression 逻辑回归
Logistic regression is a classification model, but you can use it to solve regression problems if you want to.
WARNING: do not use linear regression to solve claasification problems.
Logistic regression
sigmoid function: $g(x) = \frac{1}{1+e^{-x}}$
define $h_{\theta}(x) = g(\theta^TX) = \frac{1}{1+e^{-\theta^TX}},\ P(y=1|x;\theta) = h_{\theta}(x) ,\ P(y=0|x;\theta) = 1-h_{\theta}(x)$
combine these two equations : $P(y|x;\theta) = (h_{\theta}(x))^y(1-h_{\theta}(x))^{1-y}$
use maximum likelihood estimation(MLE):
likelihood $L(\theta)=P(\vec{y}|x;\theta)=\prod\limits_{i=1}^m(h_{\theta}(x^{(i)}))^{y^{(i)}}(1-h_{\theta}(x^{(i)}))^{1-y^{(i)}}$
we like sum rather than product $l(\theta)=\sum\limits_{i=1}^m[y^{(i)}log(h_{\theta}(x^{(i)}))+(1-y^{(i)})log(1-h_{\theta}(x^{(i)}))]$
This function is also called cross entropy function for binary classification. We will talk about it later.
use gradient descent to find the optima:
$$
\theta_j := \theta_j + \alpha\frac{\partial}{\partial\theta_j}l(\theta) \
\theta_j := \theta_j + \alpha\sum\limits_{i=1}^m(y^{(i)}-h_{\theta}(x^{(i)}))x_j^{(i)}
$$
$l(\theta)$ is a convex function and it doesn’t have local optima.
Softmax Regression
You can regard softmax regression as multiclass logistic regression.
define $K$ – #class
$y$ – label, $y$ is a one hot vector
$\theta$ – parameter, $\theta = \begin{bmatrix} -\theta_1^T- \ -\theta_2^T- \…\ -\theta_K^T- \end{bmatrix}$, $\theta_k$ is the parameter of $k^{th}$ class
$h_\theta(x)$ – hypothesis, $h_\theta(x)= \begin{bmatrix} P(y^{(1)}=1|x^{(i)};\theta) \ … \ P(y^{(K)}=1|x^{(i)};\theta) \end{bmatrix} = \frac{1}{\sum\limits_{j=1}^Kexp(\theta_j^TX^{(i)})}\begin{bmatrix} exp(\theta_1^TX^{(i)}) \ … \ exp(\theta_K^TX^{(i)})\end{bmatrix}$
Each value of dimension in hypothesis represents the probability of corresponding dimension.
Cross entropy error function:
$Loss = - \sum\limits_i1{a}lnp_i$
$1{a}$ is indicate function, if a is true $1{a} = 1$, else $1{a} = 0$
log or ln are both correct
$J(\theta) = -\frac1m[\sum\limits_{i=1}^m\sum\limits_{j=1}^K1{y^{(i)}j=1}ln\frac{exp(\theta^T_jX^{(i)})}{\sum\limits{j=1}^Kexp(\theta_j^TX^{(i)})}]$
usually we would add a weight decay term in case of over parameterization(any parameter is too large):
$J(\theta) = -\frac1m[\sum\limits_{i=1}^m\sum\limits_{j=1}^K1{y^{(i)}j=1}ln\frac{exp(\theta^T_jX^{(i)})}{\sum\limits{j=1}^Kexp(\theta_j^TX^{(i)})}]+\frac\lambda2\sum\limits_{i=1}^K\sum\limits_{j=1}^n\theta_{ij}^2$
use gradient descent to solve:
$\theta_j := \theta_j -\alpha\nabla_{\theta_j}J(\theta)$
$\theta_j := \theta_j +\alpha \frac1m\sum\limits_{i=1}^m[x^{(i)}(1{y^{(i)}_j=1}- P(y^{(j)}=1|x^{(i)};\theta))]+\lambda\theta_j$
Softmax VS Logistic(one vs all)
If the examples are mutually exclusive, use Softmax Regression(it’s faster).
If the examples are not mutually exclusive, use Logistic Regression with one versus all strategy.
Newton’s Method
$l’ = f$
$\theta^{(t+1)} := \theta^{(t)} - \frac{f(\theta^{(i)})}{f’(\theta^{(t)})}$
$\theta^{(t+1)} := \theta^{(t)} - \frac{l’(\theta^{(t)})}{l’’(\theta^{(t)})}$
$\theta^{(t+1)} := \theta^{(t)} - H^{-1}\nabla_\theta l$
$H$ is Hessian matrix.
Use Newton’s method when the number of parameters is small.
Common loss functions
Classification Error $J(\theta) = \frac{error \ items}{all \ items}$
Mean Squared Error(MSE) $J(\theta)=\frac1n\sum\limits_i^n(\hat{y}_i-y_i)^2$
Cross Entropy Error Function $Loss = - \frac1N\sum\limits_i1{a}lnp_i$