Naive Bayesian 朴素贝叶斯

We would use an example to show this algorithm, this examples is still using in practice now.

Spam email classifier
$X$ is 1/0 vector correspond to dictionary and each dimension represents a word. If a word shows up in a email, then its corresponding value equals to 1.
We assume that $X_i$ are independently and identically distribution(IID), although it is obvious that they are not for the meaning of email.
We choose top 10,000 common used words as the dictionary.
$P(x_1…x_{10000}|y) = P(x_1|y)P(x_2|x_1,y)…P(x_{10000}|x_{9999},x_{9998}…x_1,y)$
$P(x_1…x_{10000}|y) = \prod\limits_{i=1}^{10000}P(x_i|y)$
Parameters:
$\phi_{j|y=1} = P(x_j=1|y=1)$
$\phi_{j|y=0} = P(x_j=1|y=0)$
$\phi_y = P(y=1)$
$y=1$ means the email is spam email.

Joint likelihood:
$L(\phi_y,\phi_{j|y}) = \prod\limits_{i=1}^m(x^{(i)},y^{(i)};\phi_y,\phi_{j|y})$
MLE:
$\phi_y=\frac{\sum\limits_{i=1}^m1{y^{(i)}=1}}{m}$
$\phi_{j|y=1} = \frac{\sum\limits_{i=1}^m1{x^{(i)}=1,y^{(i)}=1}}{\sum\limits_{i=1}^m1{y^{(i)}=1}}$
$\phi_{j|y=0} = \frac{\sum\limits_{i=1}^m1{x^{(i)}=1,y^{(i)}=0}}{\sum\limits_{i=1}^m1{y^{(i)}=0}}$

Laplace moving
If a new word(not in the top 10000 common words) shows up in a email, $\sum\limits_{i=1}^m1{x^{(i)}=1,y^{(i)}=1}=0$, then $\phi_{j|j=0}=0$. This seems not robust, so we use Laplace moving to optimize this equation.
$\phi_{j|y=1} = \frac{\sum\limits_{i=1}^m1{x^{(i)}=1,y^{(i)}=1}+1}{\sum\limits_{i=1}^m1{y^{(i)}=1}+10000}$
$\phi_{j|y=0} = \frac{\sum\limits_{i=1}^m1{x^{(i)}=1,y^{(i)}=0}+1}{\sum\limits_{i=1}^m1{y^{(i)}=0}+10000}$