GPT
GPT(TODO)
BERT
BERT(TODO)
Delta Learning
Delta Learning Intro(TODO)
Transformer Code
Transformer (Code)Based on Yu-Hsiang Huang’s implement, graykode’s implement and DASOU’s Video.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788899091929394959697989910010110210310410510610710810911011111211311411511611711811912012112212312412512612712812913013113213313413513613713813914014114214314414514614714814915015115215315415515615715815916016116216316416516616716816917017 ...
LSTM
Long short-term memory LSTM
In general, LSTM is like a RNN with a “notebook”to record the information. The pink signs in blocks belongs to the “notebook”, and the yellow ones are gates. Comparing to RNN, LSTM extends the memory of short term information, so it is called long short-term memory.
IDEA: multilayer LSTM: RNN with multi “notebooks”. The first “notebook” record information from RNN outputs; The second “notebook” record information from the first “notebook”… This architecture finally fo ...
RNN
RNN Recurrent Nerual Network1. Structure
RNN uses the output as input of the next moment, so it can consider the timing information. The main part $S$ of the model is simple NN.
The weights $W, U, V$ are shared.
2. problems
Gradient Vanishing
Gradient vanishing in RNN shows that is cannot memorize long term information for it is covered by recent information. LSTM is a good improvement of this problem.
Non-parallel computing
RNN cannot be computed paralleled, for the next ...
Prompt
Prompt Learning Intro(TODO)
Pre-train, Prompt, and Predict: A Systematic Survey of Prompting Methods in Natural Language Processing
http://pretrain.nlpedia.ai/
1. bigger model, harder fine-tuningT5 11b para
GPT-3 175b para
The process of fine-tuning is to train a new model from pre-trained models while using domain-specific data. It can be hard and expensive when the model is huge. How to prevent the direct process of training the whole model?
2. Intro of prompt learningPrompt learning bridges t ...
Transformer
Transformer
1. Encoder and DecoderEncoder:
6 identical layers (transformer block)
$output = LayerNorm (x + Sublayer(x))$
for each layer: multi-head self-attention + position-wise fully connected feed-forward network (MLP)
residual connection for each sublayer
layer normalization
$d_{model} = 512 \ \ \ (d_k = 512)$
Decoder:
6 identical layers (transformer block)
for each layer: masked multi-head self-attention + multi-head self-attention + position-wise fully connected feed-f ...
HMM
Hidden Markov Model 隐马尔可夫模型(todo)
sentence $\lambda = \lambda_1\lambda_2…\lambda_n$
predicted label $o=o_1o_2…o_n$
output = max $P(o_1o_2…o_n|\lambda_1\lambda_2…\lambda_n)$
if all the words are independent: (Markov Assumption)$$P(o_1o_2…o_n|\lambda_1\lambda_2…\lambda_n) = P(o_1|\lambda_1)P(o_2|\lambda_2)…P(o_n|\lambda_n)$$
$$P(o|\lambda) = \frac{P(\lambda | o)P(o)}{P(\lambda)}$$
$P(\lambda)$ is constant, so:$$argmax{P(o|\lambda)} = argmax{P(\lambda | o)P(o)}$$
fo ...
hexo
1. how to post a blog
locate local blog github repo
1$ cd /e/blog
create a new post
1$ hexo new "[name]"
Generate static files
1$ hexo generate
Deploy to remote sites
1$ hexo deploy
2. Other useful hexo commandview your blog on local server
1$ hexo server
clean cache
1$ hexo clean