Haron's Blog

GPT

Created2022-08-21|NLP•deep learning•PLM

GPT（TODO）

BERT

Created2022-08-21|NLP•deep learning•PLM

BERT（TODO）

Delta Learning

Created2022-08-19|Course Notes•NLP•PLM•TODO

Delta Learning Intro(TODO)

Transformer Code

Created2022-08-16|NLP•deep learning•PLM•code

Transformer (Code)Based on Yu-Hsiang Huang’s implement, graykode’s implement and DASOU’s Video. 12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788899091929394959697989910010110210310410510610710810911011111211311411511611711811912012112212312412512612712812913013113213313413513613713813914014114214314414514614714814915015115215315415515615715815916016116216316416516616716816917017 ...

LSTM

Created2022-08-16|NLP•RNN

Long short-term memory LSTM In general, LSTM is like a RNN with a “notebook”to record the information. The pink signs in blocks belongs to the “notebook”, and the yellow ones are gates. Comparing to RNN, LSTM extends the memory of short term information, so it is called long short-term memory. IDEA: multilayer LSTM: RNN with multi “notebooks”. The first “notebook” record information from RNN outputs; The second “notebook” record information from the first “notebook”… This architecture finally fo ...

RNN

Created2022-08-16|NLP•RNN

RNN Recurrent Nerual Network1. Structure RNN uses the output as input of the next moment, so it can consider the timing information. The main part $S$ of the model is simple NN. The weights $W, U, V$ are shared. 2. problems Gradient Vanishing Gradient vanishing in RNN shows that is cannot memorize long term information for it is covered by recent information. LSTM is a good improvement of this problem. Non-parallel computing RNN cannot be computed paralleled, for the next ...

Prompt

Created2022-08-10|Course Notes•NLP•PLM•TODO•Prompt

Prompt Learning Intro(TODO) Pre-train, Prompt, and Predict: A Systematic Survey of Prompting Methods in Natural Language Processing http://pretrain.nlpedia.ai/ 1. bigger model, harder fine-tuningT5 11b para GPT-3 175b para The process of fine-tuning is to train a new model from pre-trained models while using domain-specific data. It can be hard and expensive when the model is huge. How to prevent the direct process of training the whole model? 2. Intro of prompt learningPrompt learning bridges t ...

Transformer

Created2022-08-07|NLP•deep learning•PLM

Transformer 1. Encoder and DecoderEncoder: 6 identical layers (transformer block) $output = LayerNorm (x + Sublayer(x))$ for each layer: multi-head self-attention + position-wise fully connected feed-forward network (MLP) residual connection for each sublayer layer normalization $d_{model} = 512 \ \ \ (d_k = 512)$ Decoder: 6 identical layers (transformer block) for each layer: masked multi-head self-attention + multi-head self-attention + position-wise fully connected feed-f ...

HMM

Created2022-08-02|machine learning•NLP•todo

Hidden Markov Model 隐马尔可夫模型(todo) sentence $\lambda = \lambda_1\lambda_2…\lambda_n$ predicted label $o=o_1o_2…o_n$ output = max $P(o_1o_2…o_n|\lambda_1\lambda_2…\lambda_n)$ if all the words are independent: (Markov Assumption)$$P(o_1o_2…o_n|\lambda_1\lambda_2…\lambda_n) = P(o_1|\lambda_1)P(o_2|\lambda_2)…P(o_n|\lambda_n)$$ $$P(o|\lambda) = \frac{P(\lambda | o)P(o)}{P(\lambda)}$$ $P(\lambda)$ is constant, so:$$argmax{P(o|\lambda)} = argmax{P(\lambda | o)P(o)}$$ fo ...

hexo

Created2022-07-11|hexo

1. how to post a blog locate local blog github repo 1$ cd /e/blog create a new post 1$ hexo new "[name]" Generate static files 1$ hexo generate Deploy to remote sites 1$ hexo deploy 2. Other useful hexo commandview your blog on local server 1$ hexo server clean cache 1$ hexo clean