[NLP] GPT-2 : Language Models are Unsupervised Multitask Learners

zero-shot setting BUT good performance & underfits WebText
learn to perform tasks from their naturally occurring demonstrations -> promising!

current ML -> sensitive to data dist' -> narrow experts
GPT-2 = more general systems which can perform many tasks
#Multitask learning
- current approach : utilize a combination of pre-training and supervised finetuning
- only minimal or no supervised learning -> has enough potential!
- this paper : zero-shot learning -> can handle wide range of tasks ex. commonsense reasoning...

LM (Language M)
Task Conditioning
- general model should perform different task
- -> condirtional distribution : ~~p(output|input)~~ p(output|input, task)
Unsupervised **
- EVENTHOUGH messiness of 'Language in the wild'
  - -> model with sufficient capacity
  - CAN infer and perform the tasks demonstrated in natural language sequences

LM : should be able to compute the probability of (and also generate) any string
- Current : word-level LMs (ex. One Billion Word Benchmark)
- This paper : byte-level LMs on WebText
BPE (Byte Pair Encoding)
- interpolates between word level inputs / for frequent symbol sequences
- and character level inputs / for infrequent symbol sequences.
- require full space of Unicode symbols in order to model all Unicode strings
  - -> token vocabulary 130,000
  - =/= byte-level ver. token vocabulary 256
- many ver. of words (ex. dog. dog! dog?) -> uneffcient allocation of memory
  - SOL : prevent BPE from merging across character categories for any byte sequence **

transformer based model
simmilar to GPT-1
- scale the weights of residual layers : initialize factor 1/sqrt(n)
- context size 512 --> 1024 token
- larger batch size 512

FOCUS : understanding how WebText LM’s perform at zero-shot domain transfer on the primary task they are trained for – language modeling
extra tokenize etc is unnecessary --> can evaluate it on ANY task\

3.2. Children’s Book Test

3.3. LAMBADA

3.4. Winograd Schema Challenge

3.5. Reading Comprehension

3.6. Summarization

3.7. Translation

3.8. Question Answering

this paper
- measured the performance of larger language models trained on larger datasets
- LM pre-training : helpful / when fine-tuned for multi tasks

unsupervised task learning :
- promising area of research to explore
- learn to perform tasks directly without the need for supervised adaption or modification.
additional training data could be efficient -> unclear!
overcome the inefficiencies of BERT (uni-directional representations) **

large LM + large & diverse dataset
- perform well ACROSS many domains
- EVEN in zero-shot setting

IIIIIIIIIIII