0. Abstract
- zero-shot setting BUT good performance & underfits WebText
- learn to perform tasks from their naturally occurring demonstrations -> promising!
1. Introduction
- current ML -> sensitive to data dist' -> narrow experts
- GPT-2 = more general systems which can perform many tasks
- #Multitask learning
- current approach : utilize a combination of pre-training and supervised finetuning
- only minimal or no supervised learning -> has enough potential!
- this paper : zero-shot learning -> can handle wide range of tasks ex. commonsense reasoning...
2. Approach
- LM (Language M)
- Task Conditioning
- general model should perform different task
- -> condirtional distribution :
p(output|input)p(output|input, task)
- Unsupervised **
- EVENTHOUGH messiness of 'Language in the wild'
- -> model with sufficient capacity
- CAN infer and perform the tasks demonstrated in natural language sequences
- EVENTHOUGH messiness of 'Language in the wild'
2.1. Training Dataset
- FOCUS on building as large & diverse a dataset as possible
- WebText
2.2. Input Representation
- LM : should be able to compute the probability of (and also generate) any string
- Current : word-level LMs (ex. One Billion Word Benchmark)
- This paper : byte-level LMs on WebText
- BPE (Byte Pair Encoding)
- interpolates between word level inputs / for frequent symbol sequences
- and character level inputs / for infrequent symbol sequences.
- require full space of Unicode symbols in order to model all Unicode strings
- -> token vocabulary 130,000
- =/= byte-level ver. token vocabulary 256
- many ver. of words (ex. dog. dog! dog?) -> uneffcient allocation of memory
- SOL : prevent BPE from merging across character categories for any byte sequence **
2.3. Model
- transformer based model
- simmilar to GPT-1
- scale the weights of residual layers : initialize factor 1/sqrt(n)
- context size 512 --> 1024 token
- larger batch size 512
3. Experiments
3.1. Language Modeling
- FOCUS : understanding how WebText LM’s perform at zero-shot domain transfer on the primary task they are trained for – language modeling
- extra tokenize etc is unnecessary --> can evaluate it on ANY task\
3.2. Children’s Book Test
3.3. LAMBADA
3.4. Winograd Schema Challenge
3.5. Reading Comprehension
3.6. Summarization
3.7. Translation
3.8. Question Answering
4. Generalization vs Memorization
- over-reporting of the generalization performance (<- train&test set overlap)
- -> important to analyze how much it overlaps!
- --> WebText : overlap exists BUT not serious (compared to other imag datasets)
5. Related Work
- this paper
- measured the performance of larger language models trained on larger datasets
- LM pre-training : helpful / when fine-tuned for multi tasks
6. Discussion
- unsupervised task learning :
- promising area of research to explore
- learn to perform tasks directly without the need for supervised adaption or modification.
- additional training data could be efficient -> unclear!
- overcome the inefficiencies of BERT (uni-directional representations) **
7. Conclusion
- large LM + large & diverse dataset
- perform well ACROSS many domains
- EVEN in zero-shot setting