본문 바로가기

Paper/Language Model

[NLP] GPT-2 : Language Models are Unsupervised Multitask Learners

https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf

 

0. Abstract

  • zero-shot setting BUT good performance & underfits WebText
  • learn to perform tasks from their naturally occurring demonstrations  -> promising!

1. Introduction

  • current ML -> sensitive to data dist' -> narrow experts
  • GPT-2 = more general systems which can perform many tasks
  • #Multitask learning
    • current approach : utilize a combination of pre-training and supervised finetuning
    • only minimal or no supervised learning -> has enough potential!
    • this paper : zero-shot learning -> can handle wide range of tasks ex. commonsense reasoning...

2. Approach

  • LM (Language M)
  • Task Conditioning
    • general model should  perform different task
    • -> condirtional distribution : p(output|input) p(output|input, task)
  • Unsupervised **
    • EVENTHOUGH messiness of 'Language in the wild'
      • -> model with sufficient capacity 
      • CAN infer and perform the tasks demonstrated in natural language sequences

2.1. Training Dataset

  • FOCUS on building as large & diverse a dataset as possible
  • WebText

2.2. Input Representation

    • LM : should be able to compute the probability of (and also generate) any string
      • Current : word-level LMs (ex. One Billion Word Benchmark)
      • This paper : byte-level LMs on WebText
    • BPE (Byte Pair Encoding)
      • interpolates between   word level inputs      /    for frequent symbol sequences
      •                       and   character level inputs  /   for infrequent symbol sequences.
      • require full space of Unicode symbols in order to model all Unicode strings
        • -> token vocabulary 130,000
        • =/= byte-level ver. token vocabulary 256
      • many ver. of words (ex. dog. dog! dog?) -> uneffcient allocation of memory 
        • SOL : prevent BPE from merging across character categories for any byte sequence **

2.3. Model

  • transformer based model
  • simmilar to GPT-1
    • scale the weights of residual layers : initialize factor  1/sqrt(n)
    • context size 512 --> 1024 token
    • larger batch size 512

3. Experiments

3.1. Language Modeling

  • FOCUS : understanding how WebText LM’s perform at zero-shot domain transfer on the primary task they are trained for – language modeling
  • extra tokenize etc is unnecessary --> can evaluate it on ANY task\

3.2. Children’s Book Test

3.3. LAMBADA

3.4. Winograd Schema Challenge

3.5. Reading Comprehension

3.6. Summarization

3.7. Translation

3.8. Question Answering

4. Generalization vs Memorization

  • over-reporting of the generalization performance (<- train&test set overlap)
  • -> important to analyze how much it overlaps!
  • -->  WebText : overlap exists BUT not serious (compared to other imag datasets)

5. Related Work

  • this paper
    • measured the performance of larger language models trained on larger datasets
    • LM pre-training : helpful / when fine-tuned for multi tasks

 

6. Discussion

  • unsupervised task learning :
    • promising area of research to explore
    • learn to perform tasks directly without the need for supervised adaption or modification.
  • additional training data could be efficient  ->  unclear! 
  • overcome the inefficiencies of BERT (uni-directional representations) **

7. Conclusion

  • large LM + large & diverse dataset
    • perform well ACROSS many domains
    • EVEN in zero-shot setting