본문 바로가기

Paper/OCR

[OCR] TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models

 

https://arxiv.org/pdf/2109.10282.pdf

https://github.com/microsoft/unilm/tree/master/trocr

 

GitHub - microsoft/unilm: Large-scale Self-supervised Pre-training Across Tasks, Languages, and Modalities

Large-scale Self-supervised Pre-training Across Tasks, Languages, and Modalities - GitHub - microsoft/unilm: Large-scale Self-supervised Pre-training Across Tasks, Languages, and Modalities

github.com

 

0. Abstract 

transformer architecture를 활용한 end-to-end model

1. Introduction

 

[Bacground]

 

  • 이 논문에서는 Text Recognition에 집중할 것 (leave text detedction as future work)
  • 기존 Text Recognition :
    • Encoder : CNN-base   ----->  Decoder : RNN-base
    • image understanding  ----->  text generation
    •         visual signal      ----->  natural language token
  • Transformer architecture를 이용한 시도 존재, BUT 아직 encoder가 CNN-backbone에 의존
    • large-scale pre-trained model X ( ∵ synthetic/human-labled dataset으로 훈련시킨 모델의 paramter 긁어옴)
    • pre-trained image Transformer가 CNN back-bone을 대체할 수 있는가? 

 

[TrOCR]

 

  • pre-trained image & text tranformer를 jointly leverage한 첫 모델!
  • FOR effective training
    • encoder : ViT-style model 
    • decoder : BERT-style model     로 initiailize  
  • Advantage
    • large-scale unlabeled data 활용가능 ( ∵ pretrained Image & Text Transformer Model 활용)
    • 쉽게 implement & maintain ( ∵ 복잡한 신경망구조 필요 X )
    • printed & hanwritten 모두에서 좋은 성능
    • 쉽게 multilingual로 확장 가능 ( leveraging leveraging 'multilingual pre-trained model' in the decoder-side)

2. TrOCR

2.1 Model Architecture

  • encoder : representation of the image patches
  • + Attention : encoder output + previous generation
  • decoder : generate the word-piece sequence

2.1.1 Encoder

  • patch-embeddings
    1. 4고정된 (H,W)로 Resize
    2. N = HW/P²  사이즈 foursquare patch로 나누기
    3. patch를 vector로 flatten
    4. D-차원 벡터들로 linearly project (D: hidden size of Transformer)
    5. = patched- embeddigs
  • [CLS] token (ViT, DeiT와 유사)
    • 모든patch embedding의 정보를 모아 --> 전체 이미지를 대표 
  • distillation token (DeiT 활용시)
    • teacher model로부터 learn할 수 있도록 함

 

--> 세가지 값들이

  • absolute position에 따라 learnable 1D embedding
  • 동일한 구조의 encoder layer stack 통과 
    1. multi-head sef-attention 
    2. fully-connected feed-forward network 
    3. residual connection
    4. layer normalization

 

+ Multi-head Attention

  • Attention :  각 값에 다른 attention -> weighted sum을 결과값으로 
  • Multi-head Attention : 다른 representation subspaces의 info -> jointly gather
  • CNN과 달리,
    • image-specific inductive biases X
    • image를 sequence patch로 처리 -> 전체 이미지 & patch에 다른 attention 부여 가능 . . . [목적] 

 

2.1.2 Decoder

  • original  Tranformer Decoder
  • encoder-decoder attention 
    • [목적] encoder out에 각기다른 attention 
    • multi-head sef-attention <---사이에 삽입---> fully-connected feed-forward network  
    • key, value : encoder output // queries : decoder input
  • attention masking
    • [목적] traning할 때 정보 << prediction할 때 정보  (반대되는 것 방지하기 위해)
    • output (position i) : can only pay attention to position less than i
    • i.e. 미래의 결과값 알지 못하도록 방지

 

2.2 Model initialization

  • large scale labeled and unlabeled dataset -> Both encoder & decoder 

2.2.1 Encoder initialization

  • DeiT 
  • BEiT

2.2.2 Decoder initialization

  • RoBERT
    • 구조 완벽하게 맞지 않음 -> 없는 layer 는 randomly initialize

2.3 Task Pipeline

  • step
    1. extract visual features
    2. predict wordpiece tokens <-- image + context (generated before)
  • [EOS] token
    • rotate sequence backward
    • 실제값(ground truth sequence)와 비교하며 학습진행

2.4 Pre-training

  • two step
    1. sythesize a large dataset
    2. relative small dataset (two)

2.5 Fine-tuning

  • printed & hand-written data
  • output : based on BPE

2.6 Data Augmentation

  • 랜덤하게 transformation적용할 이미지 선택
  • 적용한 Augmentation 종류
    • random rotation (10 to 10)
    • Gaussian blurring
    • image dilation
    • image erosion
    • downscaling
    • underlining
    • keeping the original