https://arxiv.org/pdf/2109.10282.pdf
https://github.com/microsoft/unilm/tree/master/trocr
0. Abstract
transformer architecture를 활용한 end-to-end model
1. Introduction
[Bacground]
- 이 논문에서는 Text Recognition에 집중할 것 (leave text detedction as future work)
- 기존 Text Recognition :
- Encoder : CNN-base -----> Decoder : RNN-base
- image understanding -----> text generation
- visual signal -----> natural language token
- Transformer architecture를 이용한 시도 존재, BUT 아직 encoder가 CNN-backbone에 의존
- large-scale pre-trained model X ( ∵ synthetic/human-labled dataset으로 훈련시킨 모델의 paramter 긁어옴)
- pre-trained image Transformer가 CNN back-bone을 대체할 수 있는가?
[TrOCR]
- pre-trained image & text tranformer를 jointly leverage한 첫 모델!
- FOR effective training
- encoder : ViT-style model
- decoder : BERT-style model 로 initiailize
- Advantage
- large-scale unlabeled data 활용가능 ( ∵ pretrained Image & Text Transformer Model 활용)
- 쉽게 implement & maintain ( ∵ 복잡한 신경망구조 필요 X )
- printed & hanwritten 모두에서 좋은 성능
- 쉽게 multilingual로 확장 가능 ( leveraging leveraging 'multilingual pre-trained model' in the decoder-side)
2. TrOCR
2.1 Model Architecture
- encoder : representation of the image patches
- + Attention : encoder output + previous generation
- decoder : generate the word-piece sequence
2.1.1 Encoder
- patch-embeddings
- 4고정된 (H,W)로 Resize
- N = HW/P² 사이즈 foursquare patch로 나누기
- patch를 vector로 flatten
- D-차원 벡터들로 linearly project (D: hidden size of Transformer)
- = patched- embeddigs
- [CLS] token (ViT, DeiT와 유사)
- 모든patch embedding의 정보를 모아 --> 전체 이미지를 대표
- distillation token (DeiT 활용시)
- teacher model로부터 learn할 수 있도록 함
--> 세가지 값들이
- absolute position에 따라 learnable 1D embedding
- 동일한 구조의 encoder layer stack 통과
- multi-head sef-attention
- fully-connected feed-forward network
- residual connection
- layer normalization
+ Multi-head Attention
- Attention : 각 값에 다른 attention -> weighted sum을 결과값으로
- Multi-head Attention : 다른 representation subspaces의 info -> jointly gather
- CNN과 달리,
- image-specific inductive biases X
- image를 sequence patch로 처리 -> 전체 이미지 & patch에 다른 attention 부여 가능 . . . [목적]
2.1.2 Decoder
- original Tranformer Decoder
- encoder-decoder attention
- [목적] encoder out에 각기다른 attention
- multi-head sef-attention <---사이에 삽입---> fully-connected feed-forward network
- key, value : encoder output // queries : decoder input
- attention masking
- [목적] traning할 때 정보 << prediction할 때 정보 (반대되는 것 방지하기 위해)
- output (position i) : can only pay attention to position less than i
- i.e. 미래의 결과값 알지 못하도록 방지
2.2 Model initialization
- large scale labeled and unlabeled dataset -> Both encoder & decoder
2.2.1 Encoder initialization
- DeiT
- BEiT
2.2.2 Decoder initialization
- RoBERT
- 구조 완벽하게 맞지 않음 -> 없는 layer 는 randomly initialize
2.3 Task Pipeline
- step
- extract visual features
- predict wordpiece tokens <-- image + context (generated before)
- [EOS] token
- rotate sequence backward
- 실제값(ground truth sequence)와 비교하며 학습진행
2.4 Pre-training
- two step
- sythesize a large dataset
- relative small dataset (two)
2.5 Fine-tuning
- printed & hand-written data
- output : based on BPE
2.6 Data Augmentation
- 랜덤하게 transformation적용할 이미지 선택
- 적용한 Augmentation 종류
- random rotation (10 to 10)
- Gaussian blurring
- image dilation
- image erosion
- downscaling
- underlining
- keeping the original
'Paper > OCR' 카테고리의 다른 글
[OCR] ViT-STR: Vision Transformer for Fast and Efficient SceneText Recognition (0) | 2022.02.03 |
---|---|
[OCR] FOTS: Fast Oriented Text Spotting with a Unified Network (0) | 2022.02.03 |
[OCR] Donut : Document Understanding Transformer without OCR (0) | 2022.01.25 |
[Search] OCR (0) | 2022.01.18 |