https://arxiv.org/pdf/2111.15664v1.pdf
0. Abstract
- OCR framework에서 벗어난 E2E model
- synthetic document image generator -> large scale data에 대한 의존 낮춤
1. Introduction
- Semi-structured documents
- 기존 VDU (Visual Document Understanding)
- 보통 OCR 기반
- 분리된 세 개의 모듈로 구성 : text detection, text recognition, parsing
- 문제점 1: OCR is expensive and is not always available
- 문제점 2: OCR errors negatively influence subsequent processes
- (한국어, 일본어에서 극심 + Post OCR correction도 원초적인 답 X)
- Donut : complete
- SynthDoG : large scale real document에 대한 의존도 낮춤
2. Method
2.1 Preliminary : background
- CV based ---> BERT ---> CV+NLP
2.2 Document Undestanding Tranformer = Donut
- Aim : simple architecture based on the transformer
- architecture
- visual encoder : document image
- textural decoder modules. : sequence of tokens
- one to one
- into desired structured format
[ Encoder ]
input document image x∈R H ×W ×C
|
|
Visual encoder : CNN based or tranformer based ( Swin Transformer*)
|
V
a set of embeddings {zi|zi∈R d, 1≤i≤n}
- n : feature map size or the number of image patches
- d : the dimension of the latent vectors of the encoder
Swin transformer
- splits the input image x into non-overlapping patches
- blocks where the local self-attentions
- with the shifted window are inside
- patch merging layers are applied on the patch tokens
[ Decoder ]
- decoder : representations {z} --> generates token sequence
- BART ( Multilingual , 4 layer )
[ Model Input ]
- training : teacher-forcing manner
- test : generates a token sequence given a prompt (Like GPT-3)
[ Output Conversion ]
- JSON format
- add two special tokens [START_∗] and [END_∗] / per a field ∗
- wrongly structured? only [START_name] , no [END_name]. ---> the field “name” is lost
2.3 Pre-training
- Synthetic Document Generator : SynthDoG
- SynthDoG
- Aim : pretraining과정에서 large scale document 의존도 낮추기 위해
- generated image
- background
- document
- text
- layout
- major techniques in image rendering 활용 -> imitate real photographs
2.4 Application
- how to read ----> how to understand
2. Experiments and Analysis
3.1 Experiments and Analysis
3.1.1 Document Classification
- Others : sofrmax로 분류
- Donut : class label포함한 JSON파일로
3.1.2 Document Parsing
- Understand -> complex layouts, formats, and contents
3.1.3 Document VQA
- document image and a natural language question -> predicts the proper answer
- Decoder가 JSON 파일에 the question (given) and answer (predicted) 모두 포함하도록 훈련
- to keep the uniformity of the method.
'Paper > OCR' 카테고리의 다른 글
[OCR] ViT-STR: Vision Transformer for Fast and Efficient SceneText Recognition (0) | 2022.02.03 |
---|---|
[OCR] FOTS: Fast Oriented Text Spotting with a Unified Network (0) | 2022.02.03 |
[OCR] TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models (0) | 2022.01.23 |
[Search] OCR (0) | 2022.01.18 |