[OCR] Donut : Document Understanding Transformer without OCR

https://arxiv.org/pdf/2111.15664v1.pdf

0. Abstract

OCR framework에서 벗어난 E2E model
synthetic document image generator -> large scale data에 대한 의존 낮춤

1. Introduction

Semi-structured documents
기존 VDU (Visual Document Understanding)
- 보통 OCR 기반
- 분리된 세 개의 모듈로 구성 : text detection, text recognition, parsing
  - 문제점 1: OCR is expensive and is not always available
  - 문제점 2: OCR errors negatively influence subsequent processes
  - (한국어, 일본어에서 극심 + Post OCR correction도 원초적인 답 X)
Donut : complete
SynthDoG : large scale real document에 대한 의존도 낮춤

2. Method

2.1 Preliminary : background

CV based ---> BERT ---> CV+NLP

2.2 Document Undestanding Tranformer = Donut

Aim : simple architecture based on the transformer

architecture
- visual encoder : document image
- textural decoder modules. : sequence of tokens
  - one to one
  - into desired structured format

[ Encoder ]

input document image x∈R H ×W ×C

Visual encoder : CNN based or tranformer based ( Swin Transformer*)
|
V

a set of embeddings {zi|zi∈R d, 1≤i≤n}

n : feature map size or the number of image patches
d : the dimension of the latent vectors of the encoder

Swin transformer

splits the input image x into non-overlapping patches
blocks where the local self-attentions
- with the shifted window are inside
- patch merging layers are applied on the patch tokens

[ Decoder ]

decoder : representations {z} --> generates token sequence
BART ( Multilingual , 4 layer )

[ Model Input ]

training : teacher-forcing manner
test : generates a token sequence given a prompt (Like GPT-3)

[ Output Conversion ]

JSON format
- add two special tokens [START_∗] and [END_∗] / per a field ∗
- wrongly structured? only [START_name] , no [END_name]. ---> the field “name” is lost

2.3 Pre-training

Synthetic Document Generator : SynthDoG

SynthDoG
- Aim : pretraining과정에서 large scale document 의존도 낮추기 위해
- generated image
  - background
  - document
  - text
  - layout
- major techniques in image rendering 활용 -> imitate real photographs

2.4 Application

how to read ----> how to understand

2. Experiments and Analysis

3.1 Experiments and Analysis

3.1.1 Document Classification

Others : sofrmax로 분류
Donut : class label포함한 JSON파일로

3.1.2 Document Parsing

Understand -> complex layouts, formats, and contents

3.1.3 Document VQA

document image and a natural language question -> predicts the proper answer
Decoder가 JSON 파일에 the question (given) and answer (predicted) 모두 포함하도록 훈련
- to keep the uniformity of the method.

'Paper > OCR' 카테고리의 다른 글

[OCR] ViT-STR: Vision Transformer for Fast and Efficient SceneText Recognition (0)	2022.02.03
[OCR] FOTS: Fast Oriented Text Spotting with a Unified Network (0)	2022.02.03
[OCR] TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models (0)	2022.01.23
[Search] OCR (0)	2022.01.18

IIIIIIIIIIII

[OCR] Donut : Document Understanding Transformer without OCR

0. Abstract

1. Introduction

2. Method

2.1 Preliminary : background

2.2 Document Undestanding Tranformer = Donut

2.3 Pre-training

2.4 Application

2. Experiments and Analysis

3.1 Experiments and Analysis

'Paper > OCR' 카테고리의 다른 글

티스토리툴바

[OCR] Donut : Document Understanding Transformer without OCR

0. Abstract

1. Introduction

2. Method

2.1 Preliminary : background

2.2 Document Undestanding Tranformer = Donut

2.3 Pre-training

2.4 Application

2. Experiments and Analysis

3.1 Experiments and Analysis

'Paper > OCR' 카테고리의 다른 글

'Paper/OCR' Related Articles

티스토리툴바