본문 바로가기

Paper/OCR

[OCR] Donut : Document Understanding Transformer without OCR

https://arxiv.org/pdf/2111.15664v1.pdf

0. Abstract

  • OCR framework에서 벗어난 E2E model
  • synthetic document image generator -> large scale data에 대한 의존 낮춤

 

1. Introduction

  • Semi-structured documents
  • 기존 VDU (Visual Document Understanding)
    • 보통 OCR 기반
    • 분리된 세 개의 모듈로 구성 : text detection, text recognition, parsing 
      • 문제점 1: OCR is expensive and is not always available
      • 문제점 2: OCR errors negatively influence subsequent processes
      • (한국어, 일본어에서 극심 + Post OCR correction도 원초적인 답 X)
  • Donut : complete
  • SynthDoG : large scale real document에 대한 의존도 낮춤

2.  Method

2.1 Preliminary : background

  • CV based ---> BERT ---> CV+NLP

2.2 Document Undestanding Tranformer = Donut

  • Aim : simple architecture based on the transformer

  • architecture
    • visual encoder                          : document image
    • textural decoder modules.      :  sequence of tokens
      • one to one
      • into desired structured format

[ Encoder ]


input document image x∈R H ×W ×C

         |

         |

Visual encoder : CNN based or tranformer based ( Swin Transformer*)
         |
         V

 a set of embeddings {zi|zi∈R d, 1≤i≤n}

  • n : feature map size or the number of image patches
  • d : the dimension of the latent vectors of the encoder

Swin transformer

  1. splits the input image x into non-overlapping patches
  2. blocks where the local self-attentions 
    • with the shifted window are inside
    • patch merging layers are applied on the patch tokens

 

[ Decoder ]

 

  • decoder : representations {z} --> generates token sequence
  • BART ( Multilingual , 4 layer )

 

[ Model Input ]

 

  • training : teacher-forcing manner
  • test : generates a token sequence given a prompt (Like GPT-3)

 

[ Output Conversion ]

 

  • JSON format
    • add two special tokens [START_∗] and [END_∗] / per a field ∗
    • wrongly structured? only [START_name] , no [END_name]. ---> the field “name” is lost

 

2.3 Pre-training

  • Synthetic Document Generator : SynthDoG

The components of SynthDoG.

  • SynthDoG 
    • Aim : pretraining과정에서 large scale document 의존도 낮추기 위해
    • generated image
      • background 
      • document
      • text
      • layout
    • major techniques in image rendering  활용 -> imitate real photographs

 

2.4 Application

  • how to read ----> how to understand 

 

2. Experiments and Analysis

3.1 Experiments and Analysis

3.1.1 Document Classification

  • Others : sofrmax로 분류
  • Donut : class label포함한 JSON파일로

3.1.2 Document Parsing

  • Understand -> complex layouts, formats, and contents

3.1.3 Document VQA

  • document image and a natural language question -> predicts the proper answer
  • Decoder가 JSON 파일에 the question (given) and answer (predicted) 모두 포함하도록 훈련
    • to keep the uniformity of the method.