[OCR] FOTS: Fast Oriented Text Spotting with a Unified Network




GitHub - jiangxiluning/FOTS.PyTorch: FOTS Pytorch Implementation

0. Abstract

  • FOTS : detection & recognition -->simultaneous & complementary ( computational&visual information 공유)

1. Introduction

  • Common Method
    • CNN : extract feature map
    • another decoder : decode the regions     ---> heavy time cost & ignore correlation [Background]
  • FOTS 
    • learns more generic features -> shared between 'detection' & 'recognition'
    •  single detection network  --->  computational cost ↓
  • HOW? ROIRotate
    • oriented detection bounding box에 따라, feature map으로부터 적절한 feature 가져옴
    • text detection branch : predict the detection bounding boxes
    • text proposal features
      1. RoIRotate : detection results에 따라 text propsal features 추출
      2. RNN encoder
      3. CTC(Connectionist Temporal Classification) decoder 

2. Related Work

2.1. Text Detection

  • character based methods (text를 character의 결합으로 생각)
    • localize characters ----> group them into words or text lines
    • Sliding-window-based methods
    • connected-components based methods
  • Recently
    • AIM : directly detect words in images
    • . . . 

2.2. Text Recognition

  • AIM : decode 'a sequence of label' from regularly cropped but variable-length text images
  • Recently
    • word classification based
    • sequence-to-label decode based
    • sequence-to-sequence model based

2.3. Text Spotting

  • separate detection / recognition model
  • Recently <Towards end-to-end text spotting with convolutional recurrent neural networks>
    • detection : text proposal network inspired by RPN
    • recogniton : LSTM with attention mechanism
  • --> FOTS : 수직회전외의 더 복잡한 케이스 처리가능 + 빠름

3. Methodology

3.1. Overall Architecture

  • backbone : ResNet-50
  • concatenate -> low-level feature maps & high-level semantic feature maps
  • text detection branch 
  • RoIRotate 
    • text detection branch 의 output = 'oriented text region proposals' 에 따라
    • fixed-height representations로
  • text recognition branch
    • CNN and LSTM : encode text sequence information
    • CTC : decode


3.2. Text Detection Branch 

  • text detector : fully convolutional network
  • upscale the feature maps from 1/32 to 1/4 size
  • first channel : the probability of each pixel being a positive sample
    • positive sample? 
      • 4 channels -->  predict its distances to top, bottom, left, right sides of the bounding box
      • boundig box에 thresholding and NMS적용한 것 = final output
  • loss :
    • text classification term
    • bounding box regression term

3.3. RoIRotate

  • oriented feature regions  --->  axis-aligned feature maps 
  • 원본이미지 그대로X,  feature map을!
  • (비뚤어진 글씨들 각도 맞게)
  • 기존방식(ROI POOLING, ROI ALIGN)에 대한 차별점
    • Bilinear Interpolation (이중 선형 보간법)활용 -> 출력길이 가변적
    • *이중선형보간법 : x축, y축으로 선형보간법 두번적용 
  • Importance
    • text recognition은 noise에 매우 예민


3.4. Text Recognition Branch

  • Aim : predict text labels ( RoIRotate에서 얻은  region features 활용해서 )



  1. VGGlike sequential convolutions            
  2. poolings with reduction along height axis only
    • output : higher-level feature maps
  3. encoder : bi-directional LSTM
  4. one fully-connection
  5. decoder : CTC ( frame-wise classification scores ---[transform]---> label sequence )


  • Loss