A curated list of papers and resources for scene text detection and recognition. This repository tracks the latest advances in text-in-the-wild research from 2010 to 2025.
Note on Paper Dating: We use the year when a paper was first publicly available (including arXiv preprints) rather than the conference publication year. For example, a paper published on arXiv in 2023 but accepted to CVPR 2024 is listed under 2023.
- Total Papers: 160+
- Timespan: 2010-2025
- Focus Areas: Detection, Recognition, End-to-End Spotting, Video Text, Generation, and more
- Survey Papers
- Tools & Libraries
- Datasets & Benchmarks
- Scene Text Detection
- Scene Text Recognition
- End-to-End Text Spotting
- Video Text Detection & Recognition
- Text Generation with Diffusion Models
- Text Editing & Removal
- Weakly Supervised Methods
- Multilingual & Low-Resource Languages
- Document AI & Layout Analysis
- Other Scene Text Papers
- Challenges and Gaps in Scene Text Detection and Recognition: A Detailed Survey [paper]
- Self-Supervised Learning for Text Recognition: A Critical Survey [IJCV 2025] [paper]
- Handwritten Text Recognition: A Survey [arXiv 2025] [paper]
- A Comprehensive Survey of Transformers in Text Recognition: Techniques, Challenges, and Future Directions [ACM Computing Surveys 2024] [paper]
- Scene text understanding: recapitulating the past decade [Artificial Intelligence Review 2023] [paper]
- Scene text detection and recognition: a survey [Multimedia Tools and Applications 2022] [paper]
- A survey on methods, datasets and implementations for scene text spotting [IET Image Processing 2022] [paper]
- Scene Text Detection and Recognition: The Deep Learning Era [arXiv 2019] [paper]
- Scene text detection and recognition with advances in deep learning: a survey [IJDAR 2019] [paper]
- PaddleOCR - Powerful, lightweight OCR toolkit supporting 100+ languages [code]
- EasyOCR - Ready-to-use OCR with 80+ languages support (PyTorch-based) [code]
- MMOCR - Comprehensive OCR toolbox with 7 detection and 5 recognition algorithms [code]
- OpenOCR - Unified benchmark system for training and evaluating scene text models [code]
For detailed comparisons of these tools, see:
- OCR comparison: Tesseract vs EasyOCR vs PaddleOCR vs MMOCR
- Open-Source OCR Libraries: A Comprehensive Study
Horizontal Text:
- ICDAR 2003 (IC03) - 509 images (258 train, 251 test), 2,266 text instances, English only [download]
- ICDAR 2013 (IC13) - 462 images (229 train, 233 test), 1,944 text instances [download]
- ICDAR 2015 (IC15) - 1,500 images (1,000 train, 500 test), 17,548 text instances, first incidental scene text dataset [download]
Multi-Lingual Text:
- ICDAR 2017 MLT - 10,000 images, 9 languages, word-level annotations [paper] [download]
- ICDAR 2019 MLT - 20,000 images, 10 languages, word-level annotations [paper] [download]
Arbitrary-Shaped Text:
- ICDAR 2019 ArT - 10,166 images (5,603 train, 4,563 test), diverse text shapes [download]
Chinese Text:
- ICDAR 2017 RCTW-17 - 12,514 images (11,514 train, 1,000 test), English/Chinese [download]
- ICDAR 2019 ReCTS - 20,000 images, Chinese street view trademark dataset [download]
Horizontal/Multi-Oriented Text:
- COCO-Text - 63,686 images (43,686 train, 20,000 test), 145,859 text instances, multilingual [paper] [download]
- MSRA-TD500 - 500 images (300 train, 200 test), English/Chinese, text-line level [download]
- SVT (Street View Text) - 350 images, 725 text instances [download]
- USTB-SV1K - 1,000 street view images, 2,955 text instances [download]
- IIIT5K - 5,000 word images (2,000 train, 3,000 test), 50-word and 1,000-word lexicons [download]
Curved/Irregular Text:
- Total-Text - 1,555 images, 11,459 text instances, horizontal/multi-oriented/curved [paper] [download]
- SCUT-CTW1500 - 1,500 images (1,000 train, 500 test), 10,751 text instances, 14-vertex polygon annotations [download]
- CUTE80 - 80 high-resolution images, 288 cropped curved text instances [download]
- LSVT - 450,000 images (430,000 train, 20,000 test), horizontal/multi-oriented/curved text [download]
Chinese Text:
- CTW (Chinese Text in the Wild) - 32,285 images, 1,018,402 character instances, character-level with 6 attributes [download]
Synthetic Datasets:
- SynthText - 800,000 images, 6 million text instances [code]
- Synth80k - 800,000 images, ~8 million synthetic word instances [download]
- Union14M (2023) - 4M labeled + 10M unlabeled images for STR [paper] [download]
- HierText (ICDAR 2023) - Hierarchical text with word/line/paragraph annotations, 103.8 words/image [download]
- TextOCR - 900k annotated words on real images [paper] [download]
- OCRBench v2 (2025) - Comprehensive benchmark for LMMs across 8 text-oriented abilities [paper] [download]
- WordArt-V1.5 (ICDAR 2024) - 12,000 artistic text images [paper] [download]
- DSText (ICDAR 2023) - 100 video clips from 12 scenarios [paper] [download]
- BOVText (2021) - Bilingual, OpenWorld video text with 2,000+ videos, 1.75M+ frames [paper] [download]
Visual Question Answering (VQA) with Scene Text:
- TextVQA - 45,336 questions on 28,408 images requiring reading and reasoning about text [paper] [website] [dataset]
- ST-VQA (Scene Text VQA) - 23,038 images, 31,791 QA pairs, all questions require scene text reading [paper] [download]
- OCR-VQA - 200k+ QA pairs on book cover images [paper] [download]
- DocVQA - 50,000 questions on 12,000+ document images [paper] [website] [download]
- InfographicsVQA - 5,000+ infographic images, 30,000 QA pairs [paper] [website]
- TextCaps - 145k captions for 28k images requiring text reading and reasoning [paper] [download]
- ViTextVQA - Vietnamese text comprehension in images [paper] [download]
- ViOCRVQA - 28,000+ images, 120,000+ Vietnamese QA pairs [paper] [download]
Mathematical Expressions:
- MathWriting (NeurIPS 2023) - Handwritten mathematical expressions [paper]
Other Domains:
- Robust Reading Competition - Ongoing challenges since 2013 [website]
- Bharat Scene Text: A Novel Comprehensive Dataset and Benchmark for Indian Language Scene Text Understanding [arXiv 2025] [paper]
- The Devil is in Fine-tuning and Long-tailed Problems: A New Benchmark for Scene Text Detection [IJCAI 2025] [paper] [code]
- Scene Text Detection and Recognition "in light of" Challenging Environmental Conditions using Aria Glasses Egocentric Vision Cameras [arXiv 2025] [paper] [code]
- A Large-scale Dataset for Robust Complex Anime Scene Text Detection [arXiv 2025] [paper] [dataset]
- Masked Text Pre-Training for Scene Text Detection [Transactions on Multimedia 2025] [paper]
- TextMamba: Scene Text Detector with Mamba [arXiv 2025] [paper]
- Occluded scene text detection via context-awareness from sketch-level image representations [Multimedia Systems 2025] [paper]
- ContraText-DETR: Boosting Industrial Scene Text Detection Based on Contrastive Learning and Synthetic Low-Contrast Text [Sensors Journal 2025] [paper]
- Explicit Relational Reasoning Network for Scene Text Detection [AAAI 2025] [paper]
- InstructOCR: Instruction Boosting Scene Text Spotting [AAAI 2025] [paper]
- Arbitrary Reading Order Scene Text Spotter with Local Semantics Guidance [AAAI 2025] [paper]
- Revisiting Tampered Scene Text Detection in the Era of Generative AI [AAAI 2025] [paper] [code]
- TextBlockV2: Towards Precise-Detection-Free Scene Text Spotting with Pre-trained Language Model [arXiv 2024] [paper]
- ODM: A Text-Image Further Alignment Pre-training Approach for Scene Text Detection and Spotting [CVPR 2024] [paper] [code] [code]
- Bridging the Gap Between End-to-End and Two-Step Text Spotting [CVPR 2024] [paper] [code]
- LayoutFormer: Hierarchical Text Detection Towards Scene Text Understanding [CVPR 2024] [paper]
- Text Grouping Adapter: Adapting Pre-trained Text Detector for Layout Analysis [CVPR 2024] [paper]
- ACP-Net: Asymmetric Center Positioning Network for Real-Time Text Detection [Knowledge-Based Systems 2024] [paper]
- GridMask: An Efficient Scheme for Real Time Curved Scene Text Detection [paper]
- TextBlockV2: Towards Precise-Detection-Free Scene Text Spotting with Pre-trained Language Model [Transactions on Multimedia Computing, Communications and Applications 2025] [paper]
- SwinTextSpotter v2: Towards Better Synergy for Scene Text Spotting [IJCV 2025] [paper] [code]
- LRANet: Towards Accurate and Efficient Scene Text Detection with Low-Rank Approximation Network [AAAI 2024] [paper] [code]
- Bridging Synthetic and Real Worlds for Pre-Training Scene Text Detectors [ECCV 2024] [paper] [code] [code]
- Towards Robust Real-Time Scene Text Detection: From Semantic to Instance Representation Learning [ACMMM 2023] [paper]
- DeepSolo++: Let Transformer Decoder with Explicit Points Solo for Multilingual Text Spotting [CVPR 2023] [paper] [code]
- ESTextSpotter: Towards Better Scene Text Spotting with Explicit Synergy in Transformer [ICCV 2023] [paper] [code]
- Towards Robust Tampered Text Detection in Document Image: New dataset and New Solution [CVPR 2023] [paper] [code]
- Self-Supervised Implicit Glyph Attention for Text Recognition [CVPR 2023] [paper] [code]
- Arbitrary-shaped scene text detection with keypoint-based shape representation [IJDAR 2023] [paper]
- Arbitrary-Shaped Text Detection with B-Spline Curve Network [Sensors 2023] [paper]
- SwinTextSpotter: Scene Text Spotting via Better Synergy Between Text Detection and Recognition [CVPR 2022] [paper] [code]
- Few Could Be Better Than All: Feature Sampling and Grouping for Scene Text Detection [CVPR 2022] [paper]
- Vision-Language Pre-Training for Boosting Scene Text Detectors [CVPR 2022] [paper]
- Towards End-to-End Unified Scene Text Detection and Layout Analysis [CVPR 2022] [paper] [code]
- TESTR: Text Spotting Transformers [CVPR 2022] [paper]
- Contextual Text Block Detection towards Scene Text Understanding [ECCV 2022] [paper]
- GLASS: Global to Local Attention for Scene-Text Spotting [ECCV 2022] [paper] [code]
- Language Matters: A Weakly Supervised Vision-Language Pre-training Approach for Scene Text Detection and Spotting [ECCV 2022] [paper] [code]
- Arbitrary Shape Text Detection via Boundary Transformer [Transactions on Multimedia 2023] [paper] [code]
- Arbitrary shape scene text detector with accurate text instance generation based on instance-relevant contexts [Multimedia Tools and Applications 2022] [paper]
- Arbitrary Shape Text Detection using Transformers [arXiv 2022] [paper]
- Real-Time Scene Text Detection with Differentiable Binarization and Adaptive Scale Fusion [TPAMI 2022] [paper] [code]
- Progressive Contour Regression for Arbitrary-Shape Scene Text Detection [CVPR 2021] [paper]
- Fourier Contour Embedding for Arbitrary-Shaped Text Detection [CVPR 2021] [paper]
- A Straightforward and Efficient Instance-Aware Curved Text Detector [Sensors 2021] [paper]
- FAST: Searching for a Faster Arbitrarily-Shaped Text Detector with Minimalist Kernel Representation [arXiv 2021] [paper] [code]
- Detection and rectification of arbitrary shaped scene texts by using text keypoints and links [Pattern Recognition 2021] [paper]
- Arbitrary-shaped scene text detection by predicting distance map [Applied Intelligence 2021] [paper]
- UnrealText: Synthesizing Realistic Scene Text Images from the Unreal World [CVPR 2020] [paper] [code]
- Character Region Awareness for Text Detection [CVPR 2019] [paper]
- Curved scene text detection via transverse and longitudinal sequence connection [Pattern Recognition 2020] [paper]
- DBNet: Real-time Scene Text Detection with Differentiable Binarization [AAAI 2020] [paper] [code]
- MSR: Multi-Scale Shape Regression for Scene Text Detection [IJCAI 2019] [paper]
- Scene Text Detection with Inception Text Proposal Generation Module [ICMLC 2019] [paper]
- Towards Robust Curve Text Detection with Conditional Spatial Expansion [CVPR 2019] [paper]
- Detecting Curve Text with Local Segmentation Network and Curve Connection [arXiv 2019] [paper]
- Pyramid Mask Text Detector [arXiv 2019] [paper]
- Tightness-aware Evaluation Protocol for Scene Text Detection [CVPR 2019] [paper] [code]
- Character Region Awareness for Text Detection [CVPR 2019] [paper] [code]
- Look More Than Once: An Accurate Detector for Text of Arbitrary Shapes [CVPR 2019] [paper]
- TextCohesion: Detecting Text for Arbitrary Shapes [arXiv 2019] [paper]
- Arbitrary Shape Scene Text Detection With Adaptive Text Region Representation [CVPR 2019] [paper]
- Learning Shape-Aware Embedding for Scene Text Detection [CVPR 2019] [paper]
- A Single-Shot Arbitrarily-Shaped Text Detector based on Context Attended Multi-Task Learning [ACMMM 2019] [paper]
- Efficient and Accurate Arbitrary-Shaped Text Detection with Pixel Aggregation Network [ICCV 2019] [paper] [code]
- Towards Unconstrained End-to-End Text Spotting [ICCV 2019] [paper]
- TextDragon: An End-to-End Framework for Arbitrary Shaped Text Spotting [ICCV 2019] [paper]
- Convolutional Character Networks [ICCV 2019] [paper] [code]
- PixelLink: Detecting Scene Text via Instance Segmentation [AAAI 2018] [paper] [code]
- FOTS: Fast Oriented Text Spotting With a Unified Network [CVPR 2018] [paper]
- TextBoxes++: A Single-Shot Oriented Scene Text Detector [TIP 2018] [paper] [code]
- Multi-oriented Scene Text Detection via Corner Localization and Region Segmentation [CVPR 2018] [paper]
- An end-to-end TextSpotter with Explicit Alignment and Attention [CVPR 2018] [paper] [code]
- Rotation-Sensitive Regression for Oriented Scene Text Detection [CVPR 2018] [paper] [code]
- Detecting multi-oriented text with corner-based region proposals [Neurocomputing 2019] [paper] [code]
- An Anchor-Free Region Proposal Network for Faster R-CNN based Text Detection Approaches [arXiv 2018] [paper]
- IncepText: A New Inception-Text Module with Deformable PSROI Pooling for Multi-Oriented Scene Text Detection [IJCAI 2018] [paper] [code]
- Shape Robust Text Detection with Progressive Scale Expansion Network [CVPR 2019] [paper] [paper] [code]
- TextSnake: A Flexible Representation for Detecting Text of Arbitrary Shapes [ECCV 2018] [paper] [code]
- Mask TextSpotter: An End-to-End Trainable Neural Network for Spotting Text with Arbitrary Shapes [ECCV 2018] [paper] [code]
- Accurate Scene Text Detection through Border Semantics Awareness and Bootstrapping [ECCV 2018] [paper]
- A New Anchor-Labeling Method For Oriented Text Detection Using Dense Detection Framework [SPL 2018] [paper]
- An Efficient System for Hazy Scene Text Detection using a Deep CNN and Patch-NMS [ICPR 2018] [paper]
- Scene Text Detection with Supervised Pyramid Context Network [AAAI 2019] [paper]
- Pixel-Anchor: A Fast Oriented Scene Text Detector with Combined Networks [arXiv] [paper]
- Mask R-CNN with Pyramid Attention Network for Scene Text Detection [WACV 2019] [paper]
- TextMountain: Accurate Scene Text Detection via Instance Segmentation [arXiv] [paper]
- TextField: Learning A Deep Direction Field for Irregular Scene Text Detection [arXiv 2018] [paper] [code]
- TextNet: Irregular Text Reading from Images with an End-to-End Trainable Network [ACCV 2018] [paper]
- Multi-scale FCN with Cascaded Instance Aware Segmentation for Arbitrary Oriented Word Spotting In The Wild [CVPR 2017] [paper]
- Deep TextSpotter: An End-To-End Trainable Scene Text Localization and Recognition Framework [ICCV 2017] [paper]
- Arbitrary-Oriented Scene Text Detection via Rotation Proposals [TMM 2018] [paper] [code]
- Deep Matching Prior Network: Toward Tighter Multi-oriented Text Detection [CVPR 2017] [paper]
- Detecting Oriented Text in Natural Images by Linking Segments [CVPR 2017] [paper] [code]
- Deep Direct Regression for Multi-Oriented Scene Text Detection [ICCV 2017] [paper]
- Cascaded Segmentation-Detection Networks for Word-Level Text Spotting [arXiv 2017] [paper]
- EAST: An Efficient and Accurate Scene Text Detector [CVPR 2017] [paper] [code - TF] [code - Keras]
- WordFence: Text Detection in Natural Images with Border Awareness [ICIP 2017] [paper]
- R2CNN: Rotational Region CNN for Orientation Robust Scene Text Detection [arXiv 2017] [paper] [code]
- WordSup: Exploiting Word Annotations for Character based Text Detection [ICCV 2017] [paper]
- Single Shot Text Detector With Regional Attention [ICCV 2017] [paper] [code]
- Fused Text Segmentation Networks for Multi-oriented Scene Text Detection [ICPR 2018] [paper]
- Deep Residual Text Detection Network for Scene Text [ICDAR 2017] [paper]
- Feature Enhancement Network: A Refined Scene Text Detector [AAAI 2018] [paper]
- ArbiText: Arbitrary-Oriented Text Detection in Unconstrained Scene [arXiv 2017] [paper]
- Self-organized Text Detection with Minimal Post-processing via Border Learning [ICCV 2017] [paper] [code]
- Accurate Text Localization in Natural Image with Cascaded Convolutional Text Network [arXiv 2016] [paper]
- Multi-Oriented Text Detection With Fully Convolutional Networks [CVPR 2016] [paper]
- Scene Text Detection Via Holistic, Multi-Channel Prediction [arXiv 2016] [paper]
- Detecting Text in Natural Image with Connectionist Text Proposal Network [ECCV 2016] [paper] [code]
- TextBoxes: A Fast Text Detector with a Single Deep Neural Network [AAAI 2017] [paper] [code]
- Symmetry-based text line detection in natural scenes [CVPR 2015] [paper]
- Object proposals for text extraction in the wild [ICDAR 2015] [paper]
- Text-Attentional Convolutional Neural Network for Scene Text Detection [TIP 2016] [paper]
- Text Flow : A Unified Text Detection System in Natural Scene Images [ICCV 2015] [paper]
- Robust Scene Text Detection with Convolution Neural Network Induced MSER Trees [ECCV 2014] [paper]
- Real-time scene text localization and recognition [CVPR 2012] [paper]
- Detecting text in natural scenes with stroke width transform [CVPR 2010] [paper]
- A Method for Text Localization and Recognition in Real-World Images [ACCV 2010] [paper]
- HunyuanOCR: Commercial-Grade OCR Vision-Language Model [arXiv 2025] [paper]
- A Context-Driven Training-Free Network for Lightweight Scene Text Segmentation and Recognition [arXiv 2025] [paper]
- SSCD: Self-Supervised Coherence Discrimination Representation Learning for Scene Text Recognition [ICMR 2025] [paper]
- CLIP is Almost All You Need: Towards Parameter-Efficient Scene Text Retrieval without OCR [CVPR 2025] [paper]
- OTE: Exploring Accurate Scene Text Recognition Using One Token [CVPR 2024] [paper]
- Choose What You Need: Disentangled Representation Learning for Scene Text Recognition, Removal and Editing [CVPR 2024]
- An Empirical Study of Scaling Law for Scene Text Recognition [CVPR 2024] [paper]
- VL-Reader: Vision and Language Reconstructor is an Effective Scene Text Recognizer [arXiv 2024] [paper]
- Decoder Pre-Training with only Text for Scene Text Recognition [arXiv 2024] [paper]
- SVIPTR: Fast and Efficient Scene Text Recognition with Vision Permutable Extractor [arXiv 2024] [paper]
- TextViTCNN: Enhancing Natural Scene Text Recognition with Hybrid Transformer and Convolutional Networks [paper]
- Free Lunch: Frame-level Contrastive Learning with Text Perceiver for Robust Scene Text Recognition in Lightweight Models [ACM MM 2024] [paper]
- SVTRv2: CTC Beats Encoder-Decoder Models in Scene Text Recognition [arXiv 2024] [paper]
- Focus-Enhanced Scene Text Recognition with Deformable Convolutions [arXiv 2024] [paper] [code]
- CDistNet: Perceiving multi-domain character distance for robust text recognition [IJCV 2024]
- Revisiting Scene Text Recognition: A Data Perspective [ICCV 2023] [paper] [code]
- CLIPTER: Looking at the Bigger Picture in Scene Text Recognition [ICCV 2023] [paper]
- PreSTU: Pre-Training for Scene-Text Understanding [ICCV 2023] [paper]
- Self-Supervised Character-to-Character Distillation for Text Recognition [ICCV 2023] [paper]
- MRN: Multiplexed Routing Network for Incremental Multilingual Text Recognition [ICCV 2023]
- TrOCR: Transformer-Based Optical Character Recognition with Pre-trained Models [AAAI 2023] [paper]
- CLIP4STR: A Simple Baseline for Scene Text Recognition with Pre-trained Vision-Language Model [arXiv 2023] [paper]
- Relational Contrastive Learning for Scene Text Recognition [arXiv 2023] [paper]
- ViTSTR-Transducer: Cross-Attention-Free Vision Transformer Transducer for Scene Text Recognition [paper]
- TPS++: Attention-Enhanced Thin-Plate Spline for Scene Text Recognition [IJCAI 2023]
- PARSeq: Scene Text Recognition with Permuted Autoregressive Sequence Models [ECCV 2022] [paper] [code]
- Optimal Boxes: Boosting End-to-End Scene Text Recognition by Adjusting Annotated Bounding Boxes via Reinforcement Learning [ECCV 2022] [paper]
- Multi-Granularity Prediction for Scene Text Recognition [ECCV 2022] [paper]
- PARSeq: Scene Text Recognition with Permuted Autoregressive Sequence Models [ECCV 2022] [paper] [code]
- Corner-Guided Transformer for Scene Text Recognition [ECCV 2022] [paper]
- Toward Understanding WordArt: Corner-Guided Transformer for Scene Text Recognition [ECCV 2022] [paper]
- TextAdaIN: Paying Attention to Shortcut Learning in Text Recognizers [ECCV 2022]
- Multi-modal Text Recognition Networks: Interactive Enhancements between Visual and Semantic Features [ECCV 2022]
- SVTR: Scene Text Recognition with a Single Visual Model [IJCAI 2022]
- CarveNet: a channel-wise attention-based network for irregular scene text recognition [IJDAR 2022] [paper]
- Rethinking text rectification for scene text recognition [Expert Systems with Applications 2023] [paper]
- An extended attention mechanism for scene text recognition [Expert Systems with Applications 2022] [paper]
- ViTSTR: Vision Transformer for Fast and Efficient Scene Text Recognition [ICDAR 2021] [paper] [code]
- Sequence-to-Sequence Contrastive Learning for Text Recognition [CVPR 2021] [paper]
- What if We Only Use Real Datasets for Scene Text Recognition? Toward Scene Text Recognition with Fewer Labels [CVPR 2021] [paper] [code]
- Read Like Humans: Autonomous, Bidirectional and Iterative Language Modeling for Scene Text Recognition [CVPR 2021] [paper] [code]
- Sequence-to-Sequence Contrastive Learning for Text Recognition [CVPR 2021] [paper]
- Dictionary-Guided Scene Text Recognition [CVPR 2021] [paper] [code]
- Primitive Representation Learning for Scene Text Recognition [CVPR 2021] [paper] [code]
- MetaHTR: Towards Writer-Adaptive Handwritten Text Recognition [CVPR 2021] [paper]
- Dictionary-Guided Scene Text Recognition [CVPR 2021]
- Primitive Representation Learning for Scene Text Recognition [CVPR 2021]
- MetaHTR: Towards Writer-Adaptive Handwritten Text Recognition [CVPR 2021]
- ViTSTR: Vision Transformer for Fast and Efficient Scene Text Recognition [ICDAR 2021] [paper] [code]
- SEED: Semantics Enhanced Encoder-Decoder Framework for Scene Text Recognition [CVPR 2020] [paper] [code]
- SCATTER: Selective Context Attentional Scene Text Recognizer [CVPR 2020]
- Towards Accurate Scene Text Recognition with Semantic Reasoning Networks [CVPR 2020] [paper]
- STAN: A sequential transformation attention-based network for scene text recognition [Pattern Recognition 2021] [paper]
- GTC: Guided training of CTC [AAAI 2020]
- TextScanner [AAAI 2020]
- A Multi-Object Rectified Attention Network for Scene Text Recognition [Pattern Recognition] [paper]
- https://github.com/Canjie-Luo/MORAN_v2 [PyTorch]
- A Simple and Robust Convolutional-Attention Network for Irregular Text Recognition [paper]
- Aggregation Cross-Entropy for Sequence Recognition [CVPR 2019] [paper]
- Sequence-to-Sequence Domain Adaptation Network for Robust Text Image Recognition [CVPR 2019] [paper]
- 2D Attentional Irregular Scene Text Recognizer [arXiv] [paper]
- Deep Neural Network for Semantic-based Text Recognition in Images [arXiv] [paper]
- Symmetry-constrained Rectification Network for Scene Text Recognition [ICCV 2019] [paper]
- Rethinking Irregular Scene Text Recognition (ICDAR 2019-ArT) [paper]
- Adaptive Embedding Gate for Attention-Based Scene Text Recognition [arXiv] [paper]
- SAFL: A Self-Attention Scene Text Recognizer with Focal Loss [arXiv 2022] [paper]
- Char-Net: A Character-Aware Neural Network for Distorted Scene Text Recognition [AAAI 2018] [paper]
- SqueezedText: A Real-time Scene Text Recognition by Binary Convolutional Encoder-decoder Network [AAAI 2018] [paper]
- Edit Probability for Scene Text Recognition [CVPR 2018] [paper]
- ASTER: An Attentional Scene Text Recognizer with Flexible Rectification [TPAMI 2018] [paper]
- Synthetically Supervised Feature Learning for Scene Text Recognition [ECCV 2018] [paper]
- Scene Text Recognition from Two-Dimensional Perspective [AAAI 2019] [paper]
- ESIR: End-to-end Scene Text Recognition via Iterative Image Rectification [CVPR 2019] [paper]
- STN-OCR: A single Neural Network for Text Detection and Text Recognition [arXiv] [paper]
- Learning to Read Irregular Text with Attention Mechanisms [IJCAI 2017] [paper]
- Scene Text Recognition with Sliding Convolutional Character Models [arXiv] [paper]
- Focusing Attention: Towards Accurate Text Recognition in Natural Images [ICCV 2017] [paper]
- AON: Towards Arbitrarily-Oriented Text Recognition [CVPR 2018] [paper]
- Gated Recurrent Convolution Neural Network for OCR [NIPS 2017] [paper]
- Recursive Recurrent Nets with Attention Modeling for OCR in the Wild [CVPR 2016] [paper]
- Robust scene text recognition with automatic rectification [CVPR 2016] [paper]
- https://github.com/WarBean/tps_stn_pytorch [PyTorch]
- https://github.com/marvis/ocr_attention [PyTorch]
- CNN-N-Gram for Handwriting Word Recognition [CVPR 2016] [paper]
- STAR-Net: A SpaTial Attention Residue Network for Scene Text Recognition [BMVC 2016] [paper]
- Reading Scene Text in Deep Convolutional Sequences [AAAI 2016] [paper]
- An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition [TPAMI 2017] [paper]
- Deep Structured Output Learning for Unconstrained Text Recognition [ICLR 2015] [paper]
- Reading text in the wild with convolutional neural networks [IJCV 2016] [paper]
End-to-end text spotting performs both detection and recognition in a unified framework.
- InstructOCR: Instruction Boosting Scene Text Spotting [arXiv 2025] [paper]
- dots.ocr: Multilingual Document Layout Parsing in a Single Vision-Language Model [arXiv 2025] [paper]
- FastTextSpotter: A High-Efficiency Transformer for Multilingual Scene Text Spotting [arXiv 2024] [paper]
- OmniParser: A Unified Framework for Text Spotting, Key Information Extraction and Table Recognition [CVPR 2024]
- Bridging the Gap Between End-to-End and Two-Step Text Spotting [CVPR 2024] [paper]
- InstructOCR: Instruction Boosting Scene Text Spotting [arXiv 2024] [paper]
- Arbitrary Reading Order Scene Text Spotter with Local Semantics Guidance [arXiv 2024] [paper]
- TextBlockV2: Towards Precise-Detection-Free Scene Text Spotting with Pre-trained Language Model [arXiv 2024] [paper]
- DNTextSpotter: Arbitrary-Shaped Scene Text Spotting via Improved Denoising Training [arXiv 2024] [paper]
- TransDETR: End-to-End Video Text Spotting with Transformer [IJCV 2024] [paper] [code]
- DeepSolo: Let Transformer Decoder with Explicit Points Solo for Text Spotting [CVPR 2023] [paper] [code]
- ESTextSpotter: Towards Better Scene Text Spotting with Explicit Synergy in Transformer [ICCV 2023] [paper]
- TextFormer: A Query-based End-to-End Text Spotter with Mixed Supervision [arXiv 2023] [paper]
- SText-DETR: End-to-End Arbitrary-Shaped Text Detection with Scalable Query in Transformer [paper]
- SwinTextSpotter: Scene Text Spotting via Better Synergy Between Text Detection and Recognition [CVPR 2022] [paper] [code]
- TESTR: Text Spotting Transformers [arXiv 2022] [paper]
- TransDETR: End-to-End Video Text Spotting with Transformer [arXiv 2022] [paper]
- SPTS: Single-Point Text Spotting [ACM MM 2022]
- SPTS v2: Single-Point Scene Text Spotting [TPAMI 2023]
- Towards Unconstrained End-to-End Text Spotting [ICCV 2019] [paper]
- TextDragon: An End-to-End Framework for Arbitrary Shaped Text Spotting [ICCV 2019] [paper]
- Convolutional Character Networks [ICCV 2019] [paper]
- FOTS: Fast Oriented Text Spotting With a Unified Network [CVPR 2018] [paper]
- An end-to-end TextSpotter with Explicit Alignment and Attention [CVPR 2018] [paper]
- Mask TextSpotter: An End-to-End Trainable Neural Network for Spotting Text with Arbitrary Shapes [ECCV 2018] [paper]
- Deep TextSpotter: An End-To-End Trainable Scene Text Localization and Recognition Framework [ICCV 2017] [paper]
- SEE: Towards Semi-Supervised End-to-End Scene Text Recognition [AAAI 2018] [paper]
- https://github.com/Bartzi/see [Chainer]
- DSText V2: A Comprehensive Video Text Spotting Dataset for Dense and Small Text [arXiv 2023] [paper]
- FlowText: Synthesizing Realistic Scene Text Video with Optical Flow Estimation [arXiv 2023] [paper]
- SAMText: Scalable Mask Annotation for Video Text Spotting [arXiv 2023] [paper]
- ICDAR 2023 Competition on Video Text Reading for Dense and Small Text [paper] [competition]
- TransDETR: End-to-End Video Text Spotting with Transformer [arXiv 2022] [paper]
- BOVText: A Large-Scale, Bilingual Open World Dataset for Video Text Spotting [arXiv 2021] [paper]
- ICDAR 2021 Competition on Scene Video Text Spotting [paper]
- You Only Recognize Once: Towards Fast Video Text Spotting [arXiv 2019] [paper]
Diffusion models for rendering text in images with high quality and controllability.
- AnyText2: Visual Text Generation and Editing With Customizable Attributes [arXiv 2025] [paper]
- TextSSR: Diffusion-based Data Synthesis for Scene Text Recognition [ICCV 2025] [paper]
- TextDiffuser-2: Unleashing the Power of Language Models for Text Rendering [ECCV 2024] [paper]
- TextDiffuser: Diffusion Models as Text Painters [arXiv 2023] [paper] [project]
- AnyText: Multilingual Visual Text Generation And Editing [ICLR 2024 Spotlight] [paper] [code]
- Text Image Inpainting via Global Structure-Guided Diffusion Models [arXiv 2024] [paper]
- PSGText: Stroke-Guided Scene Text Editing with PSP Module [arXiv 2023] [paper]
- Weakly supervised scene text generation for low-resource languages [paper]
- Scene Text Synthesis for Efficient and Effective Deep Network Training [arXiv] [paper]
- Synthetic Data for Text Localisation in Natural Images [CVPR 2016] [paper]
- FLUX-Text: A Simple and Advanced Diffusion Transformer Baseline for Scene Text Editing [arXiv 2025] [paper]
- Recognition-Synergistic Scene Text Editing [arXiv 2025] [paper]
- OmniText: A Training-Free Generalist for Controllable Text-Image Manipulation [arXiv 2025] [paper]
- OTR: Synthesizing Overlay Text Dataset for Text Removal [arXiv 2025] [paper]
- DiffSTR: Controlled Diffusion Models for Scene Text Removal [arXiv 2024] [paper]
- Text Image Inpainting via Global Structure-Guided Diffusion Models [arXiv 2024] [paper]
- Choose What You Need: Disentangled Representation Learning for Scene Text Recognition, Removal and Editing [CVPR 2024]
- PSGText: Stroke-Guided Scene Text Editing with PSP Module [arXiv 2023] [paper]
- PSSTRNet: Progressive Segmentation-guided Scene Text Removal Network [paper]
- Cps-STS: Bridging the Gap Between Content and Position for Coarse-Point-Supervised Scene Text Spotter [IEEE TMM 2024] [paper]
- Weakly supervised scene text generation for low-resource languages [paper]
- Weakly Supervised Scene Text Detection using Deep Reinforcement Learning [arXiv 2022] [paper]
- Towards Weakly-Supervised Text Spotting using a Multi-Task Transformer [paper]
- Mixed-Supervised Scene Text Detection With Expectation-Maximization Algorithm [IEEE 2022] [paper]
- Attention-Based Extraction of Structured Information from Street View Imagery [ICDAR 2017] [paper]
- WeText: Scene Text Detection under Weak Supervision [ICCV 2017] [paper]
- SEE: Towards Semi-Supervised End-to-End Scene Text Recognition [AAAI 2018] [paper]
- https://github.com/Bartzi/see [Chainer]
- Cross-Lingual Learning in Multilingual Scene Text Recognition [ICASSP 2024] [paper] [code]
- Collaborative Encoding Method for Scene Text Recognition in Low Linguistic Resources: The Uyghur Language Case Study [Applied Sciences 2024] [paper]
- Multilingual scene text recognition: A faster R-CNN approach for Bengali and English scripts [paper]
- An End-to-End Scene Text Recognition for Bilingual Text [Eng 2024] [paper]
- OpenOCR: Unified benchmark for Chinese and English OCR [info]
- CDistNet: Perceiving multi-domain character distance for robust text recognition [IJCV 2024]
- MRN: Multiplexed Routing Network for Incremental Multilingual Text Recognition [ICCV 2023]
- Chinese Text Recognition with Pre-Trained CLIP-Like Model Through Image-IDS Aligning [arXiv 2023] [paper]
- TPS++: Attention-Enhanced Thin-Plate Spline for Scene Text Recognition [IJCAI 2023]
- ICDAR2019 Robust Reading Challenge on Multi-lingual Scene Text Detection and Recognition (RRC-MLT-2019)
- LayoutLLM: Layout Instruction Tuning with Large Language Models for Document Understanding [CVPR 2024] [paper]
- DLAFormer: An End-to-End Transformer for Document Layout Analysis [ICDAR 2024]
- UNIT: Unifying Image and Text Recognition in One Vision Encoder [NeurIPS 2024] [paper]
- Hierarchical Visual Feature Aggregation for OCR-Free Document Understanding [NeurIPS 2024] [poster]
- mplug-docowl 1.5: Unified structure learning for ocr-free document understanding
- LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking
- DiT: Document Image Transformer
- LayoutLM: Pre-training of Text and Layout for Document Image Understanding [paper]
- LayoutLMv2: Multi-modal Pre-training for Visually-Rich Document Understanding
- Generating Handwritten Mathematical Expressions From Symbol Graphs: An End-to-End Pipeline [CVPR 2024] [paper]
- DLAFormer: Document Layout Analysis for Formula Detection [ICDAR 2024]
- Towards Scalable Training for Handwritten Mathematical Expression Recognition [paper]
- Image Over Text: Transforming Formula Recognition Evaluation with Character Detection Matching [paper]
- MathWriting: A Dataset For Handwritten Mathematical Expression Recognition [NeurIPS 2023] [paper]
- ICDAR 2023 CROHME: Competition on Recognition of Handwritten Mathematical Expressions [paper]
- Mathematical formula detection in document images: A new dataset and a new approach [Pattern Recognition 2023] [paper]
- Handwritten Text Recognition: A Survey [arXiv 2025] [paper]
- Benchmarking Large Language Models for Handwritten Text Recognition [paper]
- GraDeT-HTR: A Resource-Efficient Bengali Handwritten Text Recognition System [paper]
- HTR-VT: Handwritten Text Recognition with Vision Transformer [arXiv 2024] [paper]
- On the Generalization of Handwritten Text Recognition Models [paper]
- Advancing Offline Handwritten Text Recognition: A Systematic Review [paper]
- ViTextVQA: Vietnamese Text Comprehension in Images [paper]
- ViOCRVQA: Novel Benchmark for Vietnamese [paper]
- Beyond OCR + VQA: Towards End-to-End Reading and Reasoning for Robust and Accurate TextVQA [paper]
- Self-Supervised Learning for Text Recognition: A Critical Survey [IJCV 2025] [paper]
- Free Lunch: Frame-level Contrastive Learning with Text Perceiver [ACM MM 2024] [paper]
- Relational Contrastive Learning for Scene Text Recognition [arXiv 2023] [paper]
- Self-Supervised Character-to-Character Distillation for Text Recognition [ICCV 2023] [paper]
- Sequence-to-Sequence Contrastive Learning for Text Recognition [CVPR 2021] [paper]
- Lumos: On-Device Scene Text Recognition for MM LLMs [paper]
- ACP-Net: Asymmetric Center Positioning Network for Real-Time Text Detection [paper]
- YOLOv5ST: A Lightweight and Fast Scene Text Detector [paper]
- A light-weight natural scene text detection and recognition system [paper]
- QEST: Quantized and Efficient Scene Text Detector Using Deep Learning [paper]
- Attention Guided Feature Encoding for Scene Text Recognition [paper]
- STAN: A sequential transformation attention-based network for scene text recognition [paper]
- Rethinking text rectification for scene text recognition [paper]
- MTSTR: Multi-task learning for low-resolution scene text recognition via dual attention mechanism [paper]
Contributions are welcome! Please feel free to submit a Pull Request. When adding papers, please:
- Follow the existing format
- Place papers in reverse chronological order
- Include paper links and code repositories (if available)
- Add
[code]badge for papers with available implementations - Use the year when the paper was first publicly available (including arXiv)
To the extent possible under law, the contributors have waived all copyright and related or neighboring rights to this work.