awesome scene text

Posted 2021-01-11 fireae

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了awesome scene text相关的知识，希望对你有一定的参考价值。

awesome scene text

scene text recognition scene text spotter scene text detection

Awesome Scene text

IncepText: A New Inception-Text Module with Deformable PSROI Pooling for Multi-Oriented Scene Text Detection

mutli-oriented text
Abstract
Incidental scene text detection, especially for multi-oriented text regions, is one of the most challenging tasks in many computer vision applications. Different from the common object detection task, scene text often suffers from a large variance of aspect ratio, scale, and orientation. To solve this problem, we propose a novel end-to-end scene text detector IncepText from an instance-aware segmentation perspective. We design a novel Inception-Text module and introduce deformable PSROI pooling to deal with multi-oriented text detection. Extensive experiments on ICDAR2015, RCTW-17, and MSRA-TD500 datasets demonstrate our method‘s superiority in terms of both effectiveness and efficiency. Our proposed method achieves 1st place result on ICDAR2015 challenge and the state-of-the-art performance on other datasets. Moreover, we have released our implementation as an OCR product which is available for public access.
摘要
任意场景文本检测，特别是对于多方向文本区域，是许多计算机视觉应用中最具挑战性的任务之一。不同于普通的物体检测任务，场景文本经常会遭受长宽比、比例和方向的巨大变化。为了解决这个问题，我们从实例感知分割的角度提出了一种新的端到端场景文本检测器。我们设计了一个新的初始文本模块，并引入了可变形PSROI池来处理多方向文本检测。在icdar 2015、RCTW - 17和MSRA - TD500数据集上的大量实验证明了我们的方法在有效性和效率方面的优势。我们提出的方法在icdar 2015挑战赛上取得了第一名的成绩，在其他数据集上也取得了最新的性能。此外，我们已经发布了我们作为OCR产品的实现，可供公众访问。

pdf
code

Shape Robust Text Detection with Progressive Scale Expansion Network

mutli-oriented text
Abstract
The challenges of shape robust text detection lie in two aspects: 1) most existing quadrangular bounding box based detectors are difficult to locate texts with arbitrary shapes, which are hard to be enclosed perfectly in a rectangle; 2) most pixel-wise segmentation-based detectors may not separate the text instances that are very close to each other. To address these problems, we propose a novel Progressive Scale Expansion Network (PSENet), designed as a segmentation-based detector with multiple predictions for each text instance. These predictions correspond to different kernels produced by shrinking the original text instance into various scales. Consequently, the final detection can be conducted through our progressive scale expansion algorithm which gradually expands the kernels with minimal scales to the text instances with maximal and complete shapes. Due to the fact that there are large geometrical margins among these minimal kernels, our method is effective to distinguish the adjacent text instances and is robust to arbitrary shapes. The state-of-the-art results on ICDAR 2015 and ICDAR 2017 MLT benchmarks further confirm the great effectiveness of PSENet. Notably, PSENet outperforms the previous best record by absolute 6.37% on the curve text dataset SCUT-CTW1500.

摘要
形状鲁棒文本检测的挑战在于两个方面: 1 )大多数现有的基于四边形包围盒的检测器难以定位具有任意形状的文本，这些文本很难被完美地包围在矩形中；2 )大多数基于像素分割的检测器可能不会分离彼此非常接近的文本实例。为了解决这些问题，我们提出了一种新的渐进式扩展网络( PSENet )，它被设计为基于分段的检测器，对每个文本实例都有多个预测。这些预测对应于不同的“内核”，这些内核是通过将原始文本实例缩小到不同的尺度而产生的。因此，最终的检测可以通过我们的渐进尺度扩展算法来进行，该算法将最小尺度的内核逐渐扩展到最大和完整形状的文本实例。由于这些最小核之间有很大的几何边界，我们的方法可以有效地区分相邻的文本实例，并且对任意形状都是鲁棒的。ICDAR 2015年和ICDAR 2017年MLT基准的最新结果进一步证实了PSENet的巨大有效性。值得注意的是，在曲线文本数据集SCUT - ctw 1500上，PSENet的性能优于前一最佳记录6.37 %。

pdf
code

TextSnake: A Flexible Representation for Detecting Text of Arbitrary Shapes

mutli-oriented text Arbitrary Shapes
Abstract
Driven by deep neural networks and large scale datasets, scene text detection methods have progressed substantially over the past years, continuously refreshing the performance records on various standard benchmarks. However, limited by the representations (axis-aligned rectangles, rotated rectangles or quadrangles) adopted to describe text, existing methods may fall short when dealing with much more free-form text instances, such as curved text, which are actually very common in real-world scenarios. To tackle this problem, we propose a more flexible representation for scene text, termed as TextSnake, which is able to effectively represent text instances in horizontal, oriented and curved forms. In TextSnake, a text instance is described as a sequence of ordered, overlapping disks centered at symmetric axes, each of which is associated with potentially variable radius and orientation. Such geometry attributes are estimated via a Fully Convolutional Network (FCN) model. In experiments, the text detector based on TextSnake achieves state-of-the-art or comparable performance on Total-Text and SCUT-CTW1500, the two newly published benchmarks with special emphasis on curved text in natural images, as well as the widely-used datasets ICDAR 2015 and MSRA-TD500. Specifically, TextSnake outperforms the baseline on Total-Text by more than 40% in F-measure.

摘要
在深层神经网络和大规模数据集的驱动下，场景文本检测方法在过去几年中取得了长足的进步，不断刷新各种标准基准上的性能记录。然而，受用于描述文本的表示(轴对齐矩形、旋转矩形或四边形)的限制，现有方法在处理更多自由形式的文本实例时可能会有所欠缺，例如弯曲文本，这在现实世界场景中非常常见。为了解决这个问题，我们提出了一种更灵活的场景文本表示，称为TextSnake，它能够以水平、定向和弯曲的形式有效地表示文本实例。在TextSnake中，文本实例被描述为以对称轴为中心的有序重叠磁盘序列，每个磁盘都与潜在的可变半径和方向相关联。这种几何属性是通过完全卷积网络( FCN )模型来估计的。在实验中，基于TextSnake的文本检测器在全文本和SCUT - ctw 1500 (两个新发布的特别强调自然图像中的弯曲文本的基准)以及广泛使用的数据集ICDAR 2015和MSRA - TD 500上实现了最新或可比的性能。具体来说，TextSnake在F - measure中的总文本数超过基线40 %。

Sliding Line Point Regression for Shape Robust Scene Text Detection

mutli-oriented text Arbitrary Shapes
Abstract
Traditional text detection methods mostly focus on quadrangle text. In this study we propose a novel method named sliding line point regression (SLPR) in order to detect arbitrary-shape text in natural scene. SLPR regresses multiple points on the edge of text line and then utilizes these points to sketch the outlines of the text. The proposed SLPR can be adapted to many object detection architectures such as Faster R-CNN and R-FCN. Specifically, we first generate the smallest rectangular box including the text with region proposal network (RPN), then isometrically regress the points on the edge of text by using the vertically and horizontally sliding lines. To make full use of information and reduce redundancy, we calculate x-coordinate or y-coordinate of target point by the rectangular box position, and just regress the remaining y-coordinate or x-coordinate. Accordingly we can not only reduce the parameters of system, but also restrain the points which will generate more regular polygon. Our approach achieved competitive results on traditional ICDAR2015 Incidental Scene Text benchmark and curve text detection dataset CTW1500.

摘要
传统的文本检测方法主要集中在四边形文本上。在本研究中，我们提出了一种新的方法——滑动线点回归( SLPR )，用于检测自然场景中任意形状的文本。SLPR回归文本行边缘的多个点，然后利用这些点绘制文本的轮廓。所提出的SLPR可以适用于许多物体检测体系结构，例如更快的R - CNN和R - FCN。具体来说，我们首先生成最小的矩形框，包括带有区域建议网络( RPN )的文本，然后使用垂直和水平滑动线等距离回归文本边缘的点。为了充分利用信息并减少冗余，我们通过矩形框位置来计算目标点的x坐标或y坐标，并且只回归剩余的y坐标或x坐标。因此，我们不仅可以减少系统的参数，还可以抑制产生更多正多边形的点。我们的方法在传统的icdar 2015附带场景文本基准和曲线文本检测数据集ctw 1500上取得了竞争结果。

Arbitrary-Oriented Scene Text Detection via Rotation Proposals

mutli-oriented text
Abstract
This paper introduces a novel rotation-based framework for arbitrary-oriented text detection in natural scene images. We present the Rotation Region Proposal Networks (RRPN), which are designed to generate inclined proposals with text orientation angle information. The angle information is then adapted for bounding box regression to make the proposals more accurately fit into the text region in terms of the orientation. The Rotation Region-of-Interest (RRoI) pooling layer is proposed to project arbitrary-oriented proposals to a feature map for a text region classifier. The whole framework is built upon a region-proposal-based architecture, which ensures the computational efficiency of the arbitrary-oriented text detection compared with previous text detection systems. We conduct experiments using the rotation-based framework on three real-world scene text detection datasets and demonstrate its superiority in terms of effectiveness and efficiency over previous approaches.

摘要
本文介绍了一种新的基于旋转的自然场景图像中任意方向文本检测框架。我们提出了旋转区域提议网络( RRPN )，该网络被设计用于生成带有文本方向角度信息的倾斜提议。然后，角度信息被调整为边界框回归，以使提案在方向上更准确地适合文本区域。提出了旋转感兴趣区域( RRoI )池层来将任意方向的建议投影到文本区域分类器的特征图中。整个框架建立在基于区域提议的架构上，与以前的文本检测系统相比，这确保了任意方向文本检测的计算效率。我们在三个真实场景文本检测数据集上使用基于旋转的框架进行了实验，并证明了它在有效性和效率方面优于以前的方法。

pdf
code

Detecting Curve Text in the Wild: New Dataset and New Solution

mutli-oriented text Arbitrary Shapes
Abstract
Scene text detection has been made great progress in recent years. The detection manners are evolving from axis-aligned rectangle to rotated rectangle and further to quadrangle. However, current datasets contain very little curve text, which can be widely observed in scene images such as signboard, product name and so on. To raise the concerns of reading curve text in the wild, in this paper, we construct a curve text dataset named CTW1500, which includes over 10k text annotations in 1,500 images (1000 for training and 500 for testing). Based on this dataset, we pioneering propose a polygon based curve text detector (CTD) which can directly detect curve text without empirical combination. Moreover, by seamlessly integrating the recurrent transverse and longitudinal offset connection (TLOC), the proposed method can be end-to-end trainable to learn the inherent connection among the position offsets. This allows the CTD to explore context information instead of predicting points independently, resulting in more smooth and accurate detection. We also propose two simple but effective post-processing methods named non-polygon suppress (NPS) and polygonal non-maximum suppression (PNMS) to further improve the detection accuracy. Furthermore, the proposed approach in this paper is designed in an universal manner, which can also be trained with rectangular or quadrilateral bounding boxes without extra efforts. Experimental results on CTW-1500 demonstrate our method with only a light backbone can outperform state-of-the-art methods with a large margin. By evaluating only in the curve or non-curve subset, the CTD + TLOC can still achieve the best results.

摘要
近年来，场景文本检测取得了很大进展。检测方式正在从轴对齐矩形发展到旋转矩形，并进一步发展到四边形。然而，当前数据集包含非常少的曲线文本，这可以在招牌、产品名称等场景图像中广泛观察到。为了引起人们对在野外阅读曲线文本的关注，在本文中，我们构建了一个名为ctw 1500的曲线文本数据集，其中包括1500幅图像中的10k多条文本注释( 1000条用于培训，500条用于测试)。基于这个数据集，我们开拓性地提出了一种基于多边形的曲线文本检测器( CTD )，它可以直接检测曲线文本而无需经验组合。此外，通过无缝集成反复出现的横向和纵向偏移连接( TLOC )，所提出的方法可以进行端到端的训练，以了解位置偏移之间的内在联系。这允许CTD探索上下文信息，而不是独立预测点，从而实现更平滑和准确的检测。我们还提出了两种简单但有效的后处理方法，即非多边形抑制( NPS )和多边形非最大抑制( PNMS )，以进一步提高检测精度。此外，本文提出的方法是以通用的方式设计的，也可以用矩形或四边形边界框来训练，而无需额外的努力。在CTW - 1500上的实验结果表明，我们的方法仅用一根轻骨架就能大大优于最先进的方法。通过仅在曲线或非曲线子集中进行评估，CTD + TLOC仍然可以获得最佳结果。

pdf
code

FOTS: Fast Oriented Text Spotting with a Unified Network

mutli-oriented text
Abstract
Incidental scene text spotting is considered one of the most difficult and valuable challenges in the document analysis community. Most existing methods treat text detection and recognition as separate tasks. In this work, we propose a unified end-to-end trainable Fast Oriented Text Spotting (FOTS) network for simultaneous detection and recognition, sharing computation and visual information among the two complementary tasks. Specially, RoIRotate is introduced to share convolutional features between detection and recognition. Benefiting from convolution sharing strategy, our FOTS has little computation overhead compared to baseline text detection network, and the joint training method learns more generic features to make our method perform better than these two-stage methods. Experiments on ICDAR 2015, ICDAR 2017 MLT, and ICDAR 2013 datasets demonstrate that the proposed method outperforms state-of-the-art methods significantly, which further allows us to develop the first real-time oriented text spotting system which surpasses all previous state-of-the-art results by more than 5% on ICDAR 2015 text spotting task while keeping 22.6 fps.

摘要
任意场景文本被认为是文档分析社区中最困难和最有价值的挑战之一。大多数现有的方法将文本检测和识别作为单独的任务。在这项工作中，我们提出了一个统一的端到端可训练的快速定向文本定位( FOTS )网络，用于同时检测和识别，在两个互补任务之间共享计算和视觉信息。特别地，ROI rotate被引入到检测和识别之间共享卷积特征。得益于卷积共享策略，与基线文本检测网络相比，我们的FOTS计算开销很小，并且联合训练方法学习了更多的通用特征，使得我们的方法比这两阶段方法性能更好。在ICDAR 2015年、ICDAR 2017年MLT和ICDAR 2013年数据集上的实验表明，所提出的方法明显优于最先进的方法，这进一步允许我们开发第一个面向实时的文本定位系统，在保持22.6 fps的情况下，在ICDAR 2015年文本定位任务上，该系统超过了所有先前最先进的结果5 %。

pdf
code

TextBoxes++: A Single-Shot Oriented Scene Text Detector

mutli-oriented text
Abstract
Scene text detection is an important step of scene text recognition system and also a challenging problem. Different from general object detection, the main challenges of scene text detection lie on arbitrary orientations, small sizes, and significantly variant aspect ratios of text in natural images. In this paper, we present an end-to-end trainable fast scene text detector, named TextBoxes++, which detects arbitrary-oriented scene text with both high accuracy and efficiency in a single network forward pass. No post-processing other than an efficient non-maximum suppression is involved. We have evaluated the proposed TextBoxes++ on four public datasets. In all experiments, TextBoxes++ outperforms competing methods in terms of text localization accuracy and runtime. More specifically, TextBoxes++ achieves an f-measure of 0.817 at 11.6fps for 1024x1024 ICDAR 2015 Incidental text images, and an f-measure of 0.5591 at 19.8fps for 768x768 COCO-Text images. Furthermore, combined with a text recognizer, TextBoxes++ significantly outperforms the state-of-the-art approaches for word spotting and end-to-end text recognition tasks on popular benchmarks.

摘要
场景文本检测是场景文本识别系统的重要步骤，也是一个具有挑战性的问题。不同于一般物体检测，场景文本检测的主要挑战在于自然图像中文本的任意方向、小尺寸和显著变化的纵横比。在本文中，我们提出了一种端到端可训练的快速场景文本检测器TextBoxes++，它在一次网络前向传递中以高精度和高效率检测任意方向的场景文本。除了有效的非最大抑制之外，不涉及任何后处理。我们已经在四个公共数据集上评估了TextBoxes++。在所有实验中，TextBoxes++ 在文本定位精度和运行时间方面都优于竞争方法。更具体地说，对于1024 * 1024 ICDAR 2015附带文本图像，TextBoxes++ 在11.6 fps下获得了0.817的f值，对于768 * 768 COCO - Text图像，在19.8 fps下获得了0.5591的f值。此外，与文本识别器相结合，TextBoxes++ 在流行基准测试中显著优于最先进的单词识别和端到端文本识别任务。

pdf
code

R2CNN: Rotational Region CNN for Orientation Robust Scene Text Detection

mutli-oriented text
Abstract
In this paper, we propose a novel method called Rotational Region CNN (R2CNN) for detecting arbitrary-oriented texts in natural scene images. The framework is based on Faster R-CNN [1] architecture. First, we use the Region Proposal Network (RPN) to generate axis-aligned bounding boxes that enclose the texts with different orientations. Second, for each axis-aligned text box proposed by RPN, we extract its pooled features with different pooled sizes and the concatenated features are used to simultaneously predict the text/non-text score, axis-aligned box and inclined minimum area box. At last, we use an inclined non-maximum suppression to get the detection results. Our approach achieves competitive results on text detection benchmarks: ICDAR 2015 and ICDAR 2013.

摘要
本文提出了一种新的检测自然场景图像中任意方向文本的方法——旋转区域CNN ( R2CNN )。该框架基于更快的R - CNN [ 1 ]架构。首先，我们使用区域提议网络( RPN )来生成轴对齐的边界框，这些边界框以不同的方向包围文本。其次，对于RPN提出的每一个轴对齐文本框，我们提取不同大小的集合特征，并使用这些连接特征同时预测文本/非文本分数、轴对齐框和倾斜最小面积框。最后，我们使用倾斜的非最大抑制来获得检测结果。我们的方法在文本检测基准方面取得了有竞争力的结果: ICDAR 2015年和ICDAR 2013年。

PixelLink: Detecting Scene Text via Instance Segmentation

mutli-oriented text Arbitrary Shapes
Abstract
Most state-of-the-art scene text detection algorithms are deep learning based methods that depend on bounding box regression and perform at least two kinds of predictions: text/non-text classification and location regression. Regression plays a key role in the acquisition of bounding boxes in these methods, but it is not indispensable because text/non-text prediction can also be considered as a kind of semantic segmentation that contains full location information in itself. However, text instances in scene images often lie very close to each other, making them very difficult to separate via semantic segmentation. Therefore, instance segmentation is needed to address this problem. In this paper, PixelLink, a novel scene text detection algorithm based on instance segmentation, is proposed. Text instances are first segmented out by linking pixels within the same instance together. Text bounding boxes are then extracted directly from the segmentation result without location regression. Experiments show that, compared with regression-based methods, PixelLink can achieve better or comparable performance on several benchmarks, while requiring many fewer training iterations and less training data.

摘要
大多数最先进的场景文本检测算法是基于深度学习的方法，这些方法依赖于边界框回归并执行至少两种预测:文本/非文本分类和位置回归。回归在这些方法中的边界框的获取中起着关键作用，但是这并不是不可或缺的，因为文本/非文本预测本身也可以被认为是一种包含完整位置信息的语义分割。然而，场景图像中的文本实例通常彼此非常接近，这使得它们很难通过语义分割来分离。因此，需要实例分割来解决这个问题。本文提出了一种新的基于实例分割的场景文本检测算法pixelink。文本实例首先通过将同一实例中的像素链接在一起而被分割出来。然后直接从分割结果中提取文本边界框，而无需位置回归。实验表明，与基于回归的方法相比，pixelink可以在几个基准上获得更好或可比的性能，同时需要更少的训练迭代和更少的训练数据。

pdf
code

EAST: An Efficient and Accurate Scene Text Detector

mutli-oriented text
Abstract
Previous approaches for scene text detection have already achieved promising performances across various benchmarks. However, they usually fall short when dealing with challenging scenarios, even when equipped with deep neural network models, because the overall performance is determined by the interplay of multiple stages and components in the pipelines. In this work, we propose a simple yet powerful pipeline that yields fast and accurate text detection in natural scenes. The pipeline directly predicts words or text lines of arbitrary orientations and quadrilateral shapes in full images, eliminating unnecessary intermediate steps (e.g., candidate aggregation and word partitioning), with a single neural network. The simplicity of our pipeline allows concentrating efforts on designing loss functions and neural network architecture. Experiments on standard datasets including ICDAR 2015, COCO-Text and MSRA-TD500 demonstrate that the proposed algorithm significantly outperforms state of-the-art methods in terms of both accuracy and efficiency. On the ICDAR 2015 dataset, the proposed algorithm achieves an F-score of 0.7820 at 13.2fps at 720p resolution.

摘要
以前用于场景文本检测的方法已经在各种基准测试中取得了有希望的性能。然而，在处理具有挑战性的场景时，它们通常是不够的，即使配备了深层神经网络模型，因为整体性能取决于管道中多个阶段和组件的相互作用。在这项工作中，我们提出了一个简单而强大的管道，可以在自然场景中快速准确地检测文本。流水线用单个神经网络直接预测全图中任意方向和四边形形状的单词或文本行，消除了不必要的中间步骤(例如，候选聚合和单词划分)。我们管道的简单性允许集中精力设计损失函数和神经网络结构。在包括ICDAR 2015、COCO - Text和MSRA - TD500在内的标准数据集上的实验表明，所提出的算法在准确性和效率上都明显优于最先进的方法。在ICDAR 2015数据集上，在720 p分辨率下，该算法在13.2 fps下获得了0.7820的F分数。

Single Shot Text Detector with Regional Attention

Abstract
We present a novel single-shot text detector that directly outputs word-level bounding boxes in a natural image. We propose an attention mechanism which roughly identifies text regions via an automatically learned attentional map. This substantially suppresses background interference in the convolutional features, which is the key to producing accurate inference of words, particularly at extremely small sizes. This results in a single model that essentially works in a coarse-to-fine manner. It departs from recent FCN- based text detectors which cascade multiple FCN models to achieve an accurate prediction. Furthermore, we develop a hierarchical inception module which efficiently aggregates multi-scale inception features. This enhances local details, and also encodes strong context information, allow- ing the detector to work reliably on multi-scale and multi- orientation text with single-scale images. Our text detector achieves an F-measure of 77% on the ICDAR 2015 bench- mark, advancing the state-of-the-art results in [18, 28].

摘要
我们提出了一种新的单镜头文本检测器，它直接输出自然图像中的单词级边框。我们提出了一种注意机制，通过自动学习的注意地图粗略识别文本区域。这基本上抑制了卷积特征中的背景干扰，这是产生精确单词推断的关键，尤其是在非常小的尺寸下。这导致一个基本上以从粗到细的方式工作的单一模型。它不同于最近基于FCN的文本检测器，后者级联多个FCN模型以实现精确预测。此外，我们开发了一个分层初始模块，它有效地聚集了多尺度初始特征。这增强了局部细节，也编码了强上下文信息，允许检测器在多尺度和多方向文本上可靠地工作，并带有单尺度图像。我们的文本检测器在ICDAR 2015年基准测试中实现了77 %的F值，提高了[ 18，28 ]的最新结果。

pdf
code

Detecting Multi-Oriented Text with Corner-based Region Proposals

Abstract
Previous approaches for scene text detection usually rely on manually defined sliding windows. In this paper, an intuitive region-based method is presented to detect multi-oriented text without any prior knowledge regarding the textual shape. We first introduce a Corner-based Region Proposal Network (CRPN) that employs corners to estimate the possible locations of text instances instead of shifting a set of default anchors. The proposals generated by CRPN are geometry adaptive, which makes our method robust to various text aspect ratios and orientations. Moreover, we design a simple embedded data augmentation module inside the region-wise subnetwork, which not only ensures the model utilizes training data more efficiently, but also learns to find the most representative instance of the input images for training. Experimental results on public benchmarks confirm that the proposed method is capable of achieving comparable performance with the state-of-the-art methods. On the ICDAR 2013 and 2015 datasets, it obtains F-measure of 0.876 and 0.845 respectively.

摘要
以前用于场景文本检测的方法通常依赖于手动定义的滑动窗口。本文提出了一种直观的基于区域的方法来检测多方向文本，而无需任何关于文本形状的先验知识。我们首先引入了一个基于角点的区域提议网络( CRPN )，它使用角点来估计文本实例的可能位置，而不是移动一组默认锚点。CRPN产生的建议是几何自适应的，这使得我们的方法对不同的文本长宽比和方向具有鲁棒性。此外，我们在区域子网内设计了一个简单的嵌入式数据增强模块，这不仅确保了模型更有效地利用训练数据，还学会了找到最具代表性的训练输入图像实例。公共基准测试的实验结果证实，所提出的方法能够达到与最先进的方法相当的性能。在ICDAR 2013年和2015年数据集上，它分别获得了0.876和0.845的F值。

pdf
code

An end-to-end TextSpotter with Explicit Alignment and Attention

Abstract
Text detection and recognition in natural images have long been considered as two separate tasks that are processed sequentially. Training of two tasks in a unified framework is non-trivial due to significant dif- ferences in optimisation difficulties. In this work, we present a conceptually simple yet efficient framework that simultaneously processes the two tasks in one shot. Our main contributions are three-fold: 1) we propose a novel text-alignment layer that allows it to precisely compute convolutional features of a text instance in ar- bitrary orientation, which is the key to boost the per- formance; 2) a character attention mechanism is introduced by using character spatial information as explicit supervision, leading to large improvements in recognition; 3) two technologies, together with a new RNN branch for word recognition, are integrated seamlessly into a single model which is end-to-end trainable. This allows the two tasks to work collaboratively by shar- ing convolutional features, which is critical to identify challenging text instances. Our model achieves impressive results in end-to-end recognition on the ICDAR2015 dataset, significantly advancing most recent results, with improvements of F-measure from (0.54, 0.51, 0.47) to (0.82, 0.77, 0.63), by using a strong, weak and generic lexicon respectively. Thanks to joint training, our method can also serve as a good detec- tor by achieving a new state-of-the-art detection performance on two datasets.

摘要
长期以来，自然图像中的文本检测和识别一直被认为是两个独立的任务，并按顺序进行处理。由于优化难度的巨大差异，在统一框架中对两项任务的培训并非微不足道。在这项工作中，我们提出了一个概念上简单但有效的框架，它一次同时处理两项任务。我们的主要贡献有三点: 1 )我们提出了一种新的文本对齐层，它允许精确计算文本实例在任意方向上的卷积特征，这是提高性能的关键；2 )通过使用字符空间信息作为显式监督，引入了字符注意机制，导致识别的大幅改进；3 )两项技术以及一个新的RNN分支用于单词识别，无缝集成到一个可端到端训练的单一模型中。这使得这两项任务能够通过切分卷积特征协同工作，这对识别具有挑战性的文本实例至关重要。我们的模型在对icdar 2015数据集的端到端识别方面取得了令人印象深刻的结果，通过分别使用强、弱和通用词典，F - measure从( 0.54，0.51，0.47 )提高到( 0.82，0.77，0.63 )，显著提升了最近的结果。由于联合训练，我们的方法还可以通过在两个数据集上实现新的最先进的检测性能，成为一个很好的检测器。

pdf
code

以上是关于awesome scene text的主要内容，如果未能解决你的问题，请参考以下文章