论文标题
芒果:面具的注意指导一个阶段的场景文字摄影师
MANGO: A Mask Attention Guided One-Stage Scene Text Spotter
论文作者
论文摘要
最近,由于其在实际应用中的全球优化和高可维护性,端到端的场景文本斑点已成为一个流行的研究主题。大多数方法都试图开发各种感兴趣的区域(ROI)操作,以将检测部分和序列识别部分连接到两个阶段的文本斑点框架中。但是,在这种框架中,识别部分对检测到的结果(例如),文本轮廓的紧凑性高度敏感。为了解决这个问题,在本文中,我们提出了一种新颖的面具注意力指导了一个名为Mango的单阶段文本发现框架,其中可以直接识别角色序列而无需ROI操作。具体而言,开发了一个可以感知位置的面具注意模块,以在每个文本实例及其字符上产生注意力权重。它允许将图像中的不同文本实例分配在不同的特征映射通道上,这些特征映射通道被进一步分组为实例功能。最后,应用了轻质序列解码器来生成字符序列。值得注意的是,芒果固有地适应了任意形状的文本斑点,并且只能通过粗略的位置信息(例如),矩形边界框)和文本注释进行端到端训练。实验结果表明,所提出的方法在常规和不规则文本斑点基准(即ICDAR 2013,ICDAR 2015,Total-Text和Scut-CTW1500)上都实现了竞争性,甚至是新的最先进的性能。
Recently end-to-end scene text spotting has become a popular research topic due to its advantages of global optimization and high maintainability in real applications. Most methods attempt to develop various region of interest (RoI) operations to concatenate the detection part and the sequence recognition part into a two-stage text spotting framework. However, in such framework, the recognition part is highly sensitive to the detected results (e.g.), the compactness of text contours). To address this problem, in this paper, we propose a novel Mask AttentioN Guided One-stage text spotting framework named MANGO, in which character sequences can be directly recognized without RoI operation. Concretely, a position-aware mask attention module is developed to generate attention weights on each text instance and its characters. It allows different text instances in an image to be allocated on different feature map channels which are further grouped as a batch of instance features. Finally, a lightweight sequence decoder is applied to generate the character sequences. It is worth noting that MANGO inherently adapts to arbitrary-shaped text spotting and can be trained end-to-end with only coarse position information (e.g.), rectangular bounding box) and text annotations. Experimental results show that the proposed method achieves competitive and even new state-of-the-art performance on both regular and irregular text spotting benchmarks, i.e., ICDAR 2013, ICDAR 2015, Total-Text, and SCUT-CTW1500.