论文标题
TextDCT:通过离散余弦转换掩码进行任意形状的文本检测
TextDCT: Arbitrary-Shaped Text Detection via Discrete Cosine Transform Mask
论文作者
论文摘要
由于字体,大小,颜色和方向的各种文本变化,任意形状的场景文本检测是一项具有挑战性的任务。大多数基于回归的方法求助于回归文本区域的口罩或轮廓点以建模文本实例。但是,回归完整的口罩需要高训练的复杂性,轮廓点不足以捕获高度弯曲的文本的细节。为了应对上述限制,我们提出了一个名为TextDCT的新颖的无锚文本检测框架,该框架采用离散的余弦变换(DCT)将文本掩码编码为紧凑型向量。此外,考虑到金字塔层中训练样本不平衡的数量,我们仅采用单层头来进行自上而下的预测。为了模拟单层头部中的多尺度文本,我们通过将缩减文本区域视为正面样本,并通过融合丰富的上下文信息并专注于更重要的特征,从而介绍一种新颖的积极抽样策略,并设计一个特征意识模块(FAM),以融合丰富的上下文信息。此外,我们提出了一种分割的非最大抑制(S-NMS)方法,该方法可以过滤低质量的掩模回归。在四个具有挑战性的数据集上进行了广泛的实验,这表明我们的TextDCT在准确性和效率上都获得了竞争性能。具体而言,TextDCT分别以每秒17.2帧(FPS)和F-MEASION的f-MEASIE达到85.1,而CTW1500和Total-Text数据集的F-MEASION分别为15.1 fps。
Arbitrary-shaped scene text detection is a challenging task due to the variety of text changes in font, size, color, and orientation. Most existing regression based methods resort to regress the masks or contour points of text regions to model the text instances. However, regressing the complete masks requires high training complexity, and contour points are not sufficient to capture the details of highly curved texts. To tackle the above limitations, we propose a novel light-weight anchor-free text detection framework called TextDCT, which adopts the discrete cosine transform (DCT) to encode the text masks as compact vectors. Further, considering the imbalanced number of training samples among pyramid layers, we only employ a single-level head for top-down prediction. To model the multi-scale texts in a single-level head, we introduce a novel positive sampling strategy by treating the shrunk text region as positive samples, and design a feature awareness module (FAM) for spatial-awareness and scale-awareness by fusing rich contextual information and focusing on more significant features. Moreover, we propose a segmented non-maximum suppression (S-NMS) method that can filter low-quality mask regressions. Extensive experiments are conducted on four challenging datasets, which demonstrate our TextDCT obtains competitive performance on both accuracy and efficiency. Specifically, TextDCT achieves F-measure of 85.1 at 17.2 frames per second (FPS) and F-measure of 84.9 at 15.1 FPS for CTW1500 and Total-Text datasets, respectively.