实时端到端的视频文本查看与对比表示学习

论文标题

实时端到端的视频文本查看与对比表示学习

Real-time End-to-End Video Text Spotter with Contrastive Representation Learning

论文作者

Wu, Wejia, Li, Zhuang, Li, Jiahong, Shen, Chunhua, Zhou, Hong, Li, Size, Wang, Zhongyuan, Luo, Ping

论文摘要

视频文本发现（VTS）是需要同时检测，跟踪和识别视频中文本的任务。现有的视频文本发现方法通常开发复杂的管道和多个模型，这不是实时应用程序的朋友。在这里，我们提出了一个带有对比表示学习（Cotext）的实时端到端视频片段。我们的贡献是三个方面：1）Cotext同时解决实时端到端可训练框架中的三个任务（例如，文本检测，跟踪，识别）。 2）通过对比度学习，Cotext模拟了跨多个帧的长距离依赖性和学习时间信息。 3）简单，轻巧的体系结构设计用于有效，准确的性能，包括带有蒙版ROI的基于CTC的GPU - 平行检测后处理。广泛的实验表明我们方法的优越性。尤其是，Cotext在ICDAR2015VIDEO上以41.0 fps的速度实现了一个视频文本，以72.0％的身份命中了IDF1，而先前的最佳方法提高了10.5％和32.0 fps。该代码可以在github.com/weijiawu/cotext上找到。

Video text spotting(VTS) is the task that requires simultaneously detecting, tracking and recognizing text in the video. Existing video text spotting methods typically develop sophisticated pipelines and multiple models, which is not friend for real-time applications. Here we propose a real-time end-to-end video text spotter with Contrastive Representation learning (CoText). Our contributions are three-fold: 1) CoText simultaneously address the three tasks (e.g., text detection, tracking, recognition) in a real-time end-to-end trainable framework. 2) With contrastive learning, CoText models long-range dependencies and learning temporal information across multiple frames. 3) A simple, lightweight architecture is designed for effective and accurate performance, including GPU-parallel detection post-processing, CTC-based recognition head with Masked RoI. Extensive experiments show the superiority of our method. Especially, CoText achieves an video text spotting IDF1 of 72.0% at 41.0 FPS on ICDAR2015video, with 10.5% and 32.0 FPS improvement the previous best method. The code can be found at github.com/weijiawu/CoText.

下载PDF全文

下载文献需遵守相关版权规定

论文标题