论文标题
用自由形式的文本查询检索手语视频检索
Sign Language Video Retrieval with Free-Form Textual Queries
论文作者
论文摘要
可以有效地搜索手语视频集的系统已被强调为手语技术的有用应用。但是,搜索视频超出单个关键字的问题在文献中受到了有限的关注。为了解决这一差距,在这项工作中,我们介绍了具有自由形式的文本查询的手语检索任务:给定书面查询(例如,句子)和大量的手语视频集合,目的是在集合中找到最适合书面查询的签名视频。我们建议通过在最近引入的大规模how2sign American Sign语言数据集(ASL)上学习跨模式嵌入来解决这项任务。我们确定系统性能中的关键瓶颈是标志视频嵌入的质量,它遭受了稀缺的标记培训数据。因此,我们提出了Spot-Align,这是一个框架,用于交织的标志斑点和功能对齐,以扩大可用培训数据的范围和规模。我们通过改进标志识别和拟议的视频检索任务来验证Spot-Align对学习可靠的符号视频嵌入的有效性。
Systems that can efficiently search collections of sign language videos have been highlighted as a useful application of sign language technology. However, the problem of searching videos beyond individual keywords has received limited attention in the literature. To address this gap, in this work we introduce the task of sign language retrieval with free-form textual queries: given a written query (e.g., a sentence) and a large collection of sign language videos, the objective is to find the signing video in the collection that best matches the written query. We propose to tackle this task by learning cross-modal embeddings on the recently introduced large-scale How2Sign dataset of American Sign Language (ASL). We identify that a key bottleneck in the performance of the system is the quality of the sign video embedding which suffers from a scarcity of labeled training data. We, therefore, propose SPOT-ALIGN, a framework for interleaving iterative rounds of sign spotting and feature alignment to expand the scope and scale of available training data. We validate the effectiveness of SPOT-ALIGN for learning a robust sign video embedding through improvements in both sign recognition and the proposed video retrieval task.