论文标题
文本视频检索的功能空间多模式数据增强技术
A Feature-space Multimodal Data Augmentation Technique for Text-video Retrieval
论文作者
论文摘要
每小时,大量的视觉内容都会发布在社交媒体和用户生成的内容平台上。为了通过自然语言查询找到相关的视频,在过去几年中,文本视频检索方法受到了越来越多的关注。引入了数据增强技术,以通过应用语义传播技术(例如色彩空间或图像上的几何变换)创建新的训练样本来提高看不见的测试示例的性能。但是,这些技术通常应用于原始数据,从而导致更多资源要求的解决方案,并且还需要具有原始数据的共享性,这可能并不总是如此,例如电影或电视连续剧的剪辑中的版权问题。为了解决这一缺点,我们提出了一种多模式数据增强技术,该技术在功能空间中起作用,并通过混合语义上相似的样本来创建新的视频和字幕。我们在大规模公共数据集(Epic-Kitchens-100)上实验解决方案,并对基线方法,改进的最新性能取得了可观的改进,同时进行了多次消融研究。我们在https://github.com/aranciokov/fsmmda_videoretrieval上在github上发布代码和预估计的模型。
Every hour, huge amounts of visual contents are posted on social media and user-generated content platforms. To find relevant videos by means of a natural language query, text-video retrieval methods have received increased attention over the past few years. Data augmentation techniques were introduced to increase the performance on unseen test examples by creating new training samples with the application of semantics-preserving techniques, such as color space or geometric transformations on images. Yet, these techniques are usually applied on raw data, leading to more resource-demanding solutions and also requiring the shareability of the raw data, which may not always be true, e.g. copyright issues with clips from movies or TV series. To address this shortcoming, we propose a multimodal data augmentation technique which works in the feature space and creates new videos and captions by mixing semantically similar samples. We experiment our solution on a large scale public dataset, EPIC-Kitchens-100, and achieve considerable improvements over a baseline method, improved state-of-the-art performance, while at the same time performing multiple ablation studies. We release code and pretrained models on Github at https://github.com/aranciokov/FSMMDA_VideoRetrieval.