论文标题
Hybrid-S2S:带有循环网络的视频对象分割和匹配的通信
Hybrid-S2S: Video Object Segmentation with Recurrent Networks and Correspondence Matching
论文作者
论文摘要
一击视频对象分割〜(vos)是像素智能跟踪感兴趣的对象的任务,其中第一个帧的分割掩码在推理时给出。近年来,经常性的神经网络〜(RNN)已被广泛用于VOS任务,但它们通常会受到诸如漂移和错误传播之类的局限性。在这项工作中,我们通过提出一个名为HS2S的混合序列到序列体系结构来研究基于RNN的架构,并利用双掩模的传播策略来解决其中的一些问题,该策略允许合并从通信匹配中获得的信息。我们的实验表明,通过对应匹配增加RNN是减少漂移问题的高效解决方案。附加信息有助于模型预测更准确的掩码,并使其可抵抗错误传播。我们在Davis2017数据集以及YouTube-VOS上评估了HS2S模型。在后者中,我们比基于RNN的最先进方法在VOS中的总体细分精度中提高了11.2pp。我们在诸如遮挡和长序列之类的挑战性案例中分析了模型的行为,并表明我们的混合体系结构在这些困难场景中显着提高了细分质量。
One-shot Video Object Segmentation~(VOS) is the task of pixel-wise tracking an object of interest within a video sequence, where the segmentation mask of the first frame is given at inference time. In recent years, Recurrent Neural Networks~(RNNs) have been widely used for VOS tasks, but they often suffer from limitations such as drift and error propagation. In this work, we study an RNN-based architecture and address some of these issues by proposing a hybrid sequence-to-sequence architecture named HS2S, utilizing a dual mask propagation strategy that allows incorporating the information obtained from correspondence matching. Our experiments show that augmenting the RNN with correspondence matching is a highly effective solution to reduce the drift problem. The additional information helps the model to predict more accurate masks and makes it robust against error propagation. We evaluate our HS2S model on the DAVIS2017 dataset as well as Youtube-VOS. On the latter, we achieve an improvement of 11.2pp in the overall segmentation accuracy over RNN-based state-of-the-art methods in VOS. We analyze our model's behavior in challenging cases such as occlusion and long sequences and show that our hybrid architecture significantly enhances the segmentation quality in these difficult scenarios.