论文标题
部分可观测时空混沌系统的无模型预测
Open-Vocabulary Semantic Segmentation with Mask-adapted CLIP
论文作者
论文摘要
开放式语义语义分割旨在根据文本描述将图像分为语义区域,这在训练过程中可能没有看到。最近的两阶段方法首先生成类不足的掩码建议,然后利用预训练的视觉语言模型,例如剪辑,将其分类为掩盖区域。我们将此范式的性能瓶颈确定为预先训练的剪辑模型,因为它在蒙版的图像上表现不佳。为了解决这个问题,我们建议在掩盖图像区域的集合及其相应的文本描述上进行芬特剪辑。我们通过挖掘现有的图像捕获数据集(例如可可字幕)来收集培训数据,并使用夹子将蒙版的图像区域与图像标题中的名词匹配。与具有固定类的更精确和手动注释的分段标签(例如可可固定)相比,我们发现我们的嘈杂,但是不同的数据集可以更好地保留剪辑的概括能力。除了对整个型号进行填充之外,我们还使用掩码掩码提示调音的方法利用掩盖图像中的“空白”区域。实验证明了掩模及时调整可带来显着改进,而无需修改任何夹子的重量,并且可以进一步改善完全易键键调的模型。特别是,当接受可可培训并在ADE20K-150上进行评估时,我们的最佳模型可实现29.6%的MIOU,比以前的最新时间高8.5%。开放式摄影师通才模型首次与2017年有监督专家模型的性能相匹配,而无需针对数据集的适应。
Open-vocabulary semantic segmentation aims to segment an image into semantic regions according to text descriptions, which may not have been seen during training. Recent two-stage methods first generate class-agnostic mask proposals and then leverage pre-trained vision-language models, e.g., CLIP, to classify masked regions. We identify the performance bottleneck of this paradigm to be the pre-trained CLIP model, since it does not perform well on masked images. To address this, we propose to finetune CLIP on a collection of masked image regions and their corresponding text descriptions. We collect training data by mining an existing image-caption dataset (e.g., COCO Captions), using CLIP to match masked image regions to nouns in the image captions. Compared with the more precise and manually annotated segmentation labels with fixed classes (e.g., COCO-Stuff), we find our noisy but diverse dataset can better retain CLIP's generalization ability. Along with finetuning the entire model, we utilize the "blank" areas in masked images using a method we dub mask prompt tuning. Experiments demonstrate mask prompt tuning brings significant improvement without modifying any weights of CLIP, and it can further improve a fully finetuned model. In particular, when trained on COCO and evaluated on ADE20K-150, our best model achieves 29.6% mIoU, which is +8.5% higher than the previous state-of-the-art. For the first time, open-vocabulary generalist models match the performance of supervised specialist models in 2017 without dataset-specific adaptations.