ZegClip：旨在调整零击语义分段的剪辑

论文标题

ZegClip：旨在调整零击语义分段的剪辑

ZegCLIP: Towards Adapting CLIP for Zero-shot Semantic Segmentation

论文作者

Zhou, Ziqin, Zhang, Bowen, Lei, Yinjie, Liu, Lingqiao, Liu, Yifan

论文摘要

最近，剪辑已通过两阶段方案应用于像素级零击学习任务。一般的想法是首先生成类不足的区域建议，然后喂养裁剪的建议区域以夹夹以利用其图像级零摄像分类能力。虽然有效，但这种方案需要两个图像编码器，一个用于提案生成，一个用于剪辑，导致了复杂的管道和高计算成本。在这项工作中，我们追求一种更简单的一阶段解决方案，该解决方案将夹子的零弹声预测能力从图像级别扩展到像素级别。我们的研究始于直接扩展，作为我们的基线，它通过比较从夹子中提取的文本和斑块嵌入之间的相似性来生成语义面具。但是，这样的范式可能会大大过度地融入看到的课程，并且无法概括地看不见的班级。为了解决这个问题，我们提出了三种简单但有效的设计，并确定它们可以显着保留剪辑的固有零击能力并提高像素级的概括能力。结合这些修改会导致一个称为ZegClip的有效的零击语义分割系统。通过对三个公共基准测试的广泛实验，ZegClip表现出了卓越的性能，在“归纳”和“转移性”的零拍设置下，通过很大的边际表现出了最先进的方法。此外，与两阶段方法相比，我们的一阶段ZegClip在推理过程中的加速速度快5倍。我们在https://github.com/ziqinzhou66/zegclip.git上发布代码。

Recently, CLIP has been applied to pixel-level zero-shot learning tasks via a two-stage scheme. The general idea is to first generate class-agnostic region proposals and then feed the cropped proposal regions to CLIP to utilize its image-level zero-shot classification capability. While effective, such a scheme requires two image encoders, one for proposal generation and one for CLIP, leading to a complicated pipeline and high computational cost. In this work, we pursue a simpler-and-efficient one-stage solution that directly extends CLIP's zero-shot prediction capability from image to pixel level. Our investigation starts with a straightforward extension as our baseline that generates semantic masks by comparing the similarity between text and patch embeddings extracted from CLIP. However, such a paradigm could heavily overfit the seen classes and fail to generalize to unseen classes. To handle this issue, we propose three simple-but-effective designs and figure out that they can significantly retain the inherent zero-shot capacity of CLIP and improve pixel-level generalization ability. Incorporating those modifications leads to an efficient zero-shot semantic segmentation system called ZegCLIP. Through extensive experiments on three public benchmarks, ZegCLIP demonstrates superior performance, outperforming the state-of-the-art methods by a large margin under both "inductive" and "transductive" zero-shot settings. In addition, compared with the two-stage method, our one-stage ZegCLIP achieves a speedup of about 5 times faster during inference. We release the code at https://github.com/ZiqinZhou66/ZegCLIP.git.

下载PDF全文

下载文献需遵守相关版权规定

论文标题