论文标题
结合度量学习和注意力头,以获得准确有效的多标签图像分类
Combining Metric Learning and Attention Heads For Accurate and Efficient Multilabel Image Classification
论文作者
论文摘要
多标签图像分类允许从给定图像预测一组标签。与多类分类不同,每个图像只有一个标签,此类设置适用于更广泛的应用程序。在这项工作中,我们重新审视了多标签分类的两种流行方法:基于变压器的头和标签关系信息信息图处理分支。尽管基于变压器的头被认为比基于图基的分支获得了更好的结果,但我们认为,通过适当的训练策略,基于图的方法可以证明精确度的较小降低,而在推理上的计算资源则减少了。在我们的培训策略中,我们介绍了其度量学习修改,而不是非对称损失(ASL),而不是非对称损失(ASL)。在每个二进制分类子问题中,它都以$ L_2 $归一化的特征向量从骨架中运行,并在正面和负样本的归一化表示之间强制角度,以尽可能大。与二进制跨熵损失相比,这会提供更好的歧视能力。 With the proposed loss and training strategy, we obtain SOTA results among single modality methods on widespread multilabel classification benchmarks such as MS-COCO, PASCAL-VOC, NUS-Wide and Visual Genome 500. Source code of our method is available as a part of the OpenVINO Training Extensions https://github.com/openvinotoolkit/deep-object-reid/tree/multilabel
Multi-label image classification allows predicting a set of labels from a given image. Unlike multiclass classification, where only one label per image is assigned, such a setup is applicable for a broader range of applications. In this work we revisit two popular approaches to multilabel classification: transformer-based heads and labels relations information graph processing branches. Although transformer-based heads are considered to achieve better results than graph-based branches, we argue that with the proper training strategy, graph-based methods can demonstrate just a small accuracy drop, while spending less computational resources on inference. In our training strategy, instead of Asymmetric Loss (ASL), which is the de-facto standard for multilabel classification, we introduce its metric learning modification. In each binary classification sub-problem it operates with $L_2$ normalized feature vectors coming from a backbone and enforces angles between the normalized representations of positive and negative samples to be as large as possible. This results in providing a better discrimination ability, than binary cross entropy loss does on unnormalized features. With the proposed loss and training strategy, we obtain SOTA results among single modality methods on widespread multilabel classification benchmarks such as MS-COCO, PASCAL-VOC, NUS-Wide and Visual Genome 500. Source code of our method is available as a part of the OpenVINO Training Extensions https://github.com/openvinotoolkit/deep-object-reid/tree/multilabel