ANOVIT：无监督的异常检测和本地化，具有基于视觉变压器的编码器描述器

论文标题

ANOVIT：无监督的异常检测和本地化，具有基于视觉变压器的编码器描述器

AnoViT: Unsupervised Anomaly Detection and Localization with Vision Transformer-based Encoder-Decoder

论文作者

Lee, Yunseung, Kang, Pilsung

论文摘要

图像异常检测问题旨在确定图像是否异常并检测异常区域。这些方法可在制造，医疗和智能信息等各个领域中积极使用。编码器 - 编码器结构已被广泛用于异常检测领域，因为它们可以在无监督的学习环境中轻松学习正常模式，并计算得分以通过重建错误来识别异常错误，以指示输入和重建图像之间的差异。因此，当前图像异常检测方法通常使用卷积编码器描述器通过图像的局部特征提取正常信息。但是，它们受到限制，因为由于使用固定尺寸的过滤器的卷积操作的特征，在构造正常表示时只能使用图像的局部特征。因此，我们提出了一个基于视觉变压器的编码器模型，称为Anovit，旨在通过学习图像贴片之间的全局关系来反映正常信息，该图像贴片能够既能图像异常检测和本地化。所提出的方法构建了一个特征图，该特征图通过使用通过多个自我发项层的所有补丁的嵌入来维护单个补丁的现有位置信息。所提出的主管模型的性能比三个基准数据集上的基于卷积的模型表现更好。在MVTECAD中，MVTECAD是用于异常定位的代表性基准数据集，与基线相比，它显示出15个类别中10个的结果有所改善。此外，当定性评估定位结果时，所提出的方法表现出良好的性能，而异常区域的类别和类型。

Image anomaly detection problems aim to determine whether an image is abnormal, and to detect anomalous areas. These methods are actively used in various fields such as manufacturing, medical care, and intelligent information. Encoder-decoder structures have been widely used in the field of anomaly detection because they can easily learn normal patterns in an unsupervised learning environment and calculate a score to identify abnormalities through a reconstruction error indicating the difference between input and reconstructed images. Therefore, current image anomaly detection methods have commonly used convolutional encoder-decoders to extract normal information through the local features of images. However, they are limited in that only local features of the image can be utilized when constructing a normal representation owing to the characteristics of convolution operations using a filter of fixed size. Therefore, we propose a vision transformer-based encoder-decoder model, named AnoViT, designed to reflect normal information by additionally learning the global relationship between image patches, which is capable of both image anomaly detection and localization. The proposed approach constructs a feature map that maintains the existing location information of individual patches by using the embeddings of all patches passed through multiple self-attention layers. The proposed AnoViT model performed better than the convolution-based model on three benchmark datasets. In MVTecAD, which is a representative benchmark dataset for anomaly localization, it showed improved results on 10 out of 15 classes compared with the baseline. Furthermore, the proposed method showed good performance regardless of the class and type of the anomalous area when localization results were evaluated qualitatively.

下载PDF全文

下载文献需遵守相关版权规定

论文标题