完全竞争和解释：视觉和视频视觉变压器用于疼痛检测

论文标题

完全竞争和解释：视觉和视频视觉变压器用于疼痛检测

Fully-attentive and interpretable: vision and video vision transformers for pain detection

论文作者

Fiorentini, Giacomo, Ertugrul, Itir Onal, Salah, Albert Ali

论文摘要

疼痛在全球范围内是一个严重且昂贵的问题，但要接受治疗，必须首先检测到它。视觉变压器是计算机视觉中表现最好的架构，对它们用于疼痛检测的研究很少。在本文中，我们提出了第一个全心全意的自动疼痛检测管道，该管道可从面部表情中实现二元疼痛检测的最新性能。在面对3D注册并旋转到规范额叶视图后，该模型在UNBC-MCMASTER数据集上进行了训练。在我们的实验中，我们确定了高参数空间的重要领域及其与视觉和视频视觉变压器的相互作用，获得了3个值得注意的模型。我们分析了我们的一个模型的注意力图，为其预测找到合理的解释。我们还评估了混音，一种增强技术和敏锐感知的最小化，一种优化器，没有成功。我们提出的模型，VIT-1（F1得分0.55 +-0.15），Vivit-1（F1得分0.55 +-0.13）和Vivit-2（F1得分0.49 +-0.04），所有较早的作品都表现出了视觉变压器的疼痛检测潜力。代码可从https://github.com/ipdtfe/vit-mcmaster获得

Pain is a serious and costly issue globally, but to be treated, it must first be detected. Vision transformers are a top-performing architecture in computer vision, with little research on their use for pain detection. In this paper, we propose the first fully-attentive automated pain detection pipeline that achieves state-of-the-art performance on binary pain detection from facial expressions. The model is trained on the UNBC-McMaster dataset, after faces are 3D-registered and rotated to the canonical frontal view. In our experiments we identify important areas of the hyperparameter space and their interaction with vision and video vision transformers, obtaining 3 noteworthy models. We analyse the attention maps of one of our models, finding reasonable interpretations for its predictions. We also evaluate Mixup, an augmentation technique, and Sharpness-Aware Minimization, an optimizer, with no success. Our presented models, ViT-1 (F1 score 0.55 +- 0.15), ViViT-1 (F1 score 0.55 +- 0.13), and ViViT-2 (F1 score 0.49 +- 0.04), all outperform earlier works, showing the potential of vision transformers for pain detection. Code is available at https://github.com/IPDTFE/ViT-McMaster

下载PDF全文

下载文献需遵守相关版权规定

论文标题