互动强度预测的集体发育视频变压器

论文标题

互动强度预测的集体发育视频变压器

Class-attention Video Transformer for Engagement Intensity Prediction

论文作者

Ai, Xusheng, Sheng, Victor S., Li, Chunhua, Cui, Zhiming

论文摘要

为了处理变异长度的长视频，先前的作品提取了多模式功能并将其融合以预测学生的参与强度。在本文中，我们在视频变压器（CAVT）中提出了一个新的端到端方法类的关注，该方法涉及单个向量来处理类嵌入，并统一地对变异长的长视频和固定长度短视频进行端到端学习。此外，为了解决缺乏足够的样本，我们提出了一种二进制代表采样方法（BOR）来添加每个视频的多个视频序列以增强训练集。 BORS+CAVT不仅可以在EMOTIW-EP数据集上实现最先进的MSE（0.0495），而且还可以在Daisee数据集中获得最新的MSE（0.0377）。该代码和模型已在https://github.com/mountainai/cavt上公开提供。

In order to deal with variant-length long videos, prior works extract multi-modal features and fuse them to predict students' engagement intensity. In this paper, we present a new end-to-end method Class Attention in Video Transformer (CavT), which involves a single vector to process class embedding and to uniformly perform end-to-end learning on variant-length long videos and fixed-length short videos. Furthermore, to address the lack of sufficient samples, we propose a binary-order representatives sampling method (BorS) to add multiple video sequences of each video to augment the training set. BorS+CavT not only achieves the state-of-the-art MSE (0.0495) on the EmotiW-EP dataset, but also obtains the state-of-the-art MSE (0.0377) on the DAiSEE dataset. The code and models have been made publicly available at https://github.com/mountainai/cavt.

下载PDF全文

下载文献需遵守相关版权规定

论文标题