论文标题
暂时解析变压器的行动质量评估
Action Quality Assessment with Temporal Parsing Transformer
论文作者
论文摘要
动作质量评估(AQA)对于理解和解决任务的行动质量评估很重要,这是由于微妙的视觉差异而引起的独特挑战。现有的最新方法通常依赖于整体视频表示来进行分数回归或排名,这限制了概括以捕获细粒度的内部内部变化。为了克服上述限制,我们提出了一个时间解析变压器将整体特征分解为时间零件级表示。具体而言,我们利用一组可学习的查询来表示特定动作的原子时间模式。我们的解码过程将框架表示形式转换为固定数量的时间有序零件表示。为了获得质量得分,我们根据零件表示采用最新的对比回归。由于现有的AQA数据集不提供时间零件级标签或分区,因此我们在解码器的交叉注意力响应上提出了两个新型损失功能:排名损失,以确保可学习的查询以满足交叉注意的时间顺序,并稀疏损失损失,以鼓励零件表示更具歧视性。广泛的实验表明,我们提出的方法的表现优于三个公共AQA基准测试的先前工作。
Action Quality Assessment(AQA) is important for action understanding and resolving the task poses unique challenges due to subtle visual differences. Existing state-of-the-art methods typically rely on the holistic video representations for score regression or ranking, which limits the generalization to capture fine-grained intra-class variation. To overcome the above limitation, we propose a temporal parsing transformer to decompose the holistic feature into temporal part-level representations. Specifically, we utilize a set of learnable queries to represent the atomic temporal patterns for a specific action. Our decoding process converts the frame representations to a fixed number of temporally ordered part representations. To obtain the quality score, we adopt the state-of-the-art contrastive regression based on the part representations. Since existing AQA datasets do not provide temporal part-level labels or partitions, we propose two novel loss functions on the cross attention responses of the decoder: a ranking loss to ensure the learnable queries to satisfy the temporal order in cross attention and a sparsity loss to encourage the part representations to be more discriminative. Extensive experiments show that our proposed method outperforms prior work on three public AQA benchmarks by a considerable margin.