基于变压器的端到端语音识别，并局部密集的合成器注意

论文标题

基于变压器的端到端语音识别，并局部密集的合成器注意

Transformer-based End-to-End Speech Recognition with Local Dense Synthesizer Attention

论文作者

Xu, Menglong, Li, Shengqiang, Zhang, Xiao-Lei

论文摘要

最近，几项研究报告说，DOT产品自我意识（SA）对于最先进的变压器模型可能并不是必不可少的。由逐渐使用DOT产品和成对相互作用的密集合成器注意（DSA）在许多语言处理任务中获得竞争结果的事实，在本文中，我们首先提出了基于DSA的语音识别，作为SA的替代方法。为了降低计算复杂性并提高性能，我们进一步提议本地DSA（LDSA）将DSA的注意力范围限制在当前中央框架周围的局部范围内，以进行语音识别。最后，我们将LDSA与SA结合在一起，同时提取本地和全局信息。 AI-shell1普通话语音识别语料库的实验结果表明，提出的LDSA变形器达到了6.49％的字符错误率（CER），这比SA转换器的角色错误率（CER）稍好。同时，LDSA转换器所需的计算要小于SATRANSFORMER。所提出的组合方法不仅达到6.18％的CER，这显着超过了SA转换器，而且与后者的参数和计算复杂性大致相同。多头LDSA的实现可在https://github.com/mlxu995/multihead-ldsa上获得。

Recently, several studies reported that dot-product selfattention (SA) may not be indispensable to the state-of-theart Transformer models. Motivated by the fact that dense synthesizer attention (DSA), which dispenses with dot products and pairwise interactions, achieved competitive results in many language processing tasks, in this paper, we first propose a DSA-based speech recognition, as an alternative to SA. To reduce the computational complexity and improve the performance, we further propose local DSA (LDSA) to restrict the attention scope of DSA to a local range around the current central frame for speech recognition. Finally, we combine LDSA with SA to extract the local and global information simultaneously. Experimental results on the Ai-shell1 Mandarine speech recognition corpus show that the proposed LDSA-Transformer achieves a character error rate (CER) of 6.49%, which is slightly better than that of the SA-Transformer. Meanwhile, the LDSA-Transformer requires less computation than the SATransformer. The proposed combination method not only achieves a CER of 6.18%, which significantly outperforms the SA-Transformer, but also has roughly the same number of parameters and computational complexity as the latter. The implementation of the multi-head LDSA is available at https://github.com/mlxu995/multihead-LDSA.

下载PDF全文

下载文献需遵守相关版权规定

论文标题