蛋白质语言模型的生成力量，经过多个序列比对训练

论文标题

蛋白质语言模型的生成力量，经过多个序列比对训练

Generative power of a protein language model trained on multiple sequence alignments

论文作者

Sgarbossa, Damiano, Lupo, Umberto, Bitbol, Anne-Florence

论文摘要

从进化相关的蛋白质序列的大集合开始的计算模型捕获了蛋白质家族的表示，并学习与蛋白质结构和功能相关的约束。因此，他们为产生属于蛋白质家族的新序列打开了可能性。经过多个序列比对训练的蛋白质语言模型，例如MSA变压器，是该目的的高度吸引人的候选者。我们提出并测试一种迭代方法，该方法直接采用蒙版语言建模目标来使用MSA变压器生成序列。我们证明，用于同源性，协同进化和基于结构的度量的结果序列得分以及自然序列。对于大蛋白质家族，与POTTS模型（包括实验验证的序列）相比，我们的合成序列具有相似或更好的性质。此外，对于小蛋白质家族，我们的生成方法基于MSA变压器的表现优于Potts模型。与Potts模型相比，我们的方法还更准确地重现了自然数据序列空间中序列序列的分布。因此，MSA变压器是蛋白质序列产生和蛋白质设计的强大候选者。

Computational models starting from large ensembles of evolutionarily related protein sequences capture a representation of protein families and learn constraints associated to protein structure and function. They thus open the possibility for generating novel sequences belonging to protein families. Protein language models trained on multiple sequence alignments, such as MSA Transformer, are highly attractive candidates to this end. We propose and test an iterative method that directly employs the masked language modeling objective to generate sequences using MSA Transformer. We demonstrate that the resulting sequences score as well as natural sequences, for homology, coevolution and structure-based measures. For large protein families, our synthetic sequences have similar or better properties compared to sequences generated by Potts models, including experimentally-validated ones. Moreover, for small protein families, our generation method based on MSA Transformer outperforms Potts models. Our method also more accurately reproduces the higher-order statistics and the distribution of sequences in sequence space of natural data than Potts models. MSA Transformer is thus a strong candidate for protein sequence generation and protein design.

下载PDF全文

下载文献需遵守相关版权规定

论文标题