论文标题
在人为切碎的蛋白质上训练自我监督的肽序列模型
Training self-supervised peptide sequence models on artificially chopped proteins
论文作者
论文摘要
蛋白质的表示学习主要集中在对蛋白质序列的全球理解上,无论其长度如何。然而,与较长的蛋白质相比,较短的蛋白质(称为肽)具有不同的结构和功能。不幸的是,没有那么多天然发生的肽可以进行测序,因此可以使用肽特异性数据进行训练。在本文中,我们提出了一种新的肽数据增强方案,在该方案中,我们在人工构造的肽上训练肽语言模型,这些肽是较长,野生型蛋白质的小连续子集;我们将训练肽称为“切碎的蛋白质”。我们评估了用切碎的蛋白质与天然肽训练的模型的表示潜力,并发现具有切碎蛋白质的训练语言模型会导致短蛋白序列的更广泛的嵌入。这些肽特异性模型还保留了有关它们来自于在全长蛋白上训练的语言模型中得出的原始蛋白质的信息。我们将蒙版的语言模型训练目标与三个新型肽特异性训练目标进行比较:下一肽预测,对比肽的选择和进化加权的MLM。我们证明了为深度突变扫描肽基准测试的零射击学习性能的提高。
Representation learning for proteins has primarily focused on the global understanding of protein sequences regardless of their length. However, shorter proteins (known as peptides) take on distinct structures and functions compared to their longer counterparts. Unfortunately, there are not as many naturally occurring peptides available to be sequenced and therefore less peptide-specific data to train with. In this paper, we propose a new peptide data augmentation scheme, where we train peptide language models on artificially constructed peptides that are small contiguous subsets of longer, wild-type proteins; we refer to the training peptides as "chopped proteins". We evaluate the representation potential of models trained with chopped proteins versus natural peptides and find that training language models with chopped proteins results in more generalized embeddings for short protein sequences. These peptide-specific models also retain information about the original protein they were derived from better than language models trained on full-length proteins. We compare masked language model training objectives to three novel peptide-specific training objectives: next-peptide prediction, contrastive peptide selection and evolution-weighted MLM. We demonstrate improved zero-shot learning performance for a deep mutational scan peptides benchmark.