论文标题

结构感知的蛋白质自我监督学习

Structure-aware Protein Self-supervised Learning

论文作者

Chen, Can, Zhou, Jingbo, Wang, Fan, Liu, Xue, Dou, Dejing

论文摘要

蛋白质表示学习方法表明,对于许多下游任务,尤其是在蛋白质分类方面,产生有用的表示的巨大潜力。此外,最近的一些研究表明,通过自我监督学习方法来解决蛋白质标签不足的巨大希望。但是,现有的蛋白质语言模型通常在蛋白质序列上预测,而无需考虑重要的蛋白质结构信息。为此,我们提出了一种新型的结构感知蛋白自我监督的学习方法,以有效捕获蛋白质的结构信息。特别是,设计了精心设计的图形神经网络(GNN)模型,以分别从成对的残基距离透视和二面角透视图分别使用自我监督任务来保留蛋白质结构信息。此外,我们建议利用在蛋白质序列上预测的可用蛋白质语言模型,以增强自我监督的学习。具体而言,我们通过新型的伪双层优化方案确定了蛋白质语言模型中的顺序信息与特殊设计的GNN模型中的结构信息之间的关系。对几个监督下游任务的实验验证了我们提出的方法的有效性。该方法的代码可在\ url {https://github.com/ggchen1997/steps_bioinformatics}中获得。

Protein representation learning methods have shown great potential to yield useful representation for many downstream tasks, especially on protein classification. Moreover, a few recent studies have shown great promise in addressing insufficient labels of proteins with self-supervised learning methods. However, existing protein language models are usually pretrained on protein sequences without considering the important protein structural information. To this end, we propose a novel structure-aware protein self-supervised learning method to effectively capture structural information of proteins. In particular, a well-designed graph neural network (GNN) model is pretrained to preserve the protein structural information with self-supervised tasks from a pairwise residue distance perspective and a dihedral angle perspective, respectively. Furthermore, we propose to leverage the available protein language model pretrained on protein sequences to enhance the self-supervised learning. Specifically, we identify the relation between the sequential information in the protein language model and the structural information in the specially designed GNN model via a novel pseudo bi-level optimization scheme. Experiments on several supervised downstream tasks verify the effectiveness of our proposed method.The code of the proposed method is available in \url{https://github.com/GGchen1997/STEPS_Bioinformatics}.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源