论文标题
用基于梯度的离散MCMC插入蛋白质的插件定向演变
Plug & Play Directed Evolution of Proteins with Gradient-based Discrete MCMC
论文作者
论文摘要
基于机器学习的蛋白质工程的一个长期目标是加速发现新的突变,从而改善已知蛋白质的功能。我们引入了用于在硅中不断发展的蛋白质的采样框架,该框架支持混合和匹配各种无监督模型,例如蛋白质语言模型,以及从序列预测蛋白质功能的监督模型。通过组成这些模型,我们旨在提高评估看不见的突变的能力,并将搜索限制为可能包含功能蛋白的序列空间区域的搜索。我们的框架通过直接在离散蛋白质空间中构造专家分布的产品来实现这一目标,而无需任何模型进行微调或重新训练。我们引入了一个快速的MCMC采样器,它没有诉诸于蛮力搜索或随机抽样,这是经典的定向进化的典型代替,该采样器使用梯度提出了有希望的突变。我们在巨大的适应性景观上以及一系列不同的预训练的无监督模型(包括65亿参数蛋白质语言模型)中进行了定向进化实验。我们的结果表明,有效发现具有较高进化可能性的变体,以及估计的活性多重突变,这表明我们的采样器为基于机器学习的蛋白质工程提供了一种实用有效的新范式。
A long-standing goal of machine-learning-based protein engineering is to accelerate the discovery of novel mutations that improve the function of a known protein. We introduce a sampling framework for evolving proteins in silico that supports mixing and matching a variety of unsupervised models, such as protein language models, and supervised models that predict protein function from sequence. By composing these models, we aim to improve our ability to evaluate unseen mutations and constrain search to regions of sequence space likely to contain functional proteins. Our framework achieves this without any model fine-tuning or re-training by constructing a product of experts distribution directly in discrete protein space. Instead of resorting to brute force search or random sampling, which is typical of classic directed evolution, we introduce a fast MCMC sampler that uses gradients to propose promising mutations. We conduct in silico directed evolution experiments on wide fitness landscapes and across a range of different pre-trained unsupervised models, including a 650M parameter protein language model. Our results demonstrate an ability to efficiently discover variants with high evolutionary likelihood as well as estimated activity multiple mutations away from a wild type protein, suggesting our sampler provides a practical and effective new paradigm for machine-learning-based protein engineering.