论文标题

大规模蛋白质 - 蛋白质蛋白在翻译后的修饰提取后远处的监督和置信度校准了生物Biobert

Large-scale protein-protein post-translational modification extraction with distant supervision and confidence calibrated BioBERT

论文作者

Elangovan, Aparna, Li, Yuan, Pires, Douglas E. V., Davis, Melissa J., Verspoor, Karin

论文摘要

蛋白质 - 蛋白质相互作用(PPI)对正常细胞功能至关重要,并且与许多疾病途径有关。但是,在完整的生物知识数据库中,只有4%的PPI被PTM注释,主要是通过手动策划进行的,这既不是时间也不是成本效益。我们使用完整的PPI数据库来创建一个用交互蛋白对注释的远程监督数据集,相应的PTM类型以及来自PubMed数据库的相关摘要。我们训练一组Biobert模型 - 称为PPI-Biobert-X10,以提高置信度校准。我们扩展了合奏平均置信度方法和置信度变化的使用,以抵消阶级不平衡的影响以提取高置信度预测。在测试集上评估的PPI-Biobert-X10模型导致适度的F1-Micro 41.3(P = 5 8.1,r = 32.1)。但是,通过将高置信度和低差异结合起来确定高质量预测,调整了精确度的预测,我们以100%的精度保留了19%的测试预测。我们在1800万PubMed摘要上评估了PPI-Biobert-X10,并提取了160万(546507唯一的PTM-PPI三胞胎)PTM-PPI预测,并且过滤器〜5700(4584唯一)高置信度预测。在5700个小的随机抽样子集中的人类评估中,尽管置信度校准了,精度下降至33.7%,并突出了即使有置信度校准,超出了测试集以外的概括性挑战。我们仅通过包括与多个论文相关的预测来避免问题,从而将精度提高到58.8%。在这项工作中,我们强调了实践中基于深度学习的文本采矿的好处和挑战,以及越来越强调信心校准以促进人类策划工作的需求。

Protein-protein interactions (PPIs) are critical to normal cellular function and are related to many disease pathways. However, only 4% of PPIs are annotated with PTMs in biological knowledge databases such as IntAct, mainly performed through manual curation, which is neither time nor cost-effective. We use the IntAct PPI database to create a distant supervised dataset annotated with interacting protein pairs, their corresponding PTM type, and associated abstracts from the PubMed database. We train an ensemble of BioBERT models - dubbed PPI-BioBERT-x10 to improve confidence calibration. We extend the use of ensemble average confidence approach with confidence variation to counteract the effects of class imbalance to extract high confidence predictions. The PPI-BioBERT-x10 model evaluated on the test set resulted in a modest F1-micro 41.3 (P =5 8.1, R = 32.1). However, by combining high confidence and low variation to identify high quality predictions, tuning the predictions for precision, we retained 19% of the test predictions with 100% precision. We evaluated PPI-BioBERT-x10 on 18 million PubMed abstracts and extracted 1.6 million (546507 unique PTM-PPI triplets) PTM-PPI predictions, and filter ~ 5700 (4584 unique) high confidence predictions. Of the 5700, human evaluation on a small randomly sampled subset shows that the precision drops to 33.7% despite confidence calibration and highlights the challenges of generalisability beyond the test set even with confidence calibration. We circumvent the problem by only including predictions associated with multiple papers, improving the precision to 58.8%. In this work, we highlight the benefits and challenges of deep learning-based text mining in practice, and the need for increased emphasis on confidence calibration to facilitate human curation efforts.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源