论文标题
通过减轻负面的减轻和分解片段对比度改善分子对比度学习
Improving Molecular Contrastive Learning via Faulty Negative Mitigation and Decomposed Fragment Contrast
论文作者
论文摘要
深度学习一直是计算化学的普遍性,并且在分子性质预测中广泛实施。最近,自我监督的学习(SSL),尤其是对比度学习(CL),吸引了人们对学习概括为巨大化学空间的分子表示的潜力。与受监督的学习不同,SSL可以直接利用大型未标记数据,这大大减少了通过昂贵且耗时的模拟或实验来获取分子属性标签的努力。但是,大多数分子SSL方法借用了机器学习界的见解,但忽略了独特的化学形象(例如,分子指纹)和分子的多级图形结构(例如官能团)。在这项工作中,我们提出了IMOLCLR:通过两个方面的图形神经网络(GNN)的分子对比度学习的改善,(1)通过考虑分子对之间的化学形式相似性来减轻错误的负面对比实例; (2)从分子分解的分子内和分子间子结构之间的碎片级对比度。实验表明,所提出的策略可显着提高GNN模型在各种具有挑战性的分子性能预测上的性能。与先前的CL框架相比,IMOLCLR在7个分类基准上的ROC-AUC平均提高了1.3%,并且在5个回归基准上的误差降低了4.8%。在大多数基准测试中,由Imolclr竞争对手预先培训的通用GNN,甚至超过具有精致的建筑设计和工程功能的监督学习模型。进一步的研究表明,通过iMolclr固有嵌入支架和官能团的表示形式可以推理分子相似性。
Deep learning has been a prevalence in computational chemistry and widely implemented in molecule property predictions. Recently, self-supervised learning (SSL), especially contrastive learning (CL), gathers growing attention for the potential to learn molecular representations that generalize to the gigantic chemical space. Unlike supervised learning, SSL can directly leverage large unlabeled data, which greatly reduces the effort to acquire molecular property labels through costly and time-consuming simulations or experiments. However, most molecular SSL methods borrow the insights from the machine learning community but neglect the unique cheminformatics (e.g., molecular fingerprints) and multi-level graphical structures (e.g., functional groups) of molecules. In this work, we propose iMolCLR: improvement of Molecular Contrastive Learning of Representations with graph neural networks (GNNs) in two aspects, (1) mitigating faulty negative contrastive instances via considering cheminformatics similarities between molecule pairs; (2) fragment-level contrasting between intra- and inter-molecule substructures decomposed from molecules. Experiments have shown that the proposed strategies significantly improve the performance of GNN models on various challenging molecular property predictions. In comparison to the previous CL framework, iMolCLR demonstrates an averaged 1.3% improvement of ROC-AUC on 7 classification benchmarks and an averaged 4.8% decrease of the error on 5 regression benchmarks. On most benchmarks, the generic GNN pre-trained by iMolCLR rivals or even surpasses supervised learning models with sophisticated architecture designs and engineered features. Further investigations demonstrate that representations learned through iMolCLR intrinsically embed scaffolds and functional groups that can reason molecule similarities.