将微笑的语言知识注入化学语言模型中

论文标题

将微笑的语言知识注入化学语言模型中

Infusing Linguistic Knowledge of SMILES into Chemical Language Models

论文作者

Lee, Ingoo, Nam, Hojung

论文摘要

简化的分子输入线路进入系统（Smiles）是化合物最流行的表示。因此，已经开发了许多基于微笑的分子性质预测模型。特别是，基于变压器的模型表现出令人鼓舞的性能，因为该模型利用一个大量的化学数据集进行自学学习。但是，没有基于变压器的模型来克服微笑的固有局限性，这是由微笑的生成过程产生的。在这项研究中，我们在语法上笑了，以获得子结构与其类型之间的连通性，这称为“微笑的语法知识”。首先，我们用笑容从微笑中解析的子结构令牌预算了变压器。然后，我们使用训练策略“相同的复合模型”来更好地了解Smiles语法。此外，我们使用知识适配器将连接性和类型的知识注入变压器。结果，我们的表示模型的表现优于先前的化合物表示分子特性的预测。最后，我们分析了变压器模型和适配器的注意力，表明所提出的模型了解微笑的语法。

The simplified molecular-input line-entry system (SMILES) is the most popular representation of chemical compounds. Therefore, many SMILES-based molecular property prediction models have been developed. In particular, transformer-based models show promising performance because the model utilizes a massive chemical dataset for self-supervised learning. However, there is no transformer-based model to overcome the inherent limitations of SMILES, which result from the generation process of SMILES. In this study, we grammatically parsed SMILES to obtain connectivity between substructures and their type, which is called the grammatical knowledge of SMILES. First, we pretrained the transformers with substructural tokens, which were parsed from SMILES. Then, we used the training strategy 'same compound model' to better understand SMILES grammar. In addition, we injected knowledge of connectivity and type into the transformer with knowledge adapters. As a result, our representation model outperformed previous compound representations for the prediction of molecular properties. Finally, we analyzed the attention of the transformer model and adapters, demonstrating that the proposed model understands the grammar of SMILES.

下载PDF全文

下载文献需遵守相关版权规定

论文标题