论文标题

使用变压器准确地鉴定从宏基因组数据中的噬菌体

Accurate identification of bacteriophages from metagenomic data using Transformer

论文作者

Shang, Jiayu, Tang, Xubo, Guo, Ruocheng, Sun, Yanni

论文摘要

动机:噬菌体是感染细菌的病毒。作为微生物群落中的关键参与者,他们可以通过感染细菌宿主并介导基因转移来调节微生物组的组成/功能。最近,可以从各种微生物组中对所有遗传材料进行测序的宏基因组测序已成为新发现噬菌体的流行手段。但是,从元基因组数据中对噬菌体的准确检测仍然很困难。高度多样性/丰度和有限的参考基因组对从元基因组数据募集噬菌体片段提出了重大挑战。现有的基于对齐或基于学习的模型对元基因组数据具有低召回或精确度。结果:在这项工作中,我们采用最先进的语言模型,变压器来进行噬菌体重叠群的上下文嵌入。通过构造蛋白质簇词汇,我们可以将蛋白质组成和蛋白质的位置从每个重叠群中喂入变压器。变压器可以使用自发机制学习蛋白质组织和关联,并预测测试重叠群的标签。我们严格测试了我们在多个数据集上使用难度越来越多的数据集上的名为Phamer的工具,包括质量REFSEQ基因组,短重叠群,模拟的元基因组数据,模拟元基因组数据和公共IMG/VR数据集。所有实验结果表明,Phamer的表现优于最先进的工具。在实际的宏基因组数据实验中,Phamer将噬菌体检测的F1得分提高了27 \%。

Motivation: Bacteriophages are viruses infecting bacteria. Being key players in microbial communities, they can regulate the composition/function of microbiome by infecting their bacterial hosts and mediating gene transfer. Recently, metagenomic sequencing, which can sequence all genetic materials from various microbiome, has become a popular means for new phage discovery. However, accurate and comprehensive detection of phages from the metagenomic data remains difficult. High diversity/abundance, and limited reference genomes pose major challenges for recruiting phage fragments from metagenomic data. Existing alignment-based or learning-based models have either low recall or precision on metagenomic data. Results: In this work, we adopt the state-of-the-art language model, Transformer, to conduct contextual embedding for phage contigs. By constructing a protein-cluster vocabulary, we can feed both the protein composition and the proteins' positions from each contig into the Transformer. The Transformer can learn the protein organization and associations using the self-attention mechanism and predicts the label for test contigs. We rigorously tested our developed tool named PhaMer on multiple datasets with increasing difficulty, including quality RefSeq genomes, short contigs, simulated metagenomic data, mock metagenomic data, and the public IMG/VR dataset. All the experimental results show that PhaMer outperforms the state-of-the-art tools. In the real metagenomic data experiment, PhaMer improves the F1-score of phage detection by 27\%.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源