论文标题

比较有或没有特征提取DNA分类的机器学习算法

Comparing Machine Learning Algorithms with or without Feature Extraction for DNA Classification

论文作者

Zhang, Xiangxie, Beinke, Ben, Kindhi, Berlian Al, Wiering, Marco

论文摘要

DNA序列的分类是生物信息学的关键研究领域,因为它使研究人员能够进行基因组分析并检测可能的疾病。在本文中,使用三种最先进的算法,即卷积神经网络,深神经网络和N-gram概率模型,用于DNA分类的任务。此外,我们引入了一种基于Levenshtein距离的新型特征提取方法,并随机生成的DNA子序列从DNA序列中计算信息丰富的特征。我们还使用基于3克的现有特征提取方法来表示氨基酸,并将两种特征提取方法与多种机器学习算法相结合。四个不同的数据集用于病毒疾病,例如COVID-19,AIDS,流感和丙型肝炎,用于评估不同的方法。实验的结果表明,所有方法都在不同的DNA数据集上获得高精度。此外,特定于域特异性的3克特征提取方法一般可以在实验中取得最佳结果,而新提出的技术在最小的COVID-19数据集上优于所有其他方法

The classification of DNA sequences is a key research area in bioinformatics as it enables researchers to conduct genomic analysis and detect possible diseases. In this paper, three state-of-the-art algorithms, namely Convolutional Neural Networks, Deep Neural Networks, and N-gram Probabilistic Models, are used for the task of DNA classification. Furthermore, we introduce a novel feature extraction method based on the Levenshtein distance and randomly generated DNA sub-sequences to compute information-rich features from the DNA sequences. We also use an existing feature extraction method based on 3-grams to represent amino acids and combine both feature extraction methods with a multitude of machine learning algorithms. Four different data sets, each concerning viral diseases such as Covid-19, AIDS, Influenza, and Hepatitis C, are used for evaluating the different approaches. The results of the experiments show that all methods obtain high accuracies on the different DNA datasets. Furthermore, the domain-specific 3-gram feature extraction method leads in general to the best results in the experiments, while the newly proposed technique outperforms all other methods on the smallest Covid-19 dataset

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源