比较有或没有特征提取DNA分类的机器学习算法

论文标题

比较有或没有特征提取DNA分类的机器学习算法

Comparing Machine Learning Algorithms with or without Feature Extraction for DNA Classification

论文作者

Zhang, Xiangxie, Beinke, Ben, Kindhi, Berlian Al, Wiering, Marco

论文摘要

DNA序列的分类是生物信息学的关键研究领域，因为它使研究人员能够进行基因组分析并检测可能的疾病。在本文中，使用三种最先进的算法，即卷积神经网络，深神经网络和N-gram概率模型，用于DNA分类的任务。此外，我们引入了一种基于Levenshtein距离的新型特征提取方法，并随机生成的DNA子序列从DNA序列中计算信息丰富的特征。我们还使用基于3克的现有特征提取方法来表示氨基酸，并将两种特征提取方法与多种机器学习算法相结合。四个不同的数据集用于病毒疾病，例如COVID-19，AIDS，流感和丙型肝炎，用于评估不同的方法。实验的结果表明，所有方法都在不同的DNA数据集上获得高精度。此外，特定于域特异性的3克特征提取方法一般可以在实验中取得最佳结果，而新提出的技术在最小的COVID-19数据集上优于所有其他方法

The classification of DNA sequences is a key research area in bioinformatics as it enables researchers to conduct genomic analysis and detect possible diseases. In this paper, three state-of-the-art algorithms, namely Convolutional Neural Networks, Deep Neural Networks, and N-gram Probabilistic Models, are used for the task of DNA classification. Furthermore, we introduce a novel feature extraction method based on the Levenshtein distance and randomly generated DNA sub-sequences to compute information-rich features from the DNA sequences. We also use an existing feature extraction method based on 3-grams to represent amino acids and combine both feature extraction methods with a multitude of machine learning algorithms. Four different data sets, each concerning viral diseases such as Covid-19, AIDS, Influenza, and Hepatitis C, are used for evaluating the different approaches. The results of the experiments show that all methods obtain high accuracies on the different DNA datasets. Furthermore, the domain-specific 3-gram feature extraction method leads in general to the best results in the experiments, while the newly proposed technique outperforms all other methods on the smallest Covid-19 dataset

下载PDF全文

下载文献需遵守相关版权规定

论文标题