论文标题

使用深神经网络的同工型功能预测

Isoform Function Prediction Using a Deep Neural Network

论文作者

Ghazanfari, Sara, Rasteh, Ali, Motahari, Seyed Abolfazl, Baghshah, Mahdieh Soleymani

论文摘要

同工型是从同一基因位点产生的MRNA,称为替代剪接。研究表明,超过95%的人类多外观基因经历了替代剪接。尽管mRNA序列的变化很少,但它们可能会对细胞功能和调节产生系统的影响。广泛报道了一个基因的同工型具有不同的甚至对比的功能。大多数研究表明,替代剪接在人类健康和疾病中起着重要作用。尽管具有广泛的基因功能研究,但关于同工型功能的信息很少。最近,已经提出了一些基于多个实例学习的计算方法,用于使用基因功能和基因表达谱预测同工型函数。但是,由于缺乏标记的培训数据,他们的性能并不理想。另外,概率模型(例如条件随机场(CRF))已被用于建模同工型之间的关系。该项目使用所有数据和有价值的信息,例如同工型序列,表达谱和基因本体论图,并提出了基于深神经网络的综合模型。 Uniprot基因本体论(GO)数据库用作基因函数的标准参考。 NCBI REFSEQ数据库用于提取基因和同工型序列,NCBI SRA数据库用于表达配置文件数据。曲线下(ROC AUC)下的接收器操作特征区域和曲线下的Precision-Recall等指标用于测量预测准确性。

Isoforms are mRNAs produced from the same gene site in the phenomenon called Alternative Splicing. Studies have shown that more than 95% of human multi-exon genes have undergone alternative splicing. Although there are few changes in mRNA sequence, They may have a systematic effect on cell function and regulation. It is widely reported that isoforms of a gene have distinct or even contrasting functions. Most studies have shown that alternative splicing plays a significant role in human health and disease. Despite the wide range of gene function studies, there is little information about isoforms' functionalities. Recently, some computational methods based on Multiple Instance Learning have been proposed to predict isoform function using gene function and gene expression profile. However, their performance is not desirable due to the lack of labeled training data. In addition, probabilistic models such as Conditional Random Field (CRF) have been used to model the relation between isoforms. This project uses all the data and valuable information such as isoform sequences, expression profiles, and gene ontology graphs and proposes a comprehensive model based on Deep Neural Networks. The UniProt Gene Ontology (GO) database is used as a standard reference for gene functions. The NCBI RefSeq database is used for extracting gene and isoform sequences, and the NCBI SRA database is used for expression profile data. Metrics such as Receiver Operating Characteristic Area Under the Curve (ROC AUC) and Precision-Recall Under the Curve (PR AUC) are used to measure the prediction accuracy.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源