论文标题

高维度数据的结果引导的疾病亚型

Outcome-Guided Disease Subtyping for High-Dimensional Omics Data

论文作者

Liu, Peng, Fang, Yusi, Ren, Zhao, Tang, Lu, Tseng, George C.

论文摘要

高通量微阵列和测序技术已被用来鉴定疾病亚型,而单独使用临床变量否则无法观察到这些亚型。经典的无监督聚类策略主要涉及在基因特征中具有相似模式的亚群的鉴定。但是,由于与无关的混杂因素(例如性别或年龄)相对应的特征可能会主导聚类过程,因此所得的簇可能会或可能不会捕获临床上有意义的疾病亚型。这引起了一个基本问题:我们能找到一个以预先指定的疾病结果为指导的亚型程序?现有方法(例如监督聚类)采用了两阶段的方法,并取决于与结果相关的任意数量的选定特征。在本文中,我们提出了一个统一的潜在生成模型,以执行由OMICS数据构建的结果引导的疾病亚型,从而改善了有关感兴趣疾病的结果亚型。特征选择嵌入在正规化回归中。修改的EM算法用于数值计算和参数估计。所提出的方法同时执行特征选择,潜在亚型表征和结果预测。为了说明可能的离群值或违反混合物高斯假设,我们使用自适应Huber或中位数截断函数进行了稳健的估计。具有转录组和临床数据的复杂肺部疾病的广泛模拟和应用于鉴定临床相关疾病亚型和适合于探索精确医学的临床相关疾病亚型和签名基因的能力。

High-throughput microarray and sequencing technology have been used to identify disease subtypes that could not be observed otherwise by using clinical variables alone. The classical unsupervised clustering strategy concerns primarily the identification of subpopulations that have similar patterns in gene features. However, as the features corresponding to irrelevant confounders (e.g. gender or age) may dominate the clustering process, the resulting clusters may or may not capture clinically meaningful disease subtypes. This gives rise to a fundamental problem: can we find a subtyping procedure guided by a pre-specified disease outcome? Existing methods, such as supervised clustering, apply a two-stage approach and depend on an arbitrary number of selected features associated with outcome. In this paper, we propose a unified latent generative model to perform outcome-guided disease subtyping constructed from omics data, which improves the resulting subtypes concerning the disease of interest. Feature selection is embedded in a regularization regression. A modified EM algorithm is applied for numerical computation and parameter estimation. The proposed method performs feature selection, latent subtype characterization and outcome prediction simultaneously. To account for possible outliers or violation of mixture Gaussian assumption, we incorporate robust estimation using adaptive Huber or median-truncated loss function. Extensive simulations and an application to complex lung diseases with transcriptomic and clinical data demonstrate the ability of the proposed method to identify clinically relevant disease subtypes and signature genes suitable to explore toward precision medicine.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源