模型辅助队列选择具有偏见分析，用于从EHR中生成大规模研究以进行肿瘤学研究

论文标题

模型辅助队列选择具有偏见分析，用于从EHR中生成大规模研究以进行肿瘤学研究

Model-assisted cohort selection with bias analysis for generating large-scale cohorts from the EHR for oncology research

论文作者

Birnbaum, Benjamin, Nussbaum, Nathan, Seidl-Rathkopf, Katharina, Agrawal, Monica, Estevez, Melissa, Estola, Evan, Haimson, Joshua, He, Lucy, Larson, Peter, Richardson, Paul

论文摘要

客观的电子健康记录（EHRS）是肿瘤学健康结果研究的有希望的数据来源。使用EHR数据的一个挑战是，选择患者的同类通常需要在记录的非结构化部分中提供信息。机器学习已被用来解决此问题，但是即使是高性能的算法也可能以非随机方式选择患者，并偏向由此产生的队列。为了提高在测量潜在偏见的同时选择队列选择的效率，我们引入了一种具有偏置分析的称为模型辅助队列选择（MAC）的技术，并将其应用于转移性乳腺癌（MBC）患者的选择。材料和方法我们使用术语频率逆频率（TF-IDF）和逻辑回归培训了17,263名患者的模型。我们使用一组17,292名患者测试算法性能并进行偏见分析。我们将MAC生成的队列与本来可以在没有MAC作为参考标准的情况下生成的队列进行了比较，首先是通过比较一组广泛的临床和人口统计学变量的分布，然后比较解决现有示例研究问题的两个分析的结果。结果我们的算法在曲线（AUC）下的面积为0.976，灵敏度为96.0％，抽象效率增益为77.9％。在偏见分析中，我们发现基线特征没有巨大差异，示例分析没有差异。结论具有偏差分析的MAC可以显着提高EHR数据中队列选择的效率，同时灌输对所得群人进行的结果研究的信心不会偏向。

Objective Electronic health records (EHRs) are a promising source of data for health outcomes research in oncology. A challenge in using EHR data is that selecting cohorts of patients often requires information in unstructured parts of the record. Machine learning has been used to address this, but even high-performing algorithms may select patients in a non-random manner and bias the resulting cohort. To improve the efficiency of cohort selection while measuring potential bias, we introduce a technique called Model-Assisted Cohort Selection (MACS) with Bias Analysis and apply it to the selection of metastatic breast cancer (mBC) patients. Materials and Methods We trained a model on 17,263 patients using term-frequency inverse-document-frequency (TF-IDF) and logistic regression. We used a test set of 17,292 patients to measure algorithm performance and perform Bias Analysis. We compared the cohort generated by MACS to the cohort that would have been generated without MACS as reference standard, first by comparing distributions of an extensive set of clinical and demographic variables and then by comparing the results of two analyses addressing existing example research questions. Results Our algorithm had an area under the curve (AUC) of 0.976, a sensitivity of 96.0%, and an abstraction efficiency gain of 77.9%. During Bias Analysis, we found no large differences in baseline characteristics and no differences in the example analyses. Conclusion MACS with bias analysis can significantly improve the efficiency of cohort selection on EHR data while instilling confidence that outcomes research performed on the resulting cohort will not be biased.

下载PDF全文

下载文献需遵守相关版权规定

论文标题