论文标题

语言模型分类器在重新预测上比XGBoost更好地与医师单词敏感性一致

Language Model Classifier Aligns Better with Physician Word Sensitivity than XGBoost on Readmission Prediction

论文作者

Yang, Grace, Cao, Ming, Jiang, Lavender Y., Liu, Xujin C., Cheung, Alexander T. M., Weiss, Hannah, Kurland, David, Cho, Kyunghyun, Oermann, Eric K.

论文摘要

在自然语言处理中进行分类的传统评估指标,例如曲线的准确性和面积,尽管具有相似的性能指标,但在具有不同预测行为的模型之间无法区分。我们介绍了灵敏度评分,这是一个指标,可以在词汇层审查模型的行为,以提供对其决策逻辑差异的见解。我们使用两个经过相似性能统计的医院再入院分类的分类器来评估测​​试集中一组代表单词的灵敏度得分。我们的实验比较了基于灵敏度评分的等级相关性的临床医生和分类器的决策逻辑。结果表明,与TF-IDF嵌入式上的XGBoost分类器相比,语言模型的灵敏度分数与专业人员的敏感性分数更好,这表明XGBOOST使用了一些虚假功能。总体而言,该指标通过用专业意见量化其差异来评估模型的鲁棒性。我们的代码可在GitHub(https://github.com/nyuolab/model_sensitivity)上找到。

Traditional evaluation metrics for classification in natural language processing such as accuracy and area under the curve fail to differentiate between models with different predictive behaviors despite their similar performance metrics. We introduce sensitivity score, a metric that scrutinizes models' behaviors at the vocabulary level to provide insights into disparities in their decision-making logic. We assess the sensitivity score on a set of representative words in the test set using two classifiers trained for hospital readmission classification with similar performance statistics. Our experiments compare the decision-making logic of clinicians and classifiers based on rank correlations of sensitivity scores. The results indicate that the language model's sensitivity score aligns better with the professionals than the xgboost classifier on tf-idf embeddings, which suggests that xgboost uses some spurious features. Overall, this metric offers a novel perspective on assessing models' robustness by quantifying their discrepancy with professional opinions. Our code is available on GitHub (https://github.com/nyuolab/Model_Sensitivity).

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源