论文标题
使用极端多标签长文本的自动化ICD编码基于基于多标签的模型
Automated ICD Coding using Extreme Multi-label Long Text Transformer-based Models
论文作者
论文摘要
背景:在许多自然语言处理任务中,经过预定的变压器模型成功的鼓励,它们用于国际疾病分类(ICD)编码任务的使用现在正在积极探索。在这项研究中,我们研究了三种基于变压器的模型,旨在解决由自动化ICD编码任务提出的极端标签集和长文本分类挑战。方法:基于变压器的模型PLM-ICD实现了ICD编码基准数据集Mimic-III上的当前最新性能(SOTA)性能。它被选为我们的基线模型,以进一步优化。 XR转换器是一般极端多标签文本分类域中的新SOTA模型,而XR-Transformer模型的新型改编版XR-LAT也接受了模拟III III数据集的培训。 XR-LAT是在预定义的分层代码树上递归训练的模型链,具有标签的注意力,知识传递和动态负抽样机制。结果:我们优化的PLM-ICD模型,经过更长的总和序列长度训练,明显优于当前的SOTA PLM-ICD模型,并获得了最高的Micro-F1分数60.8%。 XR转换模型虽然在通用域中的SOTA在所有指标中都表现不佳。最佳基于XR-LAT的模型获得了与当前SOTA PLM-ICD模型竞争的结果,包括将宏观AUC提高2.1%。结论:我们优化的PLM-ICD模型是模拟于IIII数据集上自动化ICD编码的新型SOTA模型,而我们的新型XR-LAT模型则与先前的SOTA PLM-ICD模型竞争性能。
Background: Encouraged by the success of pretrained Transformer models in many natural language processing tasks, their use for International Classification of Diseases (ICD) coding tasks is now actively being explored. In this study, we investigate three types of Transformer-based models, aiming to address the extreme label set and long text classification challenges that are posed by automated ICD coding tasks. Methods: The Transformer-based model PLM-ICD achieved the current state-of-the-art (SOTA) performance on the ICD coding benchmark dataset MIMIC-III. It was chosen as our baseline model to be further optimised. XR-Transformer, the new SOTA model in the general extreme multi-label text classification domain, and XR-LAT, a novel adaptation of the XR-Transformer model, were also trained on the MIMIC-III dataset. XR-LAT is a recursively trained model chain on a predefined hierarchical code tree with label-wise attention, knowledge transferring and dynamic negative sampling mechanisms. Results: Our optimised PLM-ICD model, which was trained with longer total and chunk sequence lengths, significantly outperformed the current SOTA PLM-ICD model, and achieved the highest micro-F1 score of 60.8%. The XR-Transformer model, although SOTA in the general domain, did not perform well across all metrics. The best XR-LAT based model obtained results that were competitive with the current SOTA PLM-ICD model, including improving the macro-AUC by 2.1%. Conclusion: Our optimised PLM-ICD model is the new SOTA model for automated ICD coding on the MIMIC-III dataset, while our novel XR-LAT model performs competitively with the previous SOTA PLM-ICD model.