semeval-2022任务11：使用多语言变压器通过伪标签增强多语言和代码混合复合物的实体识别

论文标题

semeval-2022任务11：使用多语言变压器通过伪标签增强多语言和代码混合复合物的实体识别

UM6P-CS at SemEval-2022 Task 11: Enhancing Multilingual and Code-Mixed Complex Named Entity Recognition via Pseudo Labels using Multilingual Transformer

论文作者

Mekki, Abdellah El, Mahdaouy, Abdelkader El, Akallouch, Mohammed, Berrada, Ismail, Khoumsi, Ahmed

论文摘要

构建名为实体识别（NER）系统的现实世界中的建筑综合体是一项具有挑战性的任务。这是由于出现在各种情况下（例如简短的输入句子，新兴实体和复杂实体）的命名实体的复杂性和歧义所致。此外，现实世界中的查询大多是畸形的，因为它们可以被编码或多语言以及其他情况。在本文中，我们将提交的系统介绍给名为“实体识别”（Multiconer）共享任务的多语言复合体。我们通过依靠多语言变压器XLM-Roberta提供的上下文化表示，以获取多语言和代码混合查询的复杂NER。除了基于CRF的令牌分类层外，我们还将跨度分类损失纳入识别命名实体跨度。此外，我们使用一种自我训练机制从大型未标记的数据集中生成弱宣布的数据。我们提出的系统分别在多语言和混合多语音的轨道中排名第六和第八。

Building real-world complex Named Entity Recognition (NER) systems is a challenging task. This is due to the complexity and ambiguity of named entities that appear in various contexts such as short input sentences, emerging entities, and complex entities. Besides, real-world queries are mostly malformed, as they can be code-mixed or multilingual, among other scenarios. In this paper, we introduce our submitted system to the Multilingual Complex Named Entity Recognition (MultiCoNER) shared task. We approach the complex NER for multilingual and code-mixed queries, by relying on the contextualized representation provided by the multilingual Transformer XLM-RoBERTa. In addition to the CRF-based token classification layer, we incorporate a span classification loss to recognize named entities spans. Furthermore, we use a self-training mechanism to generate weakly-annotated data from a large unlabeled dataset. Our proposed system is ranked 6th and 8th in the multilingual and code-mixed MultiCoNER's tracks respectively.

下载PDF全文

下载文献需遵守相关版权规定

论文标题