Gauravarora@hasoc-dravidian-Codemix-fire2020：综合生成的代码混合数据的预培训ULMFIT用于仇恨语音检测

论文标题

Gauravarora@hasoc-dravidian-Codemix-fire2020：综合生成的代码混合数据的预培训ULMFIT用于仇恨语音检测

Gauravarora@HASOC-Dravidian-CodeMix-FIRE2020: Pre-training ULMFiT on Synthetically Generated Code-Mixed Data for Hate Speech Detection

论文作者

Arora, Gaurav

论文摘要

本文介绍了提交给Dravidian-Codemix-Hasoc2020的系统：仇恨言论和Dravidian语言（泰米尔语英语和马拉雅拉姆语 - 英语）的仇恨言论和令人反感的内容识别。该任务旨在确定从社交媒体收集的Dravidian语言的评论/帖子的代码混合数据集中的令人反感的语言。我们参与了两个子任务A，旨在确定混合订阅（本地和罗马脚本的混合物）和子任务B中的令人反感的内容，旨在识别罗马脚本中的令人反感的内容，以识别Dravidian语言。为了解决这些任务，我们提出了对合成生成的代码混合数据的预培训ULMFIT，该数据是通过使用Markov链建模为Markov进程而生成的。我们的模型在子任务B中获得了0.88的加权F1得分，用于混合代码的泰米尔语 - 英语语言，并在领导者板上排名第二。此外，我们的模型在子任务A中获得了0.91的加权F1分数（第四级），在子任务A中的Malayalam-English和0.74加权的F1分数（第五级）的Malayalam-English语言中。

This paper describes the system submitted to Dravidian-Codemix-HASOC2020: Hate Speech and Offensive Content Identification in Dravidian languages (Tamil-English and Malayalam-English). The task aims to identify offensive language in code-mixed dataset of comments/posts in Dravidian languages collected from social media. We participated in both Sub-task A, which aims to identify offensive content in mixed-script (mixture of Native and Roman script) and Sub-task B, which aims to identify offensive content in Roman script, for Dravidian languages. In order to address these tasks, we proposed pre-training ULMFiT on synthetically generated code-mixed data, generated by modelling code-mixed data generation as a Markov process using Markov chains. Our model achieved 0.88 weighted F1-score for code-mixed Tamil-English language in Sub-task B and got 2nd rank on the leader-board. Additionally, our model achieved 0.91 weighted F1-score (4th Rank) for mixed-script Malayalam-English in Sub-task A and 0.74 weighted F1-score (5th Rank) for code-mixed Malayalam-English language in Sub-task B.

下载PDF全文

下载文献需遵守相关版权规定

论文标题