论文标题
Gauravarora@hasoc-dravidian-Codemix-fire2020:综合生成的代码混合数据的预培训ULMFIT用于仇恨语音检测
Gauravarora@HASOC-Dravidian-CodeMix-FIRE2020: Pre-training ULMFiT on Synthetically Generated Code-Mixed Data for Hate Speech Detection
论文作者
论文摘要
本文介绍了提交给Dravidian-Codemix-Hasoc2020的系统:仇恨言论和Dravidian语言(泰米尔语英语和马拉雅拉姆语 - 英语)的仇恨言论和令人反感的内容识别。该任务旨在确定从社交媒体收集的Dravidian语言的评论/帖子的代码混合数据集中的令人反感的语言。我们参与了两个子任务A,旨在确定混合订阅(本地和罗马脚本的混合物)和子任务B中的令人反感的内容,旨在识别罗马脚本中的令人反感的内容,以识别Dravidian语言。为了解决这些任务,我们提出了对合成生成的代码混合数据的预培训ULMFIT,该数据是通过使用Markov链建模为Markov进程而生成的。我们的模型在子任务B中获得了0.88的加权F1得分,用于混合代码的泰米尔语 - 英语语言,并在领导者板上排名第二。此外,我们的模型在子任务A中获得了0.91的加权F1分数(第四级),在子任务A中的Malayalam-English和0.74加权的F1分数(第五级)的Malayalam-English语言中。
This paper describes the system submitted to Dravidian-Codemix-HASOC2020: Hate Speech and Offensive Content Identification in Dravidian languages (Tamil-English and Malayalam-English). The task aims to identify offensive language in code-mixed dataset of comments/posts in Dravidian languages collected from social media. We participated in both Sub-task A, which aims to identify offensive content in mixed-script (mixture of Native and Roman script) and Sub-task B, which aims to identify offensive content in Roman script, for Dravidian languages. In order to address these tasks, we proposed pre-training ULMFiT on synthetically generated code-mixed data, generated by modelling code-mixed data generation as a Markov process using Markov chains. Our model achieved 0.88 weighted F1-score for code-mixed Tamil-English language in Sub-task B and got 2nd rank on the leader-board. Additionally, our model achieved 0.91 weighted F1-score (4th Rank) for mixed-script Malayalam-English in Sub-task A and 0.74 weighted F1-score (5th Rank) for code-mixed Malayalam-English language in Sub-task B.