Twhin-Bert：在Twitter上用于多语言推文表示的具有社会增益的预训练的语言模型

论文标题

Twhin-Bert：在Twitter上用于多语言推文表示的具有社会增益的预训练的语言模型

TwHIN-BERT: A Socially-Enriched Pre-trained Language Model for Multilingual Tweet Representations at Twitter

论文作者

Zhang, Xinyang, Malkov, Yury, Florez, Omar, Park, Serim, McWilliams, Brian, Han, Jiawei, El-Kishky, Ahmed

论文摘要

预训练的语言模型（PLM）是自然语言处理应用的基础。大多数现有的PLM并不是针对社交媒体上嘈杂的用户生成的文本量身定制的，并且预培训并不能考虑社交网络中可用的宝贵社交参与日志。我们介绍了Twhin-Bert，这是一种在Twitter上生产的多语言模型，对流行社交网络的内域数据进行了培训。 Twhin-bert与先前的预训练的语言模型不同，因为它不仅接受了基于文本的自学训练，而且还具有基于Twitter异质信息网络（TWHIN）中丰富社会参与的社会目标。我们的模型接受了70亿条推文的培训，这些推文涵盖了100多种不同的语言，为模型简短，嘈杂，用户生成的文本提供了有价值的表示。我们对各种多语言社会推荐和语义理解任务的模型进行评估，并证明对已建立的预训练的语言模型的指标得到了重大改进。我们为研究界开放源twhin-bert和我们精心策划的标签预测和社会参与基准数据集。

Pre-trained language models (PLMs) are fundamental for natural language processing applications. Most existing PLMs are not tailored to the noisy user-generated text on social media, and the pre-training does not factor in the valuable social engagement logs available in a social network. We present TwHIN-BERT, a multilingual language model productionized at Twitter, trained on in-domain data from the popular social network. TwHIN-BERT differs from prior pre-trained language models as it is trained with not only text-based self-supervision, but also with a social objective based on the rich social engagements within a Twitter heterogeneous information network (TwHIN). Our model is trained on 7 billion tweets covering over 100 distinct languages, providing a valuable representation to model short, noisy, user-generated text. We evaluate our model on various multilingual social recommendation and semantic understanding tasks and demonstrate significant metric improvement over established pre-trained language models. We open-source TwHIN-BERT and our curated hashtag prediction and social engagement benchmark datasets to the research community.

下载PDF全文

下载文献需遵守相关版权规定

论文标题