Finchat：关于日常主题的芬兰聊天对话的语料库和评估设置

论文标题

Finchat：关于日常主题的芬兰聊天对话的语料库和评估设置

FinChat: Corpus and evaluation setup for Finnish chat conversations on everyday topics

论文作者

Leino, Katri, Leinonen, Juho, Singh, Mittul, Virpioja, Sami, Kurimo, Mikko

论文摘要

创建开放域聊天机器人需要大量的对话数据和相关的基准任务来评估它们。标准化的评估任务对于为模型开发创建自动评估指标至关重要。否则，比较模型将需要资源扩大的人类评估。尽管聊天机器人的挑战最近设法为英语提供了大量资源，但其他语言的资源尚无可用。在这项工作中，我们为芬兰开放域聊天机器人研究提供了一个起点。我们描述了我们创建芬兰聊天对话语料库Finchat的收集工作，该对话公开可用。 Finchat包括有关不同年龄段的人的七个主题的无脚本对话。使用此语料库，我们还为芬兰聊天机器人开发构建了基于检索的评估任务。我们观察到，在对话中训练的现成的聊天机器人模型并没有比基于自动指标选择正确答案的机会更好，而人类几乎可以完美地完成相同的任务。同样，在人类评估中，聊天机器人产生的评估集中对问题的回答主要被标记为不连贯。因此，Finchat提供了一个具有挑战性的评估集，旨在鼓励芬兰的聊天机器人开发。

Creating open-domain chatbots requires large amounts of conversational data and related benchmark tasks to evaluate them. Standardized evaluation tasks are crucial for creating automatic evaluation metrics for model development; otherwise, comparing the models would require resource-expensive human evaluation. While chatbot challenges have recently managed to provide a plethora of such resources for English, resources in other languages are not yet available. In this work, we provide a starting point for Finnish open-domain chatbot research. We describe our collection efforts to create the Finnish chat conversation corpus FinChat, which is made available publicly. FinChat includes unscripted conversations on seven topics from people of different ages. Using this corpus, we also construct a retrieval-based evaluation task for Finnish chatbot development. We observe that off-the-shelf chatbot models trained on conversational corpora do not perform better than chance at choosing the right answer based on automatic metrics, while humans can do the same task almost perfectly. Similarly, in a human evaluation, responses to questions from the evaluation set generated by the chatbots are predominantly marked as incoherent. Thus, FinChat provides a challenging evaluation set, meant to encourage chatbot development in Finnish.

下载PDF全文

下载文献需遵守相关版权规定

论文标题