希腊议会诉讼数据集用于计算语言学和政治分析

论文标题

希腊议会诉讼数据集用于计算语言学和政治分析

A Greek Parliament Proceedings Dataset for Computational Linguistics and Political Analysis

论文作者

Dritsa, Konstantina, Thoma, Kaiti, Pavlopoulos, John, Louridas, Panos

论文摘要

很难遇到大型政治话语数据集，特别是对于希腊语等资源倾向语言。在本文中，我们介绍了希腊议会程序的精心策划数据集，该数据集从1989年到2020年，按时间顺序延伸。它由超过100万的演讲和广泛的元数据演讲，从5355个议会记录档案中提取。我们解释了它是如何构建的以及我们必须克服的挑战。该数据集可用于计算语言学和政治分析，并将其结合在一起。我们介绍了这样的应用程序，表明（i）如何使用数据集在时间上使用词的变化，（ii）重要的历史事件与政党之间，（iii）通过评估和采用算法来检测语义转移。

Large, diachronic datasets of political discourse are hard to come across, especially for resource-lean languages such as Greek. In this paper, we introduce a curated dataset of the Greek Parliament Proceedings that extends chronologically from 1989 up to 2020. It consists of more than 1 million speeches with extensive metadata, extracted from 5,355 parliamentary record files. We explain how it was constructed and the challenges that we had to overcome. The dataset can be used for both computational linguistics and political analysis-ideally, combining the two. We present such an application, showing (i) how the dataset can be used to study the change of word usage through time, (ii) between significant historical events and political parties, (iii) by evaluating and employing algorithms for detecting semantic shifts.

下载PDF全文

下载文献需遵守相关版权规定

论文标题