论文标题

风格指纹,POS-TAG和易转词:波兰语中的案例研究

Stylistic Fingerprints, POS-tags and Inflected Languages: A Case Study in Polish

论文作者

Eder, Maciej, Górski, Rafał. L.

论文摘要

在口号调查中,即使它们的性能在各种语言中也有很大变化,但最常见单词(MFW)和字符n-grams的频率也优于其他样式标记。在易转的语言中,单词结尾起着重要的作用,因此无法使用通用文本令牌化来识别不同的单词形式。无数的单词形式使频率稀疏,使大多数统计程序变得复杂。据推测,应用一种NLP技术之一,例如诱饵和/或解析,可能会增加分类的性能。本文的目的是检查语法特征(通过POS-TAG N-grams评估)和识别作者概况的诱人形式的有用性,以解决Lexis和Grammar内选择自由度的潜在问题。使用波兰小说的语料库,我们进行了一系列监督的作者归因基准,以比较不同类型的词汇和句法样式标记的分类精度。即使众所周知,POS标签的性能以及诱人的形式也比词汇标记的形式还差,但差异并不是很大,并且永远不会超过CA。 15%。

In stylometric investigations, frequencies of the most frequent words (MFWs) and character n-grams outperform other style-markers, even if their performance varies significantly across languages. In inflected languages, word endings play a prominent role, and hence different word forms cannot be recognized using generic text tokenization. Countless inflected word forms make frequencies sparse, making most statistical procedures complicated. Presumably, applying one of the NLP techniques, such as lemmatization and/or parsing, might increase the performance of classification. The aim of this paper is to examine the usefulness of grammatical features (as assessed via POS-tag n-grams) and lemmatized forms in recognizing authorial profiles, in order to address the underlying issue of the degree of freedom of choice within lexis and grammar. Using a corpus of Polish novels, we performed a series of supervised authorship attribution benchmarks, in order to compare the classification accuracy for different types of lexical and syntactic style-markers. Even if the performance of POS-tags as well as lemmatized forms was notoriously worse than that of lexical markers, the difference was not substantial and never exceeded ca. 15%.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源