论文标题

TF-IDFC-RF:一种新颖的监督术语加权方案

TF-IDFC-RF: A Novel Supervised Term Weighting Scheme

论文作者

Carvalho, Flavio, Guedes, Gustavo Paiva

论文摘要

情感分析是情感计算的一个分支,通常被视为二进制分类任务。在这种推理方面,可以在几种情况下应用情感分析以对文本样本中表达的态度进行分类,例如电影评论,讽刺等。表示文本样本的一种常见方法是使用向量空间模型计算由术语重量组成的数值特征向量。最流行的术语加权方案是TF -IDF(术语频率 - 逆文档频率)。这是一个无监督的加权方案(UWS),因为它不考虑术语加权中的类信息。除此之外,还有监督的加权方案(SWS),这些方案考虑了有关学期加权计算的类信息。最近提出了几种SWS,表现出比TF-IDF更好的结果。在这种情况下,这项工作介绍了一项关于不同项加权方案的比较研究,并提出了一种新颖的监督术语加权方案,称为TF -idfc -rf(术语频率 - 类中的逆文档频率 - 相关性频率)。 TF-IDFC-RF的有效性通过SVM(支持向量机)和NB(Naive Bayes)分类器在四个常用情感分析数据集上进行了验证。 TF-IDFC-RF显示出令人鼓舞的结果,超过了两个数据集上的所有其他加权方案。

Sentiment Analysis is a branch of Affective Computing usually considered a binary classification task. In this line of reasoning, Sentiment Analysis can be applied in several contexts to classify the attitude expressed in text samples, for example, movie reviews, sarcasm, among others. A common approach to represent text samples is the use of the Vector Space Model to compute numerical feature vectors consisting of the weight of terms. The most popular term weighting scheme is TF-IDF (Term Frequency - Inverse Document Frequency). It is an Unsupervised Weighting Scheme (UWS) since it does not consider the class information in the weighting of terms. Apart from that, there are Supervised Weighting Schemes (SWS), which consider the class information on term weighting calculation. Several SWS have been recently proposed, demonstrating better results than TF-IDF. In this scenario, this work presents a comparative study on different term weighting schemes and proposes a novel supervised term weighting scheme, named as TF-IDFC-RF (Term Frequency - Inverse Document Frequency in Class - Relevance Frequency). The effectiveness of TF-IDFC-RF is validated with SVM (Support Vector Machine) and NB (Naive Bayes) classifiers on four commonly used Sentiment Analysis datasets. TF-IDFC-RF shows promising results, outperforming all other weighting schemes on two datasets.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源