论文标题
大量增强的提取的电子邮件功能,适合余弦距离
Massive Enhanced Extracted Email Features Tailored for Cosine Distance
论文作者
论文摘要
在本文中,解释和评估了将Enron电子邮件数据集(预印本中引用的版本)转换为数千个功能的过程,以说明和评估一组标记的电子邮件所选的一组。最终功能是针对余弦距离量身定制的,因此余弦距离反映了以可解释的归一化方式在两个电子邮件之间常见的每个电子邮件的顶部指示单词的数量。标签基于安然电子邮件数据集(预印本中引用的版本)中的叶子文件夹名称,而所选的2400封电子邮件为8个标签中的每个电子邮件构成了300封电子邮件。该评估基于K最近的邻居多数投票分类的准确性。除了KNN多数投票分类的准确性和混乱矩阵外,还报告了该过程的一些统计数据。使用余弦距离的KNN多数投票分类精度为76.75%,鉴于涉及的8个标签,至少显示了一定程度的成功。转换的结果是每个选定的电子邮件的48557功能,其中每封电子邮件恰好40个功能非零。转换的结果是一个名为MeeeftCD的数据集(量身定制的余弦距离提取的大量提取的电子邮件功能),网址为https://web.cs.dal.ca/~barahimi/~barahimi/data-sets/meeeeftcd/以及本文中提到的GitHub存储库。
In this paper, the process of converting the Enron email dataset (the version cited in the preprint) to thousands of features per email for a selected set of 2400 labelled emails is explained and evaluated. The final features are tailored for Cosine distance so that the Cosine distance invertly reflect the number of top indicative words of each email that are common between the two emails in an explainable normalized fashion. The labelling is based on the leaf folder name in the Enron email dataset (the version cited in the preprint) folders tree and the 2400 emails selected consist 300 emails for each of the 8 labels. The evaluation is based on the accuracy of a k nearest neighbours majority voting classification using Cosine distance. In addition to KNN majority voting classification accuracy and confusion matrix, some statistics for the process is reported. The KNN majority voting classification accuracy using Cosine distance is 76.75% which shows at least some level of success given the 8 labels involved. The result of conversion is 48557 features per selected email out of which exactly 40 features per email are non-zero. The result of conversion is a data set named MeeefTCD (Massive Enhanced Extracted Email Features Tailored for Cosine Distance) available at https://web.cs.dal.ca/~barahimi/data-sets/meeeftcd/ and on a github repository mentioned in this paper.