基于痕量比率的并行特征选择

论文标题

基于痕量比率的并行特征选择

Parallel feature selection based on the trace ratio criterion

论文作者

Nguyen, Thu, Phan, Thanh Nhan, Nguyen, Van Nhuong, Nguyen, Thanh Binh, Halvorsen, Pål, Riegler, Michael

论文摘要

当今的数据增长在管理和推理中构成了挑战。尽管特征提取方法能够减少数据的大小进行推断，但它们无助于最大程度地减少数据存储成本。另一方面，功能选择有助于删除冗余功能，因此不仅在推论方面有助于降低管理成本。这项工作提出了一种用于分类的新型并行特征选择方法，即使用Trace Criterion（PFST）选择的并行特征选择，该特征选择可以扩展到非常大的数据集。我们的方法使用Trace Criterion，这是Fisher判别分析中使用的类别可分离性的度量，以评估特征有用性。我们从理论上分析了标准的理想属性。根据标准，PFST通过首先提前删除看似冗余的功能，迅速从一组大数据集中找到重要功能。在模型中包含最重要的功能之后，我们检查了它们的贡献，以了解可能改善拟合度的可能相互作用。最后，我们做出一个向后的选择，以检查向前步骤添加的冗余。我们使用线性判别分析作为选定特征的分类器通过各种实验评估我们的方法。实验表明，我们的方法可以通过比较的其他方法在时间的一小部分中产生一小部分特征。此外，对PFST选择的功能进行培训的分类器不仅比其他方法选择的分类器要获得的精度更好，而且还可以比所有可用功能的分类更好地获得准确性。

The growth of data today poses a challenge in management and inference. While feature extraction methods are capable of reducing the size of the data for inference, they do not help in minimizing the cost of data storage. On the other hand, feature selection helps to remove the redundant features and therefore is helpful not only in inference but also in reducing management costs. This work presents a novel parallel feature selection approach for classification, namely Parallel Feature Selection using Trace criterion (PFST), which scales up to very large datasets. Our method uses trace criterion, a measure of class separability used in Fisher's Discriminant Analysis, to evaluate feature usefulness. We analyzed the criterion's desirable properties theoretically. Based on the criterion, PFST rapidly finds important features out of a set of features for big datasets by first making a forward selection with early removal of seemingly redundant features parallelly. After the most important features are included in the model, we check back their contribution for possible interaction that may improve the fit. Lastly, we make a backward selection to check back possible redundant added by the forward steps. We evaluate our methods via various experiments using Linear Discriminant Analysis as the classifier on selected features. The experiments show that our method can produce a small set of features in a fraction of the amount of time by the other methods under comparison. In addition, the classifier trained on the features selected by PFST not only achieves better accuracy than the ones chosen by other approaches but can also achieve better accuracy than the classification on all available features.

下载PDF全文

下载文献需遵守相关版权规定

论文标题