论文标题
DoubleSemble:一种基于样本重新加权和特征选择用于财务数据分析的新合奏方法
DoubleEnsemble: A New Ensemble Method Based on Sample Reweighting and Feature Selection for Financial Data Analysis
论文作者
论文摘要
现代的机器学习模型(例如深度神经网络和增强决策树模型)在金融市场预测中变得越来越流行,因为它们可以提取复杂的非线性模式。但是,由于金融数据集具有非常低的信噪比,并且是非平稳的,因此复杂的模型通常非常容易过度拟合和遭受不稳定性问题的困扰。此外,随着各种机器学习和数据挖掘工具在定量交易中越来越广泛使用,许多交易公司一直在产生越来越多的功能(aka因素)。因此,如何自动选择有效特征成为迫在眉睫的问题。为了解决这些问题,我们提出了DoubleSemble,这是一个合奏框架,利用基于学习轨迹的样本重新加权和基于改组的功能选择。具体而言,我们根据每个样本的训练动力学识别关键样本,并根据每个功能通过改组的消融影响引起关键特征。我们的模型适用于广泛的基本模型,能够提取复杂的模式,同时减轻金融市场预测的过度拟合和不稳定问题。我们使用DNN和梯度提升决策树作为基本型号进行了广泛的实验,包括加密货币和股票交易的价格预测。我们的实验结果表明,与几种基线方法相比,双重元素的性能卓越。
Modern machine learning models (such as deep neural networks and boosting decision tree models) have become increasingly popular in financial market prediction, due to their superior capacity to extract complex non-linear patterns. However, since financial datasets have very low signal-to-noise ratio and are non-stationary, complex models are often very prone to overfitting and suffer from instability issues. Moreover, as various machine learning and data mining tools become more widely used in quantitative trading, many trading firms have been producing an increasing number of features (aka factors). Therefore, how to automatically select effective features becomes an imminent problem. To address these issues, we propose DoubleEnsemble, an ensemble framework leveraging learning trajectory based sample reweighting and shuffling based feature selection. Specifically, we identify the key samples based on the training dynamics on each sample and elicit key features based on the ablation impact of each feature via shuffling. Our model is applicable to a wide range of base models, capable of extracting complex patterns, while mitigating the overfitting and instability issues for financial market prediction. We conduct extensive experiments, including price prediction for cryptocurrencies and stock trading, using both DNN and gradient boosting decision tree as base models. Our experiment results demonstrate that DoubleEnsemble achieves a superior performance compared with several baseline methods.