随机预测和内核留下一个集群交叉验证：通用基线和评估工具，用于监督材料属性的机器学习

论文标题

随机预测和内核留下一个集群交叉验证：通用基线和评估工具，用于监督材料属性的机器学习

Random projections and Kernelised Leave One Cluster Out Cross-Validation: Universal baselines and evaluation tools for supervised machine learning for materials properties

论文作者

Durdy, Samantha, Gaultois, Michael, Gusev, Vladimir, Bollegala, Danushka, Rosseinsky, Matthew J.

论文摘要

由于机器学习是当前计算材料科学文献中的流行主题，因此为化合物创建表示形式已成为普遍的位置。这些表示形式很少被比较，因为评估了它们的性能 - 与它们一起使用的算法的性能是非平凡的。由于研究过程引起的许多材料数据集包含偏差和偏斜，因此已经引入了一个集群杂交验证（Loco-CV），以衡量算法在预测以前看不见的材料群体时的性能。这就提出了对Loco-CV测量结果的簇大小范围的影响和控制的问题。我们提出了基于组成的表示之间的详尽比较，并研究了如何使用核近似功能来更好地分开数据以增强Loco-CV应用程序。我们发现，在大多数测试的任务中，域知识并不能提高机器学习的性能，而带隙预测是显着的例外。我们还发现，径向基函数在所有测试的10个数据集中提高了化学数据集的线性可分离性，并为在LOCO-CV过程中应用此功能的应用提供了一个框架，以改善机器-CV测量结果，无论机器学习算法，指标的选择和化合物代表的选择，都可以提高LOCO-CV测量结果。我们建议将内核Loco-CV作为训练范式，以示材料数据上算法的外推能力。

With machine learning being a popular topic in current computational materials science literature, creating representations for compounds has become common place. These representations are rarely compared, as evaluating their performance - and the performance of the algorithms that they are used with - is non-trivial. With many materials datasets containing bias and skew caused by the research process, leave one cluster out cross validation (LOCO-CV) has been introduced as a way of measuring the performance of an algorithm in predicting previously unseen groups of materials. This raises the question of the impact, and control, of the range of cluster sizes on the LOCO-CV measurement outcomes. We present a thorough comparison between composition-based representations, and investigate how kernel approximation functions can be used to better separate data to enhance LOCO-CV applications. We find that domain knowledge does not improve machine learning performance in most tasks tested, with band gap prediction being the notable exception. We also find that the radial basis function improves the linear separability of chemical datasets in all 10 datasets tested and provide a framework for the application of this function in the LOCO-CV process to improve the outcome of LOCO-CV measurements regardless of machine learning algorithm, choice of metric, and choice of compound representation. We recommend kernelised LOCO-CV as a training paradigm for those looking to measure the extrapolatory power of an algorithm on materials data.

下载PDF全文

下载文献需遵守相关版权规定

论文标题