使机器学习数据集和模型公平用于HPC：一种方法和案例研究

论文标题

使机器学习数据集和模型公平用于HPC：一种方法和案例研究

Making Machine Learning Datasets and Models FAIR for HPC: A Methodology and Case Study

论文作者

Lin, Pei-Hung, Liao, Chunhua, Chen, Winson, Vanderbruggen, Tristan, Emani, Murali, Xu, Hailu

论文摘要

公平的指导原则旨在通过使数字内容可操作，以提高数字内容的可发现性，可访问性，互操作性和可重复性。但是，这些原则尚未在基于机器学习的程序分析和高性能计算（HPC）的优化领域中广泛采用。在本文中，我们设计了一种方法，以使HPC数据集和机器学习模型在研究了现有的公平评估和改进技术后公平。我们的方法包括对当选数据的全面，定量的评估，然后提出具体的，可行的建议，以提高与持续标识符，丰富的元数据描述，许可证和出处信息有关的常见问题的公平性。此外，我们选择一个代表性培训数据集来评估我们的方法论。该实验表明，该方法可以有效地将数据集和模型的公平性从19.1％的初始分数提高到83.0％的最终分数。

The FAIR Guiding Principles aim to improve the findability, accessibility, interoperability, and reusability of digital content by making them both human and machine actionable. However, these principles have not yet been broadly adopted in the domain of machine learning-based program analyses and optimizations for High-Performance Computing (HPC). In this paper, we design a methodology to make HPC datasets and machine learning models FAIR after investigating existing FAIRness assessment and improvement techniques. Our methodology includes a comprehensive, quantitative assessment for elected data, followed by concrete, actionable suggestions to improve FAIRness with respect to common issues related to persistent identifiers, rich metadata descriptions, license and provenance information. Moreover, we select a representative training dataset to evaluate our methodology. The experiment shows the methodology can effectively improve the dataset and model's FAIRness from an initial score of 19.1% to the final score of 83.0%.

下载PDF全文

下载文献需遵守相关版权规定

论文标题