论文标题
除了评估约束生成模型的测试集外,操作规格
Operationalizing Specifications, In Addition to Test Sets for Evaluating Constrained Generative Models
论文作者
论文摘要
在这项工作中,我们提出了一些有关评估最先进生成模型的建议。近年来,生成模型的进展一直很快。这些大规模模型产生了三种影响:首先,语言和视力方式的发电性流利性使常见的平均案例评估指标在诊断系统错误中的用处要少得多。其次,现在相同的底物模型构成了许多应用程序的基础,这既取决于其表示的效用,又是现象(例如内部上下文学习),这些现象提高了与此类模型相互作用的抽象水平。第三,对这些模型的用户期望及其受欢迎的公共发行版,在实践中,从领域泛化的技术挑战中却少得多。随后,我们的评估方法没有适应这些变化。更具体地说,尽管相关的效用和与生成模型相互作用的方法已经扩展,但在评估实践中尚未观察到类似的扩展。在本文中,我们认为可以利用生成模型的规模来提高进行评估本身的抽象水平,并为其提供建议。我们的建议基于利用规格作为评估发电质量的强大工具,并且很容易适用于各种任务。
In this work, we present some recommendations on the evaluation of state-of-the-art generative models for constrained generation tasks. The progress on generative models has been rapid in recent years. These large-scale models have had three impacts: firstly, the fluency of generation in both language and vision modalities has rendered common average-case evaluation metrics much less useful in diagnosing system errors. Secondly, the same substrate models now form the basis of a number of applications, driven both by the utility of their representations as well as phenomena such as in-context learning, which raise the abstraction level of interacting with such models. Thirdly, the user expectations around these models and their feted public releases have made the technical challenge of out of domain generalization much less excusable in practice. Subsequently, our evaluation methodologies haven't adapted to these changes. More concretely, while the associated utility and methods of interacting with generative models have expanded, a similar expansion has not been observed in their evaluation practices. In this paper, we argue that the scale of generative models could be exploited to raise the abstraction level at which evaluation itself is conducted and provide recommendations for the same. Our recommendations are based on leveraging specifications as a powerful instrument to evaluate generation quality and are readily applicable to a variety of tasks.