论文标题
具有挑战性的大台任务,以及经过思考的链条能否解决这些问题
Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them
论文作者
论文摘要
Big Bench(Srivastava等,2022)是一个多样化的评估套件,专注于被认为超出当前语言模型的功能的任务。语言模型在这个基准测试中已经取得了良好的进步,在大台纸中,最好的模型优于报道的平均报告人类评价者,从而在65%的大台式任务中通过很少的射击提示效果。但是,在哪些任务上,语言模型的效果不足,而当前语言模型实际上无法解决这些任务? 在这项工作中,我们专注于一套由23个挑战性的大台任务组成的套件,我们将其称为Big-Ben Hard(BBH)。这些是先前的语言模型评估的任务,并不能超越平均人类评价者。我们发现,促使BBH任务的促进链链(COT)使Palm能够超过23个任务中10项的平均人类评价者表现,而Codex(Code-Davinci-002)可以超过23项任务中17个的人类评价者的平均绩效。由于BBH中的许多任务都需要多步推理,因此在大台式评估中完成的很少的弹药提示(Srivastava等人,2022年),因此大大低估了语言模型的最佳性能和能力,这是通过COT提示更好地捕获的。随着进一步的分析,我们探讨了BBH上COT和模型量表之间的相互作用,发现COT可以在几个BBH任务上具有较平稳曲线的多个BBH任务。
BIG-Bench (Srivastava et al., 2022) is a diverse evaluation suite that focuses on tasks believed to be beyond the capabilities of current language models. Language models have already made good progress on this benchmark, with the best model in the BIG-Bench paper outperforming average reported human-rater results on 65% of the BIG-Bench tasks via few-shot prompting. But on what tasks do language models fall short of average human-rater performance, and are those tasks actually unsolvable by current language models? In this work, we focus on a suite of 23 challenging BIG-Bench tasks which we call BIG-Bench Hard (BBH). These are the task for which prior language model evaluations did not outperform the average human-rater. We find that applying chain-of-thought (CoT) prompting to BBH tasks enables PaLM to surpass the average human-rater performance on 10 of the 23 tasks, and Codex (code-davinci-002) to surpass the average human-rater performance on 17 of the 23 tasks. Since many tasks in BBH require multi-step reasoning, few-shot prompting without CoT, as done in the BIG-Bench evaluations (Srivastava et al., 2022), substantially underestimates the best performance and capabilities of language models, which is better captured via CoT prompting. As further analysis, we explore the interaction between CoT and model scale on BBH, finding that CoT enables emergent task performance on several BBH tasks with otherwise flat scaling curves.