尽管“超人”表现，但当前的LLM却不适合有关道德和安全的决策

论文标题

尽管“超人”表现，但当前的LLM却不适合有关道德和安全的决策

Despite "super-human" performance, current LLMs are unsuited for decisions about ethics and safety

论文作者

Albrecht, Joshua, Kitanidis, Ellie, Fetterman, Abraham J.

论文摘要

大型语言模型（LLM）在过去几年中爆炸了，并在基准上取得了令人印象深刻的成果，就像问答和文本摘要一样多样。我们提供了一种简单的新提示策略，该策略导致另一个据称“超人”结果，这段时间在常识的道德推理上超过了人类（通过在伦理数据集的一部分中的准确性来衡量）。不幸的是，我们发现依靠平均绩效来判断能力可能会产生高度误导。 LLM错误与人类错误的系统不同，以使对抗性示例变得易于制作，甚至扰乱现有示例以翻转输出标签。我们还观察到一些示例中具有模型大小的反向缩放的迹象，并表明促使模型“解释其推理”通常会导致令人震惊的不道德行动理由。我们的结果表明，类似人类的表现并不一定意味着人类般的理解或推理。

Large language models (LLMs) have exploded in popularity in the past few years and have achieved undeniably impressive results on benchmarks as varied as question answering and text summarization. We provide a simple new prompting strategy that leads to yet another supposedly "super-human" result, this time outperforming humans at common sense ethical reasoning (as measured by accuracy on a subset of the ETHICS dataset). Unfortunately, we find that relying on average performance to judge capabilities can be highly misleading. LLM errors differ systematically from human errors in ways that make it easy to craft adversarial examples, or even perturb existing examples to flip the output label. We also observe signs of inverse scaling with model size on some examples, and show that prompting models to "explain their reasoning" often leads to alarming justifications of unethical actions. Our results highlight how human-like performance does not necessarily imply human-like understanding or reasoning.

下载PDF全文

下载文献需遵守相关版权规定

论文标题