论文标题
大型语言模型是很好的零拍摄视频游戏错误探测器
Large Language Models are Pretty Good Zero-Shot Video Game Bug Detectors
论文作者
论文摘要
视频游戏测试需要特定于游戏的知识以及有关游戏中事件的常识推理。尽管AI驱动的代理可以满足第一个要求,但尚无法自动满足第二个要求。因此,视频游戏测试通常仍然依赖于手动测试,并且需要人类测试人员彻底玩游戏以检测错误。结果,完全自动化游戏测试是一个挑战。在这项研究中,我们探讨了利用大型语言模型的零拍功能进行视频游戏错误检测的可能性。通过将错误检测问题制定为提问任务,我们表明,大型语言模型可以在游戏中的一系列文本描述中确定哪个事件是错误的。为此,我们介绍了GameBugDescriptions基准数据集,该数据集由167个错误的游戏视频组成,总共8个游戏中总共有334个问答对。我们广泛评估了我们的基准数据集中的OPT和Constructgpt大型语言模型系列的六个模型的性能。我们的结果显示了采用语言模型检测视频游戏错误的有希望的结果。借助适当的提示技术,我们可以达到70.66%的准确性,在某些视频游戏中,最高可达78.94%。我们的代码,评估数据和基准可以在https://asgaardlab.github.io/llmxbugs上找到
Video game testing requires game-specific knowledge as well as common sense reasoning about the events in the game. While AI-driven agents can satisfy the first requirement, it is not yet possible to meet the second requirement automatically. Therefore, video game testing often still relies on manual testing, and human testers are required to play the game thoroughly to detect bugs. As a result, it is challenging to fully automate game testing. In this study, we explore the possibility of leveraging the zero-shot capabilities of large language models for video game bug detection. By formulating the bug detection problem as a question-answering task, we show that large language models can identify which event is buggy in a sequence of textual descriptions of events from a game. To this end, we introduce the GameBugDescriptions benchmark dataset, which consists of 167 buggy gameplay videos and a total of 334 question-answer pairs across 8 games. We extensively evaluate the performance of six models across the OPT and InstructGPT large language model families on our benchmark dataset. Our results show promising results for employing language models to detect video game bugs. With the proper prompting technique, we could achieve an accuracy of 70.66%, and on some video games, up to 78.94%. Our code, evaluation data and the benchmark can be found on https://asgaardlab.github.io/LLMxBugs