论文标题
互动决策的预训练语言模型
Pre-Trained Language Models for Interactive Decision-Making
论文作者
论文摘要
语言模型(LM)预训练在许多语言处理任务中都是有用的。但是,可以进一步利用预先培训的LMS来解决更通用的机器学习问题吗?我们提出了一种使用LMS在一般顺序决策问题中使用LMS脚手架学习和概括的方法。在这种方法中,目标和观察结果表示为嵌入序列,并且以预训练的LM初始化的策略网络预测了下一个动作。我们证明,该框架可以在不同的环境和监督方式上有效的组合概括。我们首先假设访问一组专家演示,并表明通过LMS初始化政策并通过行为克隆对其进行微调,在虚拟室环境中将任务完成率提高了43.6%。接下来,我们集成了一个主动的数据收集程序,在该过程中,代理与环境进行迭代相互作用,Relabel过去与新目标“失败”体验,并在自我监督的循环中更新其政策。积极的数据收集进一步提高了组合概括,优于最佳基线的25.1%。最后,我们通过研究基于LM政策的有效性的三个可能因素来解释这些结果。我们发现,顺序输入表示(与固定维特征向量)和基于LM的权重初始化对于概括都很重要。但是,令人惊讶的是,编码的策略输入的格式(例如,作为自然语言字符串与任意顺序编码)的影响很小。这些结果共同表明,语言建模会引起对不仅仅是语言建模的代表,还可以为目标和计划。这些表示即使在语言处理之外也可以帮助学习和概括。
Language model (LM) pre-training is useful in many language processing tasks. But can pre-trained LMs be further leveraged for more general machine learning problems? We propose an approach for using LMs to scaffold learning and generalization in general sequential decision-making problems. In this approach, goals and observations are represented as a sequence of embeddings, and a policy network initialized with a pre-trained LM predicts the next action. We demonstrate that this framework enables effective combinatorial generalization across different environments and supervisory modalities. We begin by assuming access to a set of expert demonstrations, and show that initializing policies with LMs and fine-tuning them via behavior cloning improves task completion rates by 43.6% in the VirtualHome environment. Next, we integrate an active data gathering procedure in which agents iteratively interact with the environment, relabel past "failed" experiences with new goals, and update their policies in a self-supervised loop. Active data gathering further improves combinatorial generalization, outperforming the best baseline by 25.1%. Finally, we explain these results by investigating three possible factors underlying the effectiveness of the LM-based policy. We find that sequential input representations (vs. fixed-dimensional feature vectors) and LM-based weight initialization are both important for generalization. Surprisingly, however, the format of the policy inputs encoding (e.g. as a natural language string vs. an arbitrary sequential encoding) has little influence. Together, these results suggest that language modeling induces representations that are useful for modeling not just language, but also goals and plans; these representations can aid learning and generalization even outside of language processing.