论文标题
大规模对话AI系统中的技能路由的可扩展和强大的自学习
Scalable and Robust Self-Learning for Skill Routing in Large-Scale Conversational AI Systems
论文作者
论文摘要
技能路由是大规模对话系统中的重要组成部分。与传统的基于规则的技能路由相反,最先进的系统使用基于模型的方法来实现自然对话。为了提供培训此类模型所需的监督信号,建议提出诸如人类注释,基于规则的系统的复制,基于用户释义的重新标记以及基于强盗的学习的想法。但是,这些方法:(a)不要根据技能和技能上的技能来扩展,(b)需要非常昂贵的专家注释/规则设计,(c)在每个模型更新中引入用户体验中的风险。在本文中,我们提出了一种可扩展的自学习方法,可以探索路由替代方案,而不会引起突然的策略变化,从而破坏用户体验,从用户互动中学习并通过频繁的模型刷新来逐步改善路由。为了启用这种强大的频繁模型更新,我们建议一种简单有效的方法,以确保单个域的受控策略更新,然后进行非政策评估,以做出部署决策,而无需进行长时间的A/B实验。我们在商业大规模对话系统上进行了各种离线和在线A/B实验,以证明该方法在现实世界生产环境中的有效性。
Skill routing is an important component in large-scale conversational systems. In contrast to traditional rule-based skill routing, state-of-the-art systems use a model-based approach to enable natural conversations. To provide supervision signal required to train such models, ideas such as human annotation, replication of a rule-based system, relabeling based on user paraphrases, and bandit-based learning were suggested. However, these approaches: (a) do not scale in terms of the number of skills and skill on-boarding, (b) require a very costly expert annotation/rule-design, (c) introduce risks in the user experience with each model update. In this paper, we present a scalable self-learning approach to explore routing alternatives without causing abrupt policy changes that break the user experience, learn from the user interaction, and incrementally improve the routing via frequent model refreshes. To enable such robust frequent model updates, we suggest a simple and effective approach that ensures controlled policy updates for individual domains, followed by an off-policy evaluation for making deployment decisions without any need for lengthy A/B experimentation. We conduct various offline and online A/B experiments on a commercial large-scale conversational system to demonstrate the effectiveness of the proposed method in real-world production settings.