论文标题
牧场上的母牛:语言驱动的零击对象导航的基准和基准
CoWs on Pasture: Baselines and Benchmarks for Language-Driven Zero-Shot Object Navigation
论文作者
论文摘要
为了使机器人通常有用,即使没有昂贵的导航培训(即执行零弹性推理),他们也必须能够找到人物描述的任意对象(即是语言驱动的对象)。我们在统一的设置中探索这些功能:语言驱动的零击对象导航(L-ZSON)。受到开放式摄影模型用于图像分类的成功的启发,我们研究了一个直接的框架,夹在车轮上(牛),以使开放式摄影模型模型适应此任务,而无需微调。为了更好地评估L-Zson,我们介绍了牧场基准,该基准认为找到罕见的对象,由空间和外观属性描述的对象以及相对于可见对象所描述的隐藏对象。我们通过在栖息地,机器人和牧场上直接部署21个牛基线来进行深入的实证研究。总的来说,我们总共评估了超过90k的导航剧集,发现(1)牛基线通常很难利用语言描述,但熟练地熟练寻找不常见的对象。 (2)一头简单的牛,基于夹子的对象定位和经典探索 - 没有其他培训 - 与在栖息地MP3D数据上进行了500m步骤训练的最先进的ZSON方法的导航效率。同一位母牛比最先进的机器人ZSON模型提供了15.6个百分点的成功点。
For robots to be generally useful, they must be able to find arbitrary objects described by people (i.e., be language-driven) even without expensive navigation training on in-domain data (i.e., perform zero-shot inference). We explore these capabilities in a unified setting: language-driven zero-shot object navigation (L-ZSON). Inspired by the recent success of open-vocabulary models for image classification, we investigate a straightforward framework, CLIP on Wheels (CoW), to adapt open-vocabulary models to this task without fine-tuning. To better evaluate L-ZSON, we introduce the Pasture benchmark, which considers finding uncommon objects, objects described by spatial and appearance attributes, and hidden objects described relative to visible objects. We conduct an in-depth empirical study by directly deploying 21 CoW baselines across Habitat, RoboTHOR, and Pasture. In total, we evaluate over 90k navigation episodes and find that (1) CoW baselines often struggle to leverage language descriptions, but are proficient at finding uncommon objects. (2) A simple CoW, with CLIP-based object localization and classical exploration -- and no additional training -- matches the navigation efficiency of a state-of-the-art ZSON method trained for 500M steps on Habitat MP3D data. This same CoW provides a 15.6 percentage point improvement in success over a state-of-the-art RoboTHOR ZSON model.