论文标题
Kubric:可扩展数据集生成器
Kubric: A scalable dataset generator
论文作者
论文摘要
数据是机器学习的驱动力,培训数据的数量和质量通常比建筑和培训细节更重要。但是,收集,处理和注释按大规模收集和注释是困难,昂贵的,并且经常提出额外的隐私,公平和法律问题。合成数据是一种有力的工具,具有解决这些缺点的潜力:1)廉价2)支持丰富的地面真相注释3)提供对数据的完全控制,4)可以避免或减轻有关偏见,隐私和许可的问题。不幸的是,有效数据生成的软件工具不如建筑设计和培训的软件工具成熟,这导致了生成零散的工作。为了解决这些问题,我们介绍了Kubric,这是一个开源的Python框架,该框架与Pybullet和Blender接口,以生成照片真实的场景,具有丰富的注释,并无缝地缩放到大型作业中,分布在数千台机器上,并生成TBS。我们通过提供一系列13个不同生成的数据集来证明Kubric的有效性,用于从研究3D NERF模型到光流估计的任务。我们发布了Kubric,二手资产,所有生成代码以及渲染数据集以进行重复使用和修改。
Data is the driving force of machine learning, with the amount and quality of training data often being more important for the performance of a system than architecture and training details. But collecting, processing and annotating real data at scale is difficult, expensive, and frequently raises additional privacy, fairness and legal concerns. Synthetic data is a powerful tool with the potential to address these shortcomings: 1) it is cheap 2) supports rich ground-truth annotations 3) offers full control over data and 4) can circumvent or mitigate problems regarding bias, privacy and licensing. Unfortunately, software tools for effective data generation are less mature than those for architecture design and training, which leads to fragmented generation efforts. To address these problems we introduce Kubric, an open-source Python framework that interfaces with PyBullet and Blender to generate photo-realistic scenes, with rich annotations, and seamlessly scales to large jobs distributed over thousands of machines, and generating TBs of data. We demonstrate the effectiveness of Kubric by presenting a series of 13 different generated datasets for tasks ranging from studying 3D NeRF models to optical flow estimation. We release Kubric, the used assets, all of the generation code, as well as the rendered datasets for reuse and modification.