论文标题
一种基于经验的直接生成方法,用于自动图像裁剪
An Experience-based Direct Generation approach to Automatic Image Cropping
论文作者
论文摘要
对于许多实用的下游应用程序,自动图像裁剪是一项具有挑战性的任务。该任务通常分为子问题 - 生成候选候选者,找到视觉上重要的区域,并确定美学以选择最吸引人的候选人。先前的方法分别对这些子问题中的一个或多个进行建模,并经常将它们顺序结合。我们提出了一种基于新型的卷积神经网络(CNN)方法,可以直接进行作物图像,而无需明确建模图像美学,评估多个作物候选物或检测视觉上显着的区域。我们的模型在经验丰富的编辑器裁剪的大型图像数据集上进行了培训,并且可以同时预测多个固定长宽比的边界框。我们将裁剪图像的纵横比是影响美学的关键因素。自动图像裁剪的先前方法没有执行输出的长宽比,这可能是由于缺乏此任务的数据集。因此,我们在公共数据集上对两个相关任务进行基准测试我们的方法 - 首先是美学图像裁剪,而无需考虑宽高比,其次是需要固定纵横比输出的缩略图生成,但美学不是至关重要的。我们表明,在这两个任务中,我们的策略与现有方法相比具有比现有方法更具竞争力。此外,我们的单阶段模型比现有的两阶段或端到端的推理方法更容易训练,并且更快。我们提出了一项定性评估研究,发现我们的模型能够概括从看不见的数据集中的各种图像,并且经常在裁剪后保留原始图像的组成属性。我们的结果表明,不一定需要明确建模图像美学或视觉注意区域来构建竞争性图像裁剪算法。
Automatic Image Cropping is a challenging task with many practical downstream applications. The task is often divided into sub-problems - generating cropping candidates, finding the visually important regions, and determining aesthetics to select the most appealing candidate. Prior approaches model one or more of these sub-problems separately, and often combine them sequentially. We propose a novel convolutional neural network (CNN) based method to crop images directly, without explicitly modeling image aesthetics, evaluating multiple crop candidates, or detecting visually salient regions. Our model is trained on a large dataset of images cropped by experienced editors and can simultaneously predict bounding boxes for multiple fixed aspect ratios. We consider the aspect ratio of the cropped image to be a critical factor that influences aesthetics. Prior approaches for automatic image cropping, did not enforce the aspect ratio of the outputs, likely due to a lack of datasets for this task. We, therefore, benchmark our method on public datasets for two related tasks - first, aesthetic image cropping without regard to aspect ratio, and second, thumbnail generation that requires fixed aspect ratio outputs, but where aesthetics are not crucial. We show that our strategy is competitive with or performs better than existing methods in both these tasks. Furthermore, our one-stage model is easier to train and significantly faster than existing two-stage or end-to-end methods for inference. We present a qualitative evaluation study, and find that our model is able to generalize to diverse images from unseen datasets and often retains compositional properties of the original images after cropping. Our results demonstrate that explicitly modeling image aesthetics or visual attention regions is not necessarily required to build a competitive image cropping algorithm.