看不见：儿童性虐待数据集的分析管道

论文标题

看不见：儿童性虐待数据集的分析管道

Seeing without Looking: Analysis Pipeline for Child Sexual Abuse Datasets

论文作者

Laranjeira, Camila, Macedo, João, Avila, Sandra, Santos, Jefersson A. dos

论文摘要

在线分享和观看儿童性虐待材料（CSAM）的增长迅速，因此人类专家无法再处理手动检查。但是，CSAM的自动分类是一个充满挑战的研究领域，这在很大程度上是由于目标数据的无法访问，并且应该永远是私人的，并且唯一拥有执法机构。为了帮助研究人员从看不见的数据中汲取见解并安全地提供对CSAM图像的进一步了解，我们提出了一个分析模板，该模板超出了数据集和各自标签的统计数据。它着重于提取自动信号，包括预训练的机器学习模型，例如对象类别和色情检测，以及图像指标，例如亮度和清晰度。仅提供稀疏信号的汇总统计数据，以确保受害儿童和青少年的匿名性。该管道允许通过将阈值应用于每个指定的信号来过滤数据，并在子集中提供此类信号的分布，信号之间的相关性以及偏差评估。我们展示了我们关于基于区域的注释儿童色情数据集（RCPD）的提议，这是文献中为数不多的CSAM基准之一，由2000多个常规和CSAM图像组成，与巴西联邦警察合作生产。尽管在几种感觉上嘈杂且有限，但我们认为自动信号可以突出数据的整体分布的重要方面，这对于无法披露的数据库很有价值。我们的目标是安全地宣传CSAM数据集的特征，鼓励研究人员加入该领域，也许其他机构可以在其基准上提供类似的报告。

The online sharing and viewing of Child Sexual Abuse Material (CSAM) are growing fast, such that human experts can no longer handle the manual inspection. However, the automatic classification of CSAM is a challenging field of research, largely due to the inaccessibility of target data that is - and should forever be - private and in sole possession of law enforcement agencies. To aid researchers in drawing insights from unseen data and safely providing further understanding of CSAM images, we propose an analysis template that goes beyond the statistics of the dataset and respective labels. It focuses on the extraction of automatic signals, provided both by pre-trained machine learning models, e.g., object categories and pornography detection, as well as image metrics such as luminance and sharpness. Only aggregated statistics of sparse signals are provided to guarantee the anonymity of children and adolescents victimized. The pipeline allows filtering the data by applying thresholds to each specified signal and provides the distribution of such signals within the subset, correlations between signals, as well as a bias evaluation. We demonstrated our proposal on the Region-based annotated Child Pornography Dataset (RCPD), one of the few CSAM benchmarks in the literature, composed of over 2000 samples among regular and CSAM images, produced in partnership with Brazil's Federal Police. Although noisy and limited in several senses, we argue that automatic signals can highlight important aspects of the overall distribution of data, which is valuable for databases that can not be disclosed. Our goal is to safely publicize the characteristics of CSAM datasets, encouraging researchers to join the field and perhaps other institutions to provide similar reports on their benchmarks.

下载PDF全文

下载文献需遵守相关版权规定

论文标题