论文标题
自动文档选择有效的编码器预处理
Automatic Document Selection for Efficient Encoder Pretraining
论文作者
论文摘要
构建预验证的语言模型被认为是昂贵且具有数据密集型的,但是我们必须增加数据集大小以实现更好的性能吗?我们通过自动识别较小但域代表性的子集提出了较大训练集的替代方案。我们扩展了愤世嫉俗的数据选择,这是一种统计句子评分方法,该方法在代表性目标域语料库上条件。例如,我们将Ontonotes语料库视为目标结构域,并从囊肿的子集中预处理类似Roberta的编码器。在目标域中的困惑和多个下游任务上,它始终优于随机选择的数据,较少的数据,较少的训练次数减少3倍,估计的云计算成本少2倍,从而验证了自动文档选择LM预处理的食谱。
Building pretrained language models is considered expensive and data-intensive, but must we increase dataset size to achieve better performance? We propose an alternative to larger training sets by automatically identifying smaller yet domain-representative subsets. We extend Cynical Data Selection, a statistical sentence scoring method that conditions on a representative target domain corpus. As an example, we treat the OntoNotes corpus as a target domain and pretrain a RoBERTa-like encoder from a cynically selected subset of the Pile. On both perplexity and across several downstream tasks in the target domain, it consistently outperforms random selection with 20x less data, 3x fewer training iterations, and 2x less estimated cloud compute cost, validating the recipe of automatic document selection for LM pretraining.