论文标题

半监督数据编程,并选择

Semi-Supervised Data Programming with Subset Selection

论文作者

Maheshwari, Ayush, Chatterjee, Oishik, Killamsetty, KrishnaTeja, Ramakrishnan, Ganesh, Iyer, Rishabh

论文摘要

数据编程的范式以规则/标签功能的形式使用薄弱的监督以及半监督的学习,该学习在几种文本分类方案中显示出少量的带有大型未标记数据集的标记数据。在这项工作中,我们认为,通过不使用任何标记的数据,基于数据编程的方法可以产生次优的性能,尤其是在标签功能嘈杂时。这项工作的第一个贡献是介绍一个框架,\模型,该模型是一个半监督的数据编程范式,该范式学习了一个\ emph {inton Model},该范围有效地使用了规则/标记功能以及在特征空间上的半监督损失函数。接下来,我们还研究了在联合半监督数据编程目标之上进行子集选择的\模型,并且\ emph {select}一组示例可以用作由\模型设置的标记。 \模型的目的是确保标记的数据可以\ emph {reflement}标签功能,从而受益于数据编程以及适当选择的人类标签数据。我们证明,通过有效地结合了半佩斯特,数据编程和子集选择范例,我们在七个公开可用的数据集上的当前最新时间表大大胜过。 \ footNote {源代码可在\ url {https://github.com/ayushbits/semi-supervised-lfs-subset-selection}}中获得

The paradigm of data programming, which uses weak supervision in the form of rules/labelling functions, and semi-supervised learning, which augments small amounts of labelled data with a large unlabelled dataset, have shown great promise in several text classification scenarios. In this work, we argue that by not using any labelled data, data programming based approaches can yield sub-optimal performances, particularly when the labelling functions are noisy. The first contribution of this work is an introduction of a framework, \model which is a semi-supervised data programming paradigm that learns a \emph{joint model} that effectively uses the rules/labelling functions along with semi-supervised loss functions on the feature space. Next, we also study \modelss which additionally does subset selection on top of the joint semi-supervised data programming objective and \emph{selects} a set of examples that can be used as the labelled set by \model. The goal of \modelss is to ensure that the labelled data can \emph{complement} the labelling functions, thereby benefiting from both data-programming as well as appropriately selected data for human labelling. We demonstrate that by effectively combining semi-supervision, data-programming, and subset selection paradigms, we significantly outperform the current state-of-the-art on seven publicly available datasets. \footnote{The source code is available at \url{https://github.com/ayushbits/Semi-Supervised-LFs-Subset-Selection}}

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源