论文标题
运输行业中的文本分类使用无监督的模型和基于变压器的监督模型
Text classification in shipping industry using unsupervised models and Transformer based supervised models
论文作者
论文摘要
在特定上下文中获取标记的数据可能是昂贵且耗时的。尽管采用了不同的算法,包括无监督的学习,半监督学习,自学学习,但文本分类的性能随上下文而变化。鉴于缺乏标签的数据集,我们提出了一种新颖而简单的无监督文本分类模型,使用标准国际贸易分类(SITC)代码对国际运输行业进行分类。我们的方法源于使用预读的手套单词嵌入的单词,并使用余弦相似性找到最可能的标签。为了将无监督的文本分类模型与监督分类进行比较,我们还应用了几个变压器模型来对货物进行分类。由于缺乏培训数据,SITC数值代码和相应的文本描述被用作培训数据。少数手动标记的货物内容数据用于评估无监督分类和基于变压器的监督分类的分类性能。比较表明,即使将训练数据集的大小增加30%,无监督的分类也明显优于基于变压器的监督分类。缺乏培训数据是一个关键的瓶颈,它禁止深度学习模型(例如变形金刚)成功实用应用。当缺乏培训数据时,无监督的分类可以提供一种替代性和有效的方法来对文本进行分类。
Obtaining labelled data in a particular context could be expensive and time consuming. Although different algorithms, including unsupervised learning, semi-supervised learning, self-learning have been adopted, the performance of text classification varies with context. Given the lack of labelled dataset, we proposed a novel and simple unsupervised text classification model to classify cargo content in international shipping industry using the Standard International Trade Classification (SITC) codes. Our method stems from representing words using pretrained Glove Word Embeddings and finding the most likely label using Cosine Similarity. To compare unsupervised text classification model with supervised classification, we also applied several Transformer models to classify cargo content. Due to lack of training data, the SITC numerical codes and the corresponding textual descriptions were used as training data. A small number of manually labelled cargo content data was used to evaluate the classification performances of the unsupervised classification and the Transformer based supervised classification. The comparison reveals that unsupervised classification significantly outperforms Transformer based supervised classification even after increasing the size of the training dataset by 30%. Lacking training data is a key bottleneck that prohibits deep learning models (such as Transformers) from successful practical applications. Unsupervised classification can provide an alternative efficient and effective method to classify text when there is scarce training data.