论文标题
在AEC域中,针对一般信息检索任务的特定领域特定语言模型
Pretrained Domain-Specific Language Model for General Information Retrieval Tasks in the AEC Domain
论文作者
论文摘要
作为体系结构,工程和构建(AEC)行业的重要任务,基于自然语言处理(NLP)的非结构化文本数据的信息检索(IR)正在越来越关注。尽管已经在AEC域中研究了针对IR任务的各种深度学习(DL)模型,但尚不清楚域Corpora和特定于域的验证DL模型如何改善各种IR任务的性能。为此,这项工作系统地探讨了域Corpora和各种转移学习技术对IR任务DL模型的性能的影响,并为AEC域提出了鉴定的特定领域语言模型。首先,开发了内域和近域语料库。然后,根据各种领域语料库和转移学习策略对两种类型的预识别模型,包括传统措辞嵌入模型和基于BERT的模型。最后,基于各种配置和预验证的模型,对IR任务的几种广泛使用的DL模型进行了进一步培训和测试。结果表明,域语料库对文本分类和指定实体识别任务的传统单词嵌入模型具有相反的影响,但可以进一步改善所有任务中基于BERT的模型的性能。同时,基于BERT的模型在所有IR任务中都显着优于传统方法,最大提高了5.4%和10.1%的F1分数。这项研究以两种方式促进了知识的体系:1)证明域语料库和预验证的DL模型的优势,以及2)为AEC域开设了第一个特定领域的数据集和鉴定的语言模型,以获取我们的最佳知识。因此,这项工作阐明了在AEC域中验证的模型的采用和应用。
As an essential task for the architecture, engineering, and construction (AEC) industry, information retrieval (IR) from unstructured textual data based on natural language processing (NLP) is gaining increasing attention. Although various deep learning (DL) models for IR tasks have been investigated in the AEC domain, it is still unclear how domain corpora and domain-specific pretrained DL models can improve performance in various IR tasks. To this end, this work systematically explores the impacts of domain corpora and various transfer learning techniques on the performance of DL models for IR tasks and proposes a pretrained domain-specific language model for the AEC domain. First, both in-domain and close-domain corpora are developed. Then, two types of pretrained models, including traditional wording embedding models and BERT-based models, are pretrained based on various domain corpora and transfer learning strategies. Finally, several widely used DL models for IR tasks are further trained and tested based on various configurations and pretrained models. The result shows that domain corpora have opposite effects on traditional word embedding models for text classification and named entity recognition tasks but can further improve the performance of BERT-based models in all tasks. Meanwhile, BERT-based models dramatically outperform traditional methods in all IR tasks, with maximum improvements of 5.4% and 10.1% in the F1 score, respectively. This research contributes to the body of knowledge in two ways: 1) demonstrating the advantages of domain corpora and pretrained DL models and 2) opening the first domain-specific dataset and pretrained language model for the AEC domain, to the best of our knowledge. Thus, this work sheds light on the adoption and application of pretrained models in the AEC domain.