用于代码混合搜索查询翻译的编码器架构的研究

论文标题

用于代码混合搜索查询翻译的编码器架构的研究

Study of Encoder-Decoder Architectures for Code-Mix Search Query Translation

论文作者

Kulkarni, Mandar, Chennabasavaraj, Soumya, Garera, Nikesh

论文摘要

随着互联网和智能手机的广泛影响，电子商务平台拥有越来越多的用户群。由于本地语言用户的英语不是熟悉的，因此他们首选的浏览模式是其区域语言或区域语言和英语的组合。从我们最近关于查询数据的研究中，我们注意到我们收到的许多查询都是代码混合物，特别是hinglish，即用英文（拉丁）脚本写的一个或多个印地语单词的查询。我们为代码混合查询转换提出了一种基于变压器的方法，以使用户可以通过这些查询进行搜索。我们证明了在该任务中未标记的英语文本的大型语料库中训练的预训练的编码模型的有效性。使用通用域翻译模型，我们创建了一个伪标记的数据集，用于培训有关搜索查询的模型，并验证了各种数据增强技术的有效性。此外，为了减少模型的延迟，我们使用知识蒸馏和重量量化。该方法的有效性已通过实验评估和A/B测试验证。该模型目前现场直播在Flipkart应用程序和网站上，可供数百万个查询。

With the broad reach of the internet and smartphones, e-commerce platforms have an increasingly diversified user base. Since native language users are not conversant in English, their preferred browsing mode is their regional language or a combination of their regional language and English. From our recent study on the query data, we noticed that many of the queries we receive are code-mix, specifically Hinglish i.e. queries with one or more Hindi words written in English (Latin) script. We propose a transformer-based approach for code-mix query translation to enable users to search with these queries. We demonstrate the effectiveness of pre-trained encoder-decoder models trained on a large corpus of the unlabeled English text for this task. Using generic domain translation models, we created a pseudo-labelled dataset for training the model on the search queries and verified the effectiveness of various data augmentation techniques. Further, to reduce the latency of the model, we use knowledge distillation and weight quantization. Effectiveness of the proposed method has been validated through experimental evaluations and A/B testing. The model is currently live on Flipkart app and website, serving millions of queries.

下载PDF全文

下载文献需遵守相关版权规定

论文标题