使用语音先验破译的古代脚本解密

论文标题

使用语音先验破译的古代脚本解密

Deciphering Undersegmented Ancient Scripts Using Phonetic Prior

论文作者

Luo, Jiaming, Hartmann, Frederik, Santus, Enrico, Cao, Yuan, Barzilay, Regina

论文摘要

大多数未剥夺的语言表现出两个构成重大解密挑战的特征：（1）脚本并未完全分为单词；（2）未确定最接近的已知语言。我们提出了一个解密模型，该模型通过建立丰富的语言约束来应对历史声音变化的一致模式来应对这两种挑战。我们通过基于国际语音字母（IPA）的学习字符嵌入来捕获自然语音几何形状。由此产生的生成框架共同对单词进行分割和认知对齐方式进行建模，并以语音约束为导致。我们在两种解密的语言（哥特式，ugaritic）和不隔离的语言（伊比利亚语）上评估了该模型。实验表明，结合语音几何形状会导致清晰，一致的收益。此外，我们提出了一种对语言亲密关系的措施，该措施正确地识别了哥特式和乌加利特语的相关语言。对于伊比利亚人来说，该方法没有显示出支持巴斯克作为一种相关语言的有力证据，并同意当前奖学金的偏爱地位。

Most undeciphered lost languages exhibit two characteristics that pose significant decipherment challenges: (1) the scripts are not fully segmented into words; (2) the closest known language is not determined. We propose a decipherment model that handles both of these challenges by building on rich linguistic constraints reflecting consistent patterns in historical sound change. We capture the natural phonological geometry by learning character embeddings based on the International Phonetic Alphabet (IPA). The resulting generative framework jointly models word segmentation and cognate alignment, informed by phonological constraints. We evaluate the model on both deciphered languages (Gothic, Ugaritic) and an undeciphered one (Iberian). The experiments show that incorporating phonetic geometry leads to clear and consistent gains. Additionally, we propose a measure for language closeness which correctly identifies related languages for Gothic and Ugaritic. For Iberian, the method does not show strong evidence supporting Basque as a related language, concurring with the favored position by the current scholarship.

下载PDF全文

下载文献需遵守相关版权规定

论文标题