论文标题
BYGPT5:端到端样式条件的诗歌生成没有令牌的语言模型
ByGPT5: End-to-End Style-conditioned Poetry Generation with Token-free Language Models
论文作者
论文摘要
最先进的诗歌生成系统通常很复杂。它们要么由特定于任务的模型管道组成,要么以手动创建的约束形式合并了先验知识,或者两者兼有。相反,端到端模型不会遭受必须对先验知识进行建模的开销,并且可以仅从数据中学习诗歌的细微差别,从而降低了所需的人类监督程度。在这项工作中,我们研究以韵律,仪表和替代品等风格为条件的端到端诗歌生成。我们确定并解决缺乏培训数据和不匹配的令牌化算法是过去尝试的可能局限性。特别是,我们成功地预先培训BYGPT5(一种新的无令牌解码器语言模型),并以我们风格注释的大型英语和德语Quatrains的大量定制语料库进行了微调。我们表明,BYGPT5的表现优于其他模型,例如MT5,BYT5,GPT-2和CHATGPT,同时也更有效地且与人类相比表现出色。此外,我们分析了其运行时性能,并证明它不容易记忆。我们公开提供代码,模型和数据集。
State-of-the-art poetry generation systems are often complex. They either consist of task-specific model pipelines, incorporate prior knowledge in the form of manually created constraints, or both. In contrast, end-to-end models would not suffer from the overhead of having to model prior knowledge and could learn the nuances of poetry from data alone, reducing the degree of human supervision required. In this work, we investigate end-to-end poetry generation conditioned on styles such as rhyme, meter, and alliteration. We identify and address lack of training data and mismatching tokenization algorithms as possible limitations of past attempts. In particular, we successfully pre-train ByGPT5, a new token-free decoder-only language model, and fine-tune it on a large custom corpus of English and German quatrains annotated with our styles. We show that ByGPT5 outperforms other models such as mT5, ByT5, GPT-2 and ChatGPT, while also being more parameter efficient and performing favorably compared to humans. In addition, we analyze its runtime performance and demonstrate that it is not prone to memorization. We make our code, models, and datasets publicly available.