DFX：用于加速基于变压器的文本生成的低延迟多FPGA设备

论文标题

DFX：用于加速基于变压器的文本生成的低延迟多FPGA设备

DFX: A Low-latency Multi-FPGA Appliance for Accelerating Transformer-based Text Generation

论文作者

Hong, Seongmin, Moon, Seungjae, Kim, Junsoo, Lee, Sungjae, Kim, Minsub, Lee, Dongsoo, Kim, Joo-Young

论文摘要

变形金刚是一种深入学习语言模型，用于数据中心中的自然语言处理（NLP）服务。在变压器模型中，生成的预训练的变压器（GPT）在文本生成或自然语言生成（NLG）方面取得了显着的性能，该性能在摘要阶段需要处理大型输入上下文，然后是一个生成阶段，该阶段一次产生一个单词。常规平台（例如GPU）专门用于在摘要阶段平行处理大型输入，但是由于其顺序特征，它们的性能在生成阶段大大降低。因此，需要一个有效的硬件平台来解决由文本生成的顺序特征引起的高潜伏期。在本文中，我们提出了DFX，这是一种多FPGA加速器，该设备在摘要和发电阶段中执行GPT-2模型端到端，端到端的末端且高吞吐量很高。 DFX使用模型并行性和优化的数据流，这是模型和硬件感知的设备之间快速同时执行执行。其计算核心根据自定义说明运行，并提供GPT-2操作端到端。我们在四个Xilinx Alveo U280 FPGAS上实现了建议的硬件体系结构，并利用了高带宽内存（HBM）的所有频道，以及用于高硬件效率的最大计算资源数量。在现代GPT-2模型上，DFX实现了四个NVIDIA V100 GPU的5.58倍加速度和3.99倍的能源效率。 DFX的成本效益比GPU设备更具成本效益，这表明它是云数据中心中文本生成工作负载的有前途的解决方案。

Transformer is a deep learning language model widely used for natural language processing (NLP) services in datacenters. Among transformer models, Generative Pre-trained Transformer (GPT) has achieved remarkable performance in text generation, or natural language generation (NLG), which needs the processing of a large input context in the summarization stage, followed by the generation stage that produces a single word at a time. The conventional platforms such as GPU are specialized for the parallel processing of large inputs in the summarization stage, but their performance significantly degrades in the generation stage due to its sequential characteristic. Therefore, an efficient hardware platform is required to address the high latency caused by the sequential characteristic of text generation. In this paper, we present DFX, a multi-FPGA acceleration appliance that executes GPT-2 model inference end-to-end with low latency and high throughput in both summarization and generation stages. DFX uses model parallelism and optimized dataflow that is model-and-hardware-aware for fast simultaneous workload execution among devices. Its compute cores operate on custom instructions and provide GPT-2 operations end-to-end. We implement the proposed hardware architecture on four Xilinx Alveo U280 FPGAs and utilize all of the channels of the high bandwidth memory (HBM) and the maximum number of compute resources for high hardware efficiency. DFX achieves 5.58x speedup and 3.99x energy efficiency over four NVIDIA V100 GPUs on the modern GPT-2 model. DFX is also 8.21x more cost-effective than the GPU appliance, suggesting that it is a promising solution for text generation workloads in cloud datacenters.

下载PDF全文

下载文献需遵守相关版权规定

论文标题