论文标题
探索在FPGA上粗糙的线程
Exploring Thread Coarsening on FPGA
论文作者
论文摘要
在过去的几年中,人们对在数据中心和高性能计算群集以及GPU和其他加速器的高性能计算群中包括FPGA一直引起人们的兴趣。结果,拥有CPU,GPU和FPGA的统一,高级编程接口变得越来越重要。这导致了编译器工具链的开发以在FPGA上部署OpenCL代码。但是,GPU和FPGA之间的基本体系结构差异导致了性能可移植性问题:已证明对GPU进行优化的OPENCL代码不一定很好地映射到FPGA,通常需要手动优化以提高性能。在本文中,我们探讨了螺纹变形的使用 - 一种编译器技术,可在FPGA上运行的OPENCL代码将多个线程的工作整合到单个线程中。虽然在CPU和GPU上探索了这种优化,但FPGA的架构特征以及它们提供的并行性的性质导致了不同的性能考虑,从而分析了对FPGA的螺纹质量的分析。我们的评估是在我们的Microbenchs和开源基准套件的一系列应用程序上执行的,这表明螺纹变形可以以有限的资源利用成本以在FPGA上运行的OPENCL代码产生性能优势(最高3-4倍的加速)。
Over the past few years, there has been an increased interest in including FPGAs in data centers and high-performance computing clusters along with GPUs and other accelerators. As a result, it has become increasingly important to have a unified, high-level programming interface for CPUs, GPUs and FPGAs. This has led to the development of compiler toolchains to deploy OpenCL code on FPGA. However, the fundamental architectural differences between GPUs and FPGAs have led to performance portability issues: it has been shown that OpenCL code optimized for GPU does not necessarily map well to FPGA, often requiring manual optimizations to improve performance. In this paper, we explore the use of thread coarsening - a compiler technique that consolidates the work of multiple threads into a single thread - on OpenCL code running on FPGA. While this optimization has been explored on CPU and GPU, the architectural features of FPGAs and the nature of the parallelism they offer lead to different performance considerations, making an analysis of thread coarsening on FPGA worthwhile. Our evaluation, performed on our microbenchmarks and on a set of applications from open-source benchmark suites, shows that thread coarsening can yield performance benefits (up to 3-4x speedups) to OpenCL code running on FPGA at a limited resource utilization cost.