使用并发架构减少推理延迟以进行图像识别

论文标题

使用并发架构减少推理延迟以进行图像识别

Reducing Inference Latency with Concurrent Architectures for Image Recognition

论文作者

Hadidi, Ramyad, Cao, Jiashen, Ryoo, Michael S., Kim, Hyesoon

论文摘要

满足现代深度学习体系结构的高计算需求对于达到低推断潜伏期的挑战。当前减少延迟的方法仅增加一层中的并行性。这是因为体系结构通常捕获单链依赖模式，该模式以较高的并发性（即同时执行设备之间的推理）来防止有效分布。这种单链依赖性非常普遍，甚至隐含地偏向了最近的神经结构搜索（NAS）研究。在这篇有远见的论文中，我们将注意力集中在NAS的全新空间中，该空间放松了单链依赖性，以提供更高的并发和分配机会。为了定量比较这些体系结构，我们提出了一个分数，该分数封装了关键指标，例如通信，并发和负载平衡。此外，我们提出了一个新的发电机和变换块，该块与当前的最新方法相比，它始终如一地提供出色的体系结构。最后，我们的初步结果表明，这些新体系结构减少了推理潜伏期，值得更多关注。

Satisfying the high computation demand of modern deep learning architectures is challenging for achieving low inference latency. The current approaches in decreasing latency only increase parallelism within a layer. This is because architectures typically capture a single-chain dependency pattern that prevents efficient distribution with a higher concurrency (i.e., simultaneous execution of one inference among devices). Such single-chain dependencies are so widespread that even implicitly biases recent neural architecture search (NAS) studies. In this visionary paper, we draw attention to an entirely new space of NAS that relaxes the single-chain dependency to provide higher concurrency and distribution opportunities. To quantitatively compare these architectures, we propose a score that encapsulates crucial metrics such as communication, concurrency, and load balancing. Additionally, we propose a new generator and transformation block that consistently deliver superior architectures compared to current state-of-the-art methods. Finally, our preliminary results show that these new architectures reduce the inference latency and deserve more attention.

下载PDF全文

下载文献需遵守相关版权规定

论文标题