部分可观测时空混沌系统的无模型预测

论文标题

部分可观测时空混沌系统的无模型预测

The Benefit of Hindsight: Tracing Edge-Cases in Distributed Systems

论文作者

Zhang, Lei, Anand, Vaastav, Xie, Zhiqiang, Vigfusson, Ymir, Mace, Jonathan

论文摘要

当今的分布式跟踪框架不足以解决稀有边缘案例请求。问题的症结在于特异性和间接费用之间的权衡。一方面，框架可以不加选择地选择进入系统时（头采样）的请求以跟踪它们，但是这不太可能捕获相关的边缘键跟踪，因为框架不知道哪些请求在事后直到事后才会出现问题。另一方面，框架可以追踪所有内容，后来仅保留有趣的边缘轨迹（尾部采样），但这在跟踪应用程序上具有很高的开销和巨大的数据摄入成本。在本文中，我们规避了任何可以通过编程性检测到的症状的边缘案例的权衡，例如高尾潜伏期，错误和瓶颈排队。我们提出了一个轻巧且始终开的分布式追踪系统，事后看来，它实现了追溯抽象的抽象：而不是热切摄入和加工痕迹，而是在发现问题症状后才懒惰地检索痕量数据。事后看来类似于汽车仪表板，该摄像头在势头中突然震动时，始终是镜头的最后一个小时。使用事后的开发人员会收到他们想要的确切的边缘痕迹，而不会过度的开销或对运气的依赖。我们的评估表明，每秒几百万请求的后视量表添加了纳秒级别的开销，以生成跟踪数据，处理每个节点的GB/s数据，并与现有的分布式跟踪系统透明地集成在一起，并成功地在现实情况下，在检测到Edge-case问题时成功地持续了充分的，详细的痕迹。

Today's distributed tracing frameworks are ill-equipped to troubleshoot rare edge-case requests. The crux of the problem is a trade-off between specificity and overhead. On the one hand, frameworks can indiscriminately select requests to trace when they enter the system (head sampling), but this is unlikely to capture a relevant edge-case trace because the framework cannot know which requests will be problematic until after-the-fact. On the other hand, frameworks can trace everything and later keep only the interesting edge-case traces (tail sampling), but this has high overheads on the traced application and enormous data ingestion costs. In this paper we circumvent this trade-off for any edge-case with symptoms that can be programmatically detected, such as high tail latency, errors, and bottlenecked queues. We propose a lightweight and always-on distributed tracing system, Hindsight, which implements a retroactive sampling abstraction: instead of eagerly ingesting and processing traces, Hindsight lazily retrieves trace data only after symptoms of a problem are detected. Hindsight is analogous to a car dash-cam that, upon detecting a sudden jolt in momentum, persists the last hour of footage. Developers using Hindsight receive the exact edge-case traces they desire without undue overhead or dependence on luck. Our evaluation shows that Hindsight scales to millions of requests per second, adds nanosecond-level overhead to generate trace data, handles GB/s of data per node, transparently integrates with existing distributed tracing systems, and successfully persists full, detailed traces in real-world use cases when edge-case problems are detected.

下载PDF全文

下载文献需遵守相关版权规定

论文标题