论文标题

部分可观测时空混沌系统的无模型预测

The Benefit of Hindsight: Tracing Edge-Cases in Distributed Systems

论文作者

Zhang, Lei, Anand, Vaastav, Xie, Zhiqiang, Vigfusson, Ymir, Mace, Jonathan

论文摘要

当今的分布式跟踪框架不足以解决稀有边缘案例请求。问题的症结在于特异性和间接费用之间的权衡。一方面,框架可以不加选择地选择进入系统时(头采样)的请求以跟踪它们,但是这不太可能捕获相关的边缘键跟踪,因为框架不知道哪些请求在事后直到事后才会出现问题。另一方面,框架可以追踪所有内容,后来仅保留有趣的边缘轨迹(尾部采样),但这在跟踪应用程序上具有很高的开销和巨大的数据摄入成本。 在本文中,我们规避了任何可以通过编程性检测到的症状的边缘案例的权衡,例如高尾潜伏期,错误和瓶颈排队。我们提出了一个轻巧且始终开的分布式追踪系统,事后看来,它实现了追溯抽象的抽象:而不是热切摄入和加工痕迹,而是在发现问题症状后才懒惰地检索痕量数据。事后看来类似于汽车仪表板,该摄像头在势头中突然震动时,始终是镜头的最后一个小时。使用事后的开发人员会收到他们想要的确切的边缘痕迹,而不会过度的开销或对运气的依赖。我们的评估表明,每秒几百万请求的后视量表添加了纳秒级别的开销,以生成跟踪数据,处理每个节点的GB/s数据,并与现有的分布式跟踪系统透明地集成在一起,并成功地在现实情况下,在检测到Edge-case问题时成功地持续了充分的,详细的痕迹。

Today's distributed tracing frameworks are ill-equipped to troubleshoot rare edge-case requests. The crux of the problem is a trade-off between specificity and overhead. On the one hand, frameworks can indiscriminately select requests to trace when they enter the system (head sampling), but this is unlikely to capture a relevant edge-case trace because the framework cannot know which requests will be problematic until after-the-fact. On the other hand, frameworks can trace everything and later keep only the interesting edge-case traces (tail sampling), but this has high overheads on the traced application and enormous data ingestion costs. In this paper we circumvent this trade-off for any edge-case with symptoms that can be programmatically detected, such as high tail latency, errors, and bottlenecked queues. We propose a lightweight and always-on distributed tracing system, Hindsight, which implements a retroactive sampling abstraction: instead of eagerly ingesting and processing traces, Hindsight lazily retrieves trace data only after symptoms of a problem are detected. Hindsight is analogous to a car dash-cam that, upon detecting a sudden jolt in momentum, persists the last hour of footage. Developers using Hindsight receive the exact edge-case traces they desire without undue overhead or dependence on luck. Our evaluation shows that Hindsight scales to millions of requests per second, adds nanosecond-level overhead to generate trace data, handles GB/s of data per node, transparently integrates with existing distributed tracing systems, and successfully persists full, detailed traces in real-world use cases when edge-case problems are detected.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源