大纲
- Observability Primer
- *What is Observability?*
- Reliability & Metrics
- *Understanding Distributed Tracing*
- *Logs*
- Spans
- Span attributes
- Distributed Traces
Observability Primer
可观测性入门
Core observability concepts. 可观测性核心概念。
What is Observability?
什么是可观测性?
Observability lets us understand a system from the outside, by letting us ask questions about that system without knowing its inner workings. Furthermore, it allows us to easily troubleshoot and handle novel problems (i.e. “unknown unknowns”), and helps us answer the question, “Why is this happening?” 可观测性是指我们可以从外部,在不了解其内部工作原理的情况下,可以向系统提出(诊断)问题(的特性)。(可以理解为医生没有进入我们血管,但是可以问我们“血压多少”)此外,它还使我们能够轻松排查和处理新问题,并帮助我们回答”为什么会发生这种情况?之类的问题。
In order to be able to ask those questions of a system, the application must be properly instrumented. That is, the application code must emit signals such as traces, metrics, and logs. An application is properly instrumented when developers don’t need to add more instrumentation to troubleshoot an issue, because they have all of the information they need. 为了能够对系统提出这些问题,应用程序必须被正确测量。也就是说,应用程序代码必须发出Traces、Metrics和Logs等信号。应用程序已被正确测量的标志是:开发人员不需要添加更多测量装置(诸如代码等)来解决问题。 因为他们拥有所有所需的信息。
OpenTelemetry is the mechanism by which application code is instrumented, to help make a system observable. OpenTelemetry是一种对应用程序代码进行测量,以帮助使其具有可观测性的机制。
Reliability & Metrics
可靠性和指标
Telemetry refers to data emitted from a system, about its behavior. The data can come in the form of traces, metrics, and logs. 遥测(数据)是指从系统发出来的行为数据。数据的形式可以是Traces、Metrics和Logs。
Reliability answers the question: “Is the service doing what users expect it to be doing?” A system could be up 100% of the time, but if, when a user clicks “Add to Cart” to add a black pair of shoes to their shopping cart, and instead, the system doesn’t always add black shoes, then the system would be said to be unreliable. 可靠性回答了这个问题:“服务是否按照用户的期望运行?”。如果一个系统一直可以运行,但是如果当用户点击 “添加到购物车”以将一双黑色鞋子添加到他们的购物车中,然而系统并不总是添加黑鞋,那么就可以认为系统不可靠的。(是想表达服务可用,但是功能错误)
Metrics are aggregations over a period of time of numeric data about your infrastructure or application. Examples include: system error rate, CPU utilization, request rate for a given service. For more on metrics and how they pertain to OpenTelemetry, see Metrics. 指标是在一段时间内基础设施或应用程序的量化数据的聚合信息。示例包括:系统错误率、CPU 利用率,给定服务的请求速率。有关指标及其与 OpenTelemetry关联的更多信息,请参阅Metrics。
SLI, or Service Level Indicator, represents a measurement of a service’s behavior. A good SLI measures your service from the perspective of your users. An example SLI can be the speed at which a web page loads. SLI 或服务级别指标表示对服务行为的测量。一个好的 SLI 是从用户的角度来衡量您的服务。 一个SLI示例是网页加载的速度。
SLO, or Service Level Objective, is the means by which reliability is communicated to an organization/other teams. This is accomplished by attaching one or more SLIs to business value. SLO,即服务水平目标,是向组织/其他团队传达可靠性的方式。这是通过将一个或多个 SLI 附加到业务价值上来实现的。
Understanding Distributed Tracing
了解分布式跟踪
To understand Distributed Tracing, let’s start with some basics. 要了解分布式跟踪,让我们从一些基础知识开始。
Logs
A log is a timestamped message emitted by services or other components. Unlike traces, however, they are not necessarily associated with any particular user request or transaction. They are found almost everywhere in software, and have been heavily relied on in the past by both developers and operators alike to help them understand system behavior.
Log 是由服务或其他组件发出的带时间戳的消息。 然而,与Trace不同的是,Log不一定是与任何特定的用户请求或事务相关联。它们在软件中几乎无处不在,并且在过去被开发人员和操作员严重依赖,以帮助他们了解系统行为。
Sample log:
I, [2021-02-23T13:26:23.505892 #22473] INFO – : [6459ffe1-ea53-4044-aaa3-bf902868f730] Started GET “/” for ::1 at 2021-02-23 13:26:23 -0800
Unfortunately, logs aren’t extremely useful for tracking code execution, as they typically lack contextual information, such as where they were called from. 不幸的是,Log对于跟踪代码执行并不是非常有用,因为它们通常缺少上下文信息,例如从哪里调用它们(即调用链路不清晰)。
They become far more useful when they are included as part of a span, or when they are correlated with a trace and a span. 当它们作为Span的一部分被包含在内时, 或者当它们与Trace和Span相关联时,它们会变得更加有用。
For more on logs and how they pertain to OTel, see Logs. 有关Logs及其与 OTel 关系的更多信息,请参阅Logs。
Spans
A span represents a unit of work or operation. It tracks specific operations that a request makes, painting a picture of what happened during the time in which that operation was executed. Span表示一个工作或操作的单元。它跟踪请求所产生的具体操作,描绘了当时在执行该操作时发生的情况。
A span contains name, time-related data, structured log messages, and other metadata (that is, Attributes) to provide information about the operation it tracks. 一个Span包含名称、与时间相关的数据、结构化日志消息和其他元数据(即属性),以提供有关其跟踪的操作的信息。
Span attributes
Span属性
The following table contains examples of span attributes: 下表包含Span属性的示例:
Key | Value |
---|---|
http.request.method | “GET” |
network.protocol.version | “1.1” |
url.path | “/webshop/articles/4” |
url.query | “?s=1” |
server.address | “example.com” |
server.port | 8080 |
url.scheme | “https” |
http.route | “/webshop/articles/:article_id” |
http.response.status_code | 200 |
client.address | “192.0.2.4” |
client.socket.address | “192.0.2.5” (the client goes through a proxy) |
user_agent.original | “Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:72.0) Gecko/20100101 Firefox/72.0” |
For more on spans and how they pertain to OpenTelemetry, see Spans. 有关Spans及其与OpenTelemetry的关系的详细信息,请参阅Spans。
Distributed Traces
分布式跟踪
A distributed trace, more commonly known as a trace, records the paths taken by requests (made by an application or end-user) as they propagate through multi-service architectures, like microservice and serverless applications. 分布式跟踪(通常称为Trace)记录了在多服务器架构上,如微服务和无服务器应用程序,(由应用程序或最终用户发出的)请求传播的路径。
Without tracing, it is challenging to pinpoint the cause of performance problems in a distributed system. 在分布式系统中,如果没有Trace,就很难查明性能问题的原因。
It improves the visibility of our application or system’s health and lets us debug behavior that is difficult to reproduce locally. Tracing is essential for distributed systems, which commonly have nondeterministic problems or are too complicated to reproduce locally. 它提高了应用程序或系统运行状况的可见性,并让我们调试难以在本地重现的行为。Trace对于分布式系统至关重要,因为很多不确定性问题很难在本地复现。
Tracing makes debugging and understanding distributed systems less daunting by breaking down what happens within a request as it flows through a distributed system. Trace使调试和理解分布式系统变得不那么令人生畏,它会分解请求流经分布式系统时发生的情况。
A trace is made of one or more spans. The first span represents the root span. Each root span represents a request from start to finish. The spans underneath the parent provide a more in-depth context of what occurs during a request (or what steps make up a request). Trace由一个或多个Span组成。第一个Span是Root Span。 每个Root Span表示从头到尾的请求。下面的Span父级提供更详细的上下文以了解请求期间发生的情况(或请求的构成步骤)。
Many Observability backends visualize traces as waterfall diagrams that may look something like this: 许多可观测性后端将Trace可视化为瀑布图,这些瀑布图可能看起来像下图:
Waterfall diagrams show the parent-child relationship between a root span and its child spans. When a span encapsulates another span, this also represents a nested relationship. 瀑布图展示了Root Span和 它的Child Span。当一个Span封装另一个Span时,这也展示嵌套关系。
For more on traces and how they pertain to OpenTelemetry, see Traces. 有关Trace及其与 OpenTelemetry 的关系的详细信息,请参阅Traces。