1:1 When something notable happens in your system, you emit an event for it. An event has a timestamp that tells you when it happened. It has a user-facing message that tells you what happened. It has contextual data that tells you where and why it happened.


1:2 Some events are more important than others so you categorise them by their severity. Higher severity events can be distinguished from, and handled differently to, lower severity ones.

Three log events in a stream showing their timestamp, level, service, and message Three log events in a stream showing their timestamp, level, service, and message
Events can be visualised as a stream, ordered by their timestamps. The error is called out more prominently.

2:1 Your system is organised into procedures, both logically, and mechanically. Procedures take time, and that’s useful information, so you emit an event when they complete with both the time they started, and the time they ended. That lets you calculate how long they took.

Three span events in a stream showing their timestamp, status, and duration Three span events in a stream showing their timestamp, status, and duration
Adding a second timestamp to events changes them from representing a point in time to representing a span of time.

2:2 You want to correlate events emitted by the same procedure, so you add a shared identifier to them. Procedures call eachother, so you organise their identifiers into a hierarchy. The caller becomes the parent, and callees become the children.

A trace of eight spans arranged in a tree showing their span ids and relative duration A trace of eight spans arranged in a tree showing their span ids and relative duration
Using hierarchical identifiers lets us visualise procedures as a tree of who-called-who. The timespans on events show the relative cost of each call, and how they parallelise.

2:3 Your system makes inter-process calls that are logically part of the same procedure. You want these to be correlated too, so you include the caller’s identifier as a header when making remote calls. The callee then reconstitutes it as if they were called directly in-process.


2:4 Events are normally independent, but hierarchical identifiers make events from procedure calls dependent on eachother. You have to retain all of them, or none of them. You make this decision upfront before the outer-most procedure call, or you defer the decision to some later point when the events are available. Other independent events emitted by procedures are retained even if the call events themselves are not.


3:1 Your system uses resources that you want to track over time and correlate with other behaviour. You don’t know when usage changes, you can only ask for its current value, so you sample it at regular intervals and emit that value as an event. The frequency of sampling trades volume for resolution. You aggregate samples to preserve different properties of the underlying distribution.

A line chart showing the mean, min, and max together A line chart showing the mean, min, and max together
Samples can be visualised as charts. This one is a line chart where the x-axis is time, and the y-axis is value, for example memory usage. These samples may have been collected at an interval of 1s, and re-aggregated with the mean, min, and max at a coarser granularity. This could happen on the producer of the samples, or when processing them later. Including the min and max gives you a sense of how the mean is skewed.

3:2 When events are high volume, or you’re particularly interested in their frequency, you count their occurrence instead of emitting them directly. You sample the count at regular intervals, and emit that value as an event.

A bar chart showing the count A bar chart showing the count
Counts can be visualised as a bar chart where the x-axis is time and the y-axis is the count. Higher bars correspond to higher counts, meaning more of the event in question occurred in that interval.

3:3 When sampling procedures that take time, you don’t just want to know how many calls were made, you also want to know how long they took. You construct a histogram by dividing the total count into buckets by duration, with procedures taking about the same time sharing the same bucket. Bucket granularity trades volume for resolution.

A bar chart showing the count A bar chart showing the count
Adding an extra dimension to a bar chart can turn it into a heatmap. In this chart, the x-axis is time, the y-axis is the span duration, and the z-axis (shading intensity) is the count of spans in that bucket. More prominently shaded regions correspond to higher counts, meaning more spans completed in about that duration, at about that time.

Good diagnostics compress and externalise the state of your systems with sufficient detail to understand and manage them from the outside. We’ve just covered the three main observability signals that make up OpenTelemetry; 1 logs, 2 traces, and 3 metrics. I’ve presented them as a neat series of incremental progressions over events, but the real history isn’t so serial. Each one has evolved to serve different needs at different times. Many systems today make use of all of these generic signals, plus others that may be more specialised. Over time, those specialisations that prove more broadly valuable find their way into the general toolkit, and the list grows.