OpenTelemetry signals from first principles
1:1 When something notable happens in your system, you emit an event for it. An event has a timestamp that tells you when it happened. It has a user-facing message that tells you what happened. It has contextual data that tells you where and why it happened.
1:2 Some events are more important than others so you categorise them by their severity. Higher severity events can be distinguished from, and handled differently to, lower severity ones.
2:1 Your system is organised into procedures, both logically, and mechanically. Procedures take time, and that’s useful information, so you emit an event when they complete with both the time they started, and the time they ended. That lets you calculate how long they took.
2:2 You want to correlate events emitted by the same procedure, so you add a shared identifier to them. Procedures call eachother, so you organise their identifiers into a hierarchy. The caller becomes the parent, and callees become the children.
2:3 Your system makes inter-process calls that are logically part of the same procedure. You want these to be correlated too, so you include the caller’s identifier as a header when making remote calls. The callee then reconstitutes it as if they were called directly in-process.
2:4 Events are normally independent, but hierarchical identifiers make events from procedure calls dependent on eachother. You have to retain all of them, or none of them. You make this decision upfront before the outer-most procedure call, or you defer the decision to some later point when the events are available. Other independent events emitted by procedures are retained even if the call events themselves are not.
3:1 Your system uses resources that you want to track over time and correlate with other behaviour. You don’t know when usage changes, you can only ask for its current value, so you sample it at regular intervals and emit that value as an event. The frequency of sampling trades volume for resolution. You aggregate samples to preserve different properties of the underlying distribution.
3:2 When events are high volume, or you’re particularly interested in their frequency, you count their occurrence instead of emitting them directly. You sample the count at regular intervals, and emit that value as an event.
3:3 When sampling procedures that take time, you don’t just want to know how many calls were made, you also want to know how long they took. You construct a histogram by dividing the total count into buckets by duration, with procedures taking about the same time sharing the same bucket. Bucket granularity trades volume for resolution.
Good diagnostics compress and externalise the state of your systems with sufficient detail to understand and manage them from the outside. We’ve just covered the three main observability signals that make up OpenTelemetry; 1 logs, 2 traces, and 3 metrics. I’ve presented them as a neat series of incremental progressions over events, but the real history isn’t so serial. Each one has evolved to serve different needs at different times. Many systems today make use of all of these generic signals, plus others that may be more specialised. Over time, those specialisations that prove more broadly valuable find their way into the general toolkit, and the list grows.