Tagging log events for effective correlation

Correlation, at least when we’re talking diagnostic logs, is the process of finding events that are related to each other in some way. It’s typical to correlate events raised during a transaction, on a specific thread, on a machine or from steps in some kind of process.

Correlation is king in the world of operating complex, distributed and asynchronous apps: it’s how the signal is picked out of the noise – the wheat from the irrelevant chaff – the diamonds that will help you solve a critical issue from the mud of day-to-day happenings.

I gave a presentation a few weeks back at DDD Brisbane on building better apps by taking an active approach to logging. The last post should give you an idea of the flavor of it. After the talk some nice folks approached me and said I’d filled in a missing link for them – it made me realize I’ve been using a little strategy for correlation without really thinking about it, and you might find this useful too.

Contextual correlation

For a long time, diagnostic logs have incorporated contextual information into events. Context tells you what was happening when an event was raised, or where the event was raised from. In its earliest form it might be the thread ID you sometimes see formatted into text logs:

2015-01-22 10:55:05.776 INFO [22] Serving catalog index
2015-01-22 10:55:05.902 WARN [8] Query took 109 ms
2015-01-22 10:55:05.902 INFO [22] Responding with OK

The thread ID in this case is included in each log message between brackets after the level. It tells us that the warning on the second line was raised on thread 8, which isn’t the same thread as was concurrently serving the product catalog (thread 22), so the source of the warning was elsewhere.

Searching a simple text log like this we could grep for /[8]/ to see what else was happening on thread 8 and hopefully narrow down on the source of the slow query.

Serilog does contextual correlation in a few different ways.

Enrichers lke the thread id enricher can be used to attach different contextual properties to events:

var log = new LoggerConfiguration()
  .Enrich.WithThreadId()
   // Other config here, then…
  .CreateLogger();

Enrichers work well when the context of an event can be determined statically – i.e. it’s something ambient like the thread id, machine name, the logged-in user, and so-on.

For context that’s more localized, there’s a special enricher that can be added to enable the Log Context:

  .Enrich.FromLogContext()

Once the log context is enabled, a block of related messages can be tagged using the methods on the LogContext class:

using (LogContext.PushProperty("MessageId", GetCurrentMessageId()))
{
  Log.Information("Processing {OrderId}", message.Order.Id); 

  // Later
  Log.Information("Order processing complete");
}

If we look at these logs in Seq, we’ll see an MessageId property attached to each event within the using block:

1-MessageId

This can be used as a filter, e.g. MessageId == 12345, to view the events raised when processing the message.

Contextual correlation is bread-and-butter stuff. If your application processes work asynchronously, you need to be able to filter down to a single task, and properties tagging events from a context make this as painless as possible.

It’s worth noting that contexts overlap – an event can belong to a thread as well as to a message. (This is one of the arguments for structured logs – it’s much easier to work with events with real, named properties than it is to devise readable, unambiguous ways to embed all of these into the format of a text log message.)

Setting up and using the log context does add some noise to application code, so it’s something to be used judiciously. To tag events for correlation without additional machinery, another technique is useful.

Lateral correlation

If you think about all the contexts in your running application, you might come up with a mental picture like this:

Contextual

The green lines show sequences of events tagged with the same RequestId property, while blue shows a MessageId property that might be added to events during back-end processing. There’s some simplification here, given the way contexts overlap, but essentially each context is a temporal sequence of correlated events.

When it comes time to use these logs however, scenarios frequently cross these little islands of navigability, so something else is needed.

Let’s say we’ve received an issue report tagged with an exception message. We find the exception using the text:

2-InvOrder

And then follow the MessageId contextual property to see what happened in the message handler that failed.

3-Mid

Now, we know somehow we were sent an order without any items – where did it come from? This is where a lateral correlation identifier comes in handy. The second event in the three above implements this:

Log.Information("Processing {OrderId}", message.Order.Id);

OrderId is a business-domain identifier included in the message itself that links the event across lines of causality to others related to the same order.

Once we jump across contexts using the OrderId:

4-Find

We can now see an event from the HTTP request that initiated the order.

5-Req

(A keen-eyed reader might note here that searching on ItemCount == 0 would have narrowed things down a little faster - but identifying this opportunity still requires some navigation through the logs from the point of failure too.)

Correlating by HTTP request Id we get all of the events from the request and should (on a real event stream) be able to get to the bottom of the issue:

6-Final

Patching this into our mental model, OrderId let us navigate laterally between the two sequences of events.

Lateral

Logging in two dimensions

Armed with a tool like structured logging that can easily be overused and create clutter, the last thing anyone wants to do is bung every possible property onto every event in an attempt ease diagnostics later on.

Keeping a clear distinction between the contextual and lateral correlation links makes it possible to navigate effectively through events raised by the app with a minimum of syntactic noise.

It turns out that in most situations I’ve encountered, architectural and infrastructural concerns dominate the notion of context. HTTP request id, thread id, message id – these are all mechanical concerns that get buried into plumbing and attached using enrichers or the log context. This is pretty convenient - the noisier APIs for adding contextual properties are best kept in infrastructural code when possible.

Identifiers from the business domain provide lateral navigation through logs – customer ids, order ids and the like – letting you skip between the various contexts in which an entity was referenced. Using structured logging these links can be created without much fanfare - just logging an event with the identifier in its message text is enough to make the connection.

It’s a simple perspective that helps me use structured logs more effectively, and it can be a nice foundation for guidelines shared with a team. If you use this strategy already, or another similar one, I’d love to hear more about your experiences!