Trace sampling in SerilogTracing

Sampling in SerilogTracing is taking shape. If you haven’t heard of SerilogTracing, that’s not surprising - it’s a brand new project launched in January 2024. But, we’ve reached 250k downloads of the core package, and we’re still picking up steam. 😎

This post takes a look at the sampling functionality that’s arrived in SerilogTracing 2.2.

Trace sampling

There are a few things we can do to reduce the volume of structured log and trace data an application generates. We can reduce the granularity of events, reduce event size, and conditionally switch events on and off using leveling.

Sampling is another popular option: if enough data is being generated, collecting only some portion of it can still provide reasonable visibility. Because it’s externally-controllable using standardized HTTP headers, sampling is known for being especially effective in situations where the source applications can’t otherwise be modified.

The first thing to understand about sampling in SerilogTracing is that it’s trace sampling: a trace is a collection of spans, and sampling aims to record either all spans in a trace, or none of them.

An example distributed trace, crossing service boundaries

Why not (also) sample spans?

First, turning individual spans on and off using sampling would cause operations executing the same sequence of steps to have traces with different shapes. When you’re troubleshooting using tracing, comparing the overall shape of two traces should reveal whether each follwed the same sequence of steps - and sampling individual spans prevents this.

Second, a span on its own isn’t as interesting as a span in the context of its parent and child spans. By sampling individual spans, recorded spans would often be the only recorded span in their containing trace, and at that point a lot of their value would have been lost.

Sampling in a monolith

If we first look at tracing within a monolith or single service, we can see how this plays out. I prefer self-contained code samples over realistic ones, so to illustrate the basic sampling API we’ll run this program:

using Serilog;
using SerilogTracing;
using SerilogTracing.Expressions;

Log.Logger = new LoggerConfiguration()
    // The custom formatter here is provided by SerilogTracing.Expressions.
    .WriteTo.Console(Formatters.CreateConsoleTextFormatter())
    .CreateLogger();

// Set up SerilogTracing with the default options.
using var _ = new ActivityListenerConfiguration()
    .TraceToSharedLogger();

for (var i = 0; i < 10000; ++i)
{
    // The `outer` activity is the root of each trace the program generates.
    using var outer = Log.Logger.StartActivity("Outer {i}", i);

    // The `inner` activity is a child of `outer`.
    using var inner = Log.Logger.StartActivity("Inner {i}", i);
    
    await Task.Delay(100);
}

The program produces this output:

> dotnet run --project example/Sampling/Sampling.csproj
[09:41:44 INF] ├ Inner 0 (102.677 ms)
[09:41:44 INF] └─ Outer 0 (127.013 ms)
[09:41:44 INF] ├ Inner 1 (101.166 ms)
[09:41:44 INF] └─ Outer 1 (101.259 ms)
[09:41:44 INF] ├ Inner 2 (100.795 ms)
[09:41:44 INF] └─ Outer 2 (100.879 ms)
[09:41:44 INF] ├ Inner 3 (101.062 ms)
[09:41:44 INF] └─ Outer 3 (101.169 ms)
^C

Note that when tracing to the console, spans appear in the order that they complete, which is the opposite of what you’d see in a tracing tool. It’s not an ideal presentation, but it’s easier to get multiple example traces into a compact format this way.

The program isn’t currently using sampling: all of the traces are recorded.

Now, let’s add Sample.OneTraceIn(7) to the ActivityListenerConfiguration:

using var _ = new ActivityListenerConfiguration()
    .Sample.OneTraceIn(7)
    .TraceToSharedLogger();

Running again:

> dotnet run --project example/Sampling/Sampling.csproj
[09:42:56 INF] ├ Inner 0 (102.621 ms)
[09:42:56 INF] └─ Outer 0 (125.868 ms)
[09:42:56 INF] ├ Inner 7 (100.233 ms)
[09:42:56 INF] └─ Outer 7 (100.703 ms)
[09:42:57 INF] ├ Inner 14 (101.172 ms)
[09:42:57 INF] └─ Outer 14 (101.462 ms)
^C

Now, every seventh trace is now recorded. The child inner activity is always recorded if its outer parent is — so we see one in seven complete traces.

That’s just about all there is to say about Sample.OneTraceIn(); it’s intentionally the simplest possible strategy, serving as an example to build from, and an entrypoint to exploring the SerilogTracing API.

Sampling across services

Sampling gets more interesting when multiple services are involved. The first service involved in an operation might apply a sampler such as Sample.OneTraceIn(), but when it calls out to other services, how do they know that any spans they generate will be part of a sampled trace?

Propagating sampling flags

Along with the trace and span ids of the calling span, the W3C traceparent HTTP header carries sampling information from client to server in its final two hexadecimal digits:

00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-00

In this example from the spec, the trailing -00 means that the parent is not being recorded.

The alternative -01, indicates that the parent has been chosen for recording:

00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01

This is a curious system because headers often come from untrusted sources. Should an application or service receiving traceparent respect the caller’s sampling decision?

Public and internet-facing endpoints

If the service is public or internet-facing then the answer is an obvious “no”: a malicious actor could use the -00 sampling flags to turn off recording of their nefarious activities. Or, an external/third-party caller that always records all traces could consume excess storage by propagating its sampling decision with -01.

By default, SerilogTracing’s ASP.NET Core integration assumes that the caller is untrusted, and sampling decisions by external callers are ignored.

SerilogTracing will pick up the trace and parent span ids from callers, however, so you’ll still see that they’re present unless you disable them entirely using IncomingTraceParent.Ignore, which is the cleanest option for public or internet-facing services:

using var _ = new ActivityListenerConfiguration()
    .Instrument.AspNetCoreRequests(opts => 
        // Ignore all tracing information provided by callers - start a new trace for
        // each request, and apply local sampling policies (if any).
        opts.IncomingTraceParent = IncomingTraceParent.Ignore)
    .TraceToSharedLogger();

Internal endpoints

Conversely, if you’re “inside the firewall” and want sampling decisions to propagate between your services (the whole point, really), then you’ll need to trust the sampling decision made in the incoming traceparent:

using var _ = new ActivityListenerConfiguration()
    .Instrument.AspNetCoreRequests(opts =>
        // Accept trace and span ids, and sampling decisions provided by callers,
        // overriding local sampling decisions.
        opts.IncomingTraceParent = IncomingTraceParent.Trust)
    .TraceToSharedLogger();

If you choose to trust the incoming traceparent sampling decision, this will override any decisions made locally by sampling policies like Sample.AllTraces(), for activities with remote parents.

Sampling algorithms

Out of the box, SerilogTracing provides Sample.AllTraces() (the default), Sample.OneTraceIn(), and Sample.Using().

The argument accepted by Sample.Using() is a .NET ActivityListener sampling delegate, which you can learn a bit more about in this brief write-up, and use to construct custom sampling algorithms of your own.

Many more pre-built algorithms are possible, and we’re hoping to collect examples of these to share more widely in the community: please open a ticket if you have one to request or contribute.