.NET's ActivityListener sampling API
Distributed tracing can generate a lot of data, and sampling is the most established method to keep data volumes manageable. In .NET, the System.Diagnostics.ActivityListener
class exposes two properties to control sampling: Sample
, and SampleUsingParentId
.
How do these work?
If you just switched browser tabs to check the documentation, I know why you’re back. There’s barely any useful information to be found about these, either in the .NET documentation or elsewhere online. Even the original API design proposal is a dead end. The canonical ActivityListener
example, copied everywhere, includes a Sample
implementation that’s something like this:
var listener = new ActivityListener();
listener.Sample = (ref ActivityCreationOptions<ActivityContext> _) =>
ActivitySamplingResult.AllData;
// ...
You see, you need to specify a Sample
function when creating an activity listener. The default sampling decision is ActivitySamplingResult.None
, and if all registered ActivityListener
s return this value, no activities will be created at all.
If you want to use the sampling function to do something more sophisticated than simply capture all traces, it’s assumed you’ll plug in the OpenTelemetry SDK and use its sampling APIs to achieve this, and there really isn’t any guidance out there otherwise. Depending on your circumstances, the OpenTelemetry SDK might be the right tool for the job, but it’s still deeply unsatisfying to rely on a core .NET diagnostics API that’s practically undocumented.
This year I’ve spent some time bridging System.Diagnostics.Activity
and Serilog, and in the process had to dig deeper into how ActivityListener
sampling works. Here are my conclusions, wrapped up in a tiny but non-trivial sampler. I’m fully aware that some of my conclusions and assumptions might be wrong; if you’re kind enough to send corrections I’ll make sure this article is updated.
IntervalSampler
The sampler presented here is called IntervalSampler
. Its source code lives in a SerilogTracing
example project on GitHub.
static class IntervalSampler
{
public static SampleActivity<ActivityContext> Create(ulong interval)
{
ArgumentOutOfRangeException.ThrowIfZero(interval);
var next = interval - 1;
return (ref ActivityCreationOptions<ActivityContext> options) =>
{
if (options.Parent != default)
{
return (options.Parent.TraceFlags & ActivityTraceFlags.Recorded) ==
ActivityTraceFlags.Recorded ?
ActivitySamplingResult.AllDataAndRecorded :
options.Parent.IsRemote ?
ActivitySamplingResult.PropagationData :
ActivitySamplingResult.None;
}
var n = Interlocked.Increment(ref next) % interval;
return n == 0
? ActivitySamplingResult.AllDataAndRecorded
: ActivitySamplingResult.PropagationData;
};
}
}
IntervalSampler
aims to collect one in every N possible traces (the “interval”), selected using modulo arithmetic. A more robust sampler might introduce some randomness into this process to avoid skewing the sample when an application produces the same types of traces in a very regular sequence, but those kinds of details would obscure the parts of the sampler that are important for our current purposes.
The sampler creates a sampling function that is wired up like so:
var listener = new ActivityListener();
listener.Sample = IntervalSampler.Create(7);
// ...
Sample
vs SampleUsingParentId
The first thing you’ll encounter when setting up a sampler are the apparent duplication of the sampling function between
ActivityListener.Sample
, which describes the parent of the sampled activity usingActivityContext
, andActivityListener.SampleUsingParentId
, which describes the parent usingstring
.
public SampleActivity<string>? SampleUsingParentId { get; set; }
public SampleActivity<ActivityContext>? Sample { get; set; }
Both APIs were added in .NET 5, so one isn’t an obsolete alternative to the other. When should each be used?
It turns out that SampleUsingParentId
supports both W3C and Microsoft’s legacy “hierarchical” tracing schemes. If a listener has both SampleUsingParentId
and Sample
configured, then SampleUsingParentId
will be used. Otherwise, if the activity is using the W3C tracing scheme, Sample
will be used.
So this suggests SampleUsingParentId
is the best, most general thing to implement? No, not really. Non-W3C tracing is on its way to extinction, and within SampleUsingParentId
you can’t directly access the modern, fundamental properties describing the parent activity, such as its trace id, span id, or trace flags.
IntervalSampler
supports the Sample
delegate signature:
return (ref ActivityCreationOptions<ActivityContext> options) =>
{
// ...
};
TL;DR: unless you’re writing code that has to work in a legacy tracing scheme, Sample
is the way to go, and you can safely ignore SampleUsingParentId
.
Sampling traces, vs sampling activities
The next thing to confront is the subtle difference between the purpose of the Sample
API — to determine whether or not to create an Activity
— and the reason that you’re interested in it, which is to determine whether the trace to which the Activity
belongs should be recorded.
An Activity
is just one single span within a hierarchical trace. Sampling generally aims to either create all of the spans in a trace, or none of them. Once a decision has been made for the Activity
corresponding to the root span in a trace, then all of its child activities should be included in the sample, too.
That’s what the first condition in our sampling delegate is concerned with:
if (options.Parent != default)
{
return (options.Parent.TraceFlags & ActivityTraceFlags.Recorded) ==
ActivityTraceFlags.Recorded ?
ActivitySamplingResult.AllDataAndRecorded :
options.Parent.IsRemote ?
ActivitySamplingResult.PropagationData :
ActivitySamplingResult.None;
}
If the activity we’re being asked to make a decision about would become the child of an existing activity, then we use the sampling decision already made for that activity.
If the parent activity is recorded (included in the sample), then the child is too, and ActivitySamplingResult.AllDataAndRecorded
is the correct result.
Take care: the very similarly-named
ActivitySamplingResult.AllData
causes anActivity
to be created, but it doesn’t mark the trace as being recorded. If you returnActivitySamplingResult.AllData
from your sampler, activities likely won’t show up in your tracing system, and the sampling decision won’t be propagated downstream to other services and systems you call.
In the case that the parent isn’t included in the sample, we return ActivitySamplingResult.PropagationData
to ensure a local activity is still created when the parent is remote, and otherwise return ActivitySamplingResult.None
to save allocation of a new Activity
instance.
At the root of the trace
The next, and final part of IntervalSampler
, is concerned with root activities. These don’t have a parent, so when we make a sampling decision for them, we’re really making a decision about the whole trace: this Activity
, and its (potential) future children.
var n = Interlocked.Increment(ref next) % interval;
return n == 0
? ActivitySamplingResult.AllDataAndRecorded
: ActivitySamplingResult.PropagationData;
That’s why, when an activity isn’t included in the sample, we return ActivitySamplingResult.PropagationData
instead of ActivitySamplingResult.None
. If we returned ActivitySamplingResult.None
, no activity would be created, and so later on we’d have no way to remember our decision when looking at more deeply-nested activities. The ActivitySamplingResult.PropagationData
option does cause creation of an activity, but it’ll be marked in such a way that only minimal processing is performed on it, and it will ultimately be discarded.
So there you have it
Hopefully the information here helps you to skip some of the digging I’ve had to do, and sheds some light on what ActivityListener.Sample
is all about. Corrections and errata welcome - and if you spot other examples or documentation surrounding ActivityListener.Sample
that I’ve missed, I’d love to know about those, too.
Happy tracing! 👋
2024-10-05: added the
IsRemote
check when sampling by parent, to ensure an activity is always created for propagation purposes.