An efficient filtering DSL for Serilog

One of the premises of Serilog is that since it models log events as, well, events, it’s possible to do the kinds of interesting things with them that you’d normally associate with CEP and libraries like Rx.

In practice, working with raw Serilog event data isn’t always that much fun. When you associate data with a log event using Serilog:

Log.Information("Hello, {Name}!", name);

… properties like Name are attached to the log event as LogEventPropertyValue objects, where a value may be a ScalarValue as in this case, a StructureValue (key/value pairs), a SequenceValue (array of values), and so-on. This re-encoding of property data is an important part of Serilog’s design, but it’s not a great API to write everyday code against.

Even simple predicate-based fiters, when written by hand, have to deal with a lot of machinery:

Log.Logger = new LoggerConfiguration()
    // ...
    .Filter.ByExcluding(evt =>
    {
        LogEventPropertyValue name;
        return evt.Properties.TryGetValue("Name", out name) &&
            (name as ScalarValue)?.Value as string == "World";
    })
    // ...

That’s a substantial effort just to compare two strings. It’s also not something that can be easily encoded in a configuration file. Frankly, I’m a little bit peturbed that we’ve made it nearly four years without a better way to do this kind of thing :-)

What I’d really like to write is more along the lines of:

    .Filter.ByExcluding("Name = 'World'")

A filtering DSL will work nicely with Serilog’s XML and JSON configuration providers, too:

  <add key="serilog:Filter:ByExcluding.expression"
       value="Name = 'World'" />

Now, it just so happens that Superpower makes it easy to parse little languages like this, and between Seq and my other projects I had enough bits and pieces of related code that gluing together a reasonably complete implementation wasn’t too hard. Definitely a suitable warm-up to get 2017 kicked off after the Christmas break!

The rest of this post introduces Serilog.Filters.Expressions, including a few pointers to interesting parts of the codebase and a quick benchmark.

Serilog.Filters.Expressions

This new package implements Filter.ByExcluding(expression) and Filter.ByIncludingOnly(expression) as extensions to Serilog’s configuration syntax.

You can install it with:

Install-Package Serilog.Filters.Expressions -Pre

The extension methods are in the Serilog namespace, so everything should just work:

Log.Logger = new LoggerConfiguration()
    .Filter.ByExcluding("Name = 'World'")
    .WriteTo.LiterateConsole()
    .CreateLogger();

Filter language

I chose to implement the SQL-style filtering syntax used with Seq because it’s already well-thought-out and reasonably familiar. I’m as lukewarm about SQL as the next developer, but when it comes to data-oriented tasks, SQL really still is king. SQL’s single-quoted strings are also nicer to embed in C# string literals, which I suspect will be a pretty common occurrence.

The example in the project README gives you a pretty good sense of the flavour of it:

@Level = 'Information' and AppId is not null and Items[?] like 'C%'

Logical operations such as and, the string syntax, is not null and like are bread-and-butter SQL.

AppId and Items are properties attached to the log event either in the message template or through enrichers. The @Level property corresponds to LogEvent.Level; all the direct properties of LogEvent are available using the @ prefix.

Finally, Items[?] is a “wildcard comprehension”. The ? means any: Items[?] like 'C%' evaluates to true if any element of the Items collection starts with C. Substituting * instead of ? would test whether all elements of the collection start with C. Thanks to some mind-bending tree transformations these bake down to LINQ-style Any() and All() calls. If Items contains objects, rather than just strings, you can instead write Items[?].Name like 'C%' and so-on.

The table below is a reasonably complete description of what’s supported:

Category	Examples
Literals	`123`, `123.4`, `'Hello'`, `true`, `false`, `null`
Properties	`A`, `A.B`, `@Level`, `@Timestamp`, `@Exception`, `@Message`, `@MessageTemplate`, `@Properties['A-b-c']`
Comparisons	`A = B`, `A <> B`, `A > B`, `A >= B`, `A is null`, `A is not null`
Text	`A like 'H%'`, `A not like 'H%'`, `A like 'Hel_o'`, `Contains(A, 'H')`, `StartsWith(A, 'H')`, `EndsWith(A, 'H')`, `IndexOf(A, 'H')`
Regular expressions	`A = /H.*o/`, `Contains(A, /[lL]/)`, other string functions
Collections	`A[0] = 'Hello'`, `A[?] = 'Hello'` (any), `StartsWith(A[*], 'H')` (all)
Maths	`A + 2`, `A - 2`, `A * 2`, `A % 2`
Logic	`not A`, `A and B`, `A or B`
Grouping	`A * (B + C)`
Other	`Has(A)`, `TypeOf(A)`

The implementation

Let me be the first to say that this code carries some scars as a result of its hasty assembly. The various pieces came from other projects and some corners were cut when making them work nicely with Serilog. In particular, the test cases are a bit light, and some parts are more complex than they’d otherwise need to be. With that out of the way - I hope there are a few examples in there to learn from, and I’m looking forward to suggestions (and PRs) for improvements.

`LoggerFilterConfigurationExtensions`

Starting from the outer API surface, LoggerFilterConfigurationExtensions integrates filtering into the Serilog LoggerConfiguration system.

public static LoggerConfiguration ByExcluding(
    this LoggerFilterConfiguration loggerFilterConfiguration, string expression)
{
    var filter = FilterLanguage.CreateFilter(expression);
    return loggerFilterConfiguration.ByExcluding(e => true.Equals(filter(e)));
}

(I’ve snipped out argument checks and condensed the code where possible for the post.)

You’ll notice that the result of evaluating the compiled filter is an object - not necessarily a Boolean. Comparing with true means we’ll consider everything else - null, 123, 'Hello' and so-on, to be effectively false.

`FilterExpressionTokenParser`

Zooming in, the parser works in two parts. First, the tokenizer breaks the expression into individual tokens. Second, the token parser constructs an abstract syntax tree (AST) out of the expression.

This is all using Superpower, which was built primarily to improve on Sprache’s error reporting for Seq 3.4. There are a few more Superpower examples out there, but probably none as complete as the one implemented here.

An experimental aspect of Superpower’s design allows small, reusable text parsers to be shared between the tokenizer and the full parser. Check out FilterExpressionTextParsers to see some of these.

The AST

The AST is implemented by subclasses of the abstract FilterExpression class, like FilterPropertyExpression (represents a reference to a property of the log event) and FilterCallExpression (a call to a function such as Contains(), or the underlying function behind an operator like =).

You can see the various kinds of nodes in the Serilog.Filters.Expressions.Ast namespace.

Notice that there are AST node types like FilterLambdaExpression that don’t appear in the concrete syntax. These support a number of transformations that are applied to the AST before it’s finally compiled using the System.Linq.Expressions machinery.

Transformations

The filter you write isn’t necessarily the filter that is ultimately executed. I mentioned earlier how wildcard comprehensions get transformed down into Any() and All() calls. There are a few more of these kinds of transformations included in the code.

For example, it’s much more expensive to match a string against a regular expression than it is to compare two numbers. If these appear in an expression like A = /H.*/ and B = 1 then it makes sense to evaluate the B = 1 test first. That’s the job of FilterExpressionCostReordering. This and a few other transformations are based on the abstract FilterExpressionTransformer<T> base class, which provides the scaffolding necessary to make a pass over an AST and convert it into another representation.

One thing I like about this model is that each transformation is a modular, standalone piece of code. The trade-off is that there are implicit dependencies between the transformations - each expects the tree to be in a certain state, produced by its predecessors. Most of the transformations run in the FilterExpressionCompiler.CompileAndExpose() method, where the order of execution is made explicit.

Compilation

The most exciting-sounding part of the process is probably the least surprising. Eventually, we get a FilterExpression AST that corresponds closely to the C# code we’d write in order to evaluate an expression. In the LinqExpressionCompiler class, AST nodes are swapped for corresponding nodes from the System.Linq.Expressions AST, and Compile() turns the result into an executable delegate.

Trying to figure out what code should actually run brings the problem of representation to the fore. When we get the raw data from a LogEvent, it is represented as ScalarValue, StructureValue and so-on. But literals in the filtering expression are strings and numbers. When we compare them, it’s desirable to compare like-against-like - text should be represented either as string or ScalarValue, for example, not a mix of both.

In Serilog.Filtering.Expressions, the values operated on by the generated code obey the following rules:

If the value is scalar, whether a literal or a property value, it’s converted into its underlying type before being operated on
The underlying values are converted into decimal if they’re numeric, string if they’re text, or passed through as-is if the’re a temporal type like DateTime or TimeSpan (though there are few operators that can be applied to these latter temporal values in the current version)
StructureValue, SequenceValue and DictionaryValue are passed around as-is
The absence of a value, or the result of an unrepresentable computation like 1 / 0, is represented by the special Undefined.Value object

Calls to Representation.Represent() are inserted in a few places to make this happen.

These few rules keep the built-in operators straightforward to implement, and avoid most unnecessary conversions. Since we don’t have a schema with type information for the property values, some boxing overhead is unavoidable (though I think there are still places where this could be removed).

Execution

There’s one small remaining detail; internally, values are represented according to the rules in the last section - as values, as StructureValues and so-on. When the value of an expression has been computed however, it’s nice to get that value in a form that is independent of these implementation concerns.

FilterExpressionCompiler wraps the compiled delegate and calls Representation.Expose() to turn things like SequenceValue into regular arrays, StructureValue into dictionaries, and Undefined.Value into null.

Benchmarks

The tool that made the biggest impact on my development workflow in 2016 was undoubtedly BenchmarkDotNet. If you haven’t tried it out - it’s for microbenchmarking what xUnit.net is for unit testing - a drop-in solution that gets out of the way and mosty “just works”. I rarely start a new codebase now without adding a BenchmarkDotNet harness. Performance might only matter in 1% of the code, but in that 1% measurement beats guesswork every time.

The solution has only one benchmark currently, ComparisonBenchmark. This tests how long it takes to compare a log event property (A) with a constant value (3). Here are the interesting bits.

public class ComparisonBenchmark
{
    Func<LogEvent, bool> _trivialFilter, _handwrittenFilter, _expressionFilter;
    readonly LogEvent _event = Some.InformationEvent("{A}", 3);

    public ComparisonBenchmark()
    {
        // Just the delegate invocation overhead
        _trivialFilter = evt => true;

        // `A == 3`, the old way
        _handwrittenFilter = evt =>
        {
            LogEventPropertyValue a;
            if (evt.Properties.TryGetValue("A", out a) &&
                    a is ScalarValue &&
                    ((ScalarValue)a).Value is int)
            {
                return (int)((ScalarValue)a).Value == 3;
            }

            return false;
        };

        // The code we're interested in; the `true.Equals()` overhead is normally added when
        // this is used with Serilog
        var compiled = FilterLanguage.CreateFilter("A = 3");
        _expressionFilter = evt => true.Equals(compiled(evt));

        Assert.True(_trivialFilter(_event) && _handwrittenFilter(_event) &&
                    _expressionFilter(_event));
    }

Just like a unit test fixture, benchmarks in BenchmarkDotNet live in classes that have methods decorated with various attributes. In this case, there’s no per-benchmark setup code to run, so the constructor gets everything ready to go. We want to do as much work as we can outside the benchmark itself, so that ideally the benchmark methods invoke only the precise code we’re interested in.

Here, we’ve created three different predicates - a trival true one that just demonstrates how much overhead the delegate invocation machinery entails, a handwritten test to determine if the log event carries A == 3, and an expression-based one that we’re interested in.

The benchmarks themselves look like this:

    [Benchmark]
    public void TrivialFilter()
    {
        _trivialFilter(_event);
    }  
    
    [Benchmark(Baseline = true)]
    public void HandwrittenFilter()
    {
        _handwrittenFilter(_event);
    } 
            
    [Benchmark]
    public void ExpressionFilter()
    {
        _expressionFilter(_event);
    }

Notice that the handwritten version is annotated with Baseline = true; this instructs BenchmarkDotNet to show other timings relative to it. Since the handwritten code is what we’re replacing, it’s interesting to know how the expression-based filtering performs relative to it.

Dynamically-generated code is practically never as fast as the handwritten version compiled and optimized alongside the application. For tiny chunks of code this is especially true: the few additional method calls, conversions and boxing allocations done by the dynamic version quickly add up when compared with the bare minimum version.

My goal was to get within an order-of-magnitude of the handwritten version; 10× slower is about as much overhead as I’d be prepared to pay vs. writing the filter myself.

Here’s the output:

BenchmarkDotNet.Core=v0.9.9.0
OS=Microsoft Windows NT 6.2.9200.0
Processor=Intel(R) Core(TM) i7-3720QM CPU 2.60GHz, ProcessorCount=8
Frequency=2533308 ticks, Resolution=394.7408 ns, Timer=TSC
CLR=MS.NET 4.0.30319.42000, Arch=64-bit RELEASE [RyuJIT]
GC=Concurrent Workstation
JitModules=clrjit-v4.6.1586.0
Type=ComparisonBenchmark  Mode=Throughput

Method	Median	StdDev	Scaled	Scaled-SD
TrivialFilter	1.5038 ns	2.0923 ns	0.07	0.05
HandwrittenFilter	38.1024 ns	0.7464 ns	1.00	0.00
ExpressionFilter	236.4340 ns	9.3174 ns	6.29	0.27

This run was performed on my laptop under battery power, so it’s not as accurate as possible, but certainly validates that we’re within the same order-of-magnitude as the handwritten filter. Win! Although the library has inherited a number of optimizations from the parent codebase, it hasn’t yet been optimised at all in the log-event-filtering role, so I have no doubt that there’s plenty of room left for improvement.

Summing up

While this was a fun little project to get the year kicked off, I think it’s a pretty substantial improvement on the status quo of filtering in Serilog. I hope you’ve enjoyed poking into a few of the gory details!

Have a great 2017 :-)