Pre-filtering to speed up JSON search

Seq’s query processing architecture is based on a quick in-memory store, over a slower, disk-backed, persistent store.

The big win from this architecture is consistently quick search results on real-time data, i.e. there’s no indexing delay, and no performance cliff to fall off when oddball queries (the bread and butter of diagnostic analysis) can’t use a pre-computed index. These are major contributors to Seq’s usability and its gentle learning curve.

Searching outside Seq’s memory store, into historical data from the persistent store, started out very slow, and was really only intended for occasional use. Over time, we’ve been improving its performance so that longer, heavier event histories can be comfortably queried day-to-day on the same hardware. In version 5.0, the new native storage engine and signal indexes achieved gains for many types of query.

Ashley Mannix and I have been working on a new tactic called “pre-filtering” to improve performance of another class of query in Seq 5.1.

Low- and high-cardinality fields

Native storage, and signal indexes, were all about I/O. Loading fewer pages from disk is the cornerstone of performance in practically every storage engine since the beginning of time. We decided to pair indexing with signals - predicate-style filters - because these kinds of indexes can be created and maintained without adding a large additional write workload.

This makes signal indexing a good fit for search over low-cardinality fields like Application, Environment Version and so-on: cardinality refers to the number of unique values that might appear for the field: for Environment you might see events with values of Development, Test, Production and so-on. Creating signals to index and filter on each value works naturally.

High-cardinality fields are the other kind: RequestId, CustomerId, MessageId - correlation identifiers with large numbers of values, sometimes used once and then forgotten. These aren’t good candidates for creation of signals, and underneath, indexing them takes more storage space.

Now, this isn’t to say that we won’t consider indexing high-cardinality fields in the future, but because searches on high-cardinality fields in Seq tend to be bottlenecked on JSON deserialization, we can make reasonable gains by addressing this, first.

How much does JSON deserialization dominate search time?

A lot. Seq’s query engine is written in C#, and feeding it with raw UTF-8 encoded JSON documents means (on top of loading the documents themselves):

Parsing the JSON syntax
Allocating objects for each of the values in the JSON document
Marshalling string data from UTF-8 to .NET’s internal UTF-16 representation
Collecting garbage afterwards

For every second we spend loading and decompressing event data in the storage engine, we spend as much as 30 seconds in JSON processing. Optimizing how we handle JSON is one obvious tactic to get results, faster, and there’s plenty of room for us to make improvements there.

Pre-filtering is a lower-level technique to avoid wasting cycles on unnecessary JSON parsing altogether.

JSON fragments

Why do we parse JSON, anyway? If we’re given a JSON document, and want to find out whether it has a top-level field CustomerId with value c123, we can’t simply search the document for "CustomerId":"c123". For a start, the JSON document might contain white space between : and ", or it might contain a nested object that has that property and value in it. It might even use Unicode character escape sequences \u**** for some of the character values.

What we can be sure of, though, is that matching documents will contain the strings "CustomerId" and "c123", or an escape sequence. This lets us use an “avoidance strategy” to speed up search, by filtering out documents that couldn’t possibly meet the search criteria.

Put another way, if the document contains neither "CustomerId" and "c123", nor \u, we can skip it altogether. Documents that contain the fragments of text we’re looking for can be parsed and the expression properly evaluated to determine whether there’s a genuine match.

Since evaluating regular expressions on the raw UTF-8 JSON is much faster for Seq than marshalling the data to .NET and parsing it into an object model, spending some CPU time on pre-filtering works out to be a very appealing trade-off - as long as enough events are being searched, over which the cost of compiling the regular exprssion can be amortized.

Expressions to `RegexSet`s

Seq searches are rarely as simple as a single key-value comparison. As an example, let’s consider:

@Message like 'Hello%' and (Source = 'Web' or Source = 'API')

Pre-filtering for this query, based on the property values, would require "hello (matched case-insensitively), and either "Web" or "API".

Although we could search for the property name - CustomerId in this case, Seq doesn’t do this because the existence of a property doesn’t usually reduce the search space much, and each fragment to search for costs additional CPU time.

This looks like a task for regular expressions! I wouldn’t normally reach for regular expressions with much joy, except that Rust has a very nice, highly-optimized regular expression implementation (thanks @burntsushi!), with just the right feature set to make this use case efficient.

Here’s the query above, as a list of regular expressions that all must match:

(?i:"hello)
"Web"|"API"

The first expression uses a case-insensitive non-capturing group, so that we don’t have to worry about all of the possible variations Hello, hello, heLLo, etc. Since the text Hello needs to appear at the start of the message, we can check for the first open double-quote " to cut out false positives, for example where the word might appear in a longer tract of text.

The second expression tests for either of the two acceptable values, since or from the query maps nicely to | alternation in regular expressions.

You might recall that we need to accept and post-process documents that contain escape sequences. To simplify the engine we just check for \ and accept any document containing it:

\\|(?i:"hello)
\\|"Web"|"API"

Now, if the document contains \, then all of the expressions will match and we’ll fall through to the slow path of parsing the JSON.

Testing for the existence of each expression seems like it might require multiple (CPU-cache-unfriendly) searches through the document, but Rust’s regex::RegexSet has us covered. We can create a RegexSet containing all of the expressions we need, and perform a single pass behind an is_match() test.

Putting it all together

So how does performance stack up? I have a test data set with around 2.4M/1GB log events spread over 90 days, on my T470p laptop. I’ve turned off the in-memory store, so that all results are computed on the persistent store, which is where the optimization kicks in. (Running from the memory store as they normally would, these all return results in 400-600 ms, so don’t be too horrified by some of the larger numbers below.)

select * from stream where RequestId = '8832706879a90713'

The first query is the first target scenario - correlation id search. I’m expressing it as a SQL select * rather than just a simple filter, since SQL queries return timings, and aren’t complicated by result set paging.

Seq 5.0 returns a result in 29,288 ms. That’s a lot of JSON parsing. Seq 5.1 gives the same result in 3,663 ms, an 8× improvement. Much more usable!

Since the events are spread across 13 separate on disk storage extents, the 5.1 time includes quite a lot of redundant regular expression compilation; with some changes to the storage layout there are more gains to be had here in the future.

Next query:

select count(*)
from stream
where @Message like '%test012.pricing-05835n6X.seqcafe%'
      or @Exception like '%test012.pricing-05835n6X.seqcafe%'

The second query is more representative of what happens when you type free text into the Seq filter box. Behind the scenes, this generates (case-insensitive) like filters to test the message and exception (backtrace) properties.

On the test data set we get a result from Seq 5.0 in 27,540 ms, while 5.1 completes in 2,907 ms, giving an eye-popping 9.5× improvement.

select count(*) from stream

The third query is a count, and I’m running it over the indexed Warnings signal. The signal itself just checks the value of @Level against a couple of well-known level names.

The signal index speeds things up quite a bit, and Seq 5.0 returns the count in 6,523 ms. Version 5.1, using pre-filtering, comes in a fraction slower at 6,793 ms. The slight regression here suggests that the matching events in this data set are probably clustered quite closely - other data sets we’ve tested still see some improvement over signal indexes, so this one is bookmarked for a deeper dig. We’ll be trying to claw this time back if possible :-)

On the whole, pre-filtering turns out to be the predictable, incremental kind of performance improvement that fits nicely into Seq 5.1. It’s surprising how much benefit this change yields, especially considering how low its impact has been on the rest of the codebase. In future versions we’re expecting much bigger gains by pushing more filtering down into the native storage layer, but those changes will be quite a bit more complex.

Low- and high-cardinality fields

How much does JSON deserialization dominate search time?

JSON fragments

Expressions to RegexSets

Putting it all together

Further reading

Expressions to `RegexSet`s