Aggregate Queries in Seq Part 2: Defining a Syntax

So, before we go any farther we’re going to need to pin down a bit more tightly what form aggregate queries will take. There are options, options, options – but hopefully a lot will fall out of how queries are expressed in Seq today.

The Seq filter box accepts predicates – expressions that evaluate to a Boolean in the context of an event.

Environment == "Production"

To express something like a sum, writing:

sum(ItemsOrdered)

…in the filter box seems reasonable, except when predicates get involved again. Combining the two expressions above into something that says “the number of items ordered in the production environment” is not obvious.

To introduce aggregates the filter syntax is going to have to stretch a bit, so that Seq can tell the difference between a simple predicate and a full-fledged query. Here goes:

select sum(ItemsOrdered) where Environment == "Production"

The plan is to reappropriate select and where from SQL to mark out the clauses. SQL-like queries have the strong advantage being familiar, and loosely align with the rest of the “C#-like” syntax by their analogy to LINQ. Starting queries with the keyword select gives the UI a chance to intelligently determine the type of query being written – staying with just a single input box is an explicit goal.

So what else can a SQL-like syntax offer? Grouping the number of items ordered by the item itself:

select sum(ItemsOrdered)
where Environment == "Production"
group by ItemId

Groupings are bread-and-butter for aggregate queries, so it’s handy that they carry over fairly naturally.

What about from? I don’t think I’m going to go after from at this point. The context of a query in Seq will initially be the events viewed in the UI, filtered down to whatever signals are active. There’s lots of room to extend this down the track a bit, but I think filtering and grouping is enough to bite off for a start.

There’s one last core concept the query syntax needs. Log events are time-dimensioned, and dealing with time requires some up-front attention.

Poking around, time groupings seem to have been grafted onto SQL in a few different ways, in traditional databases, event processing systems and timeseries databases. Approaching this the obvious way by making use of the built-in @Timestamp property attached to Seq events could look like:

select sum(ItemsOrdered)
where Environment == "Production"
group by hour(@Timestamp), ItemId

The awkwardness of this approach isn’t apparent until more exotic requirements show up – grouping by 20 hour blocks, or offsetting queries into another (non-UTC) timezone. I’m also not sure I want to type @Timestamp dozens of times a day.

Instead, I’m exploring the idea of a “time expression” syntax like the one used by InfluxDB, where the size of the interval is specified as a literal like time(10s):

select sum(ItemsOrdered)
where Environment == "Production"
group by time(1h), ItemId

Melding this to the existing expression parser is going to be fun!

Read Part 3: An Opportunistic Parser