Visualizing the Serilog 4.1 batch retry algorithm

This is a quick post exploring how Serilog 4.1 (currently in preview) schedules retries when logging fails.

Logs tucked away in a file on a remote server aren’t very useful, so sending log events across a network to a centralized collector, log server, or service is very common. But, networks are unreliable and services have outages, so logging systems need to support retries when a network request fails.

Serilog 4.1 updates the algorithm used for scheduling retries, including the addition of BatchingOptions.RetryTimeLimit to control how long a failed request batch will be retried before giving up.

Since the algorithm uses exponential back-off and some internal thresholds to decide when a failed batch should be tried again, you might like to whip up a small test project to visualize this behavior under different conditions, with different options set.

To save you from needing to write this yourself, here it is:

using System.Diagnostics;
using Serilog;
using Serilog.Configuration;
using Serilog.Core;
using Serilog.Events;

var clock = Stopwatch.StartNew();

Log.Logger = new LoggerConfiguration()
    .WriteTo.FallbackChain(
        wt => wt.Sink(
            new NetworkSink(clock),
            new BatchingOptions
            {
                RetryTimeLimit = TimeSpan.FromSeconds(30),
            }),
        wt => wt.Sink(new LocalSink(clock))
    )
    .CreateLogger();

for (var i = 0; i < 1000; ++i)
{
    Log.Information("Event {I}", i);
    await Task.Delay(TimeSpan.FromSeconds(3));
}

The program logs 1000 events at three second intervals. It uses a batched network sink and a local fallback sink, both included below, to simulate some failures and reveal what’s going on behind the scenes.

The network sink accepts the first batch, and then fails for the next four:

class NetworkSink(Stopwatch clock) : IBatchedLogEventSink
{
    int _i;

    public async Task EmitBatchAsync(IReadOnlyCollection<LogEvent> batch)
    {
        var succeed = _i is 0 or > 4; 
        var symbol = succeed ? "\ud83d\udfe2" : "\ud83d\uded1";
        Console.WriteLine($"[{clock.Elapsed} Network] {symbol} Batch {_i} ({batch.Count} event(s))");

        ++_i;

        await Task.Delay(TimeSpan.FromMilliseconds(500));

        if (!succeed)
            throw new InvalidOperationException();
    }
}

Whether the batches succeed or fail, it takes 500 ms to make an attempt.

The local sink just prints the events it receives to the terminal:

class LocalSink(Stopwatch clock) : ILogEventSink
{
    public void Emit(LogEvent logEvent)
    {
        Console.WriteLine($"[{clock.Elapsed}   Local] \ud83d\udfe2 {logEvent.RenderMessage()}");
    }
}

Here’s the program’s output:

[00:00:00.0241652 Network] 🟢 Batch 0 (1 event(s))
[00:00:04.5297888 Network] 🛑 Batch 1 (1 event(s))
[00:00:07.0362428 Network] 🛑 Batch 2 (2 event(s))
[00:00:17.5382555 Network] 🛑 Batch 3 (5 event(s))
[00:00:18.0405741   Local] 🟢 Event 1
[00:00:18.0451773   Local] 🟢 Event 2
[00:00:18.0452174   Local] 🟢 Event 3
[00:00:18.0452246   Local] 🟢 Event 4
[00:00:18.0452297   Local] 🟢 Event 5
[00:00:38.0465922 Network] 🛑 Batch 4 (7 event(s))
[00:00:38.5476455   Local] 🟢 Event 6
[00:00:38.5477721   Local] 🟢 Event 7
[00:00:38.5478050   Local] 🟢 Event 8
[00:00:38.5478325   Local] 🟢 Event 9
[00:00:38.5478585   Local] 🟢 Event 10
[00:00:38.5478879   Local] 🟢 Event 11
[00:00:38.5479167   Local] 🟢 Event 12
[00:01:18.5473823 Network] 🟢 Batch 5 (14 event(s))
[00:01:21.0489512 Network] 🟢 Batch 6 (1 event(s))
[00:01:25.5509366 Network] 🟢 Batch 7 (1 event(s))
[00:01:28.0519726 Network] 🟢 Batch 8 (1 event(s))

It’s much more interesting to watch this in real-time: dotnet new console, dotnet add package serilog --prerelease, and pasting everything above into Program.cs will get you running, if you’re curious!

Breaking down the output:

[00:00:00.0241652 Network] 🟢 Batch 0 (1 event(s))

The first batch succeeds as expected.

[00:00:04.5297888 Network] 🛑 Batch 1 (1 event(s))
[00:00:07.0362428 Network] 🛑 Batch 2 (2 event(s))
[00:00:17.5382555 Network] 🛑 Batch 3 (5 event(s))

The next three batches fail. We start with just one event, but by the time we’re trying the third failed batch, four more events have accumulated.

The spacing between these is interesting:

Batch 1 comes ~3 seconds after its predecessor,
Batch 2 comes ~3 seconds later,
Batch 3 comes ~10 seconds after that.

The retry algorithm makes a couple of attempts at fairly short intervals, and then begins to exponentially back off. Exponential back-off helps conserve network bandwidth, and if the receiver is overloaded or restarting, gives it some breathing space to get back up and running. The gap between Batch 3 and Batch 4 will be even greater, as we’ll see in a moment.

[00:00:18.0405741   Local] 🟢 Event 1
[00:00:18.0451773   Local] 🟢 Event 2
[00:00:18.0452174   Local] 🟢 Event 3
[00:00:18.0452246   Local] 🟢 Event 4
[00:00:18.0452297   Local] 🟢 Event 5

Next, the five events that were queued up in Batch 3 are written to the local, fallback sink. Why does the system give up at roughly 17 seconds from the first failure, and not at 30 seconds, as we configured using RetryTimeLimit?

The answer is that because of exponential back-off, the next batch isn’t scheduled until around 37 seconds after the first failure, and this would exceed the time limit we’ve set, so the algorithm immediately triggers the failure handler and the batch is dropped.

Here’s that next retry attempt:

[00:00:38.0465922 Network] 🛑 Batch 4 (7 event(s))

The system has collected seven more events, the batch fails, and the events are immediately passed to the fallback sink:

[00:00:38.5476455   Local] 🟢 Event 6
[00:00:38.5477721   Local] 🟢 Event 7
[00:00:38.5478050   Local] 🟢 Event 8
[00:00:38.5478325   Local] 🟢 Event 9
[00:00:38.5478585   Local] 🟢 Event 10
[00:00:38.5478879   Local] 🟢 Event 11
[00:00:38.5479167   Local] 🟢 Event 12

Why aren’t these seven events retried in additional batches? After the retry delay has been reached, the algorithm switches modes, and all subsequent failing batches are immediately passed through to the fallback sink. This starts reducing the size of the internal buffer, while continuing to probe the remote service at reasonable intervals.

In this particular simulation, we’re now back online, so the next batch attempt succeeds and all subsequent ones do, too:

[00:01:18.5473823 Network] 🟢 Batch 5 (14 event(s))
[00:01:21.0489512 Network] 🟢 Batch 6 (1 event(s))
[00:01:25.5509366 Network] 🟢 Batch 7 (1 event(s))
[00:01:28.0519726 Network] 🟢 Batch 8 (1 event(s))

If we’d kept the network sink offline for longer, and simulated a higher logging rate or lower maximum batch size, we’d see the final failure state of the algorithm, which is to both drop batches and drain the entire queue at each subsequent failure. Ultimately, Serilog’s first priority is to avoid destabilizing the host application, and it’s not a good idea to keep on buffering events forever.

serilog/serilog is the place to find BatchingSink and FallbackAwareBatchFailure, which implement the above behavior. Bug reports and suggestions are most welcome over there, or here below.

I hope this post turns out to be useful when you need to better understand or tune batch retries in the future. We’ll have Serilog 4.1 out and ready for general production use soon. 😎