Update 2016: Sprache long ago moved to NuGet and GitHub. Links and instructions have been updated.

Download the example as a VS2010 solution.

Our current project includes a small workflow for submission and approval of user account creation requests. This makes a good example to discuss domain-specific languages and Sprache, so I’ll paraphrase some of the requirements here.

The set of user account types is open-ended; currently “employee”, “contractor”, “temporary” and so on. To request a particular kind of account, a user must fill in a questionnaire associated with that account type.

In the data collection and approval parts of the system, the content of the questionnaire is irrelevant so long as the necessary information is recorded and presented to the administrator, who will ultimately approve or decline the request.

The Challenge

Much of the system design is driven by the fact that the set of possible account types (and thus different questionnaires) is open. Creating new questionnaires should be possible without redeploying the application. Furthermore, the content of a particular questionnaire needs to be easily extended and modified, possibly by end-users themselves.

There are many possible ways to represent the questionnaires:

  • Map a questionnaire domain model to relational database tables

  • Create an XML-based questionnaire format

  • Use Windows Workflow Foundation with loose XAML files

  • Read the questionnaires from CSV files or spreadsheets

Each has its pros and cons with regard to ease of implementation, maintainability, user-friendliness and flexibility.

In this article, we’ll examine another appealing option: building a user-friendly mini-language to represent questionnaires.

Questionnaire Definition Language

You may have read some discussion of the differences between internal and external DSLs.

An internal DSL is a specially-constructed API in a general-purpose language like C# that, when used, reads more like a definition of the problem than a program to solve it.

An external DSL is a standalone language that must be parsed from source text before a program can work with it. Importantly for this example, an external DSL encourages a minimum of syntactic noise, and can be read without any program compilation.

The example questionnaire DSL looks like:

identification    "Personal Details"
[
    name          "Full Name"
    department    "Department"
]

employment        "Current Employer"
[
    name          "Your Employer"
    contact       "Contact Number"
    #months       "Total Months Employed"
]

Above is a two-step questionnaire that collects personal and employment details.

  • Each section has an id, a title, and a list of questions.

  • Each question has an id and some prompting text.

  • The declaration of a question id (e.g. #months) can be prefixed with a symbol to indicate what kind of data is being collected – the pound sign/hash in this case denotes a natural number.

The application will read the questionnaire definition associated with a particular kind of account request, and present the steps to the user in a wizard-like interface.

Approaches to Parsing the Questionnaire Definitions

Parsing is the process of taking text in a source language, like the questionnaire above, and converting it into a representation - usually some kind of object model - that a program can work with. For the C# programmer, there are several ways to achieve this.

Hand-built Parsers

The very simplest as well as the most complex parsers are often built by hand. The simple ones because the solution is readily evident to the programmer (e.g. “split the string by finding commas in a loop”) and the most complex because the programmer needs an extreme level of control (e.g. the C# compiler.) Parsing anything in between by hand is usually more trouble than it is worth, unless you’re really into languages and know what you’re doing (that definitely rules me out!)

Regular Expressions

These are a convenient way of matching and extracting patterns from text. .NET includes the built-in System.Text.Regex class for working efficiently with regular expressions, so they’re usually the first option to consider when faced with a parsing task. Beyond quite simple grammars, regular expressions rapidly become hard to read and maintain. This is perhaps the biggest of their shortcomings, but there are also many grammars that regular expressions are unable to parse (starting with those that allow nesting.)

Parser Construction Toolkits

Industrial-strength parser construction toolkits and ‘language workbenches’ allow a grammar to be specified in a declarative format. The toolkit includes build-time tools to generate classes in the target programming language (e.g. C#) that can parse the grammar. Using such a toolkit requires an investment in both learning how the toolkit works, and integrating it into a project’s build process. For small parsing tasks this can be overkill, but for anything of significant complexity or requiring high parsing throughput, learning and using to use such a toolkit is highly recommended.

Parser Combinators

This function-based technique is often used by default in functional languages like Haskell and F#, both of which have high-quality parser combinator libraries. C# also has a fledgling combinator library called Sprache, built by yours truly and used in the rest of the article. Sprache makes it very easy to write and maintain simple parsers, without a steep learning curve or build-time tasks. It fits very nicely with a test-driven development process. The drawbacks are currently speed, efficiency and sometimes the quality of error messages – none of which is a big deal when parsing simple DSLs. [Edit: since this article was written, parsing speed and error handling in Sprache have been significantly improved.]

Getting Started

First, install Sprache from NuGet. This article is organised in such a way that you should be able to follow along with the text by creating and testing parsers in Visual Studio with NUnit or similar.

Grammars are built from the bottom up. First, parsers are created for the low-level syntax elements like identifiers, strings and numbers. We then work upwards, combining these basic units into more complex ones until we have the complete language.

Parsing an Identifier

In our questionnaire definition language, the most-nested significant element is the question:

name    "Full Name"

The basic parts here are an identifier an some quoted text. The first task for our parser will be to parse the identifier, in this case ‘name’.

[Test]
public void AnIdentifierIsASequenceOfCharacters()
{
    var input = "name";
    var id = QuestionnaireGrammar.Identifier.Parse(input);
    Assert.AreEqual("name", id);
}

Parsers in Sprache are static methods on a class representing the grammar. QuestionnaireGrammar.Identifier is of type Parser<string>, i.e. it returns values of type string from the input.

The definition of the parser is:

public static readonly Parser<string> Identifier = Parse.Letter.AtLeastOnce().Text().Token();

This reads fairly well – we’re going to parse a non-empty sequence of letters and return the text representation of those letters. The elements of the parser here are:

Parse.Letter – the Sprache Parse class contains helper methods and properties for common parsing tasks. Letter is a simple parser of type Parser<char>, that reads a letter from the input and returns it as a char. If the next character on the input isn’t a letter, this parser won’t match.

AtLeastOnce() – all parsers created by Sprache support repetition this way. AtLeastOnce() takes a parser for a single element of type T and returns a new parser that will parse a sequence of elements, returning IEnumerable<T>.

Text() - the AtLeastOnce() modifier took our Parser<char> and converted it into a parser of type Parser<IEnumerable<char>>. The Text() helper function takes a parser of IEnumerable<char> and converts it into a Parser<string>, so that the result is more convenient to work with.

Token() - finally, the token modifier accepts then discards leading and trailing whitespace.

Simple?

There are a couple more tests we might be interested in for the Identifier parser.

[Test]
public void AnIdentifierDoesNotIncludeSpace()
{
    var input = "a b";
    var parsed = QuestionnaireGrammar.Identifier.Parse(input);
    Assert.AreEqual("a", parsed);
}

A Sprache parser will parse as much as it can before returning. In this test case the parser will succeed, but won’t consume all of the input. (Later we’ll see how to make assertions about end-of-input.)

[Test]
public void AnIdentifierCannotStartWithQuote()
{
    var input = "\"name";
    Assert.Throws<ParseException>(() => QuestionnaireGrammar.Identifier.Parse(input));
}

The Parse() extension method throws ParseException when the parser doesn’t match. You can alternatively use the non-throwing TryParse().

Once we’re happy that identifiers are being parsed correctly we can move on.

Parsing Quoted Text

Quoted text isn’t much harder to parse than an identifier – our basic version doesn’t support escaped characters or anything fancy.

Looking again at the input:

name    "Full Name"

To parse the quoted text, we need to match:

  1. The opening quote

  2. Anything except another quote

  3. The closing quote

The quotes themselves aren’t particularly interesting to the program, so we’ll return only the text between them.

One test for the parser will look like:

[Test]
public void QuotedTextReturnsAValueBetweenQuotes()
{
    var input = "\"this is text\"";
    var content = QuestionnaireGrammar.QuotedText.Parse(input);
    Assert.AreEqual("this is text", content);
}

Jumping straight to the parser:

public static readonly Parser<string> QuotedText =
    (from open in Parse.Char('"')
     from content in Parse.CharExcept('"').Many().Text()
     from close in Parse.Char('"')
     select content).Token();

This opportunistic repurposing of the Linq query comprehension syntax was first described (AFAIK) by Luke Hoban from the F# team. The from clauses lay out the individual units of the syntax, while the select clause transforms them into the return value of the overall parser.

Parsing the Question

You’ve probably noticed by now that parsers are strongly-typed: a parser for a character returns char, and a parser for text returns string. A parser for a question, will return – Question!

public class Question
{
    public Question(string id, string prompt)
    {
        Id = id;
        Prompt = prompt;
    }

    public string Id { get; private set; }

    public string Prompt { get; private set; }
}

This is a great strength of combinator-based parsing when constructing DSLs. Once a semantic model for the problem has been built, the parser can translate input directly into it.

public static readonly Parser<Question> Question =
    from id in Identifier
    from prompt in QuotedText
    select new Question(id, prompt);

A unit test for a basic question now passes:

[Test]
public void AQuestionIsAnIdentifierFollowedByAPrompt()
{
    var input = "name \"Full Name\"";
    var question = QuestionnaireGrammar.Parse(input);
    Assert.AreEqual("name", question.Id);
    Assert.AreEqual("Full Name", question.Prompt);
}

Parsing a Section

Parsing a section is just like parsing a question: first we build the semantic model, then use the existing parsers to translate the source text.

Remember a section looks like:

identification    "Personal Details"
[
    name          "Full Name"
    department    "Department"
]

We can represent this in the object model as:

public class Section
{
    public Section(string id, string title, IEnumerable<Question> questions)
    {
        Id = id;
        Title = title;
        Questions = questions;
    }

    public string Id { get; private set; }

    public string Prompt { get; private set; }

    public IEnumerable<Question> Questions { get; private set; }
}

Building a parser is as easy as building the object model:

public static readonly Parser<Section> Section =
    from id in Identifier
    from title in QuotedText
    from lbracket in Parse.Char('[').Token()
    from questions in Question.Many()
    from rbracket in Parse.Char(']').Token()
    select new Section(id, title, questions);

To complete the example, we have one more model class to define:

public class Questionnaire
{
    public Questionnaire(IEnumerable<Section> sections)
    {
        Sections = sections;
    }

    public IEnumerable<Section> Sections { get; private set; }
}

Its corresponding parser (this time without the comprehension syntax):

public static Parser<Questionnaire> Questionnaire =
        Section.Many().Select(sections => new Questionnaire(sections)).End();

By affixing .End() to the parser, we require that the full input is parsed (i.e. there’s no trailing garbage.)

That’s all we need for the example without any data type qualifiers.

Supporting Answer Data Types

The finishing touches of our grammar add support for the answer type qualifiers.

#months    "Total Months Employed"

We can use an enumeration for the possible answer types.

public enum AnswerType
{
    Natural,
    Number,
    Date,
    YesNo
}

This is a pretty limited set, so by brute force we’ll check for each possible qualifier.

public static Parser<AnswerType> AnswerTypeIndicator =
    Parse.Char('#').Return(AnswerType.Natural)
        .Or(Parse.Char('$').Return(AnswerType.Number))
        .Or(Parse.Char('%').Return(AnswerType.Date))
        .Or(Parse.Char('?').Return(AnswerType.YesNo));

The Question object is modified to accept an AnswerType as a constructor parameter, and a simple modification of the Question parser completes our work.

public static Parser<Question> Question =
    from at in AnswerTypeIndicator.Or(Parse.Return(AnswerType.Text))
    from id in Identifier
    from prompt in QuotedText
    select new Question(id, prompt, at);

Summary

The complete parser is just six rules, totaling about 25 nicely-formatted lines of code.

While robust parsing is a non-trivial task in the real world, I hope this article shows that there are simple, low-friction options that fill some of the gaps between regular expressions and language workbenches.