Building a JSON validator with Sylver - Part1/3 : Writing a JSON parser in 49 lines of code

Building a JSON validator with Sylver - Part1/3 : Writing a JSON parser in 49 lines of code

Sylver is a language agnostic platform for building custom source code analyzers (think eslint for every language). This might be a lot to unpack, so let us explore this tool by solving a real-world problem: our application's configuration is stored in complex JSON documents, and we'd like to build a tool to automatically validate these documents against our business rules.

In this series of tutorials, we'll go from having zero knowledge of Sylver or static analysis to building a fully-fledged linter for our configuration files. We will use JSON as an example, but the tools and techniques presented apply to many data formats and even complete programming languages! Also, note that while we will be building everything from scratch using Sylver's domain-specific languages (DSL), a catalog of built-in specifications for the most common languages will be included in future releases of the tool.

  • In part 1, we will discover Sylver's meta language, a DSL used to describe the shape of programming languages and data formats. After completing this tutorial, we'll have a complete spec for the JSON language, allowing us to turn JSON documents into Sylver parse trees.

  • In part 2, we will load the parse trees into Sylver's query engine, and we will find nodes that violate our business rules using an SQL-like query language.

  • In part 3, we will learn how to turn our language spec and queries into a set of linting rules so that we can run them conveniently and share them.

Installation

Sylver is distributed as a single static binary. Installing it is as simple as:

  1. Go to https://sylver.dev to download the binary for your platform
  2. Unpack the downloaded archive
  3. Move the sylver binary to a location in your $PATH

Prelude

Let us create a blank workspace for this tutorial!
We'll start by creating a new folder and a json.syl file to write the Sylver specification for the JSON language.

mkdir sylver_getting_started
cd sylver_getting_started
touch json.syl

We will also store our test JSON file in config.json.

{
    "variables": [
        {
            "name": "country",
            "description": "Customer's country of residence",
            "values": ["us", "fr", "it"]
        },
        {
            "name": "age",
            "description": "Cusomer's age",
            "type": "number"
        }
    ]
}

This file specifies the variables used to describe customer profiles in a fictional customer database. Variables are assigned a type or a list of potential values.

Types definition

Sylver parses raw text into typed structures called parse trees. The first step in defining a language spec is to define a set of types for our tree nodes. We'll add the following node types declarations to our json.syl spec:

node JsonNode { }

node Null: JsonNode { }

node Bool: JsonNode { }

node Number: JsonNode { }

node String: JsonNode { }

node Array: JsonNode { 
    elems: List<JsonNode> 
}

node Object: JsonNode {
    members: List<Member>
}

node Member: JsonNode {
    key: String,
    value: JsonNode
}

These declarations resemble object type declarations in many mainstream languages. The : syntax denotes inheritance.

Now that we have a set of types to describe JSON documents, we need to specify how to build a parse tree from a sequence of characters. This process is done in two steps:

  1. Lexical analysis: in this step, individual characters that form an indivisible entity (such as the digits of a number or the characters of a string) are grouped together into tokens. Some tokens are only one character-wide (for example, the brackets and semicolons in JSON).
  2. Syntactic analysis, in which tree nodes are built for the stream of tokens.

Lexical analysis

Tokens are described using declarations of the form term NAME = <term_content> where <term_content> is either a literal surrounded by single-quotes (') or a regex between backticks (` ). The regexes use a syntax similar to Perl-style regular expressions.
Characters in the input string that match one of the terminal literals or regexes will be grouped into a token of the given name.

term COMMA = ','
term COLON = ':'
term L_BRACE = '{'
term R_BRACE = '}'
term L_BRACKET = '['
term R_BRACKET = ']'
term NULL = 'null'

term BOOL_LIT = `true|false`
term NUMBER_LIT = `\-?(0|([1-9][0-9]*))(.[0-9]+)?((e|E)(\+|-)?[0-9]+)?`
term STRING_LIT = `"([^"\\]|(\\[\\/bnfrt"])|(\\u[a-fA-F0-9]{4}))*"`


ignore term WHITESPACE = `\s`

Term rules for numbers and strings are slightly involved in accounting for some of JSON's peculiarities.

Note that the WHITESPACE term declaration (matching a single whitespace character) is prefixed with the ignore keyword. This means that WHITESPACE tokens do not affect the structure of the document and can be ignored during syntactic analysis.

Syntactic analysis

In this last part of the language spec, we write rules describing how tree nodes are built by matching tokens from the input stream.

For example, a rule specifying: "if the current token is a STRING_LIT, build a String node" can be written as follows:

rule string = String { STRING_LIT }

Rules can refer to other rules to construct nested nodes. For example, here is a rule specifying that a Member node (corresponding to an object member in JSON) can be built by building a node using the string rule and then matching a COLON token followed by any valid JSON value:

rule member = Member { key@string COLON value@main }

Nested nodes are associated with a field using the @ syntax. The main rule is the entry point for the parser, so in our case, it designates any valid JSON value.

A valid JSON document can be made of a 'null' literal, a number, a boolean value, a string, an array of JSON values, or a JSON object, which is reflected in the main rule:

rule main =
    Null { NULL }
  | Number { NUMBER_LIT }
  | Bool { BOOL_LIT }
  | string
  | Array { L_BRACKET elems@sepBy(COMMA, main) R_BRACKET }
  | Object { L_BRACE members@sepBy(COMMA, member) R_BRACE }

The sepBy(TOKEN, rule_name) syntax is used to parse nodes using the main rule, while matching a TOKEN token between every parsed node.

Conclusion

We now have a complete language spec for the JSON language:

node JsonNode { }

node Null: JsonNode { }

node Bool: JsonNode { }

node Number: JsonNode { }

node String: JsonNode { }

node Array: JsonNode { 
    elems: List<JsonNode> 
}

node Object: JsonNode {
    members: List<Member>
}

node Member: JsonNode {
    key: String,
    value: JsonNode
}

term COMMA = ','
term COLON = ':'
term L_BRACE = '{'
term R_BRACE = '}'
term L_BRACKET = '['
term R_BRACKET = ']'
term NULL = 'null'

term BOOL_LIT = `true|false`
term NUMBER_LIT = `\-?(0|([1-9][0-9]*))(.[0-9]+)?((e|E)(\+|-)?[0-9]+)?`
term STRING_LIT = `"([^"\\]|(\\[\\/bnfrt"])|(\\u[a-fA-F0-9]{4}))*"`


ignore term WHITESPACE = `\s`

rule string = String { STRING_LIT }

rule member = Member { key@string COLON value@main }

rule main =
    Null { NULL }
  | Number { NUMBER_LIT }
  | Bool { BOOL_LIT }
  | string
  | Array { L_BRACKET elems@sepBy(COMMA, main) R_BRACKET }
  | Object { L_BRACE members@sepBy(COMMA, member) R_BRACE }

The last step is to test it on our test file! This is done by invoking the following command: sylver parse --spec=json.syl --file=config.json

Which yields the following parse tree:

Object {
. ● members: List<Member> {
. . Member {
. . . ● key: String { "variables" }
. . . ● value: Array {
. . . . ● elems: List<JsonNode> {
. . . . . Object {
. . . . . . ● members: List<Member> {
. . . . . . . Member {
. . . . . . . . ● key: String { "name" }
. . . . . . . . ● value: String { "country" }
. . . . . . . }
. . . . . . . Member {
. . . . . . . . ● key: String { "description" }
. . . . . . . . ● value: String { "Customer's country of residence" }
. . . . . . . }
. . . . . . . Member {
. . . . . . . . ● key: String { "values" }
. . . . . . . . ● value: Array {
. . . . . . . . . ● elems: List<JsonNode> {
. . . . . . . . . . String { "us" }
. . . . . . . . . . String { "fr" }
. . . . . . . . . . String { "it" }
. . . . . . . . . }
. . . . . . . . }
. . . . . . . }
. . . . . . }
. . . . . }
. . . . . Object {
. . . . . . ● members: List<Member> {
. . . . . . . Member {
. . . . . . . . ● key: String { "name" }
. . . . . . . . ● value: String { "age" }
. . . . . . . }
. . . . . . . Member {
. . . . . . . . ● key: String { "description" }
. . . . . . . . ● value: String { "Customer's age" }
. . . . . . . }
. . . . . . . Member {
. . . . . . . . ● key: String { "type" }
. . . . . . . . ● value: String { "number" }
. . . . . . . }
. . . . . . }
. . . . . }
. . . . }
. . . }
. . }
. }
}

In the next part, we'll define business rules to validate our JSON configuration (for example, the possible values for each variable must be of the same type), and we will use a query DSL to identify the tree nodes that violate these rules.