Fri Jan 20 2017

Parsing absolutely anything in JavaScript using Earley algorithm

Let me start by saying — I was surprised how easy it was to write grammar for an Earley parser. I have been using regular expressions for over a decade. And I am used to parse things using regular expressions. Its fragile, its not always possible, etc. But it is fast and for the most part it serves the purpose.

Familiarising with parsing algorithms changed this attitude forever.

This is a long article. Therefore, I have used the He Man meme to keep you entertained throughout the journey. I promise you an all-powerful weapon at the end of the article.

I am working on a declarative HTML scraper. The syntax of the scraper depends on a custom DSL, which is an extension of the CSS3 selector specification.

Here is an example of a scraper manifest used to declare what to parse, and how to validate and format the resulting data:

selector: body
properties:
  title: title
  articles:
    selector: article {0,}
    properties:
      body: .body::property(innerHTML)
      summary: .body p {0,}[0]
      imageUrl: img::attribute(src)
      title: .title::text()::test(/foo/)::format(upperCase)

There is a bunch of stuff going on there that isn't part of the CSS spec:

A quantifier expression ({0,})
An accessor expression ([0])
An XPath-esque function invocation ( ::test(/foo/)::format(upperCase))

I needed a way to parse it.

My first thought was to use regular expressions. In fact, I have used regular expressions to write a prototype of the parser. After all, in the early stages of the program design, you need to be able to quickly prototype the solution; the prototype stage is not the time to be anal about the edge cases.

This is not to say that regexp cannot be used as a parser. Regular expressions can be used to parse regular languages; CSS selectors are context-free.

By the way, if terms "context-free" or "regular language" do not make much sense, I recommend reading The true power of regular expressions (5 minutes worth of reading).

However, for the production releases, I needed a strict, extendable parser.

I have started looking for parsers in JavaScript and I found Jison and PEG.js. However, neither of the algorithms support left-recursion. I wanted a parser that supports left-recursions!

I kid you not — I didn't even know what left-recursion is at the time of making this decision. However, I found it odd that it was emphasised that these algorithms to do not support it. Nevertheless, it was a good hunch — as I have learned later, left-recursion allows to keep the parser grammar simple and it can be a lot more performant.

Long story short, on the second page of Google search for "JavaScript parser" I found http://nearley.js.org/, an implementation of Earley parsing algorithm.

The author describes it as:

Earley parsers are awesome, because they will parse anything you give them. Depending on the algorithm specified, popular parsers such as lex/yacc, flex/bison, Jison, PEGjs, and Antlr will break depending on the grammar you give it. And by break, I mean infinite loops caused by left recursion, crashes, or stubborn refusals to compile because of a "shift-reduce error".

– Better Earley than never (http://hardmath123.github.io/earley.html)

It sounded like a skill I want to learn.

Who needs all of these other parsers when there is He-Man

So I continued reading.

Install nearley package.

$ npm install nearley

nearley consists of the main package (the parser API) and several CLI programs:

$ ls -1 ./node_modules/.bin
nearley-railroad
nearley-test
nearley-unparse
nearleyc

These programs are:

nearley-railroad is used to generate railroad diagrams.
nearley-test is used to test an arbitrary input against the compiled grammar.
nearley-unparse is used to generate random strings that satisfy the grammar.
nearleyc is used to compile Nearley grammar to JavaScript script.

To make these programs available to your shell, add ./node_modules/.bin to your $PATH (export PATH=./node_modules/.bin:$PATH) or install nearley with --global option.

We are only going to be using nearleyc and nearley-test.

A parser needs a grammar to parse the input.

The Earley algorithm parses a string based on a grammar in Backus-Naur Form (BNF). A BNF grammar consists of a set of production rules, which are expansions of nonterminals.

A grammar to parse "1+2+3" input is:

expression -> "1+2+3"

In layman terms, this grammar says: match "1+2+3" as "expression".

A nonterminal is a construction of the language. A nonterminal has a name (expression) and a list of production rules. A production rule defines what to match. A production rule consists of series of either other nonterminals or strings (1+2+3 is a production rule consisting of a single terminal).

Note: expression is an arbitrary name. It does not have a semantic meaning.

To test it, compile the grammar using nearleyc:

$ cat <<'EOF' > ./grammar.ne
expression -> "1+2+3"
EOF
$ nearleyc ./grammar.ne --out ./grammar.js

Instruct nearley-test to use the resulting ./grammar.js to parse an input:

nearley-test ./grammar.js --input '1+2+3'
Table length: 6
Number of parses: 1
Parse Charts
Chart: 0
0: {expression →  ● expression$string$1}, from: 0
1: {expression$string$1 →  ● "1" "+" "2" "+" "3"}, from: 0
 
Chart: 1
0: {expression$string$1 → "1" ● "+" "2" "+" "3"}, from: 0
 
Chart: 2
0: {expression$string$1 → "1" "+" ● "2" "+" "3"}, from: 0
 
Chart: 3
0: {expression$string$1 → "1" "+" "2" ● "+" "3"}, from: 0
 
Chart: 4
0: {expression$string$1 → "1" "+" "2" "+" ● "3"}, from: 0
 
Chart: 5
0: {expression$string$1 → "1" "+" "2" "+" "3" ● }, from: 0
1: {expression → expression$string$1 ● }, from: 0
 
Parse results:
[ [ '1+2+3' ] ]

Hooray! our program parsed a string literal "1+2+3".

It is important to understand the output of nearley-test, because this is the tool you will be using to debug the grammars.

Earley works by producing a table of partial parsings, i.e. the nth column of the table contains all possible ways to parse s[:n], the first n characters of s.

Chart: 0
0: {expression →  ● expression$string$1}, from: 0
1: {expression$string$1 →  ● "1" "+" "2" "+" "3"}, from: 0

In the above example, "Chart: 0" is the first column. It shows that we have a single nonterminal expression that can be equal to expression$string$1.

As far as I understand, expression$string$1 is just a temporary variable used to represent terminal structures to avoid repeating them. More about that later.

• is a marker used to denote how far we have parsed. At the moment, we are at the beginning of the string.

As we progress, we continue to match the terminal character by character.

If the entire terminal is matched, the program produces a token for the match.

To broaden your understanding, add another terminal to the expression production rule and use nearley-test to test different inputs, e.g.

expression ->
    "1+2+0"
  | "1+2+3"

Multiple production rules (nonterminal expansions) are separated using the vertical bar (|) character.

If you are still feeling lost on the subject, then Better Earley than never article takes you through an equivalent journey documenting every step.

If you are paying a close attention, you have noticed that the result of the program (the "1+2+3") is contained in an array, which itself is contained in an array.

There are two reasons for this.

The first reason is that each production rule of a nonterminal is a list itself. In the original example, this list contains a single terminal "1+2+3". We could have written it as:

expression ->
  "1" "+" "2" "+" "3"

Note: white spaces surrounding the nonterminals and terminals have no meaning in the grammar declaration.

This grammar is equivalent to the original example. However, the result is different.

This is because Earley algorithm operates on characters. Internally, nearley converts strings longer than one character into a parser rule declared as a series of symbols (string literals) that comprise the original string. If this parser rule is matched, the resulting series of symbols are joined back into the original string.

Tip: Have a look at the compiled grammar file (grammar.js) to further understand the internal format of the parser rules.

We can use a postprocessor to do this ourself.

A postprocessor is a JavaScript function that is used to process the result of a production rule. A postprocessor can format the result or it can reject the match. A postprocessor rule is declared after the production rule surrounded by {% … %}, e.g.

expression ->
  "1" "+" "2" "+" "3" {% d => d.join('') %}
EOF

A postprocessor is invoked with 3 parameters. The first parameter is a list of all the matches. In the above example, thats: 1, +, 2, +, 3.

The result of the postprocessor becomes the value of the production rule.

We are going to come back to the other two parameters later in the article. If you are feeling impatient, refer to the official documentation.

Lets return to the definition of nonterminal I gave earlier:

A nonterminal is a construction of the language. A nonterminal has a name and a list of production rules. A production rule defines what to match. A production rule consists of series of either other nonterminals or strings

That means we can declare a nonterminal for matching the numbers and a nonterminal to match the mathematical symbols.

expression ->
  N MS N MS N  {% d => d.join('') %}
 
MS ->
    "+" {% d => d[0] %}
  | "-" {% d => d[0] %}
 
N ->
    "1" {% d => d[0] %}
  | "2" {% d => d[0] %}
  | "3" {% d => d[0] %}
  | "4" {% d => d[0] %}
  | "5" {% d => d[0] %}
  | "6" {% d => d[0] %}
  | "7" {% d => d[0] %}
  | "8" {% d => d[0] %}
  | "9" {% d => d[0] %}
  | "0" {% d => d[0] %}

Hopefully, this is self-explanatory.

As it is, the N nonterminal from the previous example can parse only single digit numbers. However, we can refer to N itself in the production rules.

This is simply telling the parser that an N is one of [0–9] or N is one of [0–9] followed by N.

Similar to the above, we can implement a zero or more quantifier.

To match nothing we use epsilon rule. An epsilon rule is denoted using null.

This is telling the parser that an N is nothing or one of [0-9] followed by N.

These are examples of a BNF recursive repeat construct.

If you are thinking [0–9] and quantifiers then you are in luck: nearley supports character sets and quantifiers (+*?).

Here is a grammar to parse any valid mathematical expression that consists of addition, subtraction, multiplication and division actions.

You already know all components that make this grammar. Understanding it is only a matter of compiling the grammar and using nearley-test to debug several expressions, e.g. 10/((1+4)*8)*4 yields 1.

When I introduced the postprocessors, I talked about how the result is contained in an array of arrays.

This is how nearley informs you about grammar ambiguity. If there is more than one result, that means that the grammar is ambiguous, i.e. there is more than one route to parse the input. This is a bad sign.

Ambiguity will present performance issues (multiple routes need to be analysed to complete parsing). It will also lead to hard to catch bugs – different routes can produce different parsing results.

Whenever you encounter ambiguity, use nearley-test program to trace the parsing route and see where the parser diverges.

When I really introduced the postprocessor functions, I have said that the postprocessor function is invoked with three parameters. The third parameter is a rejection sentinel.

We can filter out null results (when a quantifier does not match anything, it produces null). And if there are no matches, we return the rejection sentinel to reject the match.

It is not good for code readability to write postprocessor functions after each production rule. Use @{% … %} syntax to declare helper functions.

Now that you know how to parse a string, how do you tokenise it?

The answer is simple: a postprocessor can return anything. Just like we used a postprocessor to convert strings to numbers in the calculator example, we can return objects describing the tokens and delegate the math logic to whatever program that understands those tokens.

There are plugins for different IDEs that enable Nearley syntax highlighting.

I have promised you an all-powerful weapon: you have it! Now you have a skill to parse absolutely any string!

At the beginning I have said that I have started learning about parsing because I needed to build a CSS selector parser. I have built it using nearley and I am pretty sure it is the most thorough CSS selector parser available to JavaScript.

Meet Scalpel, a CSS selector parser.

If you are looking for more examples of using Earley algorithm to parse context-free grammar, I encourage you to take a look at the source code of Scalpel.

Good luck! :-)