An intro to parser combinators

You’ve probably heard a lot of clever people talk about parser combinators and there’s a good reason for it. They are brilliantly simple in their design but can be used to build arbitrarily complex parsers. The resulting code reads like grammar. They demonstrate the simplicity and power of functional programming. They are beautiful and elegant, period.

Parsers

Parsers are simple functions that will parse a small part of the string like a single character or a number. You can then combine these tiny parsers using combinators to create a big parser.

Parser is a function that takes string (that needs to be parsed) and returns an array of (string, string) tuple.

The tuple’s first element is the parsed content and the remaining string as the second element.

type Parser = (code: string) => [string,string][];

You might be wondering why are we returning an array of tuples and not just a single tuple. We could have multiple parse results if the parser we’re writing is indeterminate for eg; a greedy vs non-greedy parser. An array of tuples allows us to represent such results.

An empty array returned from a parser thus means failure to parse.

That’s cool, but what the heck is a combinator?

A combinator is a function that takes one or more parsers and return a single parser, effectively combining them.

type Combinator = (...p: Parser[]) => Parser;

Things will get much more clearer once you see a simple implementation of a combinator.

Let’s code

Let’s write a couple of very simple parsers in typescript.

export const digit: Parser = (s: string) =>
  s.length && (+s[0]).toString() == s[0]
    ? [[s[0], s.slice(1)]]
    : [];
    
export const alpha: Parser = (s: string) =>
  s.length && s[0].toUpperCase() != s[0].toLowerCase()
    ? [[s[0], s.slice(1)]]
    : [];
  
export const literal: (l: string) => Parser = 
  (l: string) => (s: string) =>
    s.length && s.startsWith(l)
      ? [[l, s.slice(l.length)]]
      : [];


digit('123abc');         // [['1','23abc']]
digit('abc');            // []  empty array represents failure
alpha('123abc');         // [] 
alpha('abc');            // [['a','bc']]
literal(';')(';abc');    // [[';','abc']]

The digit parser will parse a single digit from the front of the string. Similarly, char parser parses a single char from the front of the string.

literal matches a literal string. It is a little different, it’s a function that returns a parser. Since we need to pass additional data to literal, we use currying instead of modifying our Parser type.

How do we combine them?

Introducing sequence combinator, lovingly called as seq

export const seq = (...parsers: Parser[]) => {
  return (s: string) => {
    let remaining = s;
    let results: string[] = [];
    for (let i = 0; i < parsers.length; i++) {
      const r = parsers[i](remaining);
      if (r.length) {
        results.push(r[0][0]);
        remaining = r[0][1];
      } else return [];
    }
    return [[results.join(''), remaining]];
  }
}

// Parse a 3 digit number
const threeDigitNumber = seq(digit, digit, digit);
threeDigitNumber('123abc'); // [['123','abc']];
threeDigitNumber('12a'); // [] (fails)

// Parse string of 3 alphabets enclosed in parentheses
const parenthesesOpen = literal('(');
const parenthesesClose = literal(')');
const threeAlphabetString = seq(parenthesesOpen, alpha, alpha, alpha, parenthesesClose);
threeAlphabetString('abc'); // [] (fails)
threeAlphabetString('(abc)123'); // ['(abc)','123'] // successfully parsed

Sequence combinator allows us to use multiple parsers one after another in a sequence. If any of the parser fails, the whole sequence fails.

Similarly we can have an either combinator which returns the first successful parser’s result.

export const either: Combinator = (...parsers: Parser[]) => {
  return (s: string) => {
    for (let i = 0; i < parsers.length; i++) {
      const res = parsers[i](s);
      if (res.length) {
        return res;
      }
    }
    return [];
  }
}

The cool thing is we can build combinators by composing other combinators. Let’s write many and manySeparatedBy combinators using other combinators:

export const many: Combinator = (parser) =>
  either(
    seq(
      parser,
      lazy(() => many(parser))
    ),
    parser
  );

export const manySeparatedBy: (separator: string) => Combinator =
  (separator: string) => (parser) =>
    either(seq(parser, many(seq(literal(separator), parser))), parser);

Let’s write a cron parser

With these parsers and combinators, we can finally go ahead and start to write a cron parser.

const wildcard = literal("*");

const singleOrDoubleDigits = either(seq(digit, digit), digit);

const range = seq(singleOrDoubleDigits, literal("-"), singleOrDoubleDigits);

export const step = seq(
  either(range, wildcard),
  literal("/"),
  singleOrDoubleDigits
);

const list = manySeparatedBy(",")(either(step, range, singleOrDoubleDigits));

const validValues = either(list, wildcard, singleOrDoubleDigits);

const space = many(literal(" "));

const minute = validValues;

const hour = validValues;

const dayOfMonth = validValues;

const month = validValues;

const dayOfWeek = validValues;

export const cronParser = seq(
  minute,
  space,
  hour,
  space,
  dayOfMonth,
  space,
  month,
  space,
  dayOfWeek
);

const result = cronParser("1,2,3 1-3,5,6-12/2 1-31/2 */2 *");

if(result.length) {
  console.log("Cron expression is valid");
}
else {
  console.log("Cron expression is invalid");
}

We can use this parser to validate if a cron expression is valid or not. Note that we have not taken the command part of the cron expression into account.

In later parts of this series, we will extend this parser combinator to add support for capturing parsed values of the string into a parse tree.

You can play around with the cron-parser here:


Deepak Mittal

Powered by WordPress