Parsing Techniques

Ledger Mac App: Parsing Techniques

In this episode25:24

This episode is freely available thanks to the support of our subscribers

Subscribers get exclusive access to new and all previous subscriber-only episodes, video downloads, and 30% discount for team members. Become a Subscriber →

We look at two different techniques to parse a simple expression language: handwritten parsers and parser combinators.

00:07 For our Ledger file, we needed to parse expressions (and parse the rest of the file). Parsers might seem like an obscure topic, but once you start thinking about it, you can use them in many cases. As such, knowing how to write a parser can really help.

00:36 For example, formulas are parsed in spreadsheet applications. And in an app like Photoshop, you can enter expressions like 100/2 into text fields and the result is evaluated.

01:00 Of course, the Swift compiler also parses our code, and understanding how parsers work helps us in understanding the entire build chain. It might seem like it's complicated to write a parser, but in this episode, we'll show that it's not. Interestingly enough, many people use regular expressions for tasks where parsers would be more appropriate (and easier). Additionally, regular expressions are strictly more limited in what they can do. Because of these things, parsers are good tools to have in your tool belt.

Grammars

01:57 Let's start by defining a grammar for a simple expression language, which allows us to express simple arithmetic expressions like 5+2*3. The grammar contains the rules for valid expressions in this little language, and the precedence of the operators is encoded in the grammar. The precedence ensures that the expression above is parsed as 5+(2*3) rather than (5+2)*3, because * has a different precedence than +.

02:48 We try not to think about precedence because we learned it in school, but with obscure operators, we sometimes have to think about it (and add parentheses if we're unsure).

03:05 Let's consider multiplication expressions first, such as 2*3. The grammar looks like this:

multiplicationExpression = int "*" int

03:32 To parse expressions like 5*6 + 2*3, we can define an additional rule in our grammar:

additionExpression = multiplicationExpression "+" multiplicationExpression

04:00 However, this grammar doesn't allow for expressions like 5+2*3. Our multiplication expression always needs to be of the form int*int, and 5 doesn't have the star symbol and the second integer. We can fix that by making the final part optional:

multiplicationExpression = int ("*" int)?

04:25 For addition, we have to do the same thing, and we end up with the following grammar:

additionExpression = multiplicationExpression ("+" multiplicationExpression)?
multiplicationExpression = int ("*" int)?

04:36 This is the pseudo code for our grammar, and it's incomplete. We can't write expressions like 2*3*4, but let's just stick with the current example.

04:48 To turn this pseudo grammar into a parser, we could use parser generators such as Bison, YACC, or Lemon. This approach is used by SQLite, among others.

05:26 A second approach is to write a parser manually. This is what most programming languages, such as the Swift parser, do. It gives you full control, and you can hand tune manual parsers to be very fast. The problem is that they can get very complicated; it's not easy to write them in such a way that they're understandable.

05:59 The third approach is to use parser combinators. This approach comes from functional programming, and it's a way of writing parsers by combining them out of other parsers. Because of all the operators and types, they're strange (but interesting).

Writing a Parser by Hand

06:16 We'll ignore the parser generators for today. Instead, with the help of NSScanner, let's start by writing a parser for our current grammar by hand, and then we can have a look at the more obscure technique of parser combinators.

06:45 We start by creating the class Parser for our parser. We need to store a String and the current parsing location, and we can use the Scanner class that's in Foundation. A Scanner takes a String and a position and provides us with methods to parse simple things like an Int. We can create the Scanner in our initializer:

final class Parser {
    let scanner: Scanner
    
    init(_ string: String) {
        scanner = Scanner(string: string)
    }
}

07:35 To parse a single integer, we could add a method on Parser, which tries to parse an integer. The result type is optional, in case it can't parse the integer. To implement that method, we can directly use Scanner to scan the integer. Scanner has an old-fashioned API in which you need to pass the result as a mutable pointer. Only if the scanInt returns true can we return the result; otherwise we return nil:

func int() -> Int? {
    var result: Int = 0
    guard scanner.scanInt(&result) else { return nil }
    return result
}

08:49 We can test it, and we get 23 out:

let parser = Parser("23")
parser.int() // 23

08:58 Looking back at our grammar, we want to create the parser for a multiplicationExpression. We can add another method, multiplication, which also returns an optional integer. We start by parsing a single integer. Then we try to parse a * symbol, and finally, we parse the right-hand-side integer and return the multiplication of the two operands.

By the way, we currently return optional values, but in a real parser, we probably want to have more detailed information about what went wrong.

func multiplication() -> Int? {
    if let lhs = int() {
        if scanner.scanString("*", into: nil) {
            if let rhs = int() {
                return lhs * rhs
            }
        }
    }
}

10:44 We left out all the else cases. If we can't parse the first integer, we return nil. If we can parse the left-hand side, but not the right-hand side, we can return just the left-hand side (because the * and the right-hand side are optional).

In the case that we scan both the left-hand side and the *, but not the right-hand side, we have an invalid expression. We could choose to return nil, but our scanner has already increased its location, so we'd need to reset it. In order to do that, we need to store the initial location of where we started parsing the multiplication expression:

func multiplication() -> Int? {
    let oldLocation = scanner.scanLocation
    if let lhs = int() {
        if scanner.scanString("*", into: nil) {
            if let rhs = int() {
                return lhs * rhs
            } else {
                scanner.scanLocation = oldLocation
                return nil
            }
        } else {
            return lhs
        }
    } else {
        return nil
    }
}

Refactoring

12:16 This gets tricky, and it's hard to see what's going on here, which is one of the problems when you handwrite a parser: it's difficult to make it correct and to encode all the behavior. When you read the multiplication parser we just wrote, aside from the name, it's not obvious that it reflects the multiplicationExpression rule in our grammar. However, we can still improve the code by using guard statements:

func multiplication() -> Int? {
    let oldLocation = scanner.scanLocation
    guard let lhs = int() else { return nil }
    guard scanner.scanString("*", into: nil) else { return lhs }
    guard let rhs = int() else {
        scanner.scanLocation = oldLocation
        return nil
    }
    return lhs * rhs
}

13:53 To parse an addition expression, we can copy-paste this code. But first, we should test that everything works:

let parser = Parser("23*2")
parser.multiplication() // 46

14:30 If we paste the definition of multiplication and replace the name and the operator, we end up with a parser that parses simple addition expressions:

func addition() -> Int? {
    let oldLocation = scanner.scanLocation
    guard let lhs = int() else { return nil }
    guard scanner.scanString("+", into: nil) else { return lhs }
    guard let rhs = int() else {
        scanner.scanLocation = oldLocation
        return nil
    }
    return lhs + rhs
}

Parsing More Complicated Expressions

14:55 However, we can't yet parse expressions such as 2*3. This is because our addition method doesn't reflect the grammar. Instead of parsing with int(), we should be parsing a multiplication expression:

func addition() -> Int? {
    let oldLocation = scanner.scanLocation
    guard let lhs = multiplication() else { return nil }
    guard scanner.scanString("+", into: nil) else { return lhs }
    guard let rhs = multiplication() else {
        scanner.scanLocation = oldLocation
        return nil
    }
    return lhs + rhs
}

15:29 The simple example we've shown demonstrates how tricky it is to handwrite a parser. It's easy to forget things, it's easy to write code that doesn't exactly match the grammar, and it's hard to diagnose both those things without a lot of test cases.

Parser Combinators

16:04 Let's look at solving the same problem using parser combinators. We'll take a lot of shortcuts and write the parsers straightaway without explaining all the operators we're using. In a future episode, we'll explain how parser combinators work and really dive into that topic. But for now, we simply want to show that if you squint while reading the parser-combinator code, you'll recognize the grammar.

16:46 The first step is parsing an integer. We used the Scanner class for this in our handwritten parser, and we got it for free. But here, we'll write an integer parser from scratch because it's a good example of parser combinators.

17:07 First, we'll parse a single digit. We can use the character parser to verify the character is a member of the decimalDigits character set:

let digit = character(condition: { CharacterSet.decimalDigits.contains($0.unicodeScalar) })
digit.run("123")

17:52 The output is a bit cryptic: it's a tuple that contains the parsed value and the remaining string to be parsed. Because we're only parsing a single digit, the next step is parsing multiple digits in a row by using the many combinator:

let digit = character(condition: { CharacterSet.decimalDigits.contains($0.unicodeScalar) })
let digits = digit.many

digits.run("123")

18:40 If we run digits, we get back an array of characters rather than a single digit. The next step is combining all the digits into a string by using map:

let digit = character(condition: { CharacterSet.decimalDigits.contains($0.unicodeScalar) })
let digits = digit.many
let digitString = digits.map { characters in String(characters) }

digits.run("123")

19:23 Finally, we want to convert the string into an integer by using map again. We know that we'll always pass a valid integer string into the Int initializer, so we can force-unwrap the result. (One caveat: we're using the many combinator, which also parses an empty string. For this code to be correct, we should use many1, which parses at least one digit.)

let digit = character(condition: { CharacterSet.decimalDigits.contains($0.unicodeScalar) })
let digits = digit.many
let digitString = digits.map { characters in String(characters) }
let int = digitString.map { Int($0)! }

int.run("123")

20:08 We can make the code much shorter by inlining the intermediate steps:

let digit = character(condition: { CharacterSet.decimalDigits.contains($0.unicodeScalar) })
let int = digit.many.map { characters in Int(String(characters))! }

Multiplication Expressions

20:49 Looking at our grammar again, the next step is a multiplication expression. There's a lot of magic involved here, and if you're not used to parser combinators, the code will be hard to understand at first.

21:14 A multiplication expression is an integer, followed by a *, followed by another integer. To combine the parts into a result, we can use the <^> operator and the curry function. The actual calculation is simple: it's a function that multiplies its two arguments. The code below almost works:

let multiplication = curry({ x, y in x*y }) <^> int <*> (string("*") <*> int)

22:15 We need to ignore the result of parsing the * string. To do that, we can use the *> operator, which parses but ignores the left-hand side. Finally, we can run our parser:

let multiplication = curry({ x, y in x*y }) <^> int <*> (string("*") *> int)
multiplication.run("2*3") // 6

22:54 We can no longer parse a single integer using our multiplication parser because we didn't make the second part of our parser optional. We fix this by using the optional combinator and providing a default value in case the second parser returns nil.

let multiplication = curry({ x, y in x * (y ?? 1) }) <^> int <*> (string("*") *> int).optional

23:34 To add addition, we'll copy and paste the multiplication parser and replace the name and operators:

let addition = curry({ x, y in x + (y ?? 0) }) <^> int <*> (string("+") *> int).optional

23:58 We can test it to verify that it works. However, just like before, we can only parse integer operands. To fix this, we need to use the multiplication parser from within addition:

let addition = curry({ x, y in x + (y ?? 0) }) <^> multiplication <*> (string("+") *> multiplication).optional

24:26 If you look to the right-hand side of the <^> operator, you can see the grammar — an int, followed by an optional string and an int — and you can see the operator string gets ignored. In a way, it's very readable, and at the same time, it's very unreadable if you're not used to these operators.

24:48 This demonstrates that you can write a parser with very little code by using parser combinators. The parsers are very close to the grammar. If you're not used to them, they're also very magical, and in a future episode, we'll have a look at how this works. It'll be a challenge for those of us who are used to object-oriented design rather than functional programming.