Swift Talk #13

## Parsing Techniques

### This episode is freely available thanks to the support of our subscribers

Subscribers get exclusive access to new and all previous subscriber-only episodes, video downloads, and 30% discount for team members. Become a Subscriber →

We look at two different techniques to parse a simple expression language: handwritten parsers and parser combinators.

0:07 For our Ledger file, we needed to parse expressions (and parse the rest of the file). Parsers might seem like an obscure topic, but once you start thinking about it, you can use them in many cases. As such, knowing how to write a parser can really help.

0:36 For example, formulas are parsed in spreadsheet applications. And in
an app like Photoshop, you can enter expressions like `100/2`

into text fields
and the result is evaluated.

1:00 Of course, the Swift compiler also parses our code, and understanding how parsers work helps us in understanding the entire build chain. It might seem like it's complicated to write a parser, but in this episode, we'll show that it's not. Interestingly enough, many people use regular expressions for tasks where parsers would be more appropriate (and easier). Additionally, regular expressions are strictly more limited in what they can do. Because of these things, parsers are good tools to have in your tool belt.

## Grammars

1:57 Let's start by defining a grammar for a simple expression language,
which allows us to express simple arithmetic expressions like `5+2*3`

. The
grammar contains the rules for valid expressions in this little language, and
the precedence of the operators is encoded in the grammar. The precedence
ensures that the expression above is parsed as `5+(2*3)`

rather than `(5+2)*3`

,
because `*`

has a different precedence than `+`

.

2:48 We try not to think about precedence because we learned it in school, but with obscure operators, we sometimes have to think about it (and add parentheses if we're unsure).

3:05 Let's consider multiplication expressions first, such as `2*3`

. The
grammar looks like this:

```
multiplicationExpression = int "*" int
```

3:32 To parse expressions like `5*6 + 2*3`

, we can define an additional
rule in our grammar:

```
additionExpression = multiplicationExpression "+" multiplicationExpression
```

4:00 However, this grammar doesn't allow for expressions like `5+2*3`

.
Our multiplication expression always needs to be of the form `int*int`

, and `5`

doesn't have the star symbol and the second integer. We can fix that by making
the final part optional:

```
multiplicationExpression = int ("*" int)?
```

4:25 For addition, we have to do the same thing, and we end up with the following grammar:

```
additionExpression = multiplicationExpression ("+" multiplicationExpression)?
multiplicationExpression = int ("*" int)?
```

4:36 This is the pseudo code for our grammar, and it's incomplete. We
can't write expressions like `2*3*4`

, but let's just stick with the current
example.

4:48 To turn this pseudo grammar into a parser, we could use parser generators such as Bison, YACC, or Lemon. This approach is used by SQLite, among others.

5:26 A second approach is to write a parser manually. This is what most programming languages, such as the Swift parser, do. It gives you full control, and you can hand tune manual parsers to be very fast. The problem is that they can get very complicated; it's not easy to write them in such a way that they're understandable.

5:59 The third approach is to use parser combinators. This approach comes from functional programming, and it's a way of writing parsers by combining them out of other parsers. Because of all the operators and types, they're strange (but interesting).

## Writing a Parser by Hand

6:16 We'll ignore the parser generators for today. Instead, with the
help of `NSScanner`

, let's start by writing a parser for our current grammar by
hand, and then we can have a look at the more obscure technique of parser
combinators.

6:45 We start by creating the class `Parser`

for our parser. We need to
store a `String`

and the current parsing location, and we can use the `Scanner`

class that's in Foundation. A `Scanner`

takes a `String`

and a position and
provides us with methods to parse simple things like an `Int`

. We can create the
`Scanner`

in our initializer:

```
final class Parser {
let scanner: Scanner
init(_ string: String) {
scanner = Scanner(string: string)
}
}
```

7:35 To parse a single integer, we could add a method on `Parser`

, which
tries to parse an integer. The result type is optional, in case it can't parse
the integer. To implement that method, we can directly use `Scanner`

to scan the
integer. `Scanner`

has an old-fashioned API in which you need to pass the result
as a mutable pointer. Only if the `scanInt`

returns true can we return the
result; otherwise we return `nil`

:

```
func int() -> Int? {
var result: Int = 0
guard scanner.scanInt(&result) else { return nil }
return result
}
```

8:49 We can test it, and we get `23`

out:

```
let parser = Parser("23")
parser.int() // 23
```

8:58 Looking back at our grammar, we want to create the parser for a
`multiplicationExpression`

. We can add another method, `multiplication`

, which
also returns an optional integer. We start by parsing a single integer. Then we
try to parse a `*`

symbol, and finally, we parse the right-hand-side integer and
return the multiplication of the two operands.

By the way, we currently return optional values, but in a real parser, we probably want to have more detailed information about what went wrong.

```
func multiplication() -> Int? {
if let lhs = int() {
if scanner.scanString("*", into: nil) {
if let rhs = int() {
return lhs * rhs
}
}
}
}
```

10:44 We left out all the `else`

cases. If we can't parse the first
integer, we return `nil`

. If we can parse the left-hand side, but not the
right-hand side, we can return just the left-hand side (because the `*`

and the
right-hand side are optional).

In the case that we scan both the left-hand side and the `*`

, but not the
right-hand side, we have an invalid expression. We could choose to return `nil`

,
but our scanner has already increased its location, so we'd need to reset it. In
order to do that, we need to store the initial location of where we started
parsing the multiplication expression:

```
func multiplication() -> Int? {
let oldLocation = scanner.scanLocation
if let lhs = int() {
if scanner.scanString("*", into: nil) {
if let rhs = int() {
return lhs * rhs
} else {
scanner.scanLocation = oldLocation
return nil
}
} else {
return lhs
}
} else {
return nil
}
}
```

## Refactoring

12:16 This gets tricky, and it's hard to see what's going on here, which
is one of the problems when you handwrite a parser: it's difficult to make it
correct and to encode all the behavior. When you read the `multiplication`

parser we just wrote, aside from the name, it's not obvious that it reflects the
`multiplicationExpression`

rule in our grammar. However, we can still improve
the code by using `guard`

statements:

```
func multiplication() -> Int? {
let oldLocation = scanner.scanLocation
guard let lhs = int() else { return nil }
guard scanner.scanString("*", into: nil) else { return lhs }
guard let rhs = int() else {
scanner.scanLocation = oldLocation
return nil
}
return lhs * rhs
}
```

13:53 To parse an addition expression, we can copy-paste this code. But first, we should test that everything works:

```
let parser = Parser("23*2")
parser.multiplication() // 46
```

14:30 If we paste the definition of `multiplication`

and replace the
name and the operator, we end up with a parser that parses simple addition
expressions:

```
func addition() -> Int? {
let oldLocation = scanner.scanLocation
guard let lhs = int() else { return nil }
guard scanner.scanString("+", into: nil) else { return lhs }
guard let rhs = int() else {
scanner.scanLocation = oldLocation
return nil
}
return lhs + rhs
}
```

## Parsing More Complicated Expressions

14:55 However, we can't yet parse expressions such as `2*3`

. This is
because our `addition`

method doesn't reflect the grammar. Instead of parsing
with `int()`

, we should be parsing a multiplication expression:

```
func addition() -> Int? {
let oldLocation = scanner.scanLocation
guard let lhs = multiplication() else { return nil }
guard scanner.scanString("+", into: nil) else { return lhs }
guard let rhs = multiplication() else {
scanner.scanLocation = oldLocation
return nil
}
return lhs + rhs
}
```

15:29 The simple example we've shown demonstrates how tricky it is to handwrite a parser. It's easy to forget things, it's easy to write code that doesn't exactly match the grammar, and it's hard to diagnose both those things without a lot of test cases.

## Parser Combinators

16:04 Let's look at solving the same problem using parser combinators. We'll take a lot of shortcuts and write the parsers straightaway without explaining all the operators we're using. In a future episode, we'll explain how parser combinators work and really dive into that topic. But for now, we simply want to show that if you squint while reading the parser-combinator code, you'll recognize the grammar.

16:46 The first step is parsing an integer. We used the `Scanner`

class
for this in our handwritten parser, and we got it for free. But here, we'll
write an integer parser from scratch because it's a good example of parser
combinators.

17:07 First, we'll parse a single digit. We can use the `character`

parser to verify the character is a member of the `decimalDigits`

character
set:

```
let digit = character(condition: { CharacterSet.decimalDigits.contains($0.unicodeScalar) })
digit.run("123")
```

17:52 The output is a bit cryptic: it's a tuple that contains the
parsed value and the remaining string to be parsed. Because we're only parsing a
single digit, the next step is parsing multiple digits in a row by using the
`many`

combinator:

```
let digit = character(condition: { CharacterSet.decimalDigits.contains($0.unicodeScalar) })
let digits = digit.many
digits.run("123")
```

18:40 If we run `digits`

, we get back an array of characters rather
than a single digit. The next step is combining all the digits into a string by
using
`map`

:

```
let digit = character(condition: { CharacterSet.decimalDigits.contains($0.unicodeScalar) })
let digits = digit.many
let digitString = digits.map { characters in String(characters) }
digits.run("123")
```

19:23 Finally, we want to convert the string into an integer by using
`map`

again. We know that we'll always pass a valid integer string into the
`Int`

initializer, so we can force-unwrap the result. (One caveat: we're using
the `many`

combinator, which also parses an empty string. For this code to be
correct, we should use `many1`

, which parses at least one
digit.)

```
let digit = character(condition: { CharacterSet.decimalDigits.contains($0.unicodeScalar) })
let digits = digit.many
let digitString = digits.map { characters in String(characters) }
let int = digitString.map { Int($0)! }
int.run("123")
```

20:08 We can make the code much shorter by inlining the intermediate steps:

```
let digit = character(condition: { CharacterSet.decimalDigits.contains($0.unicodeScalar) })
let int = digit.many.map { characters in Int(String(characters))! }
```

## Multiplication Expressions

20:49 Looking at our grammar again, the next step is a multiplication expression. There's a lot of magic involved here, and if you're not used to parser combinators, the code will be hard to understand at first.

21:14 A multiplication expression is an integer, followed by a `*`

,
followed by another integer. To combine the parts into a result, we can use the
`<^>`

operator and the `curry`

function. The actual calculation is simple: it's
a function that multiplies its two arguments. The code below almost works:

```
let multiplication = curry({ x, y in x*y }) <^> int <*> (string("*") <*> int)
```

22:15 We need to ignore the result of parsing the `*`

string. To do
that, we can use the `*>`

operator, which parses but ignores the left-hand side.
Finally, we can run our parser:

```
let multiplication = curry({ x, y in x*y }) <^> int <*> (string("*") *> int)
multiplication.run("2*3") // 6
```

22:54 We can no longer parse a single integer using our multiplication
parser because we didn't make the second part of our parser optional. We fix
this by using the `optional`

combinator and providing a default value in case
the second parser returns
`nil`

.

```
let multiplication = curry({ x, y in x * (y ?? 1) }) <^> int <*> (string("*") *> int).optional
```

23:34 To add addition, we'll copy and paste the `multiplication`

parser
and replace the name and
operators:

```
let addition = curry({ x, y in x + (y ?? 0) }) <^> int <*> (string("+") *> int).optional
```

23:58 We can test it to verify that it works. However, just like
before, we can only parse integer operands. To fix this, we need to use the
`multiplication`

parser from within
`addition`

:

```
let addition = curry({ x, y in x + (y ?? 0) }) <^> multiplication <*> (string("+") *> multiplication).optional
```

24:26 If you look to the right-hand side of the `<^>`

operator, you can
see the grammar — an int, followed by an optional string and an int — and you
can see the operator string gets ignored. In a way, it's very readable, and at
the same time, it's very unreadable if you're not used to these operators.

24:48 This demonstrates that you can write a parser with very little code by using parser combinators. The parsers are very close to the grammar. If you're not used to them, they're also very magical, and in a future episode, we'll have a look at how this works. It'll be a challenge for those of us who are used to object-oriented design rather than functional programming.