00:07 For our Ledger file, we needed to parse expressions (and parse the
rest of the file). Parsers might seem like an obscure topic, but once you start
thinking about it, you can use them in many cases. As such, knowing how to write
a parser can really help.
00:36 For example, formulas are parsed in spreadsheet applications. And
in an app like Photoshop, you can enter expressions like 100/2
into text
fields and the result is evaluated.
01:00 Of course, the Swift compiler also parses our code, and
understanding how parsers work helps us in understanding the entire build chain.
It might seem like it's complicated to write a parser, but in this episode,
we'll show that it's not. Interestingly enough, many people use regular
expressions for tasks where parsers would be more appropriate (and easier).
Additionally, regular expressions are strictly more limited in what they can do.
Because of these things, parsers are good tools to have in your tool belt.
Grammars
01:57 Let's start by defining a grammar for a simple expression
language, which allows us to express simple arithmetic expressions like 5+2*3
.
The grammar contains the rules for valid expressions in this little language,
and the precedence of the operators is encoded in the grammar. The precedence
ensures that the expression above is parsed as 5+(2*3)
rather than (5+2)*3
,
because *
has a different precedence than +
.
02:48 We try not to think about precedence because we learned it in
school, but with obscure operators, we sometimes have to think about it (and add
parentheses if we're unsure).
03:05 Let's consider multiplication expressions first, such as 2*3
.
The grammar looks like this:
multiplicationExpression = int "*" int
03:32 To parse expressions like 5*6 + 2*3
, we can define an additional
rule in our grammar:
additionExpression = multiplicationExpression "+" multiplicationExpression
04:00 However, this grammar doesn't allow for expressions like 5+2*3
.
Our multiplication expression always needs to be of the form int*int
, and 5
doesn't have the star symbol and the second integer. We can fix that by making
the final part optional:
multiplicationExpression = int ("*" int)?
04:25 For addition, we have to do the same thing, and we end up with the
following grammar:
additionExpression = multiplicationExpression ("+" multiplicationExpression)?
multiplicationExpression = int ("*" int)?
04:36 This is the pseudo code for our grammar, and it's incomplete. We
can't write expressions like 2*3*4
, but let's just stick with the current
example.
04:48 To turn this pseudo grammar into a parser, we could use parser
generators such as Bison, YACC, or Lemon. This approach is used by
SQLite, among others.
05:26 A second approach is to write a parser manually. This is what most
programming languages, such as the Swift parser, do. It gives you full control,
and you can hand tune manual parsers to be very fast. The problem is that they
can get very complicated; it's not easy to write them in such a way that they're
understandable.
05:59 The third approach is to use parser combinators. This approach
comes from functional programming, and it's a way of writing parsers by
combining them out of other parsers. Because of all the operators and types,
they're strange (but interesting).
Writing a Parser by Hand
06:16 We'll ignore the parser generators for today. Instead, with the
help of NSScanner
, let's start by writing a parser for our current grammar by
hand, and then we can have a look at the more obscure technique of parser
combinators.
06:45 We start by creating the class Parser
for our parser. We need to
store a String
and the current parsing location, and we can use the Scanner
class that's in Foundation. A Scanner
takes a String
and a position and
provides us with methods to parse simple things like an Int
. We can create the
Scanner
in our initializer:
final class Parser {
let scanner: Scanner
init(_ string: String) {
scanner = Scanner(string: string)
}
}
07:35 To parse a single integer, we could add a method on Parser
,
which tries to parse an integer. The result type is optional, in case it can't
parse the integer. To implement that method, we can directly use Scanner
to
scan the integer. Scanner
has an old-fashioned API in which you need to pass
the result as a mutable pointer. Only if the scanInt
returns true can we
return the result; otherwise we return nil
:
func int() -> Int? {
var result: Int = 0
guard scanner.scanInt(&result) else { return nil }
return result
}
08:49 We can test it, and we get 23
out:
let parser = Parser("23")
parser.int()
08:58 Looking back at our grammar, we want to create the parser for a
multiplicationExpression
. We can add another method, multiplication
, which
also returns an optional integer. We start by parsing a single integer. Then we
try to parse a *
symbol, and finally, we parse the right-hand-side integer and
return the multiplication of the two operands.
By the way, we currently return optional values, but in a real parser, we
probably want to have more detailed information about what went wrong.
func multiplication() -> Int? {
if let lhs = int() {
if scanner.scanString("*", into: nil) {
if let rhs = int() {
return lhs * rhs
}
}
}
}
10:44 We left out all the else
cases. If we can't parse the first
integer, we return nil
. If we can parse the left-hand side, but not the
right-hand side, we can return just the left-hand side (because the *
and the
right-hand side are optional).
In the case that we scan both the left-hand side and the *
, but not the
right-hand side, we have an invalid expression. We could choose to return nil
,
but our scanner has already increased its location, so we'd need to reset it. In
order to do that, we need to store the initial location of where we started
parsing the multiplication expression:
func multiplication() -> Int? {
let oldLocation = scanner.scanLocation
if let lhs = int() {
if scanner.scanString("*", into: nil) {
if let rhs = int() {
return lhs * rhs
} else {
scanner.scanLocation = oldLocation
return nil
}
} else {
return lhs
}
} else {
return nil
}
}
Refactoring
12:16 This gets tricky, and it's hard to see what's going on here, which
is one of the problems when you handwrite a parser: it's difficult to make it
correct and to encode all the behavior. When you read the multiplication
parser we just wrote, aside from the name, it's not obvious that it reflects the
multiplicationExpression
rule in our grammar. However, we can still improve
the code by using guard statements:
func multiplication() -> Int? {
let oldLocation = scanner.scanLocation
guard let lhs = int() else { return nil }
guard scanner.scanString("*", into: nil) else { return lhs }
guard let rhs = int() else {
scanner.scanLocation = oldLocation
return nil
}
return lhs * rhs
}
13:53 To parse an addition expression, we can copy-paste this code. But
first, we should test that everything works:
let parser = Parser("23*2")
parser.multiplication()
14:30 If we paste the definition of multiplication
and replace the
name and the operator, we end up with a parser that parses simple addition
expressions:
func addition() -> Int? {
let oldLocation = scanner.scanLocation
guard let lhs = int() else { return nil }
guard scanner.scanString("+", into: nil) else { return lhs }
guard let rhs = int() else {
scanner.scanLocation = oldLocation
return nil
}
return lhs + rhs
}
Parsing More Complicated Expressions
14:55 However, we can't yet parse expressions such as 2*3
. This is
because our addition
method doesn't reflect the grammar. Instead of parsing
with int()
, we should be parsing a multiplication expression:
func addition() -> Int? {
let oldLocation = scanner.scanLocation
guard let lhs = multiplication() else { return nil }
guard scanner.scanString("+", into: nil) else { return lhs }
guard let rhs = multiplication() else {
scanner.scanLocation = oldLocation
return nil
}
return lhs + rhs
}
15:29 The simple example we've shown demonstrates how tricky it is to
handwrite a parser. It's easy to forget things, it's easy to write code that
doesn't exactly match the grammar, and it's hard to diagnose both those things
without a lot of test cases.
Parser Combinators
16:04 Let's look at solving the same problem using parser combinators.
We'll take a lot of shortcuts and write the parsers straightaway without
explaining all the operators we're using. In a future episode, we'll explain how
parser combinators work and really dive into that topic. But for now, we simply
want to show that if you squint while reading the parser-combinator code, you'll
recognize the grammar.
16:46 The first step is parsing an integer. We used the Scanner
class
for this in our handwritten parser, and we got it for free. But here, we'll
write an integer parser from scratch because it's a good example of parser
combinators.
17:07 First, we'll parse a single digit. We can use the character
parser to verify the character is a member of the decimalDigits
character set:
let digit = character(condition: { CharacterSet.decimalDigits.contains($0.unicodeScalar) })
digit.run("123")
17:52 The output is a bit cryptic: it's a tuple that contains the
parsed value and the remaining string to be parsed. Because we're only parsing a
single digit, the next step is parsing multiple digits in a row by using the
many
combinator:
let digit = character(condition: { CharacterSet.decimalDigits.contains($0.unicodeScalar) })
let digits = digit.many
digits.run("123")
18:40 If we run digits
, we get back an array of characters rather
than a single digit. The next step is combining all the digits into a string by
using map
:
let digit = character(condition: { CharacterSet.decimalDigits.contains($0.unicodeScalar) })
let digits = digit.many
let digitString = digits.map { characters in String(characters) }
digits.run("123")
19:23 Finally, we want to convert the string into an integer by using
map
again. We know that we'll always pass a valid integer string into the
Int
initializer, so we can force-unwrap the result. (One caveat: we're using
the many
combinator, which also parses an empty string. For this code to be
correct, we should use many1
, which parses at least one digit.)
let digit = character(condition: { CharacterSet.decimalDigits.contains($0.unicodeScalar) })
let digits = digit.many
let digitString = digits.map { characters in String(characters) }
let int = digitString.map { Int($0)! }
int.run("123")
20:08 We can make the code much shorter by inlining the intermediate
steps:
let digit = character(condition: { CharacterSet.decimalDigits.contains($0.unicodeScalar) })
let int = digit.many.map { characters in Int(String(characters))! }
Multiplication Expressions
20:49 Looking at our grammar again, the next step is a multiplication
expression. There's a lot of magic involved here, and if you're not used to
parser combinators, the code will be hard to understand at first.
21:14 A multiplication expression is an integer, followed by a *
,
followed by another integer. To combine the parts into a result, we can use the
<^>
operator and the curry
function. The actual calculation is simple: it's
a function that multiplies its two arguments. The code below almost works:
let multiplication = curry({ x, y in x*y }) <^> int <*> (string("*") <*> int)
22:15 We need to ignore the result of parsing the *
string. To do
that, we can use the *>
operator, which parses but ignores the left-hand side.
Finally, we can run our parser:
let multiplication = curry({ x, y in x*y }) <^> int <*> (string("*") *> int)
multiplication.run("2*3")
22:54 We can no longer parse a single integer using our multiplication
parser because we didn't make the second part of our parser optional. We fix
this by using the optional
combinator and providing a default value in case
the second parser returns nil
.
let multiplication = curry({ x, y in x * (y ?? 1) }) <^> int <*> (string("*") *> int).optional
23:34 To add addition, we'll copy and paste the multiplication
parser
and replace the name and operators:
let addition = curry({ x, y in x + (y ?? 0) }) <^> int <*> (string("+") *> int).optional
23:58 We can test it to verify that it works. However, just like
before, we can only parse integer operands. To fix this, we need to use the
multiplication
parser from within addition
:
let addition = curry({ x, y in x + (y ?? 0) }) <^> multiplication <*> (string("+") *> multiplication).optional
24:26 If you look to the right-hand side of the <^>
operator, you can
see the grammar — an int, followed by an optional string and an int — and you
can see the operator string gets ignored. In a way, it's very readable, and at
the same time, it's very unreadable if you're not used to these operators.
24:48 This demonstrates that you can write a parser with very little
code by using parser combinators. The parsers are very close to the grammar. If
you're not used to them, they're also very magical, and in a future episode,
we'll have a look at how this works. It'll be a challenge for those of us who
are used to object-oriented design rather than functional programming.