0:07 For our Ledger file, we needed to parse expressions (and parse the
rest of the file). Parsers might seem like an obscure topic, but once you start
thinking about it, you can use them in many cases. As such, knowing how to write
a parser can really help.
0:36 For example, formulas are parsed in spreadsheet applications. And in
an app like Photoshop, you can enter expressions like 100/2 into text fields
and the result is evaluated.
1:00 Of course, the Swift compiler also parses our code, and
understanding how parsers work helps us in understanding the entire build chain.
It might seem like it's complicated to write a parser, but in this episode,
we'll show that it's not. Interestingly enough, many people use regular
expressions for tasks where parsers would be more appropriate (and easier).
Additionally, regular expressions are strictly more limited in what they can do.
Because of these things, parsers are good tools to have in your tool belt.
1:57 Let's start by defining a grammar for a simple expression language,
which allows us to express simple arithmetic expressions like 5+2*3. The
grammar contains the rules for valid expressions in this little language, and
the precedence of the operators is encoded in the grammar. The precedence
ensures that the expression above is parsed as 5+(2*3) rather than (5+2)*3,
because * has a different precedence than +.
2:48 We try not to think about precedence because we learned it in
school, but with obscure operators, we sometimes have to think about it (and add
parentheses if we're unsure).
3:05 Let's consider multiplication expressions first, such as 2*3. The
grammar looks like this:
multiplicationExpression = int "*" int
3:32 To parse expressions like 5*6 + 2*3, we can define an additional
rule in our grammar:
4:00 However, this grammar doesn't allow for expressions like 5+2*3.
Our multiplication expression always needs to be of the form int*int, and 5
doesn't have the star symbol and the second integer. We can fix that by making
the final part optional:
multiplicationExpression = int ("*" int)?
4:25 For addition, we have to do the same thing, and we end up with the
additionExpression = multiplicationExpression ("+" multiplicationExpression)?
multiplicationExpression = int ("*" int)?
4:36 This is the pseudo code for our grammar, and it's incomplete. We
can't write expressions like 2*3*4, but let's just stick with the current
4:48 To turn this pseudo grammar into a parser, we could use parser
generators such as Bison, YACC, or Lemon. This approach is used by
SQLite, among others.
5:26 A second approach is to write a parser manually. This is what most
programming languages, such as the Swift parser, do. It gives you full control,
and you can hand tune manual parsers to be very fast. The problem is that they
can get very complicated; it's not easy to write them in such a way that they're
5:59 The third approach is to use parser combinators. This approach
comes from functional programming, and it's a way of writing parsers by
combining them out of other parsers. Because of all the operators and types,
they're strange (but interesting).
Writing a Parser by Hand
6:16 We'll ignore the parser generators for today. Instead, with the
help of NSScanner, let's start by writing a parser for our current grammar by
hand, and then we can have a look at the more obscure technique of parser
6:45 We start by creating the class Parser for our parser. We need to
store a String and the current parsing location, and we can use the Scanner
class that's in Foundation. A Scanner takes a String and a position and
provides us with methods to parse simple things like an Int. We can create the
Scanner in our initializer:
7:35 To parse a single integer, we could add a method on Parser, which
tries to parse an integer. The result type is optional, in case it can't parse
the integer. To implement that method, we can directly use Scanner to scan the
integer. Scanner has an old-fashioned API in which you need to pass the result
as a mutable pointer. Only if the scanInt returns true can we return the
result; otherwise we return nil:
8:58 Looking back at our grammar, we want to create the parser for a
multiplicationExpression. We can add another method, multiplication, which
also returns an optional integer. We start by parsing a single integer. Then we
try to parse a * symbol, and finally, we parse the right-hand-side integer and
return the multiplication of the two operands.
By the way, we currently return optional values, but in a real parser, we
probably want to have more detailed information about what went wrong.
10:44 We left out all the else cases. If we can't parse the first
integer, we return nil. If we can parse the left-hand side, but not the
right-hand side, we can return just the left-hand side (because the * and the
right-hand side are optional).
In the case that we scan both the left-hand side and the *, but not the
right-hand side, we have an invalid expression. We could choose to return nil,
but our scanner has already increased its location, so we'd need to reset it. In
order to do that, we need to store the initial location of where we started
parsing the multiplication expression:
12:16 This gets tricky, and it's hard to see what's going on here, which
is one of the problems when you handwrite a parser: it's difficult to make it
correct and to encode all the behavior. When you read the multiplication
parser we just wrote, aside from the name, it's not obvious that it reflects the
multiplicationExpression rule in our grammar. However, we can still improve
the code by using guard statements:
14:55 However, we can't yet parse expressions such as 2*3. This is
because our addition method doesn't reflect the grammar. Instead of parsing
with int(), we should be parsing a multiplication expression:
15:29 The simple example we've shown demonstrates how tricky it is to
handwrite a parser. It's easy to forget things, it's easy to write code that
doesn't exactly match the grammar, and it's hard to diagnose both those things
without a lot of test cases.
16:04 Let's look at solving the same problem using parser combinators.
We'll take a lot of shortcuts and write the parsers straightaway without
explaining all the operators we're using. In a future episode, we'll explain how
parser combinators work and really dive into that topic. But for now, we simply
want to show that if you squint while reading the parser-combinator code, you'll
recognize the grammar.
16:46 The first step is parsing an integer. We used the Scanner class
for this in our handwritten parser, and we got it for free. But here, we'll
write an integer parser from scratch because it's a good example of parser
17:07 First, we'll parse a single digit. We can use the character
parser to verify the character is a member of the decimalDigits character
17:52 The output is a bit cryptic: it's a tuple that contains the
parsed value and the remaining string to be parsed. Because we're only parsing a
single digit, the next step is parsing multiple digits in a row by using the
19:23 Finally, we want to convert the string into an integer by using
map again. We know that we'll always pass a valid integer string into the
Int initializer, so we can force-unwrap the result. (One caveat: we're using
the many combinator, which also parses an empty string. For this code to be
correct, we should use many1, which parses at least one
20:49 Looking at our grammar again, the next step is a multiplication
expression. There's a lot of magic involved here, and if you're not used to
parser combinators, the code will be hard to understand at first.
21:14 A multiplication expression is an integer, followed by a *,
followed by another integer. To combine the parts into a result, we can use the
<^> operator and the curry function. The actual calculation is simple: it's
a function that multiplies its two arguments. The code below almost works:
22:54 We can no longer parse a single integer using our multiplication
parser because we didn't make the second part of our parser optional. We fix
this by using the optional combinator and providing a default value in case
the second parser returns
24:26 If you look to the right-hand side of the <^> operator, you can
see the grammar — an int, followed by an optional string and an int — and you
can see the operator string gets ignored. In a way, it's very readable, and at
the same time, it's very unreadable if you're not used to these operators.
24:48 This demonstrates that you can write a parser with very little
code by using parser combinators. The parsers are very close to the grammar. If
you're not used to them, they're also very magical, and in a future episode,
we'll have a look at how this works. It'll be a challenge for those of us who
are used to object-oriented design rather than functional programming.