00:00 Let's talk about Swift strings. As an example, we're going to write
a CSV parser. Of course, you could use a library for this, but we'll build it as
an example of some of the complicated things related to Swift strings. In this
episode, we're not focusing on how to parse CSV files correctly; instead, we
want to use the example to look at all the pitfalls of dealing with Swift
strings.
Parsing a Single Line
00:43 We're using a test-driven approach, and as the first test, we parse
a single line of comma-separated values. To parse a line, we need to write a
function, parse(line:)
. Here's our first test:
func testLine() {
let line = "one,2,three"
XCTAssertEqual(parse(line: line), ["one", "2", "three"])
}
01:42 To make the test compile, we write an empty implementation:
func parse(line: String) -> [String] {
return []
}
01:49 We could implement this in a number of different ways, but we'll
start with a simple approach, using line.components(separatedBy:)
to separate
the line by commas. This returns an array of String
s:
func parse(line: String) -> [String] {
return line.components(separatedBy: ",")
}
Working with Substrings
02:45 Alternatively, we could've used the split(separator:)
method in
the standard library. However, split
returns a Subsequence
, which is an
associated type, and in the case of String
, the Subsequence
is defined as
Substring
. Our return type is an array of String
s, so the code no longer
compiles. If we decide to use split
, we have to change the return type to
[Substring]
. In doing so, we're making our test slightly more complex, but
it's still a smart move, because Substring
is a pointer to the original
string, combined with a start and end index, and by returning Substring
s, we
don't need to copy as many String
s around. Once we want to use a Substring
,
we can convert it into a String
.
03:49 We have to change our return type, and type inference will make
sure that the test compiles:
func testLine() {
let line = "one,2,three"
XCTAssertEqual(parse(line: line), ["one", "2", "three"])
}
04:17 If we change our input to include an empty field and change the
expected test output, our test will break:
func testLine() {
let line = "one,2,,three"
XCTAssertEqual(parse(line: line), ["one", "2", "", "three"])
}
04:33 We now can use another version of split
, without the default
arguments for maxSplits
and omittingEmptySubsequences
. We set maxSplits
to
zero and set omittingEmptySubsequences
to false
in order to try and make our
test pass. However, it fails unexpectedly. After analyzing the code, we realize
it's because we set maxSplits
to zero. Unlike in some older APIs, zero doesn't
mean "as many splits as possible." We should have set maxSplits
to Int.max
,
and once we actually do, our test passes:
func parse(line: String) -> [Substring] {
return line.split(separator: ",", maxSplits: Int.max, omittingEmptySubsequences: false)
}
Parsing Multiple Lines
05:48 Now we'll write another test that parses multiple lines. We add a
newline in the input, and we expect another row of output:
func testLines() {
let line = "one,2,,three\nfour,five"
XCTAssertEqual(parse(lines: line), [["one", "2", "", "three"], ["four","five"]])
}
We start with an empty implementation:
func parse(lines: String) -> [[Substring]] {
return []
}
06:42 Here we take an approach similar to the one in parse(line:)
and
use split(separator:)
combined with a map
to parse each line. However, the
first split gives us an array of Substring
s, and parse(line:)
expects a
String
. Again, we need to think about whether our API should take a String
or a Substring
as its input. As parse(line:)
is an internal function, we'll
let it take a Substring
, and we won't change parse(lines:)
:
func parse(lines: String) -> [[Substring]] {
return lines.split(separator: "\n").map { line in
parse(line: line)
}
}
08:09 We also need to change our test to pass in a Substring
, and then
all tests pass:
func testLine() {
let line = "one,2,,three" as Substring
XCTAssertEqual(parse(line: line), ["one", "2", "", "three"])
}
08:47 To ensure we can also use a carriage return (CR) and a line feed
(LF) as our newline, we'll write another test:
func testLinesWithCRLF() {
let line = "one,2,,three\r\nfour,five"
XCTAssertEqual(parse(lines: line), [["one", "2", "", "three"], ["four","five"]])
}
08:54 Now we have a failing test case, and we can start to fix the
implementation. To solve this, we should change the separator to be either a
newline or a carriage return. We call a different version of split
, which
takes a function instead of a single character:
func parse(lines: String) -> [[Substring]] {
return lines.split(whereSeparator: { char in
char == "\n" || char == "\r"
}).map { line in
parse(line: line)
}
}
09:32 Unfortunately, our test still fails. The combined \r\n
didn't
get parsed at all, and it's because Unicode combines the two scalars into a
single character. In most programming languages, you only get access to the
scalars, but Swift's string type gives you access to the characters. This is why
we have to change our method to test whether or not a character is a newline.
Admittedly, it's a bit tricky if you're not used to it:
func parse(lines: String) -> [[Substring]] {
return lines.split(whereSeparator: { char in
switch char {
case "\r", "\n", "\r\n": return true
default: return false
}
}).map { line in
parse(line: line)
}
}
11:25 It's interesting that the code we've written so far doesn't use
any string indices. In other languages, you might start incrementing an index
and accessing the characters of the string that way. However, Swift's string
indices are not integers, and they can be tricky to work with if you're used to
other programming languages. So far, we've done all our processing using
high-level methods on String
.
Handling Quoted Fields
12:31 Let's write a test for quoted fields. It's possible to have quoted
fields in CSV:
func testLineWithQuotes() {
let line = "one,\"quote\",2,,three" as Substring
XCTAssertEqual(parse(line: line), ["one", "quote", "2", "", "three"])
}
13:17 To make this work, we have to change our parse(line:)
method.
Currently, we only split the rows by commas. Instead of directly returning our
fields, we can map over the result of splitting the string and remove any
surrounding quotes:
func parse(line: String) -> [Substring] {
return line.split(separator: ",", maxSplits: Int.max, omittingEmptySubsequences: false).map { field in
if field.first == "\"" && field.last == "\"" {
var result = field
result.removeFirst()
result.removeLast()
return result
} else {
return field
}
}
}
15:06 There are different ways to write the method above, but we'll
leave it as it. Now let's break the test by introducing a comma in a quoted
field:
func testLineWithQuotes() {
let line = "one,\"qu,ote\",2,,three" as Substring
XCTAssertEqual(parse(line: line), ["one", "qu,ote", "2", "", "three"])
}
15:41 The test fails because we're trying to first split the line with
commas as separators and then process the quoted fields. Instead, we should
process the line field by field. If the field starts with a quote, we should
look until we see the next quote; if it doesn't start with a quote, we look
until we see the comma. The approach we've taken thus far, using the split
method, no longer works, so the easiest way to do this is by iteratively
removing parts of the substring until it's empty. We'll write this as a
mutating
method on Substring
:
extension Substring {
mutating func parseField() -> Substring {
assert(!self.isEmpty)
}
}
17:22 We'll use this method inside parse(line:)
:
func parse(line: Substring) -> [Substring] {
var remainder = line
var result: [Substring] = []
while !remainder.isEmpty {
result.append(remainder.parseField())
}
return result
}
18:02 In the code above, we can see why parseField
has to be a
mutating
method: it returns the parsed field, and at the same time, it removes
the field from the substring. Inside parseField()
, we should switch on the
first character of the string. We know the string isn't empty, so we can use
startIndex
to access the character:
mutating func parseField() -> Substring {
assert(!self.isEmpty)
switch self[startIndex] {
case "\"":
default:
}
18:48 Inside the default
case, we need to find a comma and read until
that point. If we don't find a comma, we should read until the end of the
string, as it's the last field of the row. We'll use index(of:)
to find the
position of the first comma, and if we don't find a comma, we return the entire
remaining string. At the same time, we need to clear everything from self
.
When we have a comma, we want to return everything up to the comma and remove
everything up to and including the comma. We remove the comma by setting self
to the suffix after the comma:
default:
if let commaIdx = index(of: ",") {
let result = prefix(upTo: commaIdx)
self = self[index(after: idx)...]
return result
} else {
let result = self
removeAll()
return result
}
21:58 In the case where we see a quote, the first thing we have to do
is remove that quote character from self
. Then, we start looking for the next
quote, which closes the quotation. If we can't find a closing quote, it means
the file is malformed, and we should throw an error. As a temporary measure, we
use fatalError
instead:
case "\"":
removeFirst()
guard let quoteIdx = index(of: "\"") else {
fatalError("expected quote") }
22:59 In a strict mode, we might want to throw, and in a non-strict
mode, we might want to ignore that error and try to continue parsing. With
quoteIdx
, we can parse the rest of the field: the result will be the substring
up until quoteIdx
. We also need to change self
: if the current value is
empty, we've reached the end of the line, and we don't need to do anything. If
the current value isn't empty, the next character should be a comma:
case "\"":
removeFirst()
guard let quoteIdx = index(of: "\"") else {
fatalError("expected quote") }
let result = prefix(upTo: quoteIdx)
self = self[index(after: quoteIdx)...]
if !isEmpty {
let comma = removeFirst()
assert(comma == ",") }
return result
Refactoring
25:36 To clean up the code, we'd like to have a method called
remove(upToAndIncluding:)
:
extension Substring {
mutating func remove(upToAndIncluding idx: Index) {
self = self[index(after: idx)...]
}
}
We can now replace the two instances where we used self = self[index(after: quoteIdx)...]
with our new method.
27:35 There are a number of possible improvements to what we showed. We
don't support escaped quotes within quoted fields yet, nor have we looked at the
performance, which is an interesting exercise on its own. Still, our current
version isn't completely naive.