Swift Strings and Substrings

In this episode28:29

This episode is freely available thanks to the support of our subscribers

Subscribers get exclusive access to new and all previous subscriber-only episodes, video downloads, and 30% discount for team members. Become a Subscriber →

We write a simple CSV parser as an example demonstrating how to work with Swift's String and Substring types.

00:00 Let's talk about Swift strings. As an example, we're going to write a CSV parser. Of course, you could use a library for this, but we'll build it as an example of some of the complicated things related to Swift strings. In this episode, we're not focusing on how to parse CSV files correctly; instead, we want to use the example to look at all the pitfalls of dealing with Swift strings.

Parsing a Single Line

00:43 We're using a test-driven approach, and as the first test, we parse a single line of comma-separated values. To parse a line, we need to write a function, parse(line:). Here's our first test:

func testLine() {
    let line = "one,2,three"
    XCTAssertEqual(parse(line: line), ["one", "2", "three"])
}

01:42 To make the test compile, we write an empty implementation:

func parse(line: String) -> [String] {
    return []
}

01:49 We could implement this in a number of different ways, but we'll start with a simple approach, using line.components(separatedBy:) to separate the line by commas. This returns an array of Strings:

func parse(line: String) -> [String] {
    return line.components(separatedBy: ",")
}

Working with Substrings

02:45 Alternatively, we could've used the split(separator:) method in the standard library. However, split returns a Subsequence, which is an associated type, and in the case of String, the Subsequence is defined as Substring. Our return type is an array of Strings, so the code no longer compiles. If we decide to use split, we have to change the return type to [Substring]. In doing so, we're making our test slightly more complex, but it's still a smart move, because Substring is a pointer to the original string, combined with a start and end index, and by returning Substrings, we don't need to copy as many Strings around. Once we want to use a Substring, we can convert it into a String.

03:49 We have to change our return type, and type inference will make sure that the test compiles:

func testLine() {
    let line = "one,2,three"
    XCTAssertEqual(parse(line: line), ["one", "2", "three"])
}

04:17 If we change our input to include an empty field and change the expected test output, our test will break:

func testLine() {
    let line = "one,2,,three"
    XCTAssertEqual(parse(line: line), ["one", "2", "", "three"])
}

04:33 We now can use another version of split, without the default arguments for maxSplits and omittingEmptySubsequences. We set maxSplits to zero and set omittingEmptySubsequences to false in order to try and make our test pass. However, it fails unexpectedly. After analyzing the code, we realize it's because we set maxSplits to zero. Unlike in some older APIs, zero doesn't mean "as many splits as possible." We should have set maxSplits to Int.max, and once we actually do, our test passes:

func parse(line: String) -> [Substring] {
    return line.split(separator: ",", maxSplits: Int.max, omittingEmptySubsequences: false)
}

Parsing Multiple Lines

05:48 Now we'll write another test that parses multiple lines. We add a newline in the input, and we expect another row of output:

func testLines() {
    let line = "one,2,,three\nfour,five"
    XCTAssertEqual(parse(lines: line), [["one", "2", "", "three"], ["four","five"]])
}

We start with an empty implementation:

func parse(lines: String) -> [[Substring]] {
    return []
}

06:42 Here we take an approach similar to the one in parse(line:) and use split(separator:) combined with a map to parse each line. However, the first split gives us an array of Substrings, and parse(line:) expects a String. Again, we need to think about whether our API should take a String or a Substring as its input. As parse(line:) is an internal function, we'll let it take a Substring, and we won't change parse(lines:):

func parse(lines: String) -> [[Substring]] {
    return lines.split(separator: "\n").map { line in
        parse(line: line)
    }
}

08:09 We also need to change our test to pass in a Substring, and then all tests pass:

func testLine() {
    let line = "one,2,,three" as Substring
    XCTAssertEqual(parse(line: line), ["one", "2", "", "three"])
}

08:47 To ensure we can also use a carriage return (CR) and a line feed (LF) as our newline, we'll write another test:

func testLinesWithCRLF() {
    let line = "one,2,,three\r\nfour,five"
    XCTAssertEqual(parse(lines: line), [["one", "2", "", "three"], ["four","five"]])
}

08:54 Now we have a failing test case, and we can start to fix the implementation. To solve this, we should change the separator to be either a newline or a carriage return. We call a different version of split, which takes a function instead of a single character:

func parse(lines: String) -> [[Substring]] {
    return lines.split(whereSeparator: { char in
        char == "\n" || char == "\r"
    }).map { line in
        parse(line: line)
    }
}

09:32 Unfortunately, our test still fails. The combined \r\n didn't get parsed at all, and it's because Unicode combines the two scalars into a single character. In most programming languages, you only get access to the scalars, but Swift's string type gives you access to the characters. This is why we have to change our method to test whether or not a character is a newline. Admittedly, it's a bit tricky if you're not used to it:

func parse(lines: String) -> [[Substring]] {
    return lines.split(whereSeparator: { char in
        switch char {
        case "\r", "\n", "\r\n": return true
        default: return false
        }
    }).map { line in
        parse(line: line)
    }
}

11:25 It's interesting that the code we've written so far doesn't use any string indices. In other languages, you might start incrementing an index and accessing the characters of the string that way. However, Swift's string indices are not integers, and they can be tricky to work with if you're used to other programming languages. So far, we've done all our processing using high-level methods on String.

Handling Quoted Fields

12:31 Let's write a test for quoted fields. It's possible to have quoted fields in CSV:

func testLineWithQuotes() {
    let line = "one,\"quote\",2,,three" as Substring
    XCTAssertEqual(parse(line: line), ["one", "quote", "2", "", "three"])
}

13:17 To make this work, we have to change our parse(line:) method. Currently, we only split the rows by commas. Instead of directly returning our fields, we can map over the result of splitting the string and remove any surrounding quotes:

func parse(line: String) -> [Substring] {
    return line.split(separator: ",", maxSplits: Int.max, omittingEmptySubsequences: false).map { field in
        if field.first == "\"" && field.last == "\"" {
            var result = field
            result.removeFirst()
            result.removeLast()
            return result
        } else {
            return field
        }
    }
}

15:06 There are different ways to write the method above, but we'll leave it as it. Now let's break the test by introducing a comma in a quoted field:

func testLineWithQuotes() {
    let line = "one,\"qu,ote\",2,,three" as Substring
    XCTAssertEqual(parse(line: line), ["one", "qu,ote", "2", "", "three"])
}

15:41 The test fails because we're trying to first split the line with commas as separators and then process the quoted fields. Instead, we should process the line field by field. If the field starts with a quote, we should look until we see the next quote; if it doesn't start with a quote, we look until we see the comma. The approach we've taken thus far, using the split method, no longer works, so the easiest way to do this is by iteratively removing parts of the substring until it's empty. We'll write this as a mutating method on Substring:

extension Substring {    
    mutating func parseField() -> Substring {
         assert(!self.isEmpty)
         // todo
    }
}

17:22 We'll use this method inside parse(line:):

func parse(line: Substring) -> [Substring] {
    var remainder = line
    var result: [Substring] = []
    while !remainder.isEmpty {
        result.append(remainder.parseField())
    }
    return result
}

18:02 In the code above, we can see why parseField has to be a mutating method: it returns the parsed field, and at the same time, it removes the field from the substring. Inside parseField(), we should switch on the first character of the string. We know the string isn't empty, so we can use startIndex to access the character:

mutating func parseField() -> Substring {
    assert(!self.isEmpty)
    switch self[startIndex] {
    case "\"":
        // todo            
    default:
        // todo
}

18:48 Inside the default case, we need to find a comma and read until that point. If we don't find a comma, we should read until the end of the string, as it's the last field of the row. We'll use index(of:) to find the position of the first comma, and if we don't find a comma, we return the entire remaining string. At the same time, we need to clear everything from self. When we have a comma, we want to return everything up to the comma and remove everything up to and including the comma. We remove the comma by setting self to the suffix after the comma:

default:
    if let commaIdx = index(of: ",") {
        let result = prefix(upTo: commaIdx)
        self = self[index(after: idx)...]
        return result
    } else {
        let result = self
        removeAll()
        return result
    }

21:58 In the case where we see a quote, the first thing we have to do is remove that quote character from self. Then, we start looking for the next quote, which closes the quotation. If we can't find a closing quote, it means the file is malformed, and we should throw an error. As a temporary measure, we use fatalError instead:

case "\"":
    removeFirst()
    guard let quoteIdx = index(of: "\"") else {
        fatalError("expected quote") // todo throws
    }

22:59 In a strict mode, we might want to throw, and in a non-strict mode, we might want to ignore that error and try to continue parsing. With quoteIdx, we can parse the rest of the field: the result will be the substring up until quoteIdx. We also need to change self: if the current value is empty, we've reached the end of the line, and we don't need to do anything. If the current value isn't empty, the next character should be a comma:

case "\"":
    removeFirst()
    guard let quoteIdx = index(of: "\"") else {
        fatalError("expected quote") // todo throws
    }
    let result = prefix(upTo: quoteIdx)
    self = self[index(after: quoteIdx)...]
    if !isEmpty {
        let comma = removeFirst()
        assert(comma == ",") // todo throws
    }
    return result

Refactoring

25:36 To clean up the code, we'd like to have a method called remove(upToAndIncluding:):

extension Substring {
    mutating func remove(upToAndIncluding idx: Index) {
        self = self[index(after: idx)...]
    }
}

We can now replace the two instances where we used self = self[index(after: quoteIdx)...] with our new method.

27:35 There are a number of possible improvements to what we showed. We don't support escaped quotes within quoted fields yet, nor have we looked at the performance, which is an interesting exercise on its own. Still, our current version isn't completely naive.