Sequence & Iterator

In this episode16:54

This episode is freely available thanks to the support of our subscribers

Subscribers get exclusive access to new and all previous subscriber-only episodes, video downloads, and 30% discount for team members. Become a Subscriber →

We make our collection extension even more generic by implementing it on the Sequence protocol.

00:06 In our last episode, we wrote the split(batchSize:) extension on Array, on ArraySlice, and on the Collection protocol. In this episode, we're going to dive deeper into Swift's collection protocol hierarchy and make our split method even more generic.

00:29 There are many types that don't conform to the Collection protocol but that would still benefit from a method like split(batchSize:). An example of this is the standard input: it makes sense to split lines from it into batches, but it doesn't make sense to subscript the standard input (and subscripting is a capability defined in the Collection protocol).

00:53 For use cases like this, Swift has the Sequence protocol. The only capability a sequence has is to ask for its next element, i.e. to iterate over it element by element. Sequences do not come with the promise that you can iterate over them multiple times and get the same elements out.

A Sequence from the File System

01:11 As an example of a sequence, we'll look at reading from the file system. You might have a very large file you don't want to read in its entirety; you'd rather read it batch by batch as you iterate through it. In order to have something to work with, we define a simple class that wraps reading from /dev/urandom so that we're reading from an infinitely large file:

import Foundation
final class ReadRandom: IteratorProtocol {
    let handle = FileHandle(forReadingAtPath: "/dev/urandom")!
    
    deinit {
        handle.closeFile()
    }
    
    func getByte() -> UInt8 {
        let data = handle.readData(ofLength: 1)
        return data[0]
    }
}

02:01 In this class, we open a file handle, for which we need to import Foundation. In the class' deinit, we close the file handle. Other than that, we only provide a single method, called getByte, which returns an UInt8 value from /dev/urandom.

02:58 Now we can instantiate ReadRandom and try it out:

let randomSource = ReadRandom()
randomSource.getByte()

Creating an Iterator

03:14 It turns out that with ReadRandom we've already implemented a type that almost conforms to IteratorProtocol. An iterator is the entity that lies below Sequence and provides the functionality of asking for a next element. To do this, IteratorProtocol only requires the next method, which returns an optional element of the sequence. The return type is optional, because with a finite number of elements, we'll return nil as soon as we've iterated over all available elements.

03:48 To make ReadRandom conform to IteratorProtocol, we have to add the protocol conformance to the declaration and rename getByte into next and change its return type to be an optional UInt8 (we'll never return nil here, because /dev/urandom is an infinite source of random bytes):

final class ReadRandom: IteratorProtocol {
    let handle = FileHandle(forReadingAtPath: "/dev/urandom")!
    
    deinit {
        handle.closeFile()
    }
    
    func next() -> UInt8? {
        let data = handle.readData(ofLength: 1)
        return data[0]
    }
}

Creating a Sequence

04:15 To iterate over ReadRandom, we could now write a while loop and keep calling next. We can't yet use a for loop for iteration; for this we need a sequence around the iterator. An easy way to create a sequence from an iterator is to use the AnySequence type, which has an initializer that takes a function returning an iterator:

let randomSequence = AnySequence { ReadRandom() }

04:50 Now we can loop over randomSequence using a for loop:

for el in randomSequence {
    print(el)
}

Of course, this would result in an endless loop, since we're dealing with an infinite sequence.

05:07 Now that we have a sequence around the ReadRandom iterator, we can make use of all the functionality that's already defined in the standard library for sequences. For example, we can use the prefix method to take only a certain amount of elements from our infinite sequence:

randomSequence.prefix(10)

05:39 The statement above doesn't yet trigger any file system activity. Only once we actually access those 10 elements within it — for example, by creating an array out of this sequence — will the data be read:

Array(randomSequence.prefix(10))

05:59 Once we've created an array out of the sequence of random bytes, we can also use the split(batchSize:) method from the last episode on it. However, we can't yet use this method directly on the sequence, since we've only implemented it on the Collection protocol thus far.

Extensions on Sequence

06:28 To make split(batchSize:) available on all sequences, we have to write an extension like this:

extension Sequence {
    func split(batchSize: Int) -> [[Iterator.Element]] {
        // ...
    }
}

07:13 The implementation is actually very similar to a problem we discussed in the Splitting Arrays episode, where we showed multiple solutions. The full implementation of split looks like this:

extension Sequence {
    func split(batchSize: Int) -> [[Iterator.Element]] {
        var result: [[Iterator.Element]] = []
        var batch: [Iterator.Element] = []
        for element in self {
            batch.append(element)
            if batch.count == batchSize {
                result.append(batch)
                batch = []
            }
        }
        if !batch.isEmpty { result.append(batch) }
        return result
    }
}

08:39 Now we can call split(batchSize:) directly on a sequence, like this:

randomSequence.prefix(10).split(batchSize: 3)

08:54 However, if we try to call split(batchSize:) on randomSequence without the call to prefix in between, we end up with an infinite loop: the split method tries to convert the whole sequence into an array of batches, but we're dealing with an infinite sequence.

Infinite Sequences

09:28 To make our split method work on infinite sequences, we have to change its return type: instead of returning an array of batches, we have to return a sequence of batches, because a sequence can be evaluated lazily, whereas an array always has all elements precomputed. We can change the signature of our method to the following:

func split(batchSize: Int) -> AnySequence<[Iterator.Element]>

10:03 Now we have to approach the implementation very differently. We start by returning an AnySequence. We use the initializer that takes a function that returns an iterator:

extension Sequence {
    func split(batchSize: Int) -> AnySequence<[Iterator.Element]> {
        return AnySequence { () -> AnyIterator<[Iterator.Element]> in
            // ...
        }
    }
}

10:51 Next, we create an iterator using the AnyIterator type. This type provides a convenient way to create an iterator without defining our own type conforming to IteratorProtocol, as we did with ReadRandom before. AnyIterator has an initializer that takes the next function of the iterator as its argument:

extension Sequence {
    func split(batchSize: Int) -> AnySequence<[Iterator.Element]> {
        return AnySequence { () -> AnyIterator<[Iterator.Element]> in
            return AnyIterator {
                // implement next ...
            }
        }
    }
}

11:45 To implement the iterator's next function, we first create an iterator from the sequence we're splitting into batches by calling its makeIterator method. Then we call this iterator's next method until a batch of batchSize has filled up, or until no more elements are available:

extension Sequence {
    func split(batchSize: Int) -> AnySequence<[Iterator.Element]> {
        return AnySequence { () -> AnyIterator<[Iterator.Element]> in
            var iterator = self.makeIterator()
            return AnyIterator {
                var batch: [Iterator.Element] = []
                while batch.count < batchSize, let el = iterator.next() {
                    batch.append(el)
                }
                return batch.isEmpty ? nil : batch
            }
        }
    }
}

It's important to return nil in case we can't add any elements to the next batch. Otherwise, we'd be creating a sequence that returns empty arrays indefinitely.

13:23 With this extension in place, we can now call split(batchSize:) on an infinite sequence, like our randomSequence:

randomSequence.split(batchSize: 3).prefix(5)

Lazy Evaluation

13:28 This is much nicer, because now we don't have to call prefix with the correct number before we split the sequence into batches. We can just specify whatever batch size we want and then decide how many batches we want to have.

13:38 The code above still doesn't trigger data to be read from the file system. This gets evaluated lazily once we actually iterate through the elements of the sequence, for example by using a for loop or by creating an array out of the sequence.

14:10 If we try to use a method from the standard library — like map — on our infinite sequence, we'll get stuck in an infinite loop again:

randomSequence.split(batchSize: 3).map { "\($0)" } // runs indefinitely

This happens because the return type of map is an array, and arrays always have all their elements precomputed. To avoid this problem, we can turn our sequence into a lazy sequence using the lazy property:

randomSequence.split(batchSize: 3).lazy.map { "\($0)" }

14:45 lazy returns a lazy sequence that has specific lazy implementations for methods like map so that they can be used on infinite sequences as well.

Conclusion

15:32 The requirements for a sequence are minimal: you only need to be able to provide a next element. There are many types that can be modeled as sequences that wouldn't make sense as collections (where you need to be able to provide subscripting and non-destructive iteration). And since sequences can be evaluated lazily, they're powerful tools for dealing with large or even infinite data sets.

16:17 Although the Sequence protocol is simple and only requires us to provide an iterator with a next method, we can build a surprising amount of functionality on top of it. We get methods like map, filter, and reduce from the standard library, but we were also able to implement our custom split(batchSize:) helper in such a way that it works on infinite sequences.