00:06 In our last episode, we wrote the split(batchSize:)
extension on Array
, on ArraySlice
, and on the Collection
protocol. In this episode, we're going to dive deeper into Swift's collection protocol hierarchy and make our split
method even more generic.
00:29 There are many types that don't conform to the Collection
protocol but that would still benefit from a method like split(batchSize:)
. An example of this is the standard input: it makes sense to split lines from it into batches, but it doesn't make sense to subscript the standard input (and subscripting is a capability defined in the Collection
protocol).
00:53 For use cases like this, Swift has the Sequence
protocol. The only capability a sequence has is to ask for its next element, i.e. to iterate over it element by element. Sequences do not come with the promise that you can iterate over them multiple times and get the same elements out.
A Sequence from the File System
01:11 As an example of a sequence, we'll look at reading from the file system. You might have a very large file you don't want to read in its entirety; you'd rather read it batch by batch as you iterate through it. In order to have something to work with, we define a simple class that wraps reading from /dev/urandom
so that we're reading from an infinitely large file:
import Foundation
final class ReadRandom: IteratorProtocol {
let handle = FileHandle(forReadingAtPath: "/dev/urandom")!
deinit {
handle.closeFile()
}
func getByte() -> UInt8 {
let data = handle.readData(ofLength: 1)
return data[0]
}
}
02:01 In this class, we open a file handle, for which we need to import Foundation. In the class' deinit
, we close the file handle. Other than that, we only provide a single method, called getByte
, which returns an UInt8
value from /dev/urandom
.
02:58 Now we can instantiate ReadRandom
and try it out:
let randomSource = ReadRandom()
randomSource.getByte()
Creating an Iterator
03:14 It turns out that with ReadRandom
we've already implemented a type that almost conforms to IteratorProtocol
. An iterator is the entity that lies below Sequence
and provides the functionality of asking for a next element. To do this, IteratorProtocol
only requires the next
method, which returns an optional element of the sequence. The return type is optional, because with a finite number of elements, we'll return nil
as soon as we've iterated over all available elements.
03:48 To make ReadRandom
conform to IteratorProtocol
, we have to add the protocol conformance to the declaration and rename getByte
into next
and change its return type to be an optional UInt8
(we'll never return nil
here, because /dev/urandom
is an infinite source of random bytes):
final class ReadRandom: IteratorProtocol {
let handle = FileHandle(forReadingAtPath: "/dev/urandom")!
deinit {
handle.closeFile()
}
func next() -> UInt8? {
let data = handle.readData(ofLength: 1)
return data[0]
}
}
Creating a Sequence
04:15 To iterate over ReadRandom
, we could now write a while
loop and keep calling next
. We can't yet use a for loop for iteration; for this we need a sequence around the iterator. An easy way to create a sequence from an iterator is to use the AnySequence
type, which has an initializer that takes a function returning an iterator:
let randomSequence = AnySequence { ReadRandom() }
04:50 Now we can loop over randomSequence
using a for loop:
for el in randomSequence {
print(el)
}
Of course, this would result in an endless loop, since we're dealing with an infinite sequence.
05:07 Now that we have a sequence around the ReadRandom
iterator, we can make use of all the functionality that's already defined in the standard library for sequences. For example, we can use the prefix
method to take only a certain amount of elements from our infinite sequence:
randomSequence.prefix(10)
05:39 The statement above doesn't yet trigger any file system activity. Only once we actually access those 10 elements within it — for example, by creating an array out of this sequence — will the data be read:
Array(randomSequence.prefix(10))
05:59 Once we've created an array out of the sequence of random bytes, we can also use the split(batchSize:)
method from the last episode on it. However, we can't yet use this method directly on the sequence, since we've only implemented it on the Collection
protocol thus far.
Extensions on Sequence
06:28 To make split(batchSize:)
available on all sequences, we have to write an extension like this:
extension Sequence {
func split(batchSize: Int) -> [[Iterator.Element]] {
// ...
}
}
07:13 The implementation is actually very similar to a problem we discussed in the Splitting Arrays episode, where we showed multiple solutions. The full implementation of split
looks like this:
extension Sequence {
func split(batchSize: Int) -> [[Iterator.Element]] {
var result: [[Iterator.Element]] = []
var batch: [Iterator.Element] = []
for element in self {
batch.append(element)
if batch.count == batchSize {
result.append(batch)
batch = []
}
}
if !batch.isEmpty { result.append(batch) }
return result
}
}
08:39 Now we can call split(batchSize:)
directly on a sequence, like this:
randomSequence.prefix(10).split(batchSize: 3)
08:54 However, if we try to call split(batchSize:)
on randomSequence
without the call to prefix
in between, we end up with an infinite loop: the split
method tries to convert the whole sequence into an array of batches, but we're dealing with an infinite sequence.
Infinite Sequences
09:28 To make our split
method work on infinite sequences, we have to change its return type: instead of returning an array of batches, we have to return a sequence of batches, because a sequence can be evaluated lazily, whereas an array always has all elements precomputed. We can change the signature of our method to the following:
func split(batchSize: Int) -> AnySequence<[Iterator.Element]>
10:03 Now we have to approach the implementation very differently. We start by returning an AnySequence
. We use the initializer that takes a function that returns an iterator:
extension Sequence {
func split(batchSize: Int) -> AnySequence<[Iterator.Element]> {
return AnySequence { () -> AnyIterator<[Iterator.Element]> in
// ...
}
}
}
10:51 Next, we create an iterator using the AnyIterator
type. This type provides a convenient way to create an iterator without defining our own type conforming to IteratorProtocol
, as we did with ReadRandom
before. AnyIterator
has an initializer that takes the next
function of the iterator as its argument:
extension Sequence {
func split(batchSize: Int) -> AnySequence<[Iterator.Element]> {
return AnySequence { () -> AnyIterator<[Iterator.Element]> in
return AnyIterator {
// implement next ...
}
}
}
}
11:45 To implement the iterator's next
function, we first create an iterator from the sequence we're splitting into batches by calling its makeIterator
method. Then we call this iterator's next
method until a batch of batchSize
has filled up, or until no more elements are available:
extension Sequence {
func split(batchSize: Int) -> AnySequence<[Iterator.Element]> {
return AnySequence { () -> AnyIterator<[Iterator.Element]> in
var iterator = self.makeIterator()
return AnyIterator {
var batch: [Iterator.Element] = []
while batch.count < batchSize, let el = iterator.next() {
batch.append(el)
}
return batch.isEmpty ? nil : batch
}
}
}
}
It's important to return nil
in case we can't add any elements to the next batch. Otherwise, we'd be creating a sequence that returns empty arrays indefinitely.
13:23 With this extension in place, we can now call split(batchSize:)
on an infinite sequence, like our randomSequence
:
randomSequence.split(batchSize: 3).prefix(5)
Lazy Evaluation
13:28 This is much nicer, because now we don't have to call prefix
with the correct number before we split the sequence into batches. We can just specify whatever batch size we want and then decide how many batches we want to have.
13:38 The code above still doesn't trigger data to be read from the file system. This gets evaluated lazily once we actually iterate through the elements of the sequence, for example by using a for loop or by creating an array out of the sequence.
14:10 If we try to use a method from the standard library — like map
— on our infinite sequence, we'll get stuck in an infinite loop again:
randomSequence.split(batchSize: 3).map { "\($0)" } // runs indefinitely
This happens because the return type of map
is an array, and arrays always have all their elements precomputed. To avoid this problem, we can turn our sequence into a lazy sequence using the lazy
property:
randomSequence.split(batchSize: 3).lazy.map { "\($0)" }
14:45 lazy
returns a lazy sequence that has specific lazy implementations for methods like map
so that they can be used on infinite sequences as well.
Conclusion
15:32 The requirements for a sequence are minimal: you only need to be able to provide a next element. There are many types that can be modeled as sequences that wouldn't make sense as collections (where you need to be able to provide subscripting and non-destructive iteration). And since sequences can be evaluated lazily, they're powerful tools for dealing with large or even infinite data sets.
16:17 Although the Sequence
protocol is simple and only requires us to provide an iterator with a next
method, we can build a surprising amount of functionality on top of it. We get methods like map
, filter
, and reduce
from the standard library, but we were also able to implement our custom split(batchSize:)
helper in such a way that it works on infinite sequences.