00:06 In our last
episode, we wrote
the split(batchSize:)
extension on Array
, on ArraySlice
, and on the
Collection
protocol. In this episode, we're going to dive deeper into Swift's
collection protocol hierarchy and make our split
method even more generic.
00:29 There are many types that don't conform to the Collection
protocol but that would still benefit from a method like split(batchSize:)
. An
example of this is the standard input: it makes sense to split lines from it
into batches, but it doesn't make sense to subscript the standard input (and
subscripting is a capability defined in the Collection
protocol).
00:53 For use cases like this, Swift has the Sequence
protocol. The
only capability a sequence has is to ask for its next element, i.e. to iterate
over it element by element. Sequences do not come with the promise that you
can iterate over them multiple times and get the same elements out.
A Sequence from the File System
01:11 As an example of a sequence, we'll look at reading from the file
system. You might have a very large file you don't want to read in its entirety;
you'd rather read it batch by batch as you iterate through it. In order to have
something to work with, we define a simple class that wraps reading from
/dev/urandom
so that we're reading from an infinitely large file:
import Foundation
final class ReadRandom: IteratorProtocol {
let handle = FileHandle(forReadingAtPath: "/dev/urandom")!
deinit {
handle.closeFile()
}
func getByte() -> UInt8 {
let data = handle.readData(ofLength: 1)
return data[0]
}
}
02:01 In this class, we open a file handle, for which we need to import
Foundation. In the class' deinit
, we close the file handle. Other than that,
we only provide a single method, called getByte
, which returns an UInt8
value from /dev/urandom
.
02:58 Now we can instantiate ReadRandom
and try it out:
let randomSource = ReadRandom()
randomSource.getByte()
Creating an Iterator
03:14 It turns out that with ReadRandom
we've already implemented a
type that almost conforms to IteratorProtocol
. An iterator is the entity that
lies below Sequence
and provides the functionality of asking for a next
element. To do this, IteratorProtocol
only requires the next
method, which
returns an optional element of the sequence. The return type is optional,
because with a finite number of elements, we'll return nil
as soon as we've
iterated over all available elements.
03:48 To make ReadRandom
conform to IteratorProtocol
, we have to add
the protocol conformance to the declaration and rename getByte
into next
and
change its return type to be an optional UInt8
(we'll never return nil
here,
because /dev/urandom
is an infinite source of random bytes):
final class ReadRandom: IteratorProtocol {
let handle = FileHandle(forReadingAtPath: "/dev/urandom")!
deinit {
handle.closeFile()
}
func next() -> UInt8? {
let data = handle.readData(ofLength: 1)
return data[0]
}
}
Creating a Sequence
04:15 To iterate over ReadRandom
, we could now write a while
loop
and keep calling next
. We can't yet use a for loop for iteration; for this we
need a sequence around the iterator. An easy way to create a sequence from an
iterator is to use the AnySequence
type, which has an initializer that takes a
function returning an iterator:
let randomSequence = AnySequence { ReadRandom() }
04:50 Now we can loop over randomSequence
using a for loop:
for el in randomSequence {
print(el)
}
Of course, this would result in an endless loop, since we're dealing with an
infinite sequence.
05:07 Now that we have a sequence around the ReadRandom
iterator, we
can make use of all the functionality that's already defined in the standard
library for sequences. For example, we can use the prefix
method to take only
a certain amount of elements from our infinite sequence:
randomSequence.prefix(10)
05:39 The statement above doesn't yet trigger any file system activity.
Only once we actually access those 10 elements within it — for example, by
creating an array out of this sequence — will the data be read:
Array(randomSequence.prefix(10))
05:59 Once we've created an array out of the sequence of random bytes,
we can also use the split(batchSize:)
method from the last
episode on it.
However, we can't yet use this method directly on the sequence, since we've only
implemented it on the Collection
protocol thus far.
Extensions on Sequence
06:28 To make split(batchSize:)
available on all sequences, we have to
write an extension like this:
extension Sequence {
func split(batchSize: Int) -> [[Iterator.Element]] {
}
}
07:13 The implementation is actually very similar to a problem we
discussed in the Splitting Arrays episode,
where we showed multiple solutions. The full implementation of split
looks
like this:
extension Sequence {
func split(batchSize: Int) -> [[Iterator.Element]] {
var result: [[Iterator.Element]] = []
var batch: [Iterator.Element] = []
for element in self {
batch.append(element)
if batch.count == batchSize {
result.append(batch)
batch = []
}
}
if !batch.isEmpty { result.append(batch) }
return result
}
}
08:39 Now we can call split(batchSize:)
directly on a sequence, like
this:
randomSequence.prefix(10).split(batchSize: 3)
08:54 However, if we try to call split(batchSize:)
on randomSequence
without the call to prefix
in between, we end up with an infinite loop: the
split
method tries to convert the whole sequence into an array of batches, but
we're dealing with an infinite sequence.
Infinite Sequences
09:28 To make our split
method work on infinite sequences, we have to
change its return type: instead of returning an array of batches, we have to
return a sequence of batches, because a sequence can be evaluated lazily,
whereas an array always has all elements precomputed. We can change the
signature of our method to the following:
func split(batchSize: Int) -> AnySequence<[Iterator.Element]>
10:03 Now we have to approach the implementation very differently. We
start by returning an AnySequence
. We use the initializer that takes a
function that returns an iterator:
extension Sequence {
func split(batchSize: Int) -> AnySequence<[Iterator.Element]> {
return AnySequence { () -> AnyIterator<[Iterator.Element]> in
}
}
}
10:51 Next, we create an iterator using the AnyIterator
type. This
type provides a convenient way to create an iterator without defining our own
type conforming to IteratorProtocol
, as we did with ReadRandom
before.
AnyIterator
has an initializer that takes the next
function of the iterator
as its argument:
extension Sequence {
func split(batchSize: Int) -> AnySequence<[Iterator.Element]> {
return AnySequence { () -> AnyIterator<[Iterator.Element]> in
return AnyIterator {
}
}
}
}
11:45 To implement the iterator's next
function, we first create an
iterator from the sequence we're splitting into batches by calling its
makeIterator
method. Then we call this iterator's next
method until a batch
of batchSize
has filled up, or until no more elements are available:
extension Sequence {
func split(batchSize: Int) -> AnySequence<[Iterator.Element]> {
return AnySequence { () -> AnyIterator<[Iterator.Element]> in
var iterator = self.makeIterator()
return AnyIterator {
var batch: [Iterator.Element] = []
while batch.count < batchSize, let el = iterator.next() {
batch.append(el)
}
return batch.isEmpty ? nil : batch
}
}
}
}
It's important to return nil
in case we can't add any elements to the next
batch. Otherwise, we'd be creating a sequence that returns empty arrays
indefinitely.
13:23 With this extension in place, we can now call split(batchSize:)
on an infinite sequence, like our randomSequence
:
randomSequence.split(batchSize: 3).prefix(5)
Lazy Evaluation
13:28 This is much nicer, because now we don't have to call prefix
with the correct number before we split the sequence into batches. We can just
specify whatever batch size we want and then decide how many batches we want to
have.
13:38 The code above still doesn't trigger data to be read from the file
system. This gets evaluated lazily once we actually iterate through the elements
of the sequence, for example by using a for loop or by creating an array out of
the sequence.
14:10 If we try to use a method from the standard library — like map
—
on our infinite sequence, we'll get stuck in an infinite loop again:
randomSequence.split(batchSize: 3).map { "\($0)" }
This happens because the return type of map
is an array, and arrays always
have all their elements precomputed. To avoid this problem, we can turn our
sequence into a lazy sequence using the lazy
property:
randomSequence.split(batchSize: 3).lazy.map { "\($0)" }
14:45 lazy
returns a lazy sequence that has specific lazy
implementations for methods like map
so that they can be used on infinite
sequences as well.
Conclusion
15:32 The requirements for a sequence are minimal: you only need to be
able to provide a next element. There are many types that can be modeled as
sequences that wouldn't make sense as collections (where you need to be able to
provide subscripting and non-destructive iteration). And since sequences can be
evaluated lazily, they're powerful tools for dealing with large or even infinite
data sets.
16:17 Although the Sequence
protocol is simple and only requires us to
provide an iterator with a next
method, we can build a surprising amount of
functionality on top of it. We get methods like map
, filter
, and reduce
from the standard library, but we were also able to implement our custom
split(batchSize:)
helper in such a way that it works on infinite sequences.