This episode is freely available thanks to the support of our subscribers

Subscribers get exclusive access to new and all previous subscriber-only episodes, video downloads, and 30% discount for team members. Become a Subscriber

We begin implementing the ThumbHash algorithm by creating a basic discrete cosine transform.

00:06 Today we're going to begin reimplementing part of ThumbHash, which encodes an image into a compact representation, just a handful of bytes, that can later be expanded back into a low-resolution preview. The idea is to generate a placeholder that captures the overall look of the image:

00:28 We could simply shrink the image to a tiny resolution and scale it back up, but the hashing algorithm produces previews that resemble the original more convincingly than naive downsampling.

00:51 ThumbHash was inspired by and aims to improve upon BlurHash, one of the first implementations of this concept.

01:06 The core idea is: downsample the image, convert it from RGB to a perceptual color space, and encode the components of this color space using a discrete cosine transform (DCT).

01:36 DCTs can be found everywhere — JPEG and many other compression algorithms use them. The idea with ThumbHash is to squeeze out a tiny representation you can embed directly in JSON or similar, to give users a vague but meaningful sense of what image is coming, before the actual full-resolution image loads.

02:10 We can see it in action on the website: the original photo, the ThumbHash preview, and the BlurHash preview are displayed side-by-side. Both hashes are far better than showing nothing, especially for photos.

02:39 Our goal today is to start understanding how the algorithm works by implementing the encoding and decoding parts. We'll work with simple numbers and visualize the results, working our way towards images.

Discrete Cosine Transform

03:09 Let's start with a list of numbers — these could represent luminance samples from one row of an image. By storing this list as a string, we can use a text field to quickly modify the input at runtime:

struct ContentView: View {
    @State var input = "1,2,3,4,5,6,7,8"

    // ...
}

03:39 We parse the text input into an array of Double values by splitting the string on commas and converting each component:

struct ContentView: View {
    @State var input = "1,2,3,4,5,6,7,8"

    var values: [Double] {
        input.split(separator: ",").compactMap { Double($0) }
    }

    // ...
}

04:02 Then we can plot those values in a graph. That way, after encoding and later decoding them, we can compare shapes visually without needing to render an actual image. We can iterate over the values with enumerated() and use LineMark for the line and PointMark for the sample points, using the enumerated index as the X coordinate and the value itself for the Y coordinate:

struct ContentView: View {
    @State var input = "1,2,3,4,5,6,7,8"

    var values: [Double] {
        input.split(separator: ",").compactMap { Double($0) }
    }

    var encoded: [Double] {
        dctEncode(input: values)
    }

    var decoded: [Double] {
        dctDecode(coefficients: encoded)
    }

    var body: some View {
        VStack {
            TextField("Values", text: $input)
            Chart {
                ForEach(Array(values.enumerated()), id: \.offset) { index, value in
                    LineMark(x: .value("Index", index), y: .value("Value", value))
                    PointMark(x: .value("Index", index), y: .value("Value", value))
                }
            }
        }
        .padding()
    }
}

05:17 We can run this and change one of the values, and we see the graph update accordingly:

Encoding

05:32 Now let's outline the encoding function. It will take an array of doubles as input and it returns another array of doubles representing the DCT coefficients:

func dctEncode(input: [Double]) -> [Double] {
    var output: [Double] = []


    return output
}

05:58 We can find the algorithm in math notation on Wikipedia. For each output coefficient, we sum over every input sample, multiplying it by a cosine term:

06:27 We can already see why it's called a "cosine transform" — we're decomposing our input signal into a set of coefficients representing cosine waves. With the disclaimer that we're no experts on this at all, we can think of this encoding as trying to find out which combination of cosine curves best resembles the original signal.

07:16 The first coefficient corresponds to a constant value, like the average or the sum of all input values. Each next coefficient represents a wave with a higher frequency than the previous one. The more input samples we want to encode, the more rapidly the wave needs to change. The more we compress an image, the more of the higher frequencies we discard.

08:18 The output of the encoding algorithm is an array of coefficients indicating the weighting of different cosine frequencies. We compute this in a nested loop. The outer loop iterates over the input values and computes the same number of coefficients. The inner loop accumulates a sum of values.

09:25 So, if we have eight input samples, we also produce eight coefficients. That makes sense — if we want to reconstruct something like a row of a checkerboard pattern, we need enough frequency bands to represent it.

09:52 Let's start with the two nested loops. The outer loop iterates over coefficient indices, the inner loop iterates over the sample values. Together, they compute a sum for each coefficient:

func dctEncode(input: [Double]) -> [Double] {
    var output: [Double] = []
    for coefficientIdx in 0..<input.count {
        var sum: Double = 0
        for (idx, value) in input.enumerated() {


        }
        output.append(sum)
    }
    return output
}

11:11 Inside the inner loop, we multiply each value with the cosine of a product of three terms. The first term is π divided by the number of elements:

func dctEncode(input: [Double]) -> [Double] {
    var output: [Double] = []
    for coefficientIdx in 0..<input.count {
        var sum: Double = 0
        for (idx, value) in input.enumerated() {
            sum += value * cos(.pi/Double(input.count))
        }
        output.append(sum)
    }
    return output
}

12:02 The second term is the current index plus a half. If we imagine wanting to sample 8 points — meaning we'll iterate over indices 0 through 7 — then this term makes it so that we look at values from 0.5 to 7.5:

func dctEncode(input: [Double]) -> [Double] {
    var output: [Double] = []
    for coefficientIdx in 0..<input.count {
        var sum: Double = 0
        for (idx, value) in input.enumerated() {
            sum += value * cos(.pi/Double(input.count) * (Double(idx) + 0.5))
        }
        output.append(sum)
    }
    return output
}

12:39 Lastly, we multiply with the coefficient index:

func dctEncode(input: [Double]) -> [Double] {
    var output: [Double] = []
    for coefficientIdx in 0..<input.count {
        var sum: Double = 0
        for (idx, value) in input.enumerated() {
            sum += value * cos(.pi/Double(input.count) * (Double(idx) + 0.5) * Double(coefficientIdx))
        }
        output.append(sum)
    }
    return output
}

13:38 After these loops, we have an encoded sum, i.e. a coefficient, for each data point. Let's visualize the resulting coefficients on a second chart, in which we should see the magnitude of each coefficient:

struct ContentView: View {
    @State var input = "1,2,3,4,5,6,7,8"

    var values: [Double] {
        input.split(separator: ",").compactMap { Double($0) }
    }

    var encoded: [Double] {
        dctEncode(input: values)
    }

    var body: some View {
        VStack {
            TextField("Values", text: $input)
            Chart {
                ForEach(Array(values.enumerated()), id: \.offset) { index, value in
                    LineMark(x: .value("Index", index), y: .value("Value", value))
                    PointMark(x: .value("Index", index), y: .value("Value", value))
                }
            }
            Chart {
                ForEach(Array(encoded.enumerated()), id: \.offset) { index, value in
                    LineMark(x: .value("Index", index), y: .value("Value", value))
                    PointMark(x: .value("Index", index), y: .value("Value", value))
                }
            }
        }
        .padding()
    }
}

15:04 Now we can see how the coefficients behave: the first couple have significant values, and the rest get very small. Our test data forms a simple rising line, so the first coefficient captures the average values of our whole line. The second coefficient is negative because it inverts the downward cosine curve to match our upward line. The smaller higher-frequency terms finetune the result:

16:15 It would be nice to have the inverse operation as well, so that we can see if we can reconstruct our input signal from the encoded values.

Decoding

16:28 Looking again at Wikipedia, the formula for the inverse DCT looks almost identical. The cosine part is the same, but instead of multiplying the cosine by input samples, we multiply it with the coefficient values. Also, we have to add half of the first coefficient:

17:01 So let's implement that. We'll reuse most of our structure but flip the loops. Now the outer loop iterates over coefficient indices, and the inner loop iterates over coefficient values:

func dctDecode(coefficients: [Double]) -> [Double] {
    var output: [Double] = []
    for idx in 0..<coefficients.count {
        var sum: Double = 0
        for (coefficientIdx, coefficient) in coefficients.enumerated() {
            sum += coefficient * cos(.pi/Double(coefficients.count) * (Double(idx) + 0.5) * Double(coefficientIdx))
        }
        output.append(sum)
    }
    return output
}

19:06 And then we still need to add the 0.5 * coefficients[0] term to each sum:

func dctDecode(coefficients: [Double]) -> [Double] {
    var output: [Double] = []
    for idx in 0..<coefficients.count {
        var sum: Double = coefficients[0] * 0.5
        for (coefficientIdx, coefficient) in coefficients.enumerated() {
            sum += coefficient * cos(.pi/Double(coefficients.count) * (Double(idx) + 0.5) * Double(coefficientIdx))
        }
        output.append(sum)
    }
    return output
}

19:27 Now we can plot the reconstructed signal on top of the original. We compute a decoded array using our decode function:

struct ContentView: View {
    @State var input = "1,2,3,4,5,6,7,8"

    var values: [Double] {
        input.split(separator: ",").compactMap { Double($0) }
    }

    var encoded: [Double] {
        dctEncode(input: values)
    }

    var decoded: [Double] {
        dctDecode(coefficients: encoded)
    }

    var body: some View {
        // ...
    }
}

19:46 We plot the decoded values with a different color, so that the Charts framework can tell these two lines apart:

struct ContentView: View {
    // ...

    var body: some View {
        VStack {
            TextField("Values", text: $input)
            Chart {
                ForEach(Array(values.enumerated()), id: \.offset) { index, value in
                    LineMark(x: .value("Index", index), y: .value("Value", value))
                    PointMark(x: .value("Index", index), y: .value("Value", value))
                }
                ForEach(Array(decoded.enumerated()), id: \.offset) { index, value in
                    LineMark(x: .value("Index", index), y: .value("Value", value), series: .value("Decoded", "Decoded"))
                        .foregroundStyle(.orange)
                    PointMark(x: .value("Index", index), y: .value("Value", value))
                        .foregroundStyle(.orange)
                }
            }
            Chart {
                ForEach(Array(encoded.enumerated()), id: \.offset) { index, value in
                    LineMark(x: .value("Index", index), y: .value("Value", value))
                    PointMark(x: .value("Index", index), y: .value("Value", value))
                }
            }
        }
        .padding()
    }
}

20:21 The blue line is our input data and the orange line shows the decoded values. The shapes of these lines match, but the scaling doesn't. The decoded line has the same general slope and curvature but it's offset vertically and scaled differently:

20:59 Trying fewer samples, like four instead of eight, gives a similar result: the decoded curve matches the shape but not the exact values or scale. So the transform works in principle, but we're still missing some scaling, perhaps.

21:22 Wikipedia tells us that DCT-III is the inverse of DCT-II up to a scale factor. That's probably the issue here: we need some scale factors to make our lines match. The goal in the end is to have an encoding and decoding operation that are each other's inverse. For that, we'll have to take scale factors and offsets into account. Let's pick that up in the next episode.

Resources

  • Sample Code

    Written in Swift 6.0

  • Episode Video

    Become a subscriber to download episode videos.

In Collection

201 Episodes · 70h14min

See All Collections

Episode Details

Recent Episodes

See All

Unlock Full Access

Subscribe to Swift Talk

  • Watch All Episodes

    A new episode every week

  • icon-benefit-download Created with Sketch.

    Download Episodes

    Take Swift Talk with you when you're offline

  • Support Us

    With your help we can keep producing new episodes