ThumbHash (Part 1)

In this episode22:26

This episode is freely available thanks to the support of our subscribers

Subscribers get exclusive access to new and all previous subscriber-only episodes, video downloads, and 30% discount for team members. Become a Subscriber →

We begin implementing the ThumbHash algorithm by creating a basic discrete cosine transform.

00:06 Today we're going to begin reimplementing part of ThumbHash, which encodes an image into a compact representation, just a handful of bytes, that can later be expanded back into a low-resolution preview. The idea is to generate a placeholder that captures the overall look of the image:

00:28 We could simply shrink the image to a tiny resolution and scale it back up, but the hashing algorithm produces previews that resemble the original more convincingly than naive downsampling.

00:51 ThumbHash was inspired by and aims to improve upon BlurHash, one of the first implementations of this concept.

01:06 The core idea is: downsample the image, convert it from RGB to a perceptual color space, and encode the components of this color space using a discrete cosine transform (DCT).

01:36 DCTs can be found everywhere — JPEG and many other compression algorithms use them. The idea with ThumbHash is to squeeze out a tiny representation you can embed directly in JSON or similar, to give users a vague but meaningful sense of what image is coming, before the actual full-resolution image loads.

02:10 We can see it in action on the website: the original photo, which is the ThumbHash preview, and the BlurHash preview are displayed side-by-side. Both hashes are far better than showing nothing, especially for photos.

02:39 Our goal today is to start understanding how the algorithm works by implementing the encoding and decoding parts. We'll work with simple numbers and visualize the results, working our way toward images.

Discrete Cosine Transform

03:09 Let's start with a list of numbers — these could represent luminance samples from one row of an image. By storing this list as a string, we can use a text field to quickly modify the input at runtime:

struct ContentView: View {
    @State var input = "1,2,3,4,5,6,7,8"

    // ...
}

03:39 We parse the text input into an array of Double values by splitting the string on commas and converting each component:

struct ContentView: View {
    @State var input = "1,2,3,4,5,6,7,8"

    var values: [Double] {
        input.split(separator: ",").compactMap { Double($0) }
    }

    // ...
}

04:02 Then we can plot those values in a graph. That way, after encoding and later decoding them, we can compare shapes visually without needing to render an actual image. We can iterate over the values with enumerated() and use LineMark for the line and PointMark for the sample points, using the enumerated index as the X coordinate and the value itself for the Y coordinate:

struct ContentView: View {
    @State var input = "1,2,3,4,5,6,7,8"

    var values: [Double] {
        input.split(separator: ",").compactMap { Double($0) }
    }

    var encoded: [Double] {
        dctEncode(input: values)
    }

    var decoded: [Double] {
        dctDecode(coefficients: encoded)
    }

    var body: some View {
        VStack {
            TextField("Values", text: $input)
            Chart {
                ForEach(Array(values.enumerated()), id: \.offset) { index, value in
                    LineMark(x: .value("Index", index), y: .value("Value", value))
                    PointMark(x: .value("Index", index), y: .value("Value", value))
                }
            }
        }
        .padding()
    }
}

05:17 We run this and change one of the values, and we see the graph update accordingly:

Encoding

05:32 Now let's outline the encoding function. It takes an array of doubles as input, and it returns another array of doubles representing the DCT coefficients:

func dctEncode(input: [Double]) -> [Double] {
    var output: [Double] = []


    return output
}

05:58 We can find the algorithm in math notation on Wikipedia. For each output coefficient, we sum over every input sample, multiplying it by a cosine term:

06:27 We can already see why it's called a "cosine transform" — we're decomposing our input signal into a set of coefficients representing cosine waves. With the disclaimer that we're no experts on this at all, we can think of this encoding as trying to find out which combination of cosine curves best resembles the original signal.

07:16 The first coefficient corresponds to a constant value, like the average or the sum of all input values. Each next coefficient represents a wave with a higher frequency than the previous one. The more input samples we want to encode, the more rapidly the wave needs to change. The more we compress an image, the more of the higher frequencies we discard.

08:18 The output of the encoding algorithm is an array of coefficients indicating the weighting of different cosine frequencies. We compute this in a nested loop. The outer loop iterates over the input values and computes the same number of coefficients. The inner loop accumulates a sum of values.

09:25 So, if we have eight input samples, we also produce eight coefficients. That makes sense — if we want to reconstruct something like a row of a checkerboard pattern, we need enough frequency bands to represent it.

09:52 Let's start with the two nested loops. The outer loop iterates over coefficient indices, and the inner loop iterates over the sample values. Together, they compute a sum for each coefficient:

func dctEncode(input: [Double]) -> [Double] {
    var output: [Double] = []
    for coefficientIdx in 0..<input.count {
        var sum: Double = 0
        for (idx, value) in input.enumerated() {


        }
        output.append(sum)
    }
    return output
}

11:11 Inside the inner loop, we multiply each value with the cosine of a product of three terms. The first term is π divided by the number of elements:

func dctEncode(input: [Double]) -> [Double] {
    var output: [Double] = []
    for coefficientIdx in 0..<input.count {
        var sum: Double = 0
        for (idx, value) in input.enumerated() {
            sum += value * cos(.pi/Double(input.count))
        }
        output.append(sum)
    }
    return output
}

12:02 The second term is the current index plus a half. If we imagine wanting to sample 8 points — meaning we'll iterate over indices 0 through 7 — then this term makes it so that we look at values from 0.5 to 7.5:

func dctEncode(input: [Double]) -> [Double] {
    var output: [Double] = []
    for coefficientIdx in 0..<input.count {
        var sum: Double = 0
        for (idx, value) in input.enumerated() {
            sum += value * cos(.pi/Double(input.count) * (Double(idx) + 0.5))
        }
        output.append(sum)
    }
    return output
}

12:39 Lastly, we multiply with the coefficient index:

func dctEncode(input: [Double]) -> [Double] {
    var output: [Double] = []
    for coefficientIdx in 0..<input.count {
        var sum: Double = 0
        for (idx, value) in input.enumerated() {
            sum += value * cos(.pi/Double(input.count) * (Double(idx) + 0.5) * Double(coefficientIdx))
        }
        output.append(sum)
    }
    return output
}

13:38 After these loops, we have an encoded sum, i.e. a coefficient, for each data point. Let's visualize the resulting coefficients on a second chart, in which we should see the magnitude of each coefficient:

struct ContentView: View {
    @State var input = "1,2,3,4,5,6,7,8"

    var values: [Double] {
        input.split(separator: ",").compactMap { Double($0) }
    }

    var encoded: [Double] {
        dctEncode(input: values)
    }

    var body: some View {
        VStack {
            TextField("Values", text: $input)
            Chart {
                ForEach(Array(values.enumerated()), id: \.offset) { index, value in
                    LineMark(x: .value("Index", index), y: .value("Value", value))
                    PointMark(x: .value("Index", index), y: .value("Value", value))
                }
            }
            Chart {
                ForEach(Array(encoded.enumerated()), id: \.offset) { index, value in
                    LineMark(x: .value("Index", index), y: .value("Value", value))
                    PointMark(x: .value("Index", index), y: .value("Value", value))
                }
            }
        }
        .padding()
    }
}

15:04 Now we can see how the coefficients behave: the first couple have significant values, and the rest get very small. Our test data forms a simple rising line, so the first coefficient captures the average values of our whole line. The second coefficient is negative because it inverts the downward cosine curve to match our upward line. The smaller higher-frequency terms finetune the result:

16:15 It'd be nice to have the inverse operation as well, so that we can see if we can reconstruct our input signal from the encoded values.

Decoding

16:28 Looking at Wikipedia again, the formula for the inverse DCT looks almost identical. The cosine part is the same, but instead of multiplying the cosine by input samples, we multiply it with the coefficient values. Also, we have to add half of the first coefficient:

17:01 So let's implement that. We'll reuse most of our structure, but we'll flip the loops. Now the outer loop iterates over coefficient indices, and the inner loop iterates over coefficient values:

func dctDecode(coefficients: [Double]) -> [Double] {
    var output: [Double] = []
    for idx in 0..<coefficients.count {
        var sum: Double = 0
        for (coefficientIdx, coefficient) in coefficients.enumerated() {
            sum += coefficient * cos(.pi/Double(coefficients.count) * (Double(idx) + 0.5) * Double(coefficientIdx))
        }
        output.append(sum)
    }
    return output
}

19:06 And then we still need to add the 0.5 * coefficients[0] term to each sum:

func dctDecode(coefficients: [Double]) -> [Double] {
    var output: [Double] = []
    for idx in 0..<coefficients.count {
        var sum: Double = coefficients[0] * 0.5
        for (coefficientIdx, coefficient) in coefficients.enumerated() {
            sum += coefficient * cos(.pi/Double(coefficients.count) * (Double(idx) + 0.5) * Double(coefficientIdx))
        }
        output.append(sum)
    }
    return output
}

19:27 Now we can plot the reconstructed signal on top of the original. We compute a decoded array using our decode function:

struct ContentView: View {
    @State var input = "1,2,3,4,5,6,7,8"

    var values: [Double] {
        input.split(separator: ",").compactMap { Double($0) }
    }

    var encoded: [Double] {
        dctEncode(input: values)
    }

    var decoded: [Double] {
        dctDecode(coefficients: encoded)
    }

    var body: some View {
        // ...
    }
}

19:46 We plot the decoded values with a different color, so that the Charts framework can tell these two lines apart:

struct ContentView: View {
    // ...

    var body: some View {
        VStack {
            TextField("Values", text: $input)
            Chart {
                ForEach(Array(values.enumerated()), id: \.offset) { index, value in
                    LineMark(x: .value("Index", index), y: .value("Value", value))
                    PointMark(x: .value("Index", index), y: .value("Value", value))
                }
                ForEach(Array(decoded.enumerated()), id: \.offset) { index, value in
                    LineMark(x: .value("Index", index), y: .value("Value", value), series: .value("Decoded", "Decoded"))
                        .foregroundStyle(.orange)
                    PointMark(x: .value("Index", index), y: .value("Value", value))
                        .foregroundStyle(.orange)
                }
            }
            Chart {
                ForEach(Array(encoded.enumerated()), id: \.offset) { index, value in
                    LineMark(x: .value("Index", index), y: .value("Value", value))
                    PointMark(x: .value("Index", index), y: .value("Value", value))
                }
            }
        }
        .padding()
    }
}

20:21 The blue line is our input data, and the orange line shows the decoded values. The shapes of these lines match, but the scaling doesn't. The decoded line has the same general slope and curvature, but it's offset vertically and scaled differently:

20:59 Trying fewer samples, like four instead of eight, gives a similar result: the decoded curve matches the shape but not the exact values or scale. So the transform works in principle, but we're still missing some scaling, perhaps.

21:22 Wikipedia tells us that DCT-III is the inverse of DCT-II up to a scale factor. That's probably the issue here: we need some scale factors to make our lines match. The goal in the end is to have an encoding operation and a decoding operation that are each other's inverse. For that, we'll have to take scale factors and offsets into account. Let's pick that up in the next episode.