Git Under the Hood: Blobs

In this episode26:10

This episode is freely available thanks to the support of our subscribers

Subscribers get exclusive access to new and all previous subscriber-only episodes, video downloads, and 30% discount for team members. Become a Subscriber →

We implement Git's data structure in Swift, starting with blobs.

00:06 Today, we'll start working on a pure Swift implementation of a small part of Git. Although there's a lot to Git, the actual underlying storage system is quite simple — for sake of this project, we can reduce it to blobs, trees, commits, and tags. Reimplementing this will be interesting because we'll deal with a real project and handle binary data. And we'll also get to know more about how Git works under the hood.

Inspecting a Repository

00:51 We can start by creating an actual Git repository with a single file, and then we'll try to read that file using Swift. So, we create a new subdirectory, sample-repo, and we create a file containing the phrase "Hello, world". Then, we initialize a Git repository in this directory by running git init:

01:25 This creates the .git directory containing a config file, a few empty directories, and a number of sample hooks. We can now add our file and create our first commit. When we run find .git, we see that some things have been added to the objects directory:

There are now three objects: a blob object containing the contents of the file, a tree object containing the name of our file and a pointer to the contents blob, and a commit object that contains a pointer to the tree. In short, the commit points to the tree and the tree points to the contents.

02:48 We can have a quick look at each of these objects by running git cat-file with an object's hash. We get this hash by combining the two digits of the directory name and the object's file name (the hash is split into these two parts by Git to keep the objects directory organized):

git cat-file -p a5c19667710254f835085b99726e523457150e03

03:06 This outputs the contents we just wrote to the file:

Hello, world

03:14 The second object is the commit. It has a pointer to the tree, along with the author, the committer, and the commit message:

tree 7b003be7f0523bf4ea1632ef04b39463555b0eb8
author Chris ...
committer Chris ...

First commit

03:36 The third object is the tree, and it contains a list of files:

100644 blob a5c19667710254f835085b99726e523457150e03	my-file

This list might also contain a pointer to another tree, constituting a subdirectory. But in this case, there's just one blob, which is an actual file.

04:00 A commit typically also has a parent, since commits are built on top of each other. What might be surprising to a lot of people is that there are no diffs stored anywhere; we just have blobs and trees. Diffs we see in Git clients, for example, are derived by comparing blobs.

Reading the Repository from Swift

04:32 Let's try to read this repository from Swift. The first thing we need is a Repository struct. We initialize this struct with a URL so that we have a container to work with:

public struct Repository {
    var url: URL
    init(_ url: URL) {
        self.url = url
    }
}

05:26 We can infer the Git directory from this URL, because this is conventionally a subdirectory named .git:

public struct Repository {
    var url: URL
    init(_ url: URL) {
        self.url = url
    }

    var gitURL: URL {
        url.appendingPathComponent(".git")
    }
}

06:08 Let's define how we want this to work by writing a test. The first thing we need to do is read a blob. This requires a Repository, for which we need the file URL of our test repo. By using the #file command, we can create a URL that's relative to this Swift file:

final class PureSwiftGitTests: XCTestCase {
    func testReadBlob() throws {
        let fileURL = URL(fileURLWithPath: #file)
        let repoURL = fileURL.deletingLastPathComponent().appendingPathComponent("../../Fixtures/sample-repo")
        let repo = Repository(repoURL)

    }
}

08:24 Now we want to be able to call a readObject method with a hash. Let's pick the blob object's hash for this test:

final class PureSwiftGitTests: XCTestCase {
    func testReadBlob() throws {
        let fileURL = URL(fileURLWithPath: #file)
        let repoURL = fileURL.deletingLastPathComponent().appendingPathComponent("../../Fixtures/sample-repo")
        let repo = Repository(repoURL)
        let obj = try repo.readObject("a5c19667710254f835085b99726e523457150e03")

    }
}

09:01 We then need an assertion about the return object, but let's create an Object type first. This type can be enum, because we know each of its values will be one of three different kinds of objects:

enum Object {
    case blob(Data)
    case commit
    case tree
}

09:50 In the readObject method, we receive a hash and we need to read the contents of a file. First, we construct the path to the file by splitting the hash into two parts — the first two characters, and the remainder. Then we try to read the data:

public struct Repository {
    // ...

    func readObject(_ hash: String) throws -> Object {
        // todo verify that this is an actual hash
        let objectPath = "\(hash.prefix(2))/\(hash.dropFirst(2))"
        let data = try Data(contentsOf: gitURL.appendingPathComponent("objects/\(objectPath)"))

    }
}

11:38 Our aim isn't to have a secure implementation; if people put .. in the hash, this might let us read other stuff in the file system. So we just add a note that the received hash should be validated.

12:00 When we try decoding the read data into a string, we can see that it doesn't work:

public struct Repository {
    // ...

    func readObject(_ hash: String) throws -> Object {
        // todo verify that this is an actual hash
        let objectPath = "\(hash.prefix(2))/\(hash.dropFirst(2))"
        let data = try Data(contentsOf: gitURL.appendingPathComponent("objects/\(objectPath)"))
        print(String(decoding: data, as: UTF8.self))
        fatalError()
    }
}

Decompressing

12:22 This is because the objects are all compressed using zlib. We have a few episodes about async sequences and streams, in which we read and parsed XML in chunks, but rather than reimplementing that, we're just going to take some decompression code with an MIT license from the internet. This code is simple because it only decompresses data with a fixed buffer size, which means it won't work with data larger than 8 MB, but that won't be a problem in this particular project:

import Foundation
import Compression

// From https://github.com/mezhevikin/Zlib/blob/master/Sources/Zlib/Zlib.swift

extension Data {
    var decompressed: Data {
        let size = 8_000_000
        let buffer = UnsafeMutablePointer<UInt8>.allocate(capacity: size)
        defer { buffer.deallocate() }
        let result = dropFirst(2).withUnsafeBytes({
            let read = compression_decode_buffer(
                buffer,
                size,
                $0.baseAddress!.bindMemory(
                    to: UInt8.self,
                    capacity: 1
                ),
                $0.count,
                nil,
                COMPRESSION_ZLIB
            )
            return Data(bytes: buffer, count: read)
        })
        return result
    }
}

14:03 We should now be able to append decompressed to the data and then print our string:

let data = try Data(contentsOf: gitURL.appendingPathComponent("objects/\(objectPath)")).decompressed

14:13 If we run the test, we can see it prints blob 13Hello, world. What we cannot see is that there's a zero byte between 13 and Hello. A blob object always starts with the four letters blob, followed by a space, then the size of the blob, the zero byte, and finally the data. The blob's size is included to allow C programmers to allocate memory, but we can ignore it.

Parsing a Blob

15:51 We can use this pattern to parse the blob object. We assign the data to a mutable remainder so that we can read and remove pieces from the beginning of it:

var remainder = data

17:14 Then, we want to get the type string, i.e. blob or commit or tree. We do this by taking the data's prefix up to the space character's byte, i.e. 0x20:

remainder.prefix(while: { $0 != 0x20 })

17:53 To check the contents of the prefix, we decode it into a string:

let typeStr = String(decoding: remainder.prefix(while: { $0 != 0x20 }), as: UTF8.self)

18:26 When we print this string, we can see that we're getting the word blob, so we're on the right track.

18:55 Next, we need to skip over the space and the blob's size. And since we'll be doing this a lot — taking a prefix up until a certain point and then removing it — we can extend Data with a helper method that does this. This method takes a separator element, reads the prefix off the start of the data, removes both the prefix and the separator from the data, and returns the prefix:

extension Data {
    mutating func remove(upTo separator: Element) -> Data {
        let part = prefix(while: { $0 != separator })
        removeFirst(part.count)
        popFirst()
        return part
    }
}

22:05 We can now call the helper to parse the type string:

let typeStr = String(decoding: remainder.remove(upTo: 0x20), as: UTF8.self)

22:34 We then discard the blob size by removing everything up to the zero byte:

remainder.remove(upTo: 0)

22:53 We've now reached the point of the blob where its data starts. We switch over the type string to check if it is in fact a blob object, and if it is, we return the .blob case with remainder as its associated value:

public struct Repository {
    // ...

    func readObject(_ hash: String) throws -> Object {
        // todo verify that this is an actual hash
        let objectPath = "\(hash.prefix(2))/\(hash.dropFirst(2))"
        let data = try Data(contentsOf: gitURL.appendingPathComponent("objects/\(objectPath)"))
        print(String(decoding: data, as: UTF8.self))
        var remainder = data
        let typeStr = String(decoding: remainder.remove(upTo: 0x20), as: UTF8.self)
        remainder.remove(upTo: 0)
        switch typeStr {
        case "blob": return .blob(remainder)
        default: fatalError()
        }
    }
}

23:53 To get rid of the compiler warning about not using the removed blob size, we mark the remove(upTo) method as returning a discardable result. And inside the method, we assign the result of popFirst() to an underscore to make it explicit that we're not interested in the removed separator:

extension Data {
    @discardableResult
    mutating func remove(upTo separator: Element) -> Data {
        let part = prefix(while: { $0 != separator })
        removeFirst(part.count)
        _ = popFirst()
        return part
    }
}

24:11 We can now write our assertion about the read object. To compare the result and the expected object, we need to make Object equatable. We do so by conforming it to Hashable, which inherits Equatable:

enum Object: Hashable {
    case blob(Data)
    case commit
    case tree
}

24:44 We construct the object we expect to read, and we compare the actual result to it:

final class PureSwiftGitTests: XCTestCase {
    func testReadBlob() throws {
        let fileURL = URL(fileURLWithPath: #file)
        let repoURL = fileURL.deletingLastPathComponent().appendingPathComponent("../../Fixtures/sample-repo")
        let repo = Repository(repoURL)
        let obj = try repo.readObject("a5c19667710254f835085b99726e523457150e03")
        let expected = Object.blob("Hello, world".data(using: .utf8)!)
        XCTAssertEqual(obj, expected)
    }
}

25:12 The test fails because there's a difference of one byte. This is because of the newline character appended to the file's contents (from using echo). After we adjust our expectation, the test passes:

let expected = Object.blob("Hello, world\n".data(using: .utf8)!)

25:30 In summary, we can now get into a repository and read a blob object from it. We do this by unzipping and parsing the blob. These are by far the easiest objects, but trees and commits should also not be very hard. Let's continue with those next time.