Building a Language Model: N-grams

In this episode21:21

This episode is freely available thanks to the support of our subscribers

Subscribers get exclusive access to new and all previous subscriber-only episodes, video downloads, and 30% discount for team members. Become a Subscriber →

We implement n-grams as a first step to better understand language models.

00:06 Today we'll look at a very simple language model. While everyone talks about large language models, we remember seeing a demo of someone who implemented n-grams or Markov chains maybe 10 or 15 years ago — it was already very impressive. When we think about it, LLMs fundamentally predict the next token. There's a lot of complexity to make that work, but ultimately you have a sequence of words and the model predicts what comes next.

00:49 We'll start with a small language model rather than a large one. We've done various projects in the past — for example, using the OpenAI API to build a simple coding agent — but that approach outsources the real work to large models. What we want to explore now is how this actually works under the hood, starting with a simple approach we might have used 15 years ago.

01:27 Our corpus is the book Crime and Punishment, which we downloaded from Project Gutenberg. That'll be the training set for our n-grams. The idea behind an n-gram is to use a sliding window over words. For example, with a three-gram, we look at each combination of two words and predict what word comes next. Every time we encounter a word sequence, we increment a counter for the next word after those starting words. The starting words form the context, just like in an LLM. By incrementing counters, we end up with a list of candidate words for any two input words, along with their occurrence frequency in the training set. The word that appears most often becomes the most likely prediction.

Building the N-gram Data Structure

02:38 This should be straightforward to implement. Let's begin with our data structure — an NGram struct that stores frequencies, mapping contexts to word counts. Each context is a combination of two starting words. We keep track of the number of occurrences of each third word following the starting words, and we store these word counts in dictionaries from String to Int:

struct NGram {
    var n = 3
    typealias Context = [String]
    var frequencies: [Context: [String: Int]] = [:]
}

03:22 We can make n configurable. Setting n = 3 creates a trigram, which means the context is two words long, and the output is one word.

03:35 We'll store plain words without tokenizing them; we'll only apply a very basic normalization like lowercasing the words.

03:48 Tokenization — the simplest way to tokenize would be taking a word and turning that into a number, so the same word becomes the same number. But in practice, tokenization is more complex than that. For example, not every word translates into a single token. By skipping tokenization, we waste a lot of memory space because we have this dictionary from String to Int, and these strings could be represented more efficiently, but it doesn't matter for our purposes.

Implementing the Training Algorithm

04:19 Next, we need a train method that takes an array of words as input. These represent the actual word sequence from the book. We create a sliding window on this array using a for loop, starting at index zero and going to words.count - n.

05:17 In each iteration, we take the last n - 1 words as the context, and we increment the counter for the last word of the window:

struct NGram {
    var n = 3
    typealias Context = [String]
    var frequencies: [Context: [String: Int]] = [:]

    mutating func train(words: [String]) {
        for i in 0..<(words.count - n) {
            let context = Array(words[i..<(i + n - 1)])
            let word = words[i + n - 1]
            frequencies[context, default: [:]][word, default: 0] += 1
        }
    }
}

06:08 Since the context might not exist as a key in the frequencies dictionary yet, we use an empty dictionary as the default value. In the second subscript, where we increment the word count, we default to 0 for new words.

Loading Input and Training the Model

06:44 Let's try this out by calling a run function from an onAppear block on our ContentView. Inside the function, we load the corpus.txt resource and we convert its contents into a UTF-8 string:

func run() {
    let input = try! String(contentsOf: Bundle.main.url(forResource: "corpus", withExtension: "txt")!, encoding: .utf8)
    // ...
}

struct ContentView: View {
    var body: some View {
        VStack {
            // ...
        }
        .padding()
        .onAppear {
            run()
        }
    }
}

07:20 Now we need to convert the string into individual words — that requires splitting the text.We could use input.split() with spaces, newlines, or other delimiters, but we can also use the older NSString API with enumerateSubstrings, which has a built-in option to enumerate by words:

func run() {
    let input = try! String(contentsOf: Bundle.main.url(forResource: "corpus", withExtension: "txt")!, encoding: .utf8)
    let nsString = (self as NSString)
    var words: [String] = []
    nsString.enumerateSubstrings(in: NSRange(location: 0, length: nsString.length), options: .byWords) { word, _, _, _ in
        words.append(word!)
    }
    print(words.suffix(1000))
}

09:40 When we print the last words to see if we're getting any output, we can see some uppercased words in the output. We implement simple normalization by adding lowercased() in the enumeration of the words. We could use a more sophisticated normalization algorithm, but this suffices for our purposes.

10:27 And now that we have the words, we can create an NGram with them. The mutating train function populates our frequency dictionary. We return the trained NGram from the run function:

func run() -> NGram {
    let input = try! String(contentsOf: Bundle.main.url(forResource: "corpus", withExtension: "txt")!, encoding: .utf8)
    let nsString = (self as NSString)
    var words: [String] = []
    nsString.enumerateSubstrings(in: NSRange(location: 0, length: nsString.length), options: .byWords) { word, _, _, _ in
        words.append(word!.lowercased())
    }
    var ngram = NGram()
    ngram.train(words: words)
    return ngram
}

11:05 We assign the returned NGram to a state property on ContentView:

struct ContentView: View {
    @State var ngram = NGram()

    var body: some View {
        VStack {
            // ...
        }
        .padding()
        .onAppear {
            ngram = run()
        }
    }
}

Using the Trained NGram

11:29 We replace the contents of the VStack with a text field for user input:

struct ContentView: View {
    @State var ngram = NGram()
    @State var text = ""

    var body: some View {
        VStack {
            TextField("Input", text: $text)
        }
        .padding()
        .onAppear {
            ngram = run()
        }
    }
}

12:03 We want to take the last two words of the input and generate predictions for the next word. To convert the input string into words, we can reuse the code we wrote earlier by extracting that into a reusable function. We create an extension on String with a words method:

extension String {
    func words() -> [String] {
        let nsString = (self as NSString)
        var words: [String] = []
        nsString.enumerateSubstrings(in: NSRange(location: 0, length: nsString.length), options: .byWords) { word, _, _, _ in
            words.append(word!.lowercased())
        }
        return words
    }
}

13:02 To quickly see if this works, we can display the input string as a list of individual words:

struct ContentView: View {
    @State var ngram = NGram()
    @State var text = ""

    var body: some View {
        VStack {
            TextField("Input", text: $text)
            Text("\(text.words())")
                .monospaced()
        }
        .padding()
        .onAppear {
            ngram = run()
        }
    }
}

TODO screenshot at 13:38

13:39 Now we want to predict the next word, using our frequencies dictionary in the NGram. Let's write a predict function that takes an input string and returns the possible next words with their frequencies.

14:07 For a trigram — n = 3 — we want to take the last two words from the input as the context using suffix. We look that context up in the frequencies dictionary, which gives us a dictionary of word counts. If the context doesn't exist as a key, we return an empty dictionary:

struct NGram {
    var n = 3
    typealias Context = [String]
    var frequencies: [Context: [String: Int]] = [:]

    mutating func train(words: [String]) {
        for i in 0..<(words.count-n) {
            let context = Array(words[i..<(i + n - 1)])
            let word = words[i + n - 1]
            frequencies[context, default: [:]][word, default: 0] += 1
        }
    }

    func predict(input: [String]) -> [String: Int] {
        let context = Array(input.suffix(n - 1))
        return frequencies[context] ?? [:]
    }
}

14:42 Below the input text field, we add a list using ForEach to iterate over predictions coming from ngram.predict(input: text). We display each key and its frequency value in an HStack. We also sort the predictions to show the most frequent words first:

struct ContentView: View {
    @State var ngram = NGram()
    @State var text = ""

    var body: some View {
        VStack {
            TextField("Input", text: $text)
            List {
                let pred = ngram.predict(input: text.words())
                    .sorted(by: { $0.value > $1.value })
                ForEach(pred, id: \.key) { (key, value) in
                    HStack {
                        Text(key)
                        Text(value, format: .number)
                            .bold()
                    }
                }
            }
        }
        .padding()
        .onAppear {
            ngram = run()
        }
    }
}

16:01 Excellent — we're getting predictions that make sense for the given context, and they're properly sorted:

TODO screenshot at 16:14

Interactive Input Suggestions

16:28 We could add one final feature: text input suggestions to accept predictions as autocompletions. To make the suggestions clickable, we need to specify the textInputCompletion string for each suggestion, which is the predicted word appended to the existing text:

struct ContentView: View {
    @State var ngram = NGram()
    @State var text = ""

    var body: some View {
        let pred = ngram.predict(input: text.words())
            .sorted(by: { $0.value > $1.value })
        VStack {
            TextField("Input", text: $text)
                .textInputSuggestions {
                    ForEach(pred, id: \.key) { (key, value) in
                        Text(key)
                            .textInputCompletion(text + " " + key)
                    }
                }
            List {
                ForEach(pred, id: \.key) { (key, value) in
                    HStack {
                        Text(key)
                        Text(value, format: .number)
                            .bold()
                    }
                }
            }
        }
        .padding()
        .onAppear {
            ngram = run()
        }
    }
}

18:04 Now we can repeated click the first completion and see what happens. We quickly notice that only the last two words determine the predictions — similar to iPhone keyboard suggestions. The completions don't consider the full conversation context because we're only using a two-word context window.

18:38 If we increase the context by setting n = 4, we get three words of context. In theory, this should be more intelligent, but the trade-off becomes apparent: increasing context words reduces the available completions, because the training text contains fewer exact matches for longer phrases. A more fuzzy matching strategy would be needed to work with longer contexts.

Discussion

19:56 But this approach is still valuable. We can see how the model works and observe how the training data shapes the predictions. Training on Shakespeare versus Dostoevsky would yield completely different results. We're not sure exactly how iPhone autocomplete works, but this feels very similar.

20:26 There are many paths toward an actual LLM implementation: tokenization instead of string storage, or more sophisticated matching. We'd have to implement fuzzy matching that finds results for contexts that don't exactly match phrases in the training text. The training algorithm of a real LLM would also be significantly more complex. Let's see how we can continue exploring this next time.