00:06 Today we'll look at a very simple language model. While everyone
talks about large language models, we remember seeing a demo of someone who
implemented n-grams or Markov chains maybe 10 or 15 years ago — it was already
very impressive. When we think about it, LLMs fundamentally predict the next
token. There's a lot of complexity to make that work, but ultimately you have a
sequence of words and the model predicts what comes next.
00:49 We'll start with a small language model rather than a large one.
We've done various projects in the past — for example, using the OpenAI API to
build a simple coding
agent —
but that approach outsources the real work to large models. What we want to
explore now is how this actually works under the hood, starting with a simple
approach we might have used 15 years ago.
01:27 Our corpus is the book Crime and Punishment, which we downloaded
from Project Gutenberg. That'll be the training set for
our n-grams. The idea behind an n-gram is to use a sliding window over words.
For example, with a three-gram, we look at each combination of two words and
predict what word comes next. Every time we encounter a word sequence, we
increment a counter for the next word after those starting words. The starting
words form the context, just like in an LLM. By incrementing counters, we end up
with a list of candidate words for any two input words, along with their
occurrence frequency in the training set. The word that appears most often
becomes the most likely prediction.
Building the N-gram Data Structure
02:38 This should be straightforward to implement. Let's begin with our
data structure — an NGram struct that stores frequencies, mapping contexts to
word counts. Each context is a combination of two starting words. We keep track
of the number of occurrences of each third word following the starting words,
and we store these word counts in dictionaries from String to Int:
struct NGram {
var n = 3
typealias Context = [String]
var frequencies: [Context: [String: Int]] = [:]
}
03:22 We can make n configurable. Setting n = 3 creates a trigram,
which means the context is two words long, and the output is one word.
03:35 We'll store plain words without tokenizing them; we'll only apply
a very basic normalization like lowercasing the words.
03:48 Tokenization — the simplest way to tokenize would be taking a word
and turning that into a number, so the same word becomes the same number. But in
practice, tokenization is more complex than that. For example, not every word
translates into a single token. By skipping tokenization, we waste a lot of
memory space because we have this dictionary from String to Int, and these
strings could be represented more efficiently, but it doesn't matter for our
purposes.
Implementing the Training Algorithm
04:19 Next, we need a train method that takes an array of words as
input. These represent the actual word sequence from the book. We create a
sliding window on this array using a for loop, starting at index zero and going
to words.count - n.
05:17 In each iteration, we take the last n - 1 words as the context,
and we increment the counter for the last word of the window:
struct NGram {
var n = 3
typealias Context = [String]
var frequencies: [Context: [String: Int]] = [:]
mutating func train(words: [String]) {
for i in 0..<(words.count - n) {
let context = Array(words[i..<(i + n - 1)])
let word = words[i + n - 1]
frequencies[context, default: [:]][word, default: 0] += 1
}
}
}
06:08 Since the context might not exist as a key in the frequencies
dictionary yet, we use an empty dictionary as the default value. In the second
subscript, where we increment the word count, we default to 0 for new words.
Loading Input and Training the Model
06:44 Let's try this out by calling a run function from an onAppear
block on our ContentView. Inside the function, we load the corpus.txt
resource and we convert its contents into a UTF-8 string:
func run() {
let input = try! String(contentsOf: Bundle.main.url(forResource: "corpus", withExtension: "txt")!, encoding: .utf8)
}
struct ContentView: View {
var body: some View {
VStack {
}
.padding()
.onAppear {
run()
}
}
}
07:20 Now we need to convert the string into individual words — that
requires splitting the text.We could use input.split() with spaces, newlines,
or other delimiters, but we can also use the older NSString API with
enumerateSubstrings, which has a built-in option to enumerate by words:
func run() {
let input = try! String(contentsOf: Bundle.main.url(forResource: "corpus", withExtension: "txt")!, encoding: .utf8)
let nsString = (self as NSString)
var words: [String] = []
nsString.enumerateSubstrings(in: NSRange(location: 0, length: nsString.length), options: .byWords) { word, _, _, _ in
words.append(word!)
}
print(words.suffix(1000))
}
09:40 When we print the last words to see if we're getting any output,
we can see some uppercased words in the output. We implement simple
normalization by adding lowercased() in the enumeration of the words. We could
use a more sophisticated normalization algorithm, but this suffices for our
purposes.
10:27 And now that we have the words, we can create an NGram with
them. The mutating train function populates our frequency dictionary. We
return the trained NGram from the run function:
func run() -> NGram {
let input = try! String(contentsOf: Bundle.main.url(forResource: "corpus", withExtension: "txt")!, encoding: .utf8)
let nsString = (self as NSString)
var words: [String] = []
nsString.enumerateSubstrings(in: NSRange(location: 0, length: nsString.length), options: .byWords) { word, _, _, _ in
words.append(word!.lowercased())
}
var ngram = NGram()
ngram.train(words: words)
return ngram
}
11:05 We assign the returned NGram to a state property on
ContentView:
struct ContentView: View {
@State var ngram = NGram()
var body: some View {
VStack {
}
.padding()
.onAppear {
ngram = run()
}
}
}
Using the Trained NGram
11:29 We replace the contents of the VStack with a text field for user
input:
struct ContentView: View {
@State var ngram = NGram()
@State var text = ""
var body: some View {
VStack {
TextField("Input", text: $text)
}
.padding()
.onAppear {
ngram = run()
}
}
}
12:03 We want to take the last two words of the input and generate
predictions for the next word. To convert the input string into words, we can
reuse the code we wrote earlier by extracting that into a reusable function. We
create an extension on String with a words method:
extension String {
func words() -> [String] {
let nsString = (self as NSString)
var words: [String] = []
nsString.enumerateSubstrings(in: NSRange(location: 0, length: nsString.length), options: .byWords) { word, _, _, _ in
words.append(word!.lowercased())
}
return words
}
}
13:02 To quickly see if this works, we can display the input string as a
list of individual words:
struct ContentView: View {
@State var ngram = NGram()
@State var text = ""
var body: some View {
VStack {
TextField("Input", text: $text)
Text("\(text.words())")
.monospaced()
}
.padding()
.onAppear {
ngram = run()
}
}
}
TODO screenshot at 13:38
13:39 Now we want to predict the next word, using our frequencies
dictionary in the NGram. Let's write a predict function that takes an input
string and returns the possible next words with their frequencies.
14:07 For a trigram — n = 3 — we want to take the last two words from
the input as the context using suffix. We look that context up in the
frequencies dictionary, which gives us a dictionary of word counts. If the
context doesn't exist as a key, we return an empty dictionary:
struct NGram {
var n = 3
typealias Context = [String]
var frequencies: [Context: [String: Int]] = [:]
mutating func train(words: [String]) {
for i in 0..<(words.count-n) {
let context = Array(words[i..<(i + n - 1)])
let word = words[i + n - 1]
frequencies[context, default: [:]][word, default: 0] += 1
}
}
func predict(input: [String]) -> [String: Int] {
let context = Array(input.suffix(n - 1))
return frequencies[context] ?? [:]
}
}
14:42 Below the input text field, we add a list using ForEach to
iterate over predictions coming from ngram.predict(input: text). We display
each key and its frequency value in an HStack. We also sort the predictions to
show the most frequent words first:
struct ContentView: View {
@State var ngram = NGram()
@State var text = ""
var body: some View {
VStack {
TextField("Input", text: $text)
List {
let pred = ngram.predict(input: text.words())
.sorted(by: { $0.value > $1.value })
ForEach(pred, id: \.key) { (key, value) in
HStack {
Text(key)
Text(value, format: .number)
.bold()
}
}
}
}
.padding()
.onAppear {
ngram = run()
}
}
}
16:01 Excellent — we're getting predictions that make sense for the
given context, and they're properly sorted:
TODO screenshot at 16:14
Interactive Input Suggestions
16:28 We could add one final feature: text input suggestions to accept
predictions as autocompletions. To make the suggestions clickable, we need to
specify the textInputCompletion string for each suggestion, which is the
predicted word appended to the existing text:
struct ContentView: View {
@State var ngram = NGram()
@State var text = ""
var body: some View {
let pred = ngram.predict(input: text.words())
.sorted(by: { $0.value > $1.value })
VStack {
TextField("Input", text: $text)
.textInputSuggestions {
ForEach(pred, id: \.key) { (key, value) in
Text(key)
.textInputCompletion(text + " " + key)
}
}
List {
ForEach(pred, id: \.key) { (key, value) in
HStack {
Text(key)
Text(value, format: .number)
.bold()
}
}
}
}
.padding()
.onAppear {
ngram = run()
}
}
}
18:04 Now we can repeated click the first completion and see what
happens. We quickly notice that only the last two words determine the
predictions — similar to iPhone keyboard suggestions. The completions don't
consider the full conversation context because we're only using a two-word
context window.
18:38 If we increase the context by setting n = 4, we get three words
of context. In theory, this should be more intelligent, but the trade-off
becomes apparent: increasing context words reduces the available completions,
because the training text contains fewer exact matches for longer phrases. A
more fuzzy matching strategy would be needed to work with longer contexts.
Discussion
19:56 But this approach is still valuable. We can see how the model
works and observe how the training data shapes the predictions. Training on
Shakespeare versus Dostoevsky would yield completely different results. We're
not sure exactly how iPhone autocomplete works, but this feels very similar.
20:26 There are many paths toward an actual LLM implementation:
tokenization instead of string storage, or more sophisticated matching. We'd
have to implement fuzzy matching that finds results for contexts that don't
exactly match phrases in the training text. The training algorithm of a real LLM
would also be significantly more complex. Let's see how we can continue
exploring this next time.