What is a token, what is model context?

LLMs may seem magical, but they are computer programs underneath. By understanding how they work, we can understand some of their pitfalls and make our usage of them more efficient. My aim here is to explain the concepts of: token, model context, model context size and context rot as one of the potential pitfalls. I use LLM and model interchangeably in this text.

Model context

Model context is everything that the model uses for generating the next piece of text. The main piece of context is usually the chat message that you wrote, but it also includes the uploaded documents and other information that the system automatically includes.

The automatically included parts could be high-level and not specific to the brand of LLM used, like the system prompt (descriptions of how the system should behave), and low-level and specific to the brand of LLM used, such as turn taking details.

Here’s a literal example of what goes into the LLM using the above chat template for the MiniMax model:

Notably, when you chat with an LLM, all of the messages, both yours and the LLMs, in a single chat are included in it’s history.

A transformer architecture that underlies LLM systems is a text producing architecture.

What does it mean to produce text? In transformers, it means computing the most likely next token.

Tokens

Tokens are combinations of characters, computed from the text on which the LLM is produced. Each token has a unique embedding, a representation as a vector of real numbers, and an LLM is a “next token generation engine”. Tokens are a technical solution that proved to work the best: there are too many words in a vocabulary to capture all of them, and characters are too small as building blocks. The solution: variable sequences of characters, i.e. tokens.

69 tokens

Context size and context rot

Context size is maximum number of tokens that a model can ingest for producing text. It is impossible for a model to attend to more tokens than the context size.

There are simple rules of thumb for converting between tokens and words/characters, the most important ones:

Flagship models advertise 1 million context size - fitting several books worthy of text. This is a lot of text - and even though we could fit all books inside the context, it might not be the best idea.

Context rot is the phenomenon where more tokens worsen the quality of output. Ideally, you wouldn’t give 3 full books to the LLM, but the most relevant chapter from each book. That’s why even though it might make sense to put all the scientific papers you care about at once, choosing is still beneficial. LLMs will get better and better at handling larger context, but this is fundamentally hard to test and it will take solid effort to improve in this regard. Until then, we have other techniques such as dynamically searching for content and putting it into the context (also known as a RAG system - Retrieval Augmented Generation).