Decoding the Signal

Concept:

LLMs don't read words — they read tokens. Tokenization splits text into subword pieces using Byte Pair Encoding (BPE). 'Hello' is one token, but 'tokenization' might be split into ['token', 'ization']. Spaces often attach to the next word. This is why API pricing is per-token, and why context windows are measured in tokens (e.g., 128K tokens ≈ 300 pages).

Science Officer Chen: Commander, I've been analyzing how ARIA processes our transmissions. It doesn't read our words the way we do.

Commander Vega: What do you mean? We send text, it responds with text.

Science Officer Chen: Yes, but between sending and receiving, ARIA breaks our text into fragments called tokens. 'Hello world' isn't two words to ARIA — it's two tokens: 'Hello' and ' world'. Notice the space is attached to 'world'.

Commander Vega: Why not just use words?

Science Officer Chen: Because words are messy. Different languages, compound words, code, numbers... Instead, ARIA uses Byte Pair Encoding — it learned the most common character sequences from its training data. Common words like 'the' are single tokens. Rare words get split: 'tokenization' becomes 'token' + 'ization'.

Commander Vega: So that's what 'tokens' means when ARIA reports usage. Our last transmission was 56 tokens.

Science Officer Chen: Exactly. And ARIA has a context window — the maximum number of tokens it can process at once. 128,000 tokens is roughly 300 pages. Everything beyond that is invisible to the intelligence.

Commander Vega: Let me try the signal decoder. I want to see how ARIA would read my transmissions.

Example Code:

Input: "Hello world"

Token splits:
  [0] "Hello" (ID: 9906)
  [1] " world" (ID: 1917)

Total tokens: 2
Characters: 11
Ratio: 5.5 chars per token

Your Assignment

Type any text and see how the tokenizer breaks it into tokens. Try different things: a simple sentence, a long word like 'internationalization', or even some code like 'print("hello")'.

Solution:

The quick brown fox jumps over the lazy dog

Llm Console