Decoding the Signal
Concept:
LLMs don't read words — they read tokens. Tokenization splits text into subword pieces using Byte Pair Encoding (BPE). 'Hello' is one token, but 'tokenization' might be split into ['token', 'ization']. Spaces often attach to the next word. This is why API pricing is per-token, and why context windows are measured in tokens (e.g., 128K tokens ≈ 300 pages).
Science Officer Chen:
Commander, I've been analyzing how ARIA processes our transmissions. It doesn't read our words the way we do.
Commander Vega:
What do you mean? We send text, it responds with text.
Science Officer Chen:
Yes, but between sending and receiving, ARIA breaks our text into fragments called tokens. 'Hello world' isn't two words to ARIA — it's two tokens: 'Hello' and ' world'. Notice the space is attached to 'world'.
Commander Vega:
Why not just use words?
Science Officer Chen:
Because words are messy. Different languages, compound words, code, numbers... Instead, ARIA uses Byte Pair Encoding — it learned the most common character sequences from its training data. Common words like 'the' are single tokens. Rare words get split: 'tokenization' becomes 'token' + 'ization'.
Commander Vega:
So that's what 'tokens' means when ARIA reports usage. Our last transmission was 56 tokens.
Science Officer Chen:
Exactly. And ARIA has a context window — the maximum number of tokens it can process at once. 128,000 tokens is roughly 300 pages. Everything beyond that is invisible to the intelligence.
Commander Vega:
Let me try the signal decoder. I want to see how ARIA would read my transmissions.
Example Code:
Input: "Hello world"
Token splits:
[0] "Hello" (ID: 9906)
[1] " world" (ID: 1917)
Total tokens: 2
Characters: 11
Ratio: 5.5 chars per token
Your Assignment
Type any text and see how the tokenizer breaks it into tokens. Try different things: a simple sentence, a long word like 'internationalization', or even some code like 'print("hello")'.
Llm Console