Dev May 27, 2026 8 min read

Why AI charges per 'piece of a word': I built an in-browser tokenizer so you can see it

I opened my AI API bill, got a scare, and realized almost nobody truly gets what a 'token' is — the unit you pay for. Paste any text and watch the model chop your words into colored pieces, live.

The other day I opened the monthly bill for the AI API I use in a few projects and got an honest scare. It wasn't an outrageous amount, but it had grown far more than I expected — and the bill was itemized in a unit that, let's be honest, almost nobody truly understands: tokens. “You used 4.2 million input tokens and 1.1 million output tokens.” Right. And what is a token, anyway?

I tried to explain it to a client in a meeting and realized the visual part was missing. We say “a token is a piece of a word” and everyone nods, but nobody actually sees it happen. So I did what I always do when an idea won't leave my head: I opened the editor and built the thing. There's a tokenizer running right below, in your browser, with no server at all. Paste some text and watch the model chop your words into colored little pieces, live. But first, let me tell you why this matters for your wallet.

The model doesn't read letters. Or words. It reads tokens.

When you send “Good morning, how are you?” to an AI, the model doesn't see it as letters or as whole words. Before anything else, a little program called a tokenizer chops the text into a sequence of pieces — the tokens — and each piece becomes a number. The model only talks in numbers.

The rule of thumb worth memorizing: in English, 1 token ≈ 4 characters ≈ ~¾ of a word. Common words (“the”, “dog”, “house”) are usually a single token. Rare words, proper nouns and technical terms get broken into several pieces. And it's exactly in that breaking that the part affecting your bill lives.

Why “coração” costs more than “heart”

Here's the catch that surprises a lot of people: Portuguese, and especially accents, cost more. The tokenizers of the big models were trained mostly on English, so English “fits” better: each token packs in more letters. When you write with accents, cedillas and tildes, the model often needs more tokens to represent the same idea.

Accents. A letter like “ç” or “ã” can internally become more than one piece — because text is processed as UTF-8 bytes, and those characters take more than one byte.
Emojis. That cute 🚀 you pasted into the email can be worth 2 or 3 tokens by itself. Emoji is expensive.
Code and URLs. Full of symbols, indentation and odd names, they tend to fragment a lot — which explains why asking an AI to “read” a big code file weighs so much.
Language. The same text translated can have very different counts. Portuguese and Spanish almost always spend more tokens than the equivalent English.

In other words: your AI bill has a slight built-in “language tax”. You can't fully escape it, but you can see it — and once you see it, you start making better decisions about prompts, context and what's worth sending to the model at all.

A token isn't a word, a letter or a syllable. It's a statistical chunk the model learned to recognize because it shows up together a lot. “ção” becomes a chunk because Portuguese is full of it. “ing” becomes a chunk in English for the same reason.

How the machine decides where to cut: BPE

The most common technique is called BPE (Byte Pair Encoding), and the idea is delightfully simple. You start by treating each character as a separate piece. Then you look at which pair of neighbouring pieces appears most often across all the training text and merge the two into a single piece. Repeat that thousands of times.

In the end, the pieces that appear a lot (“ção”, “ment”, “the”) become single tokens, and the rare ones stay fragmented into letters. That's why a common word costs 1 token and a rare word costs several: the model's vocabulary “memorized” the frequent pieces and improvises the rest letter by letter.

The tokenizer I built below is a simplified, educational version of this process: it has a small dictionary of common Portuguese and English merges baked in, and when it doesn't recognize a piece, it falls back to splitting by character. It's not the exact tokenizer of OpenAI or Anthropic — each model has its own — but the intuition it gives you is the same: you see right away where the cuts happen and why accents and emoji cost so much.

My honest take

After that bill, I changed a few things in how I build AI features for clients. I started measuring tokens before shipping to production, trimming context that added nothing, and explaining to the client, numbers in hand, why “sending the whole PDF every time” gets expensive. It's not about saving pennies — it's about understanding the economic unit of the product you're building. Whoever treats tokens as a technical detail eventually gets a scare on the bill. Whoever treats them as a product decision sleeps soundly.

Enough theory. Paste some text below — one of your emails, a snippet of code, a sentence full of accents — and watch the magic (and the cost) happen 👇

tokenizer.js

Educational tokenizer in pure JS, 100% in your browser — no text leaves this page. Counts are approximate (each real model uses its own tokenizer).

Played with the examples? Noticed how the emoji blows up the count and how the code snippet fragments everything? That's more or less how I work: I take a real question — in this case, “why did my AI bill go up?” — and turn it into a little thing you can touch and understand in thirty seconds. If you have a product with AI in the mix, or just want to understand how much that brilliant idea will actually cost to run, I'd love to chat.

Let's talk about your AI project