BrunoP.Blog

My AI agent looped and nearly torched the budget: the 'token bucket' that keeps the bill in check

An autonomous agent looped, hammering an expensive tool, and became a financial DoS against me. I'll show you the classic circuit breaker every serious system uses — visualized as a dripping token bucket.

The other night I left an AI agent running on its own — one of those that takes a task, thinks, calls tools, reads the result, and decides the next step. The idea was noble: it would tidy up a pile of data for me while I slept. I woke up, opened the API cost dashboard, and nearly knocked my coffee onto the keyboard. The thing had gone into a loop — endlessly calling an expensive tool that hits an external service — and spent the whole night hammering the same call thousands of times.

It wasn't an attack from outside. It was me, by accident, building a financial DoS against my own account. The agent had no notion of a "brake": as far as it was concerned, if the task wasn't done, just try again. And again. And again. Luckily the damage was small because I had a spending cap on the platform — but the scare sent me back to a backend pattern I already knew well and that, honestly, should have been there from day one: the token bucket.

The new 2026 headache: the agent that attacks itself

We always thought of rate limiting as a defense against others: the bot trying to log in a thousand times, the scraper sucking your API dry, the joker hammering F5. But autonomous agents created a brand-new category of problem — the system that attacks itself. A badly closed loop, a stop condition that's never met, a tool that returns an error the agent reinterprets as "try again": there you go, a perfect machine for burning money and blowing through quota.

The cruel part is that the cost is real and immediate. Every expensive tool call — a paid search, an image generation, a query to an external database — has a price. Multiply by thousands of iterations per minute and the loss grows exponentially while you sleep peacefully, thinking it's "working."

The token bucket, explained for real

The token bucket is probably the most widely used rate-limiting algorithm in the world — it's at the edge of AWS, Cloudflare, nginx, and pretty much every serious API. And its beauty is that it fits into one simple mental image: picture a bucket.

  • The bucket has a maximum capacity of tokens (say, 10).
  • A tap drips new tokens into it at a constant rate (for example, 2 tokens per second).
  • Every time a request arrives, it has to grab a token to be served.
  • If there's a token, the request passes and the token vanishes. If the bucket is empty, the request is rejected on the spot.
  • The bucket never overflows: once it's full, the extra tokens that would drip in are simply lost.

That's it. And that simplicity is exactly what gives it its most valuable property: it allows short bursts but limits the average. If the bucket is full and 10 requests arrive at once, they all pass — great for legitimate spikes. But if requests keep coming faster than the tap refills, the bucket empties and the excess starts hitting the wall. That's the circuit breaker: it lets normal use flow and cuts off abuse, without you having to guess a rigid "X per second" limit.

My agent's loop would have hit that wall within a few seconds. Instead of thousands of expensive calls, it would have gotten only the handful the bucket allows per time window — and every rejection would have been a clear signal of "hey, something's wrong here."

Why not just a simple counter?

The obvious question is: why not just count "max 100 per minute" and reset the counter every minute? That's the fixed window, and it has a classic flaw: the window-boundary problem. If your limit is 100 per minute, you can send 100 at 12:00:59 and another 100 at 12:01:00 — 200 calls in two seconds, all within the rules. The token bucket doesn't have that hole because it reasons about a continuous refill rate, not about chopped-up time blocks.

It has a cousin called the leaky bucket, which thinks the other way around: requests enter a bucket and leak out through a hole at a fixed rate, smoothing the output flow. The token bucket is more lenient with bursts; the leaky bucket is stricter and steadier. To rein in a looping agent, I prefer the token bucket: it tolerates legitimate batch work and only bites when the pace exceeds the sustainable limit.

Curiosities I enjoy

  • It lives hidden right in front of you. That Retry-After or X-RateLimit-Remaining header you've seen on an HTTP 429 response? It's almost always a token bucket (or a cousin) saying "your bucket is empty, come back in this many seconds."
  • You can implement it in ~15 lines. You don't even need a thread dripping tokens. The trick is to store only two numbers — how many tokens you had and when you last checked — and compute "how many dripped since then" the moment a request arrives. Lazy refill. Beautifully elegant.
  • It's the basis of the "burst" you love. When a service lets you blow past the limit for a few seconds before throttling you, that's the full bucket being spent. The bucket's capacity is the size of the allowed burst.

My honest take

After that scare, my rule became simple: every autonomous agent is born on a leash. Before unleashing any loop that calls a paid tool, I put a token bucket in front of the expensive call — and, on top of that, an absolute spending cap per run. It's not distrust of the model; it's engineering hygiene. The same reason we put a fuse in the outlet isn't that we think the appliance is bad — it's that failures happen, and the cost of not having the brake is too high.

And here's the point I most want to land: rate limiting isn't "boring infra stuff." It's cost control, it's predictability, it's what separates a prototype that becomes a surprise bill from a product you can leave running without fear. When I build something for a client, that brake ships with it — because the worst bug isn't the one that crashes, it's the one that runs perfectly while doing the wrong thing thousands of times.

Enough talk — feel the brake yourself 👇

I built a bucket simulator right below. Tweak the tap (tokens per second) and the bucket capacity, and fire requests by clicking. When you want to see the nightmare, flip on "looping agent" mode: it'll hammer requests just like mine did at 3 a.m. Watch the bucket drain, the requests hit the wall (red), and the "cost saved" counter climb — that's exactly the money the brake saves.

token-bucket.js

100% local JS simulation, no network. The per-request "price" is fictional, just so you feel the damage.

If you flipped on loop mode and watched the bucket try (and fail) to keep up with the machine gun, you felt firsthand why this little pattern is so beloved. Four or five lines of logic that separate a well-behaved agent from a bill you're scared to open. This is more or less how I work: I take a real 3 a.m. scare and turn it into a piece of engineering you can touch. If you have a product, an agent, or an API that needs to run without giving you a fright at the end of the month — want to chat?

Let's talk about your project