Quelm
Work with us
Back to blog
llmredisrate-limitingai-agentsrace-conditions

The race condition hiding in every LLM cost cap

You've probably seen this story by now. A developer wakes up to an LLM bill four figures higher than the night before. Their agent entered a loop around 2 AM. The threshold alerts they'd configured had fired — three of them, escalating through the night — but the emails arrived after the spending was already done. By morning, the damage was irreversible.

This isn't a rare story. It happens often enough that "wrapper to prevent agent runaway costs" has become a regular pattern in side projects on GitHub. But most of these wrappers have a subtle problem that only surfaces under concurrent load — the exact conditions where agents most often go wrong.

Here's the bug, and what actually works.

What provider-level caps actually do

Provider-level spend controls have come a long way. Both OpenAI and Anthropic now offer hard-blocking monthly caps in their dashboards. OpenAI ships per-project budgets via their Projects feature, with email threshold notifications. Anthropic offers an org-wide monthly cap with hard blocking at 100%. Both providers will refuse to serve requests once the cap is hit — with the caveat that OpenAI's own docs acknowledge calls "may continue to be processed" briefly past the cap.

If you're a single team using only OpenAI with a monthly budget that's fine to cap globally, you don't have a problem. The dashboard controls work.

The pain shows up when your shape doesn't fit theirs. Specifically:

  • Per-customer or per-agent caps. Providers cap at org or project level. If you're a SaaS where each end customer runs their own agent, you need 1,000 caps, not one — and they need to be created programmatically, not in a dashboard.
  • Sub-monthly windows. Provider caps are monthly. A runaway loop on day 2 locks you out for 28 days. Daily caps fail closed only for that day.
  • Multi-provider unified view. Anthropic's cap is blind to your OpenAI spend. OpenAI's cap is blind to your Anthropic spend. You'd configure separate caps in each dashboard with no aggregate view.
  • Real-time enforcement with zero lag. A 30-second window where calls "may continue to be processed" past the cap is fine for monthly billing reconciliation. It's not fine for an agent firing 100 requests per second.
  • Programmatic alerting. Both providers send threshold emails. Neither does Slack, webhook, or any other channel. If your ops practice runs on Slack, you're polling email.

For any of these, you have to build it yourself — or use something built on top of the provider APIs that does it for you. And here's where the race condition shows up.

The naive approach (and why it fails)

If you're building a wrapper that enforces a spend cap, the obvious algorithm looks like this:

def check_and_call(agent_id, prompt):
    # Read the agent's current spend
    spent = redis.get(f"spend:{agent_id}")

    if spent >= AGENT_CAP:
        return {"error": "Cap exceeded"}

    # Make the API call
    response = llm_client.chat.completions.create(...)
    cost = calculate_cost(response)

    # Update spend
    redis.incrby(f"spend:{agent_id}", cost)
    return response

This is fine when one agent makes one call at a time. It breaks the moment you have concurrent requests.

Imagine the cap is £10, and the agent has spent £9.95. Two requests arrive simultaneously. Both read the current spend (£9.95), both see they're under the cap, both proceed with the API call, both succeed, and now you've spent £11.50. The cap has been exceeded by 15%.

That doesn't sound catastrophic, but two things make it worse than it appears. First, agent runaway scenarios involve not two concurrent requests but hundreds, sometimes thousands. The cap can be exceeded by 10x or 100x in the time it takes to notice. Second, the failure mode is most likely to occur exactly when you most need the cap — when something has gone wrong and the agent is firing requests as fast as it can.

Pessimistic locking is too slow

The textbook fix is pessimistic locking. Before reading the spend, acquire a lock. Only release it after the spend has been updated.

def check_and_call(agent_id, prompt):
    with redis.lock(f"lock:{agent_id}"):
        spent = redis.get(f"spend:{agent_id}")
        if spent >= AGENT_CAP:
            return {"error": "Cap exceeded"}

        response = llm_client.chat.completions.create(...)
        cost = calculate_cost(response)
        redis.incrby(f"spend:{agent_id}", cost)
        return response

This works correctly but kills throughput. Every request to the same agent now blocks on every other request to that agent. Worse, you're holding the lock for the duration of the LLM call — which can be 5–30 seconds for a streaming response. For an agent making bursty parallel calls, you've just serialised them all.

You can release the lock before the API call and re-acquire it afterwards, but then you've reintroduced the original race condition.

Optimistic locking with CAS doesn't help under burst load

Redis offers optimistic concurrency control via WATCH/MULTI/EXEC. You watch a key, read it, prepare a transaction, and the transaction only succeeds if the key hasn't changed in the meantime.

def check_and_decrement_optimistic(agent_id, cost):
    while True:
        with redis.pipeline() as pipe:
            pipe.watch(f"spend:{agent_id}")
            spent = int(pipe.get(f"spend:{agent_id}") or 0)

            if spent + cost > AGENT_CAP:
                pipe.unwatch()
                return False

            pipe.multi()
            pipe.incrby(f"spend:{agent_id}", cost)
            try:
                pipe.execute()
                return True
            except WatchError:
                continue  # Someone else modified it, retry

This is correct, but it has its own failure mode. Under contention — exactly when you have many concurrent requests — most transactions retry. With ten concurrent requests, nine fail and retry. With a hundred, ninety-nine retry, of which most fail and retry again. Throughput collapses precisely when load is highest.

The Redis Lua approach

The clean solution is to push the entire check-and-decrement logic into Redis itself as a Lua script. Redis executes Lua scripts atomically — no other command can interleave between the lines of your script. This gives you transactional semantics without the retry overhead.

Here's the core script:

-- KEYS[1] = the spend key (e.g. "spend:agent_42")
-- ARGV[1] = the cap (in pence to avoid floats)
-- ARGV[2] = the estimated cost of this call (in pence)
-- ARGV[3] = the period TTL in seconds

local current = tonumber(redis.call('GET', KEYS[1]) or '0')
local cap = tonumber(ARGV[1])
local cost = tonumber(ARGV[2])
local ttl = tonumber(ARGV[3])

if current + cost > cap then
    return {0, current, cap}  -- denied
end

-- Reserve the cost atomically
local new_spend = redis.call('INCRBY', KEYS[1], cost)

-- Set TTL on first write so the counter resets at period end
if current == 0 then
    redis.call('EXPIRE', KEYS[1], ttl)
end

return {1, new_spend, cap}  -- allowed

You call this from your application like so:

RESERVE_SCRIPT = redis_client.register_script(open("reserve.lua").read())

def reserve_spend(agent_id, estimated_cost_pence, cap_pence, period_seconds):
    result = RESERVE_SCRIPT(
        keys=[f"spend:{agent_id}"],
        args=[cap_pence, estimated_cost_pence, period_seconds],
    )
    allowed, current, cap = result
    return bool(allowed), current, cap

Now your flow becomes:

  1. Estimate the cost of the upcoming call from the prompt token count.
  2. Reserve that estimated cost atomically. If denied, return immediately.
  3. Make the API call.
  4. After the response, reconcile the reservation with the actual cost (since estimates aren't perfect).

Step 2 is atomic. No matter how many concurrent requests arrive, exactly the right number will be allowed through. The cap cannot be exceeded by more than the largest single estimate.

The reconciliation step matters

Cost estimation from prompt tokens isn't perfectly accurate. Output token count is what you mostly pay for, and you don't know that until the response comes back. So the reservation is necessarily approximate.

To handle this, the reservation step uses a conservative estimate (assume the response will be maxed out), and after the response you settle up:

-- KEYS[1] = the spend key
-- ARGV[1] = the reserved amount
-- ARGV[2] = the actual cost

local diff = tonumber(ARGV[2]) - tonumber(ARGV[1])
local new_spend = redis.call('INCRBY', KEYS[1], diff)
return new_spend

If the actual cost was lower than the reservation, this is a negative INCRBY and the spend goes down. If it was higher, the spend goes up. Either way, the cap holds — the worst case is that a single request might cause the running total to briefly exceed the cap, after which subsequent requests are denied until the period resets.

Fail-open vs fail-closed

There's a design question worth thinking about: what should happen if Redis itself is unavailable?

Fail-closed means: if you can't enforce the cap, deny the request. This protects against runaway spend but breaks the application's normal operation if Redis goes down — every LLM call fails until Redis recovers.

Fail-open means: if you can't enforce the cap, allow the request through unchecked. This keeps the application working but defeats the purpose of having a cap during the outage.

The right answer depends on your risk profile, and a good enforcement layer should let you configure it. In practice, a graduated approach works well: fail open for the first one or two Redis failures (treat them as transient), but if the failure persists for more than a few seconds, fail closed. This gives you the best of both worlds — short blips don't take down your application, but a sustained Redis outage doesn't let an agent quietly spend your entire month's budget.

Other edges worth thinking about

A few other things that catch teams off guard:

Period boundaries are race-condition prone too. When the daily window resets, a burst of requests that were just being denied can all flood through simultaneously. The Lua script above handles this implicitly — when the key has expired, GET returns nil, the spend starts from zero, and INCRBY creates a new key with a fresh TTL. But if your reset logic involves multiple operations, make sure those are also atomic.

Estimates can be manipulated. If you're reporting spend from a client-side wrapper, a buggy or malicious client can under-report and bypass the cap. Server-side enforcement (i.e. a proxy that sits between the agent and the LLM provider) is the only way to be sure.

Multiple caps interact. Real systems have per-agent, per-team, and per-organisation caps simultaneously. The Lua script needs to check all of them and only allow the call if none would be exceeded. Easy enough — just multiple GETs and a combined check inside the script. Still atomic.

What this looks like in practice

Putting this together, the production flow becomes:

  1. Request arrives at your enforcement layer.
  2. The layer extracts the agent ID, looks up the relevant caps (agent, team, org), and runs a single atomic Lua script that reserves estimated cost against all of them.
  3. If any cap denies, the request is rejected immediately with a structured error.
  4. Otherwise, the request proxies through to the actual LLM provider.
  5. When the response returns, a reconciliation script updates the reserved amount to the actual cost.

This gives you guarantees that provider-level caps don't: per-agent enforcement at sub-monthly granularity, multi-provider unification, sub-millisecond enforcement overhead, and a clear story about what happens during infrastructure failures.


If you'd rather not build this from scratch, TokenCapAI is the productised version — atomic Redis-Lua enforcement, monitoring and proxy modes, multi-provider unified caps (OpenAI, Anthropic, Gemini, 220+ models), and per-customer usage data so you can bill your own customers based on actual AI consumption. Free tier available.