Rate Limiting
Cap the number of requests, tokens, or tool calls per user (or session) within a time window.
Intent & Description
🎯 Intent
Cap the number of requests, tokens, or tool calls per user (or session) within a time window.
📋 Context
A team runs a multi-tenant agent product where many users share the same backend resources — token budgets with model providers, tool API quotas, compute capacity. Any one of those users can, accidentally or maliciously, send much more traffic than the operator priced for: a runaway script, a compromised account, or simply a single power user opening hundreds of concurrent sessions.
💡 Solution
Define limits per identity at multiple horizons (per minute, per hour, per day). Use token-bucket or sliding-window counters. Apply at API gateway and at agent loop level. Surface limit hits to the user clearly.
Real-world Use Case
- A single user or compromised account could otherwise bankrupt the product or starve others.
- Limits per identity can be enforced at API gateway and inside the agent loop.
- Limit hits can be surfaced to users in a clear, actionable way.
Source
Advantages
- Cost predictability.
- Abuse becomes detectable as limit hits.
Disadvantages
- Legitimate burst usage is throttled.
- Tier definitions ossify.