AI API Rate Limit Comparison

Compare rate limits across OpenAI, Anthropic, Google, and Mistral APIs by tier. Calculate queue throughput, analyze burst vs sustained capacity, and decode rate limit headers from API responses. Updated for May 2026 pricing tiers.

No data leaves your browser
Provider Model Tier RPM TPM (Input) TPM (Output) RPD Batch

Queue Management Calculator

Estimate how long it takes to process a queue of requests given your rate limits.

Enter your queue details and rate limits to estimate processing time.

Burst vs Sustained Throughput

Analyze whether your workload is RPM-limited or TPM-limited, and find the effective throughput.

RPM Limit
TPM Limit
Effective RPM
Select a model to see burst vs sustained throughput analysis.

Rate Limit Header Decoder

Paste your API response headers to decode rate limit status. Works with OpenAI, Anthropic, and Google response headers.

Understanding AI API Rate Limits: The Complete Reference

Rate limits are the guardrails that API providers set to ensure fair resource allocation, prevent abuse, and maintain system stability. Every AI API call you make is counted against multiple rate limit dimensions — requests per minute (RPM), tokens per minute (TPM), requests per day (RPD), and sometimes concurrent connections. Understanding these limits is critical for building production AI applications that handle load reliably without hitting 429 errors and dropping user requests.

How Rate Limits Work Across Providers

OpenAI Rate Limit Tiers

OpenAI uses a 5-tier system that automatically upgrades based on cumulative API spend. Tier 1 starts after your first $5 payment and provides 500 RPM / 30,000 TPM for GPT-4o. Each subsequent tier unlocks higher limits — Tier 5 (the highest) offers 10,000 RPM and 30,000,000 TPM. Importantly, rate limits are per-model and per-organization, so GPT-4o and GPT-4o Mini have separate buckets. OpenAI also applies daily limits (RPD) at lower tiers, which can be more restrictive than per-minute limits for batch workloads. The Batch API offers 50% cost savings but processes requests asynchronously within a 24-hour window rather than real-time.

Anthropic Rate Limit Structure

Anthropic separates rate limits into Build and Scale tiers based on monthly spend and approval. The Build tier provides 50 RPM and 40,000 input TPM for Claude Opus 4 — significantly lower than OpenAI's entry tier for premium models. Scale tier customers get 4,000 RPM and 400,000 TPM after reaching spend thresholds. A unique aspect of Anthropic's system: input tokens and output tokens are rate-limited separately, and concurrent request limits apply in addition to RPM caps. This means even if you have RPM headroom, sending too many simultaneous long-running requests can trigger throttling.

Google Gemini Rate Limits

Google Gemini API rate limits vary by billing plan. The free tier allows 15 RPM and 32,000 TPM for Gemini 2.5 Pro, sufficient for development and testing. The paid tier increases to 2,000 RPM and 4,000,000 TPM — among the most generous default limits of any provider. Google rate limits are applied per-project (not per-key), so all API keys within a Google Cloud project share the same limit pool. Rate limit increases can be requested through the Google Cloud Console quota management interface.

Mistral Rate Limits

Mistral offers competitive rate limits starting at 1 RPM for free tier users and scaling to 500+ RPM for enterprise customers. Mistral Large has default limits of 200 RPM and 500,000 TPM on the standard paid plan. Mistral's La Plateforme provides a simpler tier structure than OpenAI or Anthropic, with limits primarily based on your subscription plan rather than cumulative spend. For open-source Mistral models self-hosted through vLLM or TGI, there are no rate limits — only GPU capacity constraints.

Rate Limit Strategies for Production

Token Bucket Algorithm

The most effective client-side rate limiting approach is the token bucket algorithm. Maintain a bucket that fills at your allowed rate (e.g., 500 tokens per minute = ~8.33 tokens per second). Each request consumes tokens from the bucket based on its estimated size. If the bucket is empty, the request waits. This smooths traffic into a steady stream rather than bursting and getting throttled. Most production AI SDKs implement this pattern internally, but you should also add it at the application level for coordinating across multiple service instances.

Multi-Provider Failover

Production applications should route requests across multiple providers. If OpenAI rate limits are hit, automatically failover to Anthropic or Google. This requires normalizing your prompt format across providers and maintaining separate rate limit tracking for each. The effective throughput multiplies: 500 RPM on OpenAI + 50 RPM on Anthropic + 2,000 RPM on Google gives you 2,550 RPM total capacity. Cost-aware routing can also direct cheaper requests to the most affordable provider that has available capacity.

Batch Processing Optimization

For non-real-time workloads, leverage batch APIs when available. OpenAI's Batch API processes requests at 50% cost with 24-hour turnaround, with separate (higher) rate limits. Structure batch jobs to maximize throughput: group requests by model, pre-validate all inputs before submission, and implement checkpointing so failed batches can resume from the last successful request. A 100,000-request batch at Tier 2 limits would take ~200 minutes in real-time but completes within 24 hours via the Batch API at half the cost.

Rate Limit Headers Reference

Header (OpenAI)Description
x-ratelimit-limit-requestsMaximum RPM for this model
x-ratelimit-remaining-requestsRequests remaining in current window
x-ratelimit-limit-tokensMaximum TPM for this model
x-ratelimit-remaining-tokensTokens remaining in current window
x-ratelimit-reset-requestsTime until request limit resets
x-ratelimit-reset-tokensTime until token limit resets
Header (Anthropic)Description
anthropic-ratelimit-requests-limitMaximum RPM
anthropic-ratelimit-requests-remainingRemaining requests in window
anthropic-ratelimit-tokens-limitMaximum input TPM
anthropic-ratelimit-tokens-remainingRemaining tokens in window
retry-afterSeconds to wait before retrying (on 429)

Exponential Backoff Implementation

When you receive a 429 status code, do not retry immediately. Implement exponential backoff: wait 1 second, then 2, then 4, then 8, capping at 60 seconds. Add random jitter (0 to 1 second) to each wait to prevent synchronized retries from multiple clients — this is called the "thundering herd" problem. In Python, the standard pattern is:

Set max_retries=5 with base_delay=1.0. For each retry, calculate delay = min(base_delay * (2 ** attempt) + random(), 60). The OpenAI and Anthropic Python SDKs have this built in, but always verify the configuration matches your requirements. For high-throughput systems, proactive rate limiting using the response headers is far more efficient than reactive retry-after patterns.

Frequently Asked Questions

What are the rate limits for OpenAI's GPT-4o API?

OpenAI's GPT-4o rate limits depend on your usage tier. Tier 1 (after first $5 payment) allows 500 RPM (requests per minute) and 30,000 TPM (tokens per minute). Tier 2 ($50+ spent) increases to 5,000 RPM and 450,000 TPM. Tier 3 ($100+ spent) allows 5,000 RPM and 800,000 TPM. Tier 4 and 5 offer up to 10,000 RPM and 10,000,000+ TPM. Rate limits are applied per-organization and per-model, meaning GPT-4o and GPT-4o Mini have separate limits.

How do Anthropic's Claude API rate limits compare to OpenAI?

Anthropic's Claude API uses a tier system based on monthly spend. The Build tier (default) allows 50 RPM and 40,000 input TPM for Claude Opus 4, and 50 RPM with 40,000 TPM for Sonnet 4. The Scale tier (after spending thresholds) increases to 4,000 RPM and 400,000 TPM. Anthropic generally has lower default limits than OpenAI but scales comparably at higher tiers. A key difference: Anthropic counts input and output tokens separately for rate limiting, while OpenAI combines them.

What HTTP headers indicate rate limit status?

Most AI APIs return rate limit headers with every response. OpenAI returns x-ratelimit-limit-requests, x-ratelimit-limit-tokens, x-ratelimit-remaining-requests, x-ratelimit-remaining-tokens, and x-ratelimit-reset-requests/tokens. Anthropic returns anthropic-ratelimit-requests-limit, anthropic-ratelimit-requests-remaining, anthropic-ratelimit-tokens-limit, and anthropic-ratelimit-tokens-remaining. When you hit a limit, you receive a 429 Too Many Requests status code. Always parse these headers to implement proactive throttling rather than waiting for 429 errors.

How should I implement retry logic for rate-limited API calls?

Use exponential backoff with jitter for rate limit retries. Start with a 1-second delay, double it on each retry (2s, 4s, 8s), and add random jitter (0-1s) to prevent thundering herd problems. Cap the maximum delay at 60 seconds and limit total retries to 5-7 attempts. Most AI SDKs (openai-python, anthropic-python) have built-in retry logic. For production systems, implement a token bucket or sliding window rate limiter on your side to proactively stay under limits rather than relying on retry-after-error patterns.

Can I increase my AI API rate limits?

Yes. All major providers offer rate limit increases. OpenAI automatically upgrades your tier as you accumulate spend — after $100+ in API usage, limits increase significantly. You can also request a manual increase through their support portal for enterprise needs. Anthropic offers custom rate limits for Scale tier customers. Google Cloud AI rate limits can be increased by requesting quota adjustments in the Google Cloud Console. For urgent production needs, contacting the provider's sales team is the fastest path to higher limits.

About the Author

Built by Michael Lip — solo developer with 10+ years experience. 140+ PRs merged into open source projects including Google Chrome and Axios. Creator of 20+ developer tools across the Zovo network. No tracking, no ads, no data collection.