PCCI Documentation

Rate limits are restrictions that our API imposes on the number of times a user or client can access our services within a specified period of time.

Why do we have rate limits?

Rate limits are a common practice for APIs, and they’re put in place for a few different reasons:

They help protect against abuse or misuse of the API. For example, a malicious actor could flood the API with requests in an attempt to overload it or cause disruptions in service. By setting rate limits, we can prevent this kind of activity.
Rate limits help ensure that everyone has fair access to the API. If one person or organization makes an excessive number of requests, it could bog down the API for everyone else. By throttling the number of requests that a single user can make, we ensure that the most number of people have an opportunity to use the API without experiencing slowdowns.
Rate limits can help manage the aggregate load on our infrastructure. If requests to the API increase dramatically, it could tax the servers and cause performance issues. By setting rate limits, we can help maintain a smooth and consistent experience for all users.

Please work through this document in its entirety to better understand how our rate limit system works. We include code examples and possible solutions to handle common issues.

How do these rate limits work?

Rate limits are measured in four ways: RPS (requests per second), TPM (tokens per minute), concurrent requests, and APM (audio minutes per minute). Rate limits can be hit across any of these options depending on what occurs first. For example, you might send 5 requests with only 100 tokens to the chat completions endpoint and that would fill your limit (if your concurrent request limit was 5), even if you did not send 150k tokens (if your TPM limit was 150k) within those 5 requests. Other important things worth noting:

Rate limits are defined at the organization level. All users and API keys within an organization share the same rate limit pool.
Rate limits vary by the request type being used. General API requests and inference requests each have different rate limits.
Limits are also placed on the token usage and audio processing time. These limits refill continuously based on your plan’s refill rate.

Identifier scope

Whenever possible, the rate limiter uses the most specific and reliable identifier available to track usage:

Organization ID – Primary rate limit scope (tied to your API key)
API Key – Used if org-level context is not clearly available
User ID – Used if org-level context is not available
IP Address – Only used as a last resort (e.g., unauthenticated or anonymous requests)

As a result, all requests made using the same organization’s API keys will share the same rate-limiting bucket.

Rate limits by tier

Rate limits vary based on your organization’s tier. Each tier defines different capacities and refill rates depending on the type of request.

Available request types

DEFAULT: General-purpose API requests (e.g., models, projects, settings)
INFERENCE: Paid model inference requests (e.g., chat completions, embeddings)
INFERENCE_FREE: Free model inference requests
AUTH: Authentication-related requests (e.g., login, token exchanges)

Usage tiers

You can view the rate and usage limits for your organization under the limits section of your account settings. As your usage on our API increases, we automatically graduate you to the next usage tier. This usually results in an increase in rate limits across most endpoints.

Tier	Qualification	Usage limits
Free	Default tier for new users	$100 / month
Tier 1	$5 paid	$100 / month
Tier 2	$50 paid and 7+ days since first successful payment	$500 / month
Tier 3	$100 paid and 7+ days since first successful payment	$1,000 / month

Request limits

Request limits control how many API requests you can make per second. Different request types have different limits based on their sensitivity and resource intensity.

Tier	Type	Capacity	Refill Rate (tokens/sec)
`BASE`	DEFAULT	50	5
`BASE`	INFERENCE	5	1
`BASE`	INFERENCE_FREE	1	1
`BASE`	AUTH	5	1
`TIER_1`	DEFAULT	150	15
`TIER_1`	INFERENCE	5	1
`TIER_1`	INFERENCE_FREE	1	1
`TIER_1`	AUTH	5	1
`TIER_2`	DEFAULT	450	45
`TIER_2`	INFERENCE	5	1
`TIER_2`	INFERENCE_FREE	1	1
`TIER_2`	AUTH	5	1
`TIER_3`	DEFAULT	1000	100
`TIER_3`	INFERENCE	5	1
`TIER_3`	INFERENCE_FREE	1	1
`TIER_3`	AUTH	5	1

We use a token bucket algorithm where each request consumes 1 token from your bucket, which refills automatically at the specified rate per second. This allows burst periods while preventing sustained overuse.

If you’re unsure about your current tier or want to upgrade, please contact [email protected].

Token limits (TPM)

For inference endpoints, a tokens per minute (TPM) limit applies. This limits the total number of tokens you can process within a given time period.

Tier	Token Limit (tokens/min)	Refill Rate (tokens/min)
`BASE`	100,000	100,000
`TIER_1`	250,000	250,000
`TIER_2`	500,000	500,000
`TIER_3`	1,000,000	1,000,000

Token limits work similarly to request rate limits:

Each tier has a maximum capacity and a refill rate
When you process a request, the total tokens used are consumed from your bucket
Your bucket refills continuously at the specified rate per minute

Audio processing limits

For audio transcription and translation endpoints, an additional audio duration limit applies. This limits the total minutes of audio you can process within a given time period.

Tier	Audio Limit (minutes)	Refill Rate (min/min)
`BASE`	60	60
`TIER_1`	120	120
`TIER_2`	240	240
`TIER_3`	480	480

Audio limits work similarly to request rate limits:

Each tier has a maximum capacity and a refill rate.
When you process audio, the duration in seconds is consumed from your bucket
Your bucket refills continuously at the specified rate per minute

Concurrent request limits

Concurrent request limits control how many inference requests can be processed simultaneously for your organization. This is separate from the request rate limit (RPS).

Tier	Concurrent Requests
`BASE`	1
`TIER_1`	1
`TIER_2`	1
`TIER_3`	1

When you hit your concurrent request limit, additional requests will receive a 429 error until an ongoing request completes.

Rate limits in headers

In addition to seeing your rate limit in your account settings, you can also view important information about your rate limits in the headers of the HTTP response. You can expect to see the following header fields:

Field	Sample Value	Description
Retry-After	1	The time in seconds until you can retry the request.

Error mitigation

What are some steps I can take to mitigate this?

You should exercise caution when providing programmatic access, bulk processing features, and automated posting - consider only enabling these for trusted customers. To protect against automated and high-volume misuse, set a usage limit for individual users within a specified time frame (daily, weekly, or monthly). Consider implementing a hard cap or a manual review process for users who exceed the limit.

Retrying with exponential backoff

One easy way to avoid rate limit errors is to automatically retry requests with a random exponential backoff. Retrying with exponential backoff means performing a short sleep when a rate limit error is hit, then retrying the unsuccessful request. If the request is still unsuccessful, the sleep length is increased and the process is repeated. This continues until the request is successful or until a maximum number of retries is reached. This approach has many benefits:

Automatic retries means you can recover from rate limit errors without crashes or missing data
Exponential backoff means that your first retries can be tried quickly, while still benefiting from longer delays if your first few retries fail
Adding random jitter to the delay helps retries from all hitting at the same time

Note that unsuccessful requests contribute to your rate limit, so continuously resending a request won’t work.

Below are a few example solutions that use exponential backoff.

Example:

An advanced implementation that respects the Retry-After header from the API response for more efficient retries:

import createRvencClient from "@premai/pcci-sdk-ts";

async function retryWithExponentialBackoff<T>(
  fn: () => Promise<T>,
  maxRetries: number = 6,
  baseDelay: number = 1000
): Promise<T> {
  let retries = 0;
  
  while (true) {
    try {
      return await fn();
    } catch (error: any) {
      const isRateLimitError = error?.status === 429;
      
      if (!isRateLimitError || retries >= maxRetries) {
        throw error;
      }
      
      let delay = parseInt(error.headers["retry-after"], 10) * 1000;
      
      retries++;
      console.log(
        `Rate limited (${error?.log?.rate_limit?.tier || 'unknown'} tier). ` +
        `Retrying in ${Math.round(delay)}ms... (attempt ${retries}/${maxRetries})`
      );
      
      await new Promise((resolve) => setTimeout(resolve, delay));
    }
  }
}

// Usage
async function main() {
  const client = await createRvencClient({ apiKey: "your-api-key" });
  
  try {
    const response = await retryWithExponentialBackoff(
      () =>
        client.chat.completions.create({
          model: "openai/gpt-oss-120b",
          messages: [
            { role: "system", content: "You are a helpful assistant." },
            { role: "user", content: "What is the capital of France?" },
          ],
        }),
      6,
      1000
    );
    
    console.log(response.choices[0].message.content);
  } catch (error) {
    console.error("Request failed after retries:", error);
  }
}

main().catch(console.error);

Handle concurrent request limits

If your use case involves multiple simultaneous requests, be mindful of concurrent request limits. Consider implementing a request queue that processes requests sequentially or in controlled batches to avoid hitting the concurrent request limit.

import createRvencClient from "@premai/pcci-sdk-ts";

class RequestQueue {
  private maxConcurrent: number;
  private running: number = 0;
  private queue: Array<() => void> = [];

  constructor(maxConcurrent: number = 3) {
    this.maxConcurrent = maxConcurrent;
  }

  async execute<T>(requestFn: () => Promise<T>): Promise<T> {
    while (this.running >= this.maxConcurrent) {
      await new Promise<void>((resolve) => this.queue.push(resolve));
    }

    this.running++;
    try {
      return await requestFn();
    } finally {
      this.running--;
      const resolve = this.queue.shift();
      if (resolve) resolve();
    }
  }
}

// Usage
async function main() {
  const client = await createRvencClient({ apiKey: "your-api-key" });
  const queue = new RequestQueue(3); // max 3 concurrent requests

  const prompts = [
    "Explain photosynthesis",
    "What is machine learning?",
    "Describe the water cycle",
    "Explain gravity",
    "What is DNA?",
  ];

  try {
    const results = await Promise.all(
      prompts.map((prompt) =>
        queue.execute(() =>
          client.chat.completions.create({
            model: "openai/gpt-oss-120b",
            messages: [
              { role: "system", content: "You are a helpful assistant." },
              { role: "user", content: prompt },
            ],
          })
        )
      )
    );

    results.forEach((response, index) => {
      console.log(`Response ${index + 1}:`, response.choices[0].message.content);
    });
  } catch (error) {
    console.error("One or more requests failed:", error);
  }
}

main().catch(console.error);

Tips for developers

Group related operations to reduce the number of requests.
Cache frequently accessed data instead of refetching it constantly.
Monitor response headers for usage patterns and implement alerts when nearing limits.
Use the Retry-After header from the error response to delay your retry appropriately.
Implement proper error handling: Always check for 429 status codes and handle them gracefully.

Example error response

When your bucket runs out of tokens, your request will return an error with a 429 Too Many Requests status. The response will include a Retry-After header (in seconds), which tells you how long to wait before retrying.

{
  "status": 429,
  "error": "Rate limit exceeded, try again in 1 seconds",
  "log": {
    "support": "Reach out to [email protected] to request a higher tier, or upgrade your plan. Read more at docs.cci.prem.io/docs/developer-resources/get-started/rate-limits.",
    "rate_limit": {
      "resource": "/rvenc/chat/completions",
      "tier": "BASE",
      "type": "INFERENCE"
    }
  }
}

Need higher limits?

If you need higher rate limits for your use case, you can:

Contact our support team at [email protected] for enterprise plans with custom rate limits

GET STARTED

API REFERENCE

Rate limits

Why do we have rate limits?

How do these rate limits work?

Identifier scope

Rate limits by tier

Available request types

Usage tiers

Request limits

Token limits (TPM)

Audio processing limits

Concurrent request limits

Rate limits in headers

Error mitigation

What are some steps I can take to mitigate this?

Retrying with exponential backoff

Example:

Handle concurrent request limits

Tips for developers

Example error response

Need higher limits?

GET STARTED

API REFERENCE

​Why do we have rate limits?

​How do these rate limits work?

​Identifier scope

​Rate limits by tier

​Available request types

​Usage tiers

​Request limits

​Token limits (TPM)

​Audio processing limits

​Concurrent request limits

​Rate limits in headers

​Error mitigation

​What are some steps I can take to mitigate this?

​Retrying with exponential backoff

​Example:

​Handle concurrent request limits

​Tips for developers

​Example error response

​Need higher limits?

Why do we have rate limits?

How do these rate limits work?

Identifier scope

Rate limits by tier

Available request types

Usage tiers

Request limits

Token limits (TPM)

Audio processing limits

Concurrent request limits

Rate limits in headers

Error mitigation

What are some steps I can take to mitigate this?

Retrying with exponential backoff

Example:

Handle concurrent request limits

Tips for developers

Example error response

Need higher limits?