Confidential AI API — Secure LLM Inference at Scale

No encryption expertise required. Prem API is designed so that the encryption layer is completely invisible in your application code. If you’ve built anything with the OpenAI API, you already know how to use Prem API . The SDK handles all cryptography automatically — you write normal API calls and get normal responses.

Two Ways to Integrate

Option 1: Prem API TypeScript SDK (Recommended)

Install the SDK and use it like any OpenAI client:

import { createRvencClient } from "@premai/api-sdk";

const client = await createRvencClient({
  apiKey: process.env.API_KEY,
  clientKEK: process.env.CLIENT_KEK, // Your master key — you generate it, we never see it
});

// This looks exactly like an OpenAI call — because it is
const chat = await client.chat.completions.create({
  model: "your-model",
  messages: [
    { role: "system", content: "You are a helpful assistant." },
    { role: "user", content: "Summarize this quarterly report." },
  ],
  stream: true,
});

for await (const chunk of chat) {
  process.stdout.write(chunk.choices[0]?.delta?.content || "");
}

Behind the scenes, the SDK encrypts your messages before sending, performs a secure key exchange with the enclave, and decrypts each streaming chunk as it arrives. You see none of this — your code looks like any other OpenAI integration.

Option 2: Local Proxy Server (Any Language)

If you use Python, Go, Java, or any other language with an OpenAI-compatible client library, the SDK includes a local proxy server that handles encryption transparently:

# Start the local proxy (one command)
npx @premai/api-sdk pcci-proxy --api-key $PREM_API_KEY --kek $CLIENT_KEK

Then point your existing code at localhost:

from openai import OpenAI

# Your existing OpenAI code — just change the base URL
client = OpenAI(
    base_url="http://localhost:3100/v1",
    api_key="unused",  # Auth is handled by the local proxy
)

response = client.chat.completions.create(
    model="your-model",
    messages=[{"role": "user", "content": "Hello, privately."}],
)

Zero code changes to your application logic. The local proxy encrypts outbound requests and decrypts responses automatically.

What You Can Do

Chat with AI Models

Full OpenAI-compatible chat API:

Feature	Details
Streaming	Real-time word-by-word output, each chunk encrypted individually
JSON mode	Structured output for reliable parsing
System messages	Control model behavior and personality
Multi-turn conversations	Full conversation history and context management
Audio transcription	Convert speech to text (Whisper, Deepgram)
Audio translation	Translate audio to English

Error Handling

Standard HTTP status codes with structured error responses:

Code	What It Means	What to Do
400	Invalid request format	Check your input against the API spec
401	Invalid API key	Verify your API key is correct and active
403	Insufficient permissions	Check your API key’s scopes
429	Rate limited	Implement exponential backoff (examples in Rate Limits)
503	Temporarily unavailable	Wait and retry

Every error includes a support_id you can share with our team for debugging.

Rate Limits

Rate limits are per-organization, across four dimensions:

Dimension	What It Limits	Why
RPS (Requests per second)	How fast you can send requests	Prevents bursts from overwhelming the system
TPM (Tokens per minute)	Total token throughput	Manages inference capacity
Concurrent	Simultaneous active requests	Ensures fair resource sharing

Limits increase across tiers (Free, Tier 1, Tier 2, Tier 3) as your usage grows. See Rate Limits for specific values and retry strategies with code examples.

Ready to start building? See the Quickstart for step-by-step setup, or browse Recipes for copy-paste examples.

​Two Ways to Integrate

​Option 1: Prem API TypeScript SDK (Recommended)

​Option 2: Local Proxy Server (Any Language)

​What You Can Do

​Chat with AI Models

​Error Handling

​Rate Limits

Two Ways to Integrate

Option 1: Prem API TypeScript SDK (Recommended)

Option 2: Local Proxy Server (Any Language)

What You Can Do

Chat with AI Models

Error Handling

Rate Limits