Request access

Integration

Integration guide

Every SDK endpoint, cross-SDK routing, model deployments, parameter quirks, virtual-key day-2 operations, fail-safe semantics.

14 min read

Every SDK endpoint Verosek exposes, what the gateway rewrites, and how admin operations on your virtual keys work once you have a tenant.

TL;DR

  • Three SDKs land natively: OpenAI at /v1/*, Anthropic at /v1/messages, Gemini at /v1beta/models/{model}:*.
  • Every endpoint accepts your virtual key. You can run any configured model through any SDK — the gateway translates request and response shapes.
  • Per-model quirks (token-param renames, sampling-param stripping, role conversion) are applied on your behalf.
  • Virtual keys have day-2 operations: rotate, block, unblock, update allowed models / budget / rate limits.
  • Fail-safe defaults to FAIL_CLOSED — an unreachable Verosek blocks tool calls rather than pass them through.

Overview

The gateway uses the OpenAI Chat Completions shape as its internal canonical format. Clients that hit the OpenAI endpoints pass through; clients that hit the Anthropic or Gemini endpoints are translated on the way in and on the way out.

"Cross-SDK translation" means, concretely: you can send an Anthropic SDK request whose model resolves to an OpenAI deployment on our side, and the response comes back in Anthropic shape. The reverse works too.

OpenAI SDK endpoints

All endpoints accept Authorization: Bearer vsk_.... The router prefix is /v1.

MethodPathRequest body shapeWhat the gateway does
POST/v1/chat/completionsOpenAI Chat CompletionsShield pre-scan → translate to provider → forward → internal tool loop if tool_calls returned → Shield post-scan → audit.
POST/v1/completionsOpenAI legacy completionRoutes to a chat-compatible deployment, adapts to a completion response.
POST/v1/responsesOpenAI Responses APIScans input, runs against a chat deployment, returns Responses-shaped JSON.
POST/v1/embeddingsOpenAI embeddingsForwards to the embedding deployment. Shield pre-scans the input string.
POST/v1/images/generationsOpenAI image generationShield pre-scans the prompt, then forwards.
POST/v1/images/editsMultipart image editProxies multipart upload to the selected image deployment.
POST/v1/images/variationsMultipart image variationProxies multipart upload.
POST/v1/audio/speechTTSShield pre-scans the input text.
POST/v1/audio/transcriptionsMultipart audio-inProxies to the transcription deployment, Shield post-scans the transcript.
POST/v1/audio/translationsMultipart audio-inSame shape as transcriptions, different upstream operation.
POST/v1/moderationsOpenAI moderationsShield pre-scans the input; upstream moderation is forwarded.
GET/v1/modelsReturns the list of deployments the virtual key is allowed to call.

All Pydantic request shapes live in gateway/models/provider.py (virtual key models) and inside the handlers themselves for endpoint-specific bodies. The OpenAI shapes are mirrored 1:1 from the upstream API.

Streaming is not supported today. Requests with stream: true receive a clear error response.

Anthropic SDK endpoint

MethodPathAuthWhat the gateway does
POST/v1/messagesx-api-key: vsk_...Accepts native Anthropic request body (system top-level, messages[] with content blocks, tools with input_schema). Translates to OpenAI canonical, runs the internal pipeline, translates the response back to Anthropic shape.

What the gateway translates on input:

  • Top-level system (string or text-block list) → OpenAI system message.
  • image content blocks with source.type = base64 | url → OpenAI image_url content blocks.
  • tool_use / tool_result blocks → OpenAI tool_calls and role=tool messages.

What the gateway translates on output:

  • OpenAI content + tool_calls → Anthropic content blocks (type=text, type=tool_use).
  • OpenAI finish_reason → Anthropic stop_reason (stopend_turn, lengthmax_tokens, tool_callstool_use).

Anthropic prompt caching (cache_control) is not forwarded today. Audio content blocks sent to Claude are dropped by the translator (Claude does not accept them).

Gemini SDK endpoints

MethodPathAuthWhat the gateway does
POST/v1beta/models/{model}:generateContentx-goog-api-key: vsk_... header or ?key=vsk_... queryAccepts native Gemini request body, translates, runs the pipeline, returns Gemini shape.
POST/v1beta/models/{model}:embedContentSameSingle embedding.
POST/v1beta/models/{model}:batchEmbedContentsSameBatch embeddings.

:streamGenerateContent and :countTokens are not supported today. The gateway returns a clear error for both.

Cross-SDK routing

You can call any configured model through any SDK. The gateway resolves model to a deployment and then translates shapes as needed.

Client SDK ↓ / Provider →OpenAI backendAnthropic backendGemini backend
OpenAI SDK (POST /v1/chat/completions)Direct passthroughTranslated to Anthropic /v1/messages, response translated backTranslated to Gemini OpenAI-compatible endpoint, response translated back
Anthropic SDK (POST /v1/messages)Inbound translation to canonical, passthrough to OpenAI, response re-shaped to AnthropicDirect passthrough after inbound translationInbound translation to canonical, then Gemini, response re-shaped to Anthropic
Gemini SDK (POST /v1beta/models/*)Inbound translation to canonical, passthrough to OpenAI, response re-shaped to GeminiInbound → canonical → Anthropic → re-shaped to GeminiDirect passthrough

Model routing is driven by the model_name on each deployment, not by the SDK the request came from. See the next section.

Model deployments

A model deployment is a mapping from a friendly model_name (what your client sends) to a concrete provider_model on a provider account we hold for you.

{
  "id": "mdl_...",
  "provider_id": "prov_...",
  "model_name": "gpt-4o-fast",
  "provider_model": "gpt-4o-2024-08-06",
  "rpm_limit": 10000,
  "tpm_limit": 1000000,
  "input_cost_per_token": 2.5e-6,
  "output_cost_per_token": 1.0e-5,
  "priority": 1,
  "cooldown_seconds": 5,
  "status": "ACTIVE",
  "model_type": "chat",
  "system_prompt": null,
  "default_temperature": null,
  "default_max_tokens": null
}

Admin endpoints for deployments (create/list/get/update/delete):

  • POST /api/v1/models — body is DeploymentCreate.
  • GET /api/v1/models, GET /api/v1/models/{deployment_id}, PATCH /api/v1/models/{deployment_id}, DELETE /api/v1/models/{deployment_id}.
  • POST /api/v1/models/reclassify re-runs the automatic model_type classifier on every deployment.

Runtime playgrounds for each model_type exist at POST /api/v1/models/{deployment_id}/{chat,embed,generate-image,speak,transcribe,moderate,complete}.

Routing strategy

When a client sends model_name, the gateway picks one deployment and forwards the request:

  1. Look up all ACTIVE deployments with that model_name, ordered by priority ascending (lower number = higher priority).
  2. Filter out deployments currently in cooldown. A deployment enters cooldown after _ALLOWED_FAILS = 3 consecutive failures for cooldown_seconds (default 5).
  3. Filter out deployments whose RPM counter in Redis has already reached rpm_limit this minute.
  4. Apply weighted random selection across the surviving set. Lower-priority deployments get more weight, so priority acts as "pick me first unless I'm sick".
  5. Decrypt the chosen provider's API key and forward.

If all deployments for the model are cooled down, the highest-priority one is tried anyway (warned in logs).

Parameter translation quirks

The gateway normalises per-model differences so clients can always send the same OpenAI-shaped body. Every translation is logged to the audit trace as a modifications note.

  • o-series reasoning models (pattern matches o1, o1-mini, o1-preview, o2, o3, o3-mini, …): rename max_tokens → max_completion_tokens; strip temperature, top_p, frequency_penalty, presence_penalty; convert system role → developer role.
  • gpt-5 family (pattern matches gpt-5, gpt-5-mini, gpt-5-nano, gpt-5.1, gpt-5.4-nano, …): same max_completion_tokens rename; strip temperature/top_p/penalties; keep system role.
  • Anthropic (backend): strip frequency_penalty and presence_penalty; extract system message to top-level; translate image_urlimage with source object; translate tool definitions into input_schema shape; backfill max_tokens if missing (default 1024).
  • Non-reasoning OpenAI models that receive a stray max_completion_tokens: rename back to max_tokens.

The capability object per model lives at gateway/core/model_translator.py:34-61 (ModelCapabilities) and is selected via get_model_capabilities(provider_type, provider_model).

Virtual keys — day-2 operations

Admin endpoints all live under /api/v1/keys. Paths use the non-secret reference ID vkr_... — the secret vsk_... value is only ever returned in the POST creation response and is never echoed back by any other endpoint.

MethodPathPurpose
POST/api/v1/keysCreate a new virtual key. Response includes the plaintext vsk_... once.
GET/api/v1/keysList keys for this tenant — secret value is never returned.
PATCH/api/v1/keys/{key_ref}Update name, allowed_models, max_budget, budget_period, rpm_limit, tpm_limit, expires_at, security_profile.
POST/api/v1/keys/{key_ref}/blockBlock the key. Takes effect on the next request once the Redis cache is invalidated.
POST/api/v1/keys/{key_ref}/unblockUnblock.
POST/api/v1/keys/{key_ref}/rotateGenerate a new key with the same policy, block the old one. Response contains the new plaintext vsk_... once.
GET/api/v1/keys/{key_ref}/spendCurrent-period spend + budget status.
GET/api/v1/keys/{key_ref}/analytics?days=30Aggregate spend, token totals, by-model and by-day breakdowns, recent traces.
GET/api/v1/keys/{key_ref}/toolsEvery MCP tool this key has access to (namespaced).

Budget model:

  • max_budget in USD.
  • budget_period = daily | weekly | monthly | null. Reset timestamps are computed at period boundaries.
  • rpm_limit / tpm_limit are counted in Redis per key. When max_budget is set and current_spend >= max_budget the key returns 429.

TTL (expires_at) is enforced on every request — an expired key returns 401.

Allowed models are a list. ["*"] means all; [] means MCP-only (no LLM endpoints); otherwise a concrete list like ["gpt-4o-fast", "claude-3-5-sonnet-fast"].

Fail-safe behaviour

  • FAIL_CLOSED is the default. If the gateway is unreachable from your application, your SDK gets a network error and your application code sees a failure. Tool calls do not pass through unguarded.
  • Per-request decision failures inside the gateway default to BLOCK. A crash in the tool-access enforcer returns a block, not an allow.
  • Shield fail behaviour is per check. fail_behavior: fail_closed returns a 503; fail_open lets the request through with a warning verdict. Default varies per check — see Shield configuration.

To verify

TODO: unverified — The SDK-side circuit breaker (3-fail-open / 2-success-close semantics) is described in the internal design doc but I could not locate an explicit circuit-breaker implementation in the verosek/ Python SDK source when writing this doc. Confirm with Vaibhav whether the breaker ships in the current SDK release and, if so, add a verosek/ file-path reference here.

Observability hooks

  • Structured logs, JSON-formatted, emitted via structlog. Every log entry includes agent_id and trace_id when present.
  • Trace events of interest include virtual_key_generated, virtual_key_rotated, mcp_connection_restored, mcp_connection_restore_failed, tool_access_blocked, tool_args_modified, audit_session_started, audit_drain_cycle, audit_drain_entry_error, audit_drain_batch_commit_failed, spend_db_update_failed, gateway_started, gateway_shutdown, postgres_connected, redis_connected, shield_startup_failed.
  • Trace API — the same data is available as queryable JSON via /api/v1/traces. See Audit API.

Onboarding-only

Handled during onboarding — not public. Metric scrape endpoints, internal service addresses, and centralized-log shipping targets are wired up during the onboarding engagement based on your stack (Datadog / Splunk / ELK / CloudWatch / etc.). We do not publish those specifics here.

What's next

Read the MCP connector catalog to see which tools you can wire up and how the access-rule schema controls what each key can do with them.