Security

Shield configuration

24 checks, profiles, thresholds, fail_behavior, custom PII recognizers, session drift, and the shadow-to-enforce workflow.

16 min read

Every Shield check, every profile, and every knob you can turn to move from shadow mode to enforce without breaking real traffic.

TL;DR

12 checks ship today, grouped into S1 (input/output), S2 (MCP-aware), and S3 (high-assurance). Their IDs run CHK-013…CHK-024.
Three built-in profiles: none, baseline (default), strict. A custom profile is available for per-check overrides.
Every check has three modes: off, log_only, enforce. baseline runs most checks in log_only so you see verdicts without blocking.
fail_behavior controls what happens when a detector is broken or unreachable — fail_closed blocks the request, fail_open lets it through with a warning verdict.
You can register custom PII recognizers and tune session-drift thresholds via the admin API. No on-disk config.

Check catalog

All 12 checks are served by the gateway's Shield layer. Scanner names are behavioural — they describe what the check detects, not how it detects it.

ID	Name	Scan point	Phase	Default mode (baseline)	Default fail_behavior	Detects
CHK-013	Prompt Injection	Pre-LLM	S1	`log_only`	`fail_closed`	Attempts in the user input to override system instructions or hijack model behaviour. Multilingual classifier-backed.
CHK-014	Jailbreak	Pre-LLM	S1	`log_only`	`fail_closed`	Jailbreak patterns — persona overrides, role-play attacks. Produced by the same classifier pass as CHK-013.
CHK-015	PII in Input	Pre-LLM	S1	`enforce`	`fail_open`	Personally identifiable information (email, phone, credit card, national ID, IP address, etc.) in the user input. Redacts inline using the configured `redaction_mode`.
CHK-016	Secrets in Input	Pre-LLM	S1	`enforce`	`fail_open`	Secret tokens in the user input. Pattern-matched against a built-in catalog (see below). Inline redaction.
CHK-017	Toxicity in Output	Post-LLM	S1	`log_only`	`fail_open`	Toxic, threatening, insulting, or hateful content in the model response. Multilingual classifier-backed.
CHK-018	PII in Output	Post-LLM	S1	`log_only`	`fail_open`	PII accidentally leaked in the model response. Same engine and redaction modes as CHK-015.
CHK-019	Secrets in Output	Post-LLM	S1	`log_only`	`fail_open`	Secret tokens the model echoes back. Same pattern catalog as CHK-016.
CHK-020	Tool-Output Injection	Post-tool	S2	`log_only`	`fail_open`	Indirect prompt-injection payloads hidden inside MCP tool results before they re-enter the LLM context.
CHK-021	PII in Tool Output	Post-tool	S2	`log_only`	`fail_open`	PII in MCP tool results. Uses a per-connection redaction mode if set, otherwise the key's configured mode.
CHK-022	Session Drift	Session-level	S2	`log_only`	`fail_open`	Cumulative exfiltration pressure across a session — PII hits, external URLs, data volume — with separate warn and block thresholds.
CHK-023	Grounding / Hallucination	Post-LLM	S3	`log_only`	`fail_open`	Whether the model's response is supported by the supplied reference context (tool outputs, system prompt, or explicit reference). Runs off the hot path; verdict is recorded in the audit trace.
CHK-024	Off-Topic Detection	Pre-LLM	S3	`log_only`	`fail_open`	User inputs that fall outside the key's allowed-topic set, determined by embedding similarity against per-key topic vectors.

Profiles

A profile is a named bundle of per-check configurations, stored as JSONB on the virtual key (VirtualKeyORM.security_profile). The three built-ins come from BUILTIN_PROFILES.

`none`

All checks off. Zero Shield overhead. Use only for trusted internal services where Shield is not required.

`baseline` (default)

Enforce: CHK-015 (PII in input), CHK-016 (secrets in input) — these redact inline and return the modified text to the model.
Log-only: CHK-013, CHK-014, CHK-017, CHK-018, CHK-019, CHK-020, CHK-021, CHK-022, CHK-023, CHK-024.
PII defaults: pii_redaction_mode: fake, entities = EMAIL_ADDRESS, PHONE_NUMBER, CREDIT_CARD, IBAN_CODE, US_SSN, IP_ADDRESS, CRYPTO, MEDICAL_LICENSE.

`strict`

Everything that was log_only in baseline is upgraded to enforce, except CHK-023. CHK-023 remains log_only in strict because its verdict arrives asynchronously — after the response has already been returned — and therefore cannot retroactively block it.

`custom`

Per-check overrides applied on top of an optional base profile. The per-key JSONB looks like:

{
  "profile": "custom",
  "base_profile": "baseline",
  "pii_redaction_mode": "mask",
  "pii_entities": ["EMAIL_ADDRESS", "PHONE_NUMBER"],
  "overrides": {
    "chk_013": { "mode": "enforce" },
    "chk_022": { "extra": { "pii_block": 100 } }
  }
}

Check IDs inside overrides can be either CHK-013 or chk_013 — the resolver normalises them.

Legacy profile names

If a JSONB blob references highassurance from an earlier internal build, the resolver silently remaps it to strict. You should not need to care about this.

Reading the resolved profile

GET /api/v1/security/profiles

Returns every built-in profile with each check's mode, threshold, fail_behavior, redaction_mode, plus the tenant default. Useful as the source of truth for an admin UI or policy-as-code tooling.

Thresholds

Each check can carry a threshold, a fail_behavior, a redaction_mode, and a free-form extra dict.

Defaults in baseline:

Check	threshold	fail_behavior	extra notes
CHK-013 / CHK-014	`0.90`	`fail_closed`	—
CHK-015 / CHK-018 / CHK-021	`0.5`	`fail_open`	`redaction_mode: fake`, `entities: [...]`
CHK-016 / CHK-019	`0.85` (default)	`fail_open`	Pattern match — threshold is effectively hit-or-miss.
CHK-017	`0.90`	`fail_open`	—
CHK-020	`0.90`	`fail_open`	—
CHK-022	`0.85` (default)	`fail_open`	Counter thresholds — see Session drift.
CHK-023	`0.7`	`fail_open`	`async: true`, `daily_token_budget: 100_000`, `max_context_chars: 8000`.
CHK-024	`0.50`	`fail_open`	Embedding similarity — lower = looser.

In the custom profile you can override any of these per key. Example override payload in PATCH /api/v1/keys/{key_ref}:

{
  "security_profile": {
    "profile": "custom",
    "base_profile": "strict",
    "overrides": {
      "CHK-024": { "threshold": 0.40 },
      "CHK-022": { "extra": { "pii_warn": 10, "pii_block": 25 } }
    }
  }
}

`fail_behavior` semantics

fail_open — if the underlying detector is broken (timeout, exception, model unavailable), the verdict is an ALLOW with an error field describing what happened. The request proceeds. The verdict is still recorded in the audit trace.
fail_closed — if the detector is broken, the request is blocked with a 503-equivalent verdict. Use for checks whose failure must not be treated as a pass (prompt injection, jailbreak).

log_only mode is orthogonal to fail_behavior: log-only verdicts never affect the overall decision. They only appear in the trace.

Built-in secret categories

CHK-016 (input) and CHK-019 (output) match against a catalogue of 17 patterns. The patterns themselves are not published — only the category names:

AWS_ACCESS_KEY, AWS_SECRET_KEY
OPENAI_API_KEY
ANTHROPIC_API_KEY
GITHUB_PAT, GITHUB_APP_TOKEN, GITHUB_FINE_GRAINED
GITLAB_PAT
SLACK_BOT_TOKEN, SLACK_USER_TOKEN
STRIPE_SECRET_LIVE, STRIPE_SECRET_TEST, STRIPE_RESTRICTED
PRIVATE_KEY_PEM
JWT_TOKEN
GOOGLE_API_KEY
VEROSEK_KEY (the gateway catches its own virtual keys leaking through a prompt).

Custom PII recognizers

You can add domain-specific entity types (Patient MRN, Case Number, Internal Customer ID, Employee SSO Token, etc.) via the admin API. Loaded into the PII scanner at gateway startup.

POST /api/v1/security/custom-pii

Request body (validated by CustomPIIRecognizerCreate):

{
  "entity_name": "PATIENT_MRN",
  "regex_pattern": "MRN[-_ ]?[0-9]{6,10}",
  "description": "Hospital patient medical record numbers",
  "default_redaction": "MRN-XXXXXX",
  "confidence": 0.85,
  "enabled": true
}

Constraints:

entity_name must match ^[A-Z][A-Z0-9_]*$ (length 2–120).
regex_pattern is validated by compiling it; invalid regex returns 400.

Response on success (CustomPIIRecognizerResponse):

{
  "id": "pii_...",
  "entity_name": "PATIENT_MRN",
  "regex_pattern": "MRN[-_ ]?[0-9]{6,10}",
  "description": "Hospital patient medical record numbers",
  "default_redaction": "MRN-XXXXXX",
  "confidence": 0.85,
  "enabled": true,
  "created_at": "2025-01-01T00:00:00Z"
}

Other endpoints:

GET /api/v1/security/custom-pii — list.
DELETE /api/v1/security/custom-pii/{recognizer_id} — remove.

Built-in recognizer entity names (shipped in baseline.pii_entities): EMAIL_ADDRESS, PHONE_NUMBER, CREDIT_CARD, IBAN_CODE, US_SSN, IP_ADDRESS, CRYPTO, MEDICAL_LICENSE.

The multilingual PII engine covers English, Spanish, French, German, Italian, and Portuguese out of the box. Language is auto-detected per request with a safe fall-back to English.

Session drift (CHK-022)

CHK-022 tracks cumulative accumulators per session and fires when any threshold is crossed. Defaults in baseline:

Counter	Warn threshold	Block threshold
PII hits (cumulative)	`20`	`50`
External URLs	`10`	`30`
Data volume (bytes)	`5_000_000`	`20_000_000`

When a block threshold is crossed, the scanner sets a flag on the session. Subsequent requests from the same key are blocked under CHK-022 until an admin intervenes.

To tune thresholds on a specific key, use the custom profile override:

{
  "security_profile": {
    "profile": "custom",
    "base_profile": "baseline",
    "overrides": {
      "CHK-022": {
        "extra": {
          "pii_warn": 10,
          "pii_block": 25,
          "urls_warn": 5,
          "urls_block": 15,
          "bytes_warn": 1000000,
          "bytes_block": 5000000
        }
      }
    }
  }
}

Per-connection redaction (CHK-021)

Each MCP connection can carry its own redaction profile, separate from whatever the calling key has configured. Useful when an enterprise-wide database connection should always mask emails regardless of which key calls it.

Set via PATCH /api/v1/connections/{connection_id}:

{
  "security_profile": {
    "pii_enabled": true,
    "pii_redaction_mode": "fake",
    "pii_entities": ["EMAIL_ADDRESS", "PHONE_NUMBER"]
  }
}

Topic centroids (CHK-024)

CHK-024 scores incoming user prompts against a set of topic centroids registered per virtual key.

GET /api/v1/security/topics?key_id=... — list centroids for a key.
POST /api/v1/security/topics — register a new topic. Body: { "key_id", "topic_name", "example_sentences", "threshold" }. The gateway computes the centroid and stores it.
DELETE /api/v1/security/topics/{topic_id} — remove.

If a key has no registered topics, CHK-024 returns ALLOW by design — the check only fires when you have defined what "on topic" means for that key.

Verdict shape in the audit trace

Every Shield scan writes a step into the trace. The step type is one of SECURITY_SCAN_INPUT, SECURITY_SCAN_OUTPUT, SECURITY_BLOCKED.

Example (SECURITY_SCAN_INPUT):

{
  "step_number": 2,
  "type": "SECURITY_SCAN_INPUT",
  "timestamp": "2025-01-01T12:00:00.123Z",
  "duration_ms": 12,
  "policy_decision": {
    "decision": "MODIFY",
    "score": 0.0,
    "triggered_checks": [
      {
        "check_id": "CHK-015",
        "triggered": true,
        "detail": "pii: MODIFY (mode=enforce, conf=0.92)",
        "score_contribution": 0.92
      },
      {
        "check_id": "CHK-016",
        "triggered": false,
        "detail": "secrets: ALLOW (mode=enforce, conf=0.00)",
        "score_contribution": 0.0
      }
    ],
    "modifications": ["CHK-015: MODIFY"],
    "block_reason": null
  }
}

Every verdict is included in triggered_checks, including ALLOW verdicts (with triggered: false), so the trace shows the full evaluation evidence — a shadow-mode scan that found nothing is visibly a scan that ran, not a missing step.

The aggregate decision field follows these rules:

Any enforce-mode check returning BLOCK → overall BLOCK.
Else any enforce-mode check returning MODIFY → overall MODIFY.
Else → ALLOW.
log_only verdicts never affect the overall decision.

Log-only → enforce graduation workflow

The point of shadow mode is to build confidence before you let a check block real traffic. The recommended workflow:

Start every key in the baseline profile (the default). Most checks are log_only.
Let real traffic flow for enough time to produce a representative verdict sample. Keep Shield's ML-backed detectors warm.
Query the audit API for the verdict distribution per check. Look at the SECURITY_SCAN_* steps in /api/v1/traces. The Shield analytics endpoint (GET /api/v1/security/analytics) aggregates hits per check.
For each check you want to graduate, compute: true-positive rate, false-positive rate on the sample. Decide whether blocking the false positives is acceptable.
When comfortable, flip the check from log_only to enforce — either by switching the profile (baseline → strict), or by applying a custom override on the key: PATCH /api/v1/keys/{key_ref} with security_profile.overrides.CHK-XXX.mode = "enforce".
Monitor the trace for blocked requests. Any false positive should show up as a SECURITY_BLOCKED step you can inspect end-to-end.

Rolling back is symmetric: flip the check back to log_only (or off) via the same PATCH.

What's next

Read the Audit API for how to query the trace store, download signed decision receipts, and export compliance evidence bundles.

#TL;DR

#Check catalog

#Profiles

#none

#baseline (default)

#strict

#custom

#Legacy profile names

#Reading the resolved profile

#Thresholds

#fail_behavior semantics

#Built-in secret categories

#Custom PII recognizers

#Session drift (CHK-022)

#Per-connection redaction (CHK-021)

#Topic centroids (CHK-024)

#Verdict shape in the audit trace

#Log-only → enforce graduation workflow

#What's next