Request access

Security

Shield configuration

24 checks, profiles, thresholds, fail_behavior, custom PII recognizers, session drift, and the shadow-to-enforce workflow.

16 min read

Every Shield check, every profile, and every knob you can turn to move from shadow mode to enforce without breaking real traffic.

TL;DR

  • 12 checks ship today, grouped into S1 (input/output), S2 (MCP-aware), and S3 (high-assurance). Their IDs run CHK-013CHK-024.
  • Three built-in profiles: none, baseline (default), strict. A custom profile is available for per-check overrides.
  • Every check has three modes: off, log_only, enforce. baseline runs most checks in log_only so you see verdicts without blocking.
  • fail_behavior controls what happens when a detector is broken or unreachable — fail_closed blocks the request, fail_open lets it through with a warning verdict.
  • You can register custom PII recognizers and tune session-drift thresholds via the admin API. No on-disk config.

Check catalog

All 12 checks are served by the gateway's Shield layer. Scanner names are behavioural — they describe what the check detects, not how it detects it.

IDNameScan pointPhaseDefault mode (baseline)Default fail_behaviorDetects
CHK-013Prompt InjectionPre-LLMS1log_onlyfail_closedAttempts in the user input to override system instructions or hijack model behaviour. Multilingual classifier-backed.
CHK-014JailbreakPre-LLMS1log_onlyfail_closedJailbreak patterns — persona overrides, role-play attacks. Produced by the same classifier pass as CHK-013.
CHK-015PII in InputPre-LLMS1enforcefail_openPersonally identifiable information (email, phone, credit card, national ID, IP address, etc.) in the user input. Redacts inline using the configured redaction_mode.
CHK-016Secrets in InputPre-LLMS1enforcefail_openSecret tokens in the user input. Pattern-matched against a built-in catalog (see below). Inline redaction.
CHK-017Toxicity in OutputPost-LLMS1log_onlyfail_openToxic, threatening, insulting, or hateful content in the model response. Multilingual classifier-backed.
CHK-018PII in OutputPost-LLMS1log_onlyfail_openPII accidentally leaked in the model response. Same engine and redaction modes as CHK-015.
CHK-019Secrets in OutputPost-LLMS1log_onlyfail_openSecret tokens the model echoes back. Same pattern catalog as CHK-016.
CHK-020Tool-Output InjectionPost-toolS2log_onlyfail_openIndirect prompt-injection payloads hidden inside MCP tool results before they re-enter the LLM context.
CHK-021PII in Tool OutputPost-toolS2log_onlyfail_openPII in MCP tool results. Uses a per-connection redaction mode if set, otherwise the key's configured mode.
CHK-022Session DriftSession-levelS2log_onlyfail_openCumulative exfiltration pressure across a session — PII hits, external URLs, data volume — with separate warn and block thresholds.
CHK-023Grounding / HallucinationPost-LLMS3log_onlyfail_openWhether the model's response is supported by the supplied reference context (tool outputs, system prompt, or explicit reference). Runs off the hot path; verdict is recorded in the audit trace.
CHK-024Off-Topic DetectionPre-LLMS3log_onlyfail_openUser inputs that fall outside the key's allowed-topic set, determined by embedding similarity against per-key topic vectors.

Profiles

A profile is a named bundle of per-check configurations, stored as JSONB on the virtual key (VirtualKeyORM.security_profile). The three built-ins come from BUILTIN_PROFILES.

none

All checks off. Zero Shield overhead. Use only for trusted internal services where Shield is not required.

baseline (default)

  • Enforce: CHK-015 (PII in input), CHK-016 (secrets in input) — these redact inline and return the modified text to the model.
  • Log-only: CHK-013, CHK-014, CHK-017, CHK-018, CHK-019, CHK-020, CHK-021, CHK-022, CHK-023, CHK-024.
  • PII defaults: pii_redaction_mode: fake, entities = EMAIL_ADDRESS, PHONE_NUMBER, CREDIT_CARD, IBAN_CODE, US_SSN, IP_ADDRESS, CRYPTO, MEDICAL_LICENSE.

strict

Everything that was log_only in baseline is upgraded to enforce, except CHK-023. CHK-023 remains log_only in strict because its verdict arrives asynchronously — after the response has already been returned — and therefore cannot retroactively block it.

custom

Per-check overrides applied on top of an optional base profile. The per-key JSONB looks like:

{
  "profile": "custom",
  "base_profile": "baseline",
  "pii_redaction_mode": "mask",
  "pii_entities": ["EMAIL_ADDRESS", "PHONE_NUMBER"],
  "overrides": {
    "chk_013": { "mode": "enforce" },
    "chk_022": { "extra": { "pii_block": 100 } }
  }
}

Check IDs inside overrides can be either CHK-013 or chk_013 — the resolver normalises them.

Legacy profile names

If a JSONB blob references highassurance from an earlier internal build, the resolver silently remaps it to strict. You should not need to care about this.

Reading the resolved profile

GET /api/v1/security/profiles

Returns every built-in profile with each check's mode, threshold, fail_behavior, redaction_mode, plus the tenant default. Useful as the source of truth for an admin UI or policy-as-code tooling.

Thresholds

Each check can carry a threshold, a fail_behavior, a redaction_mode, and a free-form extra dict.

Defaults in baseline:

Checkthresholdfail_behaviorextra notes
CHK-013 / CHK-0140.90fail_closed
CHK-015 / CHK-018 / CHK-0210.5fail_openredaction_mode: fake, entities: [...]
CHK-016 / CHK-0190.85 (default)fail_openPattern match — threshold is effectively hit-or-miss.
CHK-0170.90fail_open
CHK-0200.90fail_open
CHK-0220.85 (default)fail_openCounter thresholds — see Session drift.
CHK-0230.7fail_openasync: true, daily_token_budget: 100_000, max_context_chars: 8000.
CHK-0240.50fail_openEmbedding similarity — lower = looser.

In the custom profile you can override any of these per key. Example override payload in PATCH /api/v1/keys/{key_ref}:

{
  "security_profile": {
    "profile": "custom",
    "base_profile": "strict",
    "overrides": {
      "CHK-024": { "threshold": 0.40 },
      "CHK-022": { "extra": { "pii_warn": 10, "pii_block": 25 } }
    }
  }
}

fail_behavior semantics

  • fail_open — if the underlying detector is broken (timeout, exception, model unavailable), the verdict is an ALLOW with an error field describing what happened. The request proceeds. The verdict is still recorded in the audit trace.
  • fail_closed — if the detector is broken, the request is blocked with a 503-equivalent verdict. Use for checks whose failure must not be treated as a pass (prompt injection, jailbreak).

log_only mode is orthogonal to fail_behavior: log-only verdicts never affect the overall decision. They only appear in the trace.

Built-in secret categories

CHK-016 (input) and CHK-019 (output) match against a catalogue of 17 patterns. The patterns themselves are not published — only the category names:

  • AWS_ACCESS_KEY, AWS_SECRET_KEY
  • OPENAI_API_KEY
  • ANTHROPIC_API_KEY
  • GITHUB_PAT, GITHUB_APP_TOKEN, GITHUB_FINE_GRAINED
  • GITLAB_PAT
  • SLACK_BOT_TOKEN, SLACK_USER_TOKEN
  • STRIPE_SECRET_LIVE, STRIPE_SECRET_TEST, STRIPE_RESTRICTED
  • PRIVATE_KEY_PEM
  • JWT_TOKEN
  • GOOGLE_API_KEY
  • VEROSEK_KEY (the gateway catches its own virtual keys leaking through a prompt).

Custom PII recognizers

You can add domain-specific entity types (Patient MRN, Case Number, Internal Customer ID, Employee SSO Token, etc.) via the admin API. Loaded into the PII scanner at gateway startup.

POST /api/v1/security/custom-pii

Request body (validated by CustomPIIRecognizerCreate):

{
  "entity_name": "PATIENT_MRN",
  "regex_pattern": "MRN[-_ ]?[0-9]{6,10}",
  "description": "Hospital patient medical record numbers",
  "default_redaction": "MRN-XXXXXX",
  "confidence": 0.85,
  "enabled": true
}

Constraints:

  • entity_name must match ^[A-Z][A-Z0-9_]*$ (length 2–120).
  • regex_pattern is validated by compiling it; invalid regex returns 400.

Response on success (CustomPIIRecognizerResponse):

{
  "id": "pii_...",
  "entity_name": "PATIENT_MRN",
  "regex_pattern": "MRN[-_ ]?[0-9]{6,10}",
  "description": "Hospital patient medical record numbers",
  "default_redaction": "MRN-XXXXXX",
  "confidence": 0.85,
  "enabled": true,
  "created_at": "2025-01-01T00:00:00Z"
}

Other endpoints:

  • GET /api/v1/security/custom-pii — list.
  • DELETE /api/v1/security/custom-pii/{recognizer_id} — remove.

Built-in recognizer entity names (shipped in baseline.pii_entities): EMAIL_ADDRESS, PHONE_NUMBER, CREDIT_CARD, IBAN_CODE, US_SSN, IP_ADDRESS, CRYPTO, MEDICAL_LICENSE.

The multilingual PII engine covers English, Spanish, French, German, Italian, and Portuguese out of the box. Language is auto-detected per request with a safe fall-back to English.

Session drift (CHK-022)

CHK-022 tracks cumulative accumulators per session and fires when any threshold is crossed. Defaults in baseline:

CounterWarn thresholdBlock threshold
PII hits (cumulative)2050
External URLs1030
Data volume (bytes)5_000_00020_000_000

When a block threshold is crossed, the scanner sets a flag on the session. Subsequent requests from the same key are blocked under CHK-022 until an admin intervenes.

To tune thresholds on a specific key, use the custom profile override:

{
  "security_profile": {
    "profile": "custom",
    "base_profile": "baseline",
    "overrides": {
      "CHK-022": {
        "extra": {
          "pii_warn": 10,
          "pii_block": 25,
          "urls_warn": 5,
          "urls_block": 15,
          "bytes_warn": 1000000,
          "bytes_block": 5000000
        }
      }
    }
  }
}

Per-connection redaction (CHK-021)

Each MCP connection can carry its own redaction profile, separate from whatever the calling key has configured. Useful when an enterprise-wide database connection should always mask emails regardless of which key calls it.

Set via PATCH /api/v1/connections/{connection_id}:

{
  "security_profile": {
    "pii_enabled": true,
    "pii_redaction_mode": "fake",
    "pii_entities": ["EMAIL_ADDRESS", "PHONE_NUMBER"]
  }
}

Topic centroids (CHK-024)

CHK-024 scores incoming user prompts against a set of topic centroids registered per virtual key.

  • GET /api/v1/security/topics?key_id=... — list centroids for a key.
  • POST /api/v1/security/topics — register a new topic. Body: { "key_id", "topic_name", "example_sentences", "threshold" }. The gateway computes the centroid and stores it.
  • DELETE /api/v1/security/topics/{topic_id} — remove.

If a key has no registered topics, CHK-024 returns ALLOW by design — the check only fires when you have defined what "on topic" means for that key.

Verdict shape in the audit trace

Every Shield scan writes a step into the trace. The step type is one of SECURITY_SCAN_INPUT, SECURITY_SCAN_OUTPUT, SECURITY_BLOCKED.

Example (SECURITY_SCAN_INPUT):

{
  "step_number": 2,
  "type": "SECURITY_SCAN_INPUT",
  "timestamp": "2025-01-01T12:00:00.123Z",
  "duration_ms": 12,
  "policy_decision": {
    "decision": "MODIFY",
    "score": 0.0,
    "triggered_checks": [
      {
        "check_id": "CHK-015",
        "triggered": true,
        "detail": "pii: MODIFY (mode=enforce, conf=0.92)",
        "score_contribution": 0.92
      },
      {
        "check_id": "CHK-016",
        "triggered": false,
        "detail": "secrets: ALLOW (mode=enforce, conf=0.00)",
        "score_contribution": 0.0
      }
    ],
    "modifications": ["CHK-015: MODIFY"],
    "block_reason": null
  }
}

Every verdict is included in triggered_checks, including ALLOW verdicts (with triggered: false), so the trace shows the full evaluation evidence — a shadow-mode scan that found nothing is visibly a scan that ran, not a missing step.

The aggregate decision field follows these rules:

  • Any enforce-mode check returning BLOCK → overall BLOCK.
  • Else any enforce-mode check returning MODIFY → overall MODIFY.
  • Else → ALLOW.
  • log_only verdicts never affect the overall decision.

Log-only → enforce graduation workflow

The point of shadow mode is to build confidence before you let a check block real traffic. The recommended workflow:

  1. Start every key in the baseline profile (the default). Most checks are log_only.
  2. Let real traffic flow for enough time to produce a representative verdict sample. Keep Shield's ML-backed detectors warm.
  3. Query the audit API for the verdict distribution per check. Look at the SECURITY_SCAN_* steps in /api/v1/traces. The Shield analytics endpoint (GET /api/v1/security/analytics) aggregates hits per check.
  4. For each check you want to graduate, compute: true-positive rate, false-positive rate on the sample. Decide whether blocking the false positives is acceptable.
  5. When comfortable, flip the check from log_only to enforce — either by switching the profile (baselinestrict), or by applying a custom override on the key: PATCH /api/v1/keys/{key_ref} with security_profile.overrides.CHK-XXX.mode = "enforce".
  6. Monitor the trace for blocked requests. Any false positive should show up as a SECURITY_BLOCKED step you can inspect end-to-end.

Rolling back is symmetric: flip the check back to log_only (or off) via the same PATCH.

What's next

Read the Audit API for how to query the trace store, download signed decision receipts, and export compliance evidence bundles.