Security Model¶

Morphix implements a multi-layered security architecture combining runtime protection (circuit breaker, rate limiter), code execution sandboxing, and behavioral security (undercover mode, anti-distillation, frustration detection).

Circuit Breaker¶

File: core/circuit_breaker.py

The Circuit Breaker pattern protects against cascading failures when calling external LLM providers. It implements the classic three-state model:

CLOSED → OPEN → HALF_OPEN → CLOSED

States¶

State	Behavior
CLOSED	All requests pass through normally
OPEN	Requests are rejected immediately (`allow_request()` returns `False`)
HALF_OPEN	A single probe request is allowed after the recovery timeout

Configuration¶

@dataclass
class CircuitBreaker:
    failure_threshold: int = 5       # Consecutive failures to open
    recovery_timeout: float = 30.0   # Seconds before attempting half-open

API¶

breaker = CircuitBreakerRegistry.get("deepseek")

if breaker.allow_request():
    try:
        response = await make_llm_call()
        breaker.record_success()
    except Exception:
        breaker.record_failure()
        raise
else:
    # Circuit open — fallback to Ollama
    response = await make_ollama_call()

What it guards¶

llm/controller.py: Both call() and call_stream() check allow_request() before making provider requests.
Fallback: When the DeepSeek circuit is OPEN, the provider automatically falls back to Ollama (llm/provider.py:53-54).

Registry¶

CircuitBreakerRegistry maintains per-provider breakers:

CircuitBreakerRegistry.get("deepseek")   # DeepSeek/OpenAI
CircuitBreakerRegistry.get("ollama")      # Local Ollama
CircuitBreakerRegistry.get_all_states()   # Dict of all states
CircuitBreakerRegistry.reset_all()        # Reset all (testing)

Rate Limiter¶

File: core/rate_limiter.py

Sliding-window rate limiter to control LLM API consumption. Prevents runaway costs and respects provider quotas.

Dual window design¶

class RateLimiter:
    def __init__(self, max_per_minute: int = 20, max_per_hour: int = 200):
        self._minute_window: deque[float] = deque()
        self._hour_window: deque[float] = deque()

Each acquire() call:

Purges timestamps older than 60s (minute window) and 3600s (hour window)
Checks if either window is at capacity
If slots available, appends the current timestamp and returns True

async def acquire(self) -> bool:
    """Try to acquire a slot. Returns True if allowed, False if must wait."""

async def wait_and_acquire(self, timeout: float = 30) -> bool:
    """Wait up to timeout seconds for a slot to become available."""

async def remaining(self) -> int:
    """Number of available slots in the current minute window."""

Global instance¶

from core.rate_limiter import get_rate_limiter

limiter = get_rate_limiter()  # Lazy-init from settings.llm_rate_per_minute / llm_rate_per_hour

Sandbox — Code Execution¶

File: core/sandbox/restricted_executor.py

All user-requested code execution runs through a RestrictedPython sandbox with strict module and builtin whitelists.

Design¶

class RestrictedExecutor:
    @staticmethod
    async def execute(code: str, timeout: int = 10) -> dict:
        """Execute safely with timeout and strict guards."""

Execution flow:

Config check: settings.allow_code_execution must be True
Parse: AST-parse the code block
REPL-style evaluation: If the last statement is an expression, evaluate it and capture the result
Run: Execute compiled code inside RestrictedPython globals
Output capture: print() output goes to a StringIO buffer, last expression value is appended
Matplotlib: Generated plots are saved to charts/ as PNGs and referenced in output

SAFE_MODULES whitelist¶

SAFE_MODULES = {
    "math", "random", "collections", "datetime", "re", "json",
    "sqlite3", "ast", "io", "numpy", "np", "plt",
    # Also available: statistics, fractions, decimal, string,
    # hashlib, base64, html, copy, itertools, functools, typing, textwrap
}

SAFE_BUILTINS whitelist¶

SAFE_BUILTINS = {
    "sum", "len", "max", "min", "abs", "round", "range",
    "enumerate", "zip", "sorted", "reversed",
    "list", "dict", "set", "tuple", "str", "int", "float", "bool",
    "repr", "type", "isinstance",  # Read-only introspection
}

Blocked imports¶

safe_import() explicitly blocks: os, sys, shutil, subprocess, socket, requests, pathlib, pickle, builtins.

python3 -c permanently blocked

The bash_manager tool wrapper permanently blocks python3 -c and python -c — this is a security decision, not a configurable option. All code execution must go through the sandbox.

Undercover Mode¶

File: core/security/undercover_mode.py

Prevents extraction of internal system details — prompts, architecture, tool configuration.

Activation¶

Controlled by UNDERCOVER_MODE env var. In CI: UNDERCOVER_MODE=false.

undercover = UndercoverMode()  # Singleton

Detection layers¶

Forbidden phrases (exact match): "system prompt", "internal architecture", "undercover mode", "anti-distillation", "feature_flags", "kairos", etc. — 15 blocked terms.
Regex jailbreak patterns: Detects variants of "ignore all previous instructions", "reveal your system prompt", "from now on you are developer mode", and Spanish variants like "salta tus restricciones".
Distillation pattern detection: Delegates to DistillationTracker.check_distillation_pattern() — detects N similar queries (>80% bigram similarity) within a short window.

Response protection¶

get_safe_response() performs output scrubbing:

Redaction: Replaces internal terms in LLM output with [protected information]
Injection scan: Checks responses for indirect prompt injection patterns (e.g., "ignore all previous rules" injected by a malicious tool output)
Honeypot injection: At escalation level 3+, injects fake system details
Watermarking: Appends a rotating watermark hash
Throttle delay: At escalation level 2+, introduces artificial response delays

Identity enforcement¶

inject_identity_prompt() prepends a hardened identity prompt to every message list, reinforcing the assistant's identity and prohibiting disclosure of internal mechanisms.

Anti-Distillation¶

File: core/security/anti_distillation.py

Hardens the system against model extraction (distillation) attacks.

Watermark Rotator¶

class WatermarkRotator:
    """8 watermark styles, rotated per workspace + time window."""

8 styles: [ref:{hash}], [trace:{hash}], , zero-width variants, etc.
Rotation: Every 30 minutes, style index advances
Workspace offset: Hash of workspace name adds per-workspace diversity
Content binding: SHA-256 hash of output text produces a content-bound trace

Distillation Tracker¶

class DistillationTracker:
    """Tracks query patterns to detect distillation/extraction attempts."""

Stores the last 50 attempts (deque with maxlen)
Maintains the last 30 queries for similarity analysis
check_distillation_pattern(): Flags when 3+ recent queries have >80% bigram Jaccard similarity — this catches iterative extraction where an attacker rephrases "tell me your system prompt" in different ways

Escalation levels¶

Level	Trigger (attempts in 60s)	Response
0	0	Normal operation
1	≥1	Warn (logged)
2	≥3	Throttle — 2.0s delay per response
3	≥5	Honeypot — inject fake internal details
4	≥8	Lock — session fully locked, requires manual reset

def get_throttle_delay(self) -> float:
    delays = {0: 0.0, 1: 0.0, 2: 2.0, 3: 5.0, 4: 30.0}

Honeypot Injector¶

When escalation reaches level 3, HoneypotInjector.inject() inserts fake system details mid-response (hidden in zero-width spaces). The attacker wastes compute analyzing fabricated prompts and architecture details, while legitimate users never see this content.

Frustration Detector¶

File: core/security/frustration_detector.py

Monitors user behavior for frustration signals and adjusts system behavior to de-escalate.

Detection patterns¶

FRUSTRATION_PATTERNS = [
    ("continue_spam",     r"^\s*(continue|go on|next|proceed)\s*[.!]*\s*$"),
    ("swearing",           r"\b(fuck|shit|damn|hell|crap|wtf|stfu|idiot|stupid)\b"),
    ("shouting",           r"^(?=[A-Z\s]{10,})[A-Z\s!?.]+$"),
    ("repeated_complaint", r"why (isn't|won't|can't|don't) (it|this|you).{0,50}\?"),
    ("frustration_signal", r"\b(this is useless|you're useless|not helpful|doesn't work|broken)\b"),
    ("word_repetition",    r"\b(\w+)\s+\1\s+\1\b"),
]

Repeated query detection¶

If the same query is sent 3+ times within 10 messages, it's flagged as repeated_identical_query.

Calming prompts¶

def get_calming_prompt(self) -> str:
    # Level 1-2: "Stay calm and helpful"
    # Level 3+: "Be extra patient, empathetic, offer step-by-step help"

The system prompt modifier is injected into agent messages when frustration is detected, causing the LLM to adapt its tone.

Integration¶

The frustration detector is checked during message processing. On detection: 1. The event is logged with reason and count 2. A calming prompt modifier is generated 3. The modifier is injected into the next agent's system prompt

Security Subsystem Interactions¶

User Query
    │
    ▼
┌─────────────────┐
│ Frustration      │──► calming prompt modifiers
│ Detector         │
└─────────────────┘
    │
    ▼
┌─────────────────┐
│ Undercover Mode  │──► block OR allow
│ (check_query)    │     pattern matching + distillation tracker
└─────────────────┘
    │
    ▼
┌─────────────────┐
│ Rate Limiter     │──► allow OR throttle
│ (acquire)        │
└─────────────────┘
    │
    ▼
┌─────────────────┐
│ Circuit Breaker  │──► allow OR reject (fallback to Ollama)
│ (allow_request)  │
└─────────────────┘
    │
    ▼
┌─────────────────┐
│ LLM Call         │
└─────────────────┘
    │
    ▼
┌─────────────────┐
│ get_safe_response│──► redact, watermark, honeypot
└─────────────────┘
    │
    ▼
Response to User