Detection Engine¶

The detection engine (guard_core.detection_engine) provides multi-layered threat detection with four components working together: pattern compilation, content preprocessing, semantic analysis, and performance monitoring.

These components are orchestrated by the SusPatternsManager handler, which adapter developers do not call directly but may need to understand for tuning and diagnostics.

Architecture¶

flowchart TD
    DETECT["SusPatternsManager.detect()"]
    PREPROCESS["1. Preprocess content"]
    NORM["Normalize unicode"]
    DECODE["Decode URL + HTML"]
    NULL["Remove null bytes"]
    WHITESPACE["Normalize whitespace"]
    TRUNCATE["Truncate safely"]
    REGEX["2. Regex matching"]
    SAFE["Safe matcher with timeout"]
    PERF["Record performance metrics"]
    SEMANTIC["3. Semantic analysis"]
    PROB["Attack probability scoring"]
    OBFUSC["Obfuscation detection"]
    INJECT["Code injection risk"]
    AGG["4. Aggregate results"]

    DETECT --> PREPROCESS
    PREPROCESS --> NORM --> DECODE --> NULL --> WHITESPACE --> TRUNCATE
    TRUNCATE --> REGEX
    REGEX --> SAFE --> PERF
    PERF --> SEMANTIC
    SEMANTIC --> PROB --> OBFUSC --> INJECT
    INJECT --> AGG

PatternCompiler¶

`guard_core.detection_engine.compiler.PatternCompiler(default_timeout=5.0, max_cache_size=1000)` ¶

Source code in guard_core/detection_engine/compiler.py

def __init__(self, default_timeout: float = 5.0, max_cache_size: int = 1000):
    self.default_timeout = default_timeout
    self.max_cache_size = min(max_cache_size, 5000)
    self._compiled_cache: dict[str, re.Pattern] = {}
    self._cache_order: list[str] = []
    self._lock = asyncio.Lock()

`MAX_CACHE_SIZE = 1000` `class-attribute` `instance-attribute` ¶

`default_timeout = default_timeout` `instance-attribute` ¶

`max_cache_size = min(max_cache_size, 5000)` `instance-attribute` ¶

`batch_compile(patterns, validate=True)` `async` ¶

Source code in guard_core/detection_engine/compiler.py

async def batch_compile(
    self, patterns: list[str], validate: bool = True
) -> dict[str, re.Pattern]:
    compiled_patterns = {}
    for pattern in patterns:
        if validate:
            is_safe, reason = self.validate_pattern_safety(pattern)
            if not is_safe:
                continue
        try:
            compiled_patterns[pattern] = await self.compile_pattern(pattern)
        except re.error:
            continue
    return compiled_patterns

`clear_cache()` `async` ¶

Source code in guard_core/detection_engine/compiler.py

async def clear_cache(self) -> None:
    async with self._lock:
        self._compiled_cache.clear()
        self._cache_order.clear()

`compile_pattern(pattern, flags=re.IGNORECASE | re.MULTILINE)` `async` ¶

Source code in guard_core/detection_engine/compiler.py

async def compile_pattern(
    self, pattern: str, flags: int = re.IGNORECASE | re.MULTILINE
) -> re.Pattern:
    cache_key = f"{hash(pattern)}:{flags}"

    if cache_key in self._compiled_cache:
        async with self._lock:
            if cache_key in self._compiled_cache:
                self._cache_order.remove(cache_key)
                self._cache_order.append(cache_key)
                return self._compiled_cache[cache_key]

    async with self._lock:
        if cache_key not in self._compiled_cache:
            if len(self._compiled_cache) >= self.max_cache_size:
                oldest_key = self._cache_order.pop(0)
                del self._compiled_cache[oldest_key]

            self._compiled_cache[cache_key] = re.compile(pattern, flags)
            self._cache_order.append(cache_key)

        return self._compiled_cache[cache_key]

`compile_pattern_sync(pattern, flags=re.IGNORECASE | re.MULTILINE)` ¶

Source code in guard_core/detection_engine/compiler.py

def compile_pattern_sync(
    self, pattern: str, flags: int = re.IGNORECASE | re.MULTILINE
) -> re.Pattern:
    return re.compile(pattern, flags)

`create_safe_matcher(pattern, timeout=None)` ¶

Source code in guard_core/detection_engine/compiler.py

def create_safe_matcher(
    self, pattern: str, timeout: float | None = None
) -> Callable[[str], re.Match | None]:
    compiled = self.compile_pattern_sync(pattern)
    match_timeout = timeout or self.default_timeout

    def safe_match(text: str) -> re.Match | None:
        import concurrent.futures

        def _search() -> re.Match | None:
            return compiled.search(text)

        with concurrent.futures.ThreadPoolExecutor(max_workers=1) as executor:
            future = executor.submit(_search)
            try:
                return future.result(timeout=match_timeout)
            except concurrent.futures.TimeoutError:
                future.cancel()
                return None
            except Exception:
                return None

    return safe_match

`validate_pattern_safety(pattern, test_strings=None)` ¶

Source code in guard_core/detection_engine/compiler.py

def validate_pattern_safety(
    self, pattern: str, test_strings: list[str] | None = None
) -> tuple[bool, str]:
    dangerous_patterns = [
        r"\(\.\*\)\+",
        r"\(\.\+\)\+",
        r"\([^)]*\*\)\+",
        r"\([^)]*\+\)\+",
        r"(?:\.\*){2,}",
        r"(?:\.\+){2,}",
    ]

    for dangerous in dangerous_patterns:
        if re.search(dangerous, pattern):
            return False, f"Pattern contains dangerous construct: {dangerous}"

    if test_strings is None:
        test_strings = [
            "a" * 10,
            "a" * 100,
            "a" * 1000,
            "x" * 50 + "y" * 50,
            "<" * 100 + ">" * 100,
        ]

    try:
        compiled = self.compile_pattern_sync(pattern)
        import concurrent.futures

        for test_str in test_strings:
            start_time = time.time()

            def _search(text: str = test_str) -> re.Match | None:
                return compiled.search(text)

            with concurrent.futures.ThreadPoolExecutor(max_workers=1) as executor:
                future = executor.submit(_search)
                try:
                    future.result(timeout=0.1)
                except concurrent.futures.TimeoutError:
                    return (
                        False,
                        f"Pattern timed out on test string of length "
                        f"{len(test_str)}",
                    )

            elapsed = time.time() - start_time
            if elapsed > 0.05:
                return (
                    False,
                    f"Pattern timed out on test string of length {len(test_str)}",
                )
    except Exception as e:
        return False, f"Pattern validation failed: {str(e)}"

    return True, "Pattern appears safe"

Manages regex pattern compilation with LRU caching and ReDoS safety validation.

Constructor¶

PatternCompiler(default_timeout: float = 5.0, max_cache_size: int = 1000)

Parameter	Description	Bounds
`default_timeout`	Timeout for safe matchers in seconds	N/A
`max_cache_size`	Maximum compiled patterns to cache	Capped at 5000

Key Methods¶

compile_pattern(pattern, flags) -> re.Pattern (async)

Thread-safe compilation with LRU eviction. Cache key is f"{hash(pattern)}:{flags}".

compile_pattern_sync(pattern, flags) -> re.Pattern

Synchronous compilation without caching. Used internally by validators and safe matchers.

validate_pattern_safety(pattern, test_strings) -> tuple[bool, str]

Validates a pattern against ReDoS vulnerability:

Checks for known dangerous constructs: (.*)+, (.+)+, nested quantifiers.
Runs the pattern against test strings (default: varying lengths of 'a', 'x'+'y', '<'+'>') with a 100ms timeout per string.
If any test exceeds 50ms, the pattern is flagged as unsafe.

create_safe_matcher(pattern, timeout) -> Callable[[str], Match | None]

Returns a closure that executes the regex in a thread pool with a timeout. If the match exceeds the timeout, the future is cancelled and None is returned.

safe_match = compiler.create_safe_matcher(r"<script.*?>", timeout=2.0)
result = safe_match(user_input)  # None if timed out

batch_compile(patterns, validate) -> dict[str, re.Pattern] (async)

Compiles multiple patterns, optionally validating each for safety. Unsafe or invalid patterns are silently skipped.

ContentPreprocessor¶

`guard_core.detection_engine.preprocessor.ContentPreprocessor(max_content_length=10000, preserve_attack_patterns=True, agent_handler=None, correlation_id=None)` ¶

Source code in guard_core/detection_engine/preprocessor.py

def __init__(
    self,
    max_content_length: int = 10000,
    preserve_attack_patterns: bool = True,
    agent_handler: Any = None,
    correlation_id: str | None = None,
):
    self.max_content_length = max_content_length
    self.preserve_attack_patterns = preserve_attack_patterns
    self.agent_handler = agent_handler
    self.correlation_id = correlation_id

    self.attack_indicators = [
        r"<script",
        r"javascript:",
        r"on\w+=",
        r"SELECT\s+.{0,50}?\s+FROM",
        r"UNION\s+SELECT",
        r"\.\./",
        r"eval\s*\(",
        r"exec\s*\(",
        r"system\s*\(",
        r"<?php",
        r"<%",
        r"{{",
        r"{%",
        r"<iframe",
        r"<object",
        r"<embed",
        r"onerror\s*=",
        r"onload\s*=",
        r"\$\{",
        r"\\x[0-9a-fA-F]{2}",
        r"%[0-9a-fA-F]{2}",
    ]

    self.compiled_indicators = [
        re.compile(pattern, re.IGNORECASE) for pattern in self.attack_indicators
    ]

`agent_handler = agent_handler` `instance-attribute` ¶

`attack_indicators = ['<script', 'javascript:', 'on\\w+=', 'SELECT\\s+.{0,50}?\\s+FROM', 'UNION\\s+SELECT', '\\.\\./', 'eval\\s\\(', 'exec\\s\\(', 'system\\s\\(', '<?php', '<%', '{{', '{%', '<iframe', '<object', '<embed', 'onerror\\s=', 'onload\\s*=', '\\$\\{', '\\\\x[0-9a-fA-F]{2}', '%[0-9a-fA-F]{2}']` `instance-attribute` ¶

`compiled_indicators = [(re.compile(pattern, re.IGNORECASE)) for pattern in (self.attack_indicators)]` `instance-attribute` ¶

`correlation_id = correlation_id` `instance-attribute` ¶

`max_content_length = max_content_length` `instance-attribute` ¶

`preserve_attack_patterns = preserve_attack_patterns` `instance-attribute` ¶

`decode_common_encodings(content)` `async` ¶

Source code in guard_core/detection_engine/preprocessor.py

async def decode_common_encodings(self, content: str) -> str:
    max_decode_iterations = 3
    iterations = 0

    while iterations < max_decode_iterations:
        original = content

        try:
            import urllib.parse

            decoded = urllib.parse.unquote(content, errors="ignore")
            if decoded != content:
                content = decoded
        except Exception as e:
            await self._send_preprocessor_event(
                event_type="decoding_error",
                action_taken="decode_failed",
                reason="Failed to URL decode content",
                error=str(e),
                error_type="url_decode",
            )

        try:
            import html

            decoded = html.unescape(content)
            if decoded != content:
                content = decoded
        except Exception as e:
            await self._send_preprocessor_event(
                event_type="decoding_error",
                action_taken="decode_failed",
                reason="Failed to HTML decode content",
                error=str(e),
                error_type="html_decode",
            )

        if content == original:
            break

        iterations += 1

    return content

`extract_attack_regions(content)` ¶

Source code in guard_core/detection_engine/preprocessor.py

def extract_attack_regions(self, content: str) -> list[tuple[int, int]]:
    max_regions = min(100, self.max_content_length // 100)
    regions = []

    for indicator in self.compiled_indicators:
        import concurrent.futures

        def _find_all(pattern: re.Pattern, text: str) -> list[tuple[int, int]]:
            found: list[tuple[int, int]] = []
            for match in pattern.finditer(text):
                if len(found) >= max_regions:
                    break
                start = max(0, match.start() - 100)
                end = min(len(text), match.end() + 100)
                found.append((start, end))
            return found

        with concurrent.futures.ThreadPoolExecutor(max_workers=1) as executor:
            future = executor.submit(_find_all, indicator, content)
            try:
                indicator_regions = future.result(timeout=0.5)
                regions.extend(indicator_regions)
            except concurrent.futures.TimeoutError:
                continue

        if len(regions) >= max_regions:
            break

    if regions:
        regions.sort()
        merged = [regions[0]]
        for start, end in regions[1:]:
            if start <= merged[-1][1]:
                merged[-1] = (merged[-1][0], max(merged[-1][1], end))
            else:
                merged.append((start, end))
        return merged[:max_regions]

    return []

`normalize_unicode(content)` ¶

Source code in guard_core/detection_engine/preprocessor.py

def normalize_unicode(self, content: str) -> str:
    normalized = unicodedata.normalize("NFKC", content)

    lookalikes = {
        "\u2044": "/",
        "\uff0f": "/",
        "\u29f8": "/",
        "\u0130": "I",
        "\u0131": "i",
        "\u200b": "",
        "\u200c": "",
        "\u200d": "",
        "\ufeff": "",
        "\u00ad": "",
        "\u034f": "",
        "\u180e": "",
        "\u2028": "\n",
        "\u2029": "\n",
        "\ue000": "",
        "\ufff0": "",
        "\u01c0": "|",
        "\u037e": ";",
        "\u2215": "/",
        "\u2216": "\\",
        "\uff1c": "<",
        "\uff1e": ">",
    }

    for char, replacement in lookalikes.items():
        normalized = normalized.replace(char, replacement)

    return normalized

`preprocess(content)` `async` ¶

Source code in guard_core/detection_engine/preprocessor.py

async def preprocess(self, content: str) -> str:
    if not content:
        return ""

    content = self.normalize_unicode(content)
    content = await self.decode_common_encodings(content)
    content = self.remove_null_bytes(content)
    content = self.remove_excessive_whitespace(content)
    content = self.truncate_safely(content)

    return content

`preprocess_batch(contents)` `async` ¶

Source code in guard_core/detection_engine/preprocessor.py

async def preprocess_batch(self, contents: list[str]) -> list[str]:
    return [await self.preprocess(content) for content in contents]

`remove_excessive_whitespace(content)` ¶

Source code in guard_core/detection_engine/preprocessor.py

def remove_excessive_whitespace(self, content: str) -> str:
    content = re.sub(r"\s+", " ", content)
    content = content.strip()
    return content

`remove_null_bytes(content)` ¶

Source code in guard_core/detection_engine/preprocessor.py

def remove_null_bytes(self, content: str) -> str:
    content = content.replace("\x00", "")

    control_chars = "".join(chr(i) for i in range(32) if i not in (9, 10, 13))
    translator = str.maketrans("", "", control_chars)
    return content.translate(translator)

`truncate_safely(content)` ¶

Source code in guard_core/detection_engine/preprocessor.py

def truncate_safely(self, content: str) -> str:
    if len(content) <= self.max_content_length:
        return content

    if not self.preserve_attack_patterns:
        return content[: self.max_content_length]

    attack_regions = self.extract_attack_regions(content)

    if not attack_regions:
        return content[: self.max_content_length]

    attack_length = sum(end - start for start, end in attack_regions)

    if attack_length >= self.max_content_length:
        return self._extract_and_concatenate_attack_regions(content, attack_regions)

    return self._build_result_with_attack_regions_and_context(
        content, attack_regions
    )

Normalizes and sanitizes input before pattern matching.

Constructor¶

ContentPreprocessor(
    max_content_length: int = 10000,
    preserve_attack_patterns: bool = True,
    agent_handler: Any = None,
    correlation_id: str | None = None,
)

Preprocessing Pipeline¶

The preprocess() method runs five stages in order:

Stage	Method	Purpose
Unicode normalization	`normalize_unicode()`	NFKC normalization + lookalike character replacement
Encoding detection	`decode_common_encodings()`	URL decode + HTML entity decode (up to 3 iterations)
Null byte removal	`remove_null_bytes()`	Strips `\x00` and control characters except tab/newline/CR
Whitespace normalization	`remove_excessive_whitespace()`	Collapses multiple spaces, strips leading/trailing
Safe truncation	`truncate_safely()`	Truncates to `max_content_length` preserving attack regions

Attack-Preserving Truncation¶

When content exceeds max_content_length and preserve_attack_patterns is True:

extract_attack_regions() scans for 21 attack indicator patterns (e.g., <script, SELECT ... FROM, eval(, ../).
Regions around matches (100 characters of context on each side) are extracted.
Overlapping regions are merged.
Attack regions are included first, then non-attack content fills remaining space.

This ensures that truncated content still contains the attack patterns for detection.

Unicode Lookalike Map¶

The preprocessor replaces over 20 Unicode characters used for evasion:

Unicode	Replacement	Purpose
`\u2044`	`/`	Fraction slash evasion
`\uff0f`	`/`	Fullwidth solidus
`\u200b`	(empty)	Zero-width space
`\uff1c`	`<`	Fullwidth less-than
`\uff1e`	`>`	Fullwidth greater-than
`\u037e`	`;`	Greek question mark