Overview
NeMo Guardrails provides two approaches for jailbreak detection:- Heuristics-based detection - Fast, lightweight checks using perplexity analysis
- Model-based detection - ML classifier using embeddings for more accurate detection
Heuristics-Based Detection
This method uses two perplexity-based heuristics to detect jailbreak attempts:- Length per perplexity: Analyzes the ratio of prompt length to perplexity
- Prefix-suffix perplexity: Examines perplexity patterns at the beginning and end
Configuration
config.yml
Configuration Parameters
server_endpoint- URL of the jailbreak detection API (optional, runs in-process if not provided)length_per_perplexity_threshold- Threshold for length/perplexity ratio (default: 89.79)prefix_suffix_perplexity_threshold- Threshold for prefix-suffix perplexity (default: 1845.65)
Model-Based Detection
This method uses a trained embedding-based classifier to detect jailbreak attempts with higher accuracy.Configuration with Custom Endpoint
config.yml
Configuration with NIM
For NVIDIA NIM deployments:config.yml
Configuration Parameters
server_endpoint- URL of the model-based jailbreak detection APInim_base_url- Base URL for NVIDIA NIM deploymentnim_server_endpoint- Classification endpoint path (default: “/classify”)nim_api_key- API key for NIM authentication (optional)embedding- Embedding model to use (e.g., “Snowflake/snowflake-arctic-embed-m-long”)
Using Both Methods
You can enable both heuristics and model-based detection for defense in depth:config.yml
Behavior
When a jailbreak attempt is detected: With Rails Exceptions enabled:JailbreakDetectionRailException is raised with details about the detection.
Without Rails Exceptions:
The bot refuses to respond and the conversation is aborted.
Caching
Model-based detection supports caching to improve performance:Custom Flows
You can create custom flows that use the jailbreak detection actions:flows.co
Implementation Details
The jailbreak detection flows are defined in:/nemoguardrails/library/jailbreak_detection/flows.co/nemoguardrails/library/jailbreak_detection/actions.py
JailbreakDetectionHeuristicsAction- Runs heuristic checksJailbreakDetectionModelAction- Runs ML model classification
Dependencies
For local in-process detection (model-based), you need:The heuristics-based method has minimal dependencies and can run in-process more easily, but model-based detection requires additional ML libraries.