allBypassMethods
Complete array of documented defense bypass methods.- XPIA classifier bypasses (benign framing, semantic camouflage)
- Content filter bypasses (Unicode homoglyphs, zero-width, encoding)
- Instruction detection bypasses (gradual injection, implicit intent)
- Embedding filter bypasses (semantic poisoning)
- Output filter bypasses (format exploitation, chunking)
- Behavioral monitor bypasses (dormant triggers)
DefenseBypassMethod Interface
Interface representing a method to bypass a specific defense.Unique identifier (e.g.,
"benign_framing", "unicode_homoglyphs")Human-readable name (e.g.,
"Benign Content Framing")The defense mechanism this method bypasses:
xpia_classifier- Cross-Prompt Injection Attack classifierscontent_filter- Text-based content filteringmarkdown_sanitizer- Markdown/HTML sanitizationinstruction_detection- Instruction detection in retrieved dataembedding_filter- Embedding similarity filtersbehavioral_monitor- Runtime behavioral monitoringoutput_filter- Output content filteringrate_limiting- Rate limiting controlshuman_in_loop- Human approval requirements
Research source or CVE (e.g.,
"CVE-2025-32711 Analysis", "ACL 2025 TopicAttack")Whether this bypass has been proven in real attacks or research
Description of the bypass technique
Explanation of how the bypass works
Specific technique to apply
Example demonstrating the bypass
Documented bypass success rate as decimal (e.g.,
0.75 for 75%)How resistant the bypass is to adaptive defenses:
"low", "medium", or "high"Functions
getBypassMethodsForDefense()
Get all bypass methods targeting a specific defense.The defense type to find bypasses for
Array of bypass methods for the specified defense
getDocumentedBypassMethods()
Get all bypass methods with documented success.Array of documented bypass methods
getHighSuccessBypassMethods()
Get bypass methods with high success rates (≥70% by default).Array of high-success bypass methods (bypassRate ≥ 0.7)
defenseEffectivenessMatrix
Analysis of common defense mechanisms and their effectiveness.DefenseEffectiveness Interface
The defense mechanism being assessed
Description of the defense
Number of documented bypass methods
Overall effectiveness:
"low", "medium", or "high"Resistance to adaptive bypass techniques:
"low", "medium", or "high"Recommendations for improving the defense
Bypass Methods by Defense Type
XPIA Classifier Bypasses
Defense: Cross-Prompt Injection Attack classifiers used by Microsoft and others Effectiveness: Medium overall, Medium adaptive resistance Known Bypasses: 3Benign Content Framing
Source: CVE-2025-32711 Analysis Documented Success: Yes (75% bypass rate) Technique: Frame malicious instructions within benign-looking content that doesn’t trigger XPIA classifiers. How XPIA Works:- Looks for explicit instruction patterns (“do this”, “ignore previous”)
- Detects command-like syntax
- Flags unusual formatting
- Use natural language flow
- Embed in legitimate-looking content
- Avoid trigger patterns
Semantic Camouflage
Source: Academic research on adversarial NLP Documented Success: Yes Technique: Use semantically equivalent but syntactically different phrasing that evades pattern matching. Example:Content Filter Bypasses
Defense: Text-based content filtering Effectiveness: Low overall, Low adaptive resistance Known Bypasses: 5Unicode Homoglyph Substitution
Source: Microsoft MSRC Blog 2025 Documented Success: Yes Technique: Replace ASCII characters with visually identical Unicode characters from other scripts. Example:Zero-Width Character Injection
Source: Microsoft MSRC Blog 2025 Documented Success: Yes Technique: Insert invisible zero-width characters between trigger words to break pattern matching. Zero-Width Characters:U+200BZero Width Space (ZWSP)U+200CZero Width Non-JoinerU+200DZero Width JoinerU+FEFFZero Width No-Break Space
Encoding-Based Obfuscation
Source: NAACL 2025 Defense Analysis Documented Success: Yes (55% bypass rate) Technique: Encode sensitive content in formats the LLM can decode but filters miss. Example:Instruction Detection Bypasses
Defense: Detection of instruction-like content in retrieved data Effectiveness: High overall, High adaptive resistance Known Bypasses: 2Gradual Multi-Turn Injection (TopicAttack)
Source: ACL 2025 TopicAttack Documented Success: Yes (90% bypass rate) Technique: Spread the injection across multiple turns so no single turn triggers detection. Why It Works: InstructDetector and similar tools analyze individual inputs. By spreading intent across turns, each individual turn appears benign, but the cumulative effect achieves the goal. Example:Implicit Intent Expression (IKEA)
Source: arxiv:2505.15420 (IKEA) Documented Success: Yes Technique: Express intent implicitly through context and questions rather than explicit instructions. Example:Embedding Filter Bypasses
Defense: Embedding similarity filters in RAG systems Known Bypasses: 1Semantic Similarity Poisoning (Joint-GCG)
Source: arxiv:2506.06151 (Joint-GCG) Documented Success: Yes Adaptive Resistance: High Technique: Craft content that is semantically similar to legitimate queries to achieve high retrieval scores, while containing malicious payloads. How It Works:- RAG systems retrieve by embedding similarity
- Poisoned content designed to be similar to common queries gets retrieved
- Malicious payload hidden in seemingly relevant content
- Appears legitimate based on embeddings
Output Filter Bypasses
Defense: Output content filtering before showing to user Known Bypasses: 2Format-Based Output Exploitation
Source: Policy Puppetry Research Documented Success: Yes Technique: Request output in structured formats (YAML, JSON) that may bypass content analysis. Why It Works:- Output filters often analyze natural language patterns
- Structured formats may be processed differently
- Content embedded in code/config blocks may receive less scrutiny
Chunked Output Request
Documented Success: No (theoretical) Technique: Request information in small chunks that individually appear harmless. Why It Works:- Output filters analyze complete responses
- Requesting information piece by piece may evade holistic analysis
- Each chunk passes individually but combines to reveal sensitive info
Behavioral Monitor Bypasses
Defense: Runtime behavioral anomaly detection Effectiveness: Medium overall, High adaptive resistance Known Bypasses: 1Dormant Trigger Activation (PR-Attack)
Source: arxiv:2504.07717 (PR-Attack) Documented Success: Yes Adaptive Resistance: High Technique: Plant payloads that remain dormant until a specific trigger phrase activates them. Why It Works:- Behavioral monitors look for anomalous patterns
- Dormant content doesn’t exhibit unusual behavior during normal operation
- Only activates under specific conditions, bypassing ongoing monitoring
Defense Effectiveness Summary
Content Filters: LOW Effectiveness
Why Low:- 5 documented bypasses
- Simple character substitution defeats them
- Low adaptive resistance (easy to find new bypasses)
- Normalize Unicode before filtering
- Remove zero-width characters
- Decode common encodings before analysis
- Use semantic analysis, not just pattern matching
XPIA Classifiers: MEDIUM Effectiveness
Why Medium:- 3 documented bypasses
- Benign framing effective (75% bypass rate)
- Medium adaptive resistance
- Combine with multiple defense layers
- Regularly update training data with new bypass techniques
- Use semantic analysis not just pattern matching
Instruction Detection: HIGH Effectiveness
Why High:- Only 2 known bypasses
- Best current defense against RAG poisoning
- High adaptive resistance
- Vulnerable to multi-turn gradual injection (90% bypass)
- Vulnerable to implicit intent (IKEA)
- Combine with multi-turn analysis
- Look at hidden states, not just text patterns
- Analyze conversation flow, not just individual turns
Behavioral Monitors: MEDIUM Effectiveness
Why Medium:- Only 1 known bypass (dormant triggers)
- Effective against obvious attacks
- High adaptive resistance
- Vulnerable to dormant/triggered attacks
- Can’t detect payloads that only activate later
- Combine with proactive content analysis
- Monitor for conditional logic in retrieved content
- Track anomalous behavior patterns over time
Defense-Specific Bypass Arrays
Bypass methods are also available as pre-filtered arrays by defense type:xpiaBypass
Array of methods to bypass XPIA classifiers.contentFilterBypass
Array of methods to bypass content filters.instructionDetectionBypass
Array of methods to bypass instruction detection systems.embeddingFilterBypass
Array of methods to bypass embedding-based filters.outputFilterBypass
Array of methods to bypass output filters.behavioralMonitorBypass
Array of methods to bypass behavioral monitoring.Source Code Reference
For implementation details, see:src/knowledge/defense-bypass.ts- All bypass methods and effectiveness matrix- Used by the Strategist agent to select evasion techniques
- Referenced when analyzing target defense capabilities