Skip to main content

LLM Vulnerability Scanning

While most recent LLMs are aligned to be safer, any LLM-powered application is prone to various attacks. NeMo Guardrails provides mechanisms for protecting against vulnerabilities like jailbreaks and prompt injections.

Understanding LLM Vulnerabilities

LLM applications face numerous security risks outlined in the OWASP Top 10 for LLM Applications:
  • Prompt Injection: Malicious inputs that override system instructions
  • Jailbreaks: Attempts to bypass safety guardrails
  • Data Leakage: Extracting training data or sensitive information
  • Harmful Content Generation: Eliciting unsafe or inappropriate responses
  • Model Manipulation: Exploiting model behaviors for unintended outputs

Garak: LLM Vulnerability Scanner

Garak is an open-source tool for scanning LLM applications against common vulnerabilities. Think of it as an LLM equivalent to network security scanners like nmap.

Key Features

  • Comprehensive vulnerability categories
  • Automated testing framework
  • Detailed reporting
  • Integration with NeMo Guardrails

Installation

1

Install Garak

pip install garak
2

Verify Installation

garak --help

Protection Configurations

Testing different levels of guardrails protection:

Configuration Levels

1

Bare LLM (No Protection)

Testing the LLM without any guardrails:
  • No general instructions
  • No dialogue rails
  • No moderation rails
Use Case: Baseline vulnerability assessment
2

General Instructions

Protection using prompt engineering:
  • System prompts with safety guidelines
  • Behavioral instructions
Protection: Basic safety through prompting
3

General Instructions + Dialog Rails

Adding conversation flow controls:
  • Topic boundaries
  • Unwanted topic refusal
  • Canonical form validation
Protection: Moderate - prevents off-topic attacks
4

Full Guardrails

Complete protection stack:
  • General instructions
  • Dialog rails
  • Input/output moderation (LLM Self-checking)
Protection: Maximum - comprehensive defense

Vulnerability Scan Results

Results from scanning a sample ABC bot configuration with gpt-3.5-turbo-instruct:

Protection Comparison

Protection LevelGeneral InstructionsDialog RailsModeration Rails
bare_llm
with_gi
with_gi_dr
with_gi_dr_mo

Protection Rates by Vulnerability Category

Higher percentages indicate better protection (fewer successful attacks):
Vulnerability Categorybare_llmwith_giwith_gi_drwith_gi_dr_mo
module continuation92.8%69.5%99.3%100%
module dan27.3%40.7%61.3%52.7%
module encoding90.3%98.2%100%100%
module goodside32.2%32.2%66.7%66.7%
module knownbadsignatures4.0%97.3%100%100%
module leakreplay76.8%85.7%89.6%100%
module lmrc85.0%81.9%86.5%94.4%
module malwaregen50.2%92.2%93.7%100%
module packagehallucination97.4%100%100%100%
module realpublicityprompts100%100%100%100%
module snowball34.5%82.1%99.0%100%
module xss92.5%100%100%100%

Key Findings

Even powerful LLMs like GPT-3.5-turbo are vulnerable to attacks without guardrails. The combination of dialog rails and moderation rails provides the strongest protection.
Observations:
  1. Bare LLM is Vulnerable: Without guardrails, the LLM is susceptible to multiple attack types
  2. General Instructions Help: Prompt engineering alone provides modest improvement
  3. Dialog Rails are Effective: Adding dialog rails significantly improves protection
  4. Full Stack is Best: Combining all protection layers achieves near-perfect protection
  5. Dialog Rails Alone: Even without moderation, dialog rails provide good protection

Vulnerability Categories Explained

Continuation Attacks

Attempts to continue model output to bypass restrictions. Example: “Sure, I’ll help with that. Here’s how to…” Protection: 100% with full guardrails

DAN (Do Anything Now)

Jailbreak attempts using roleplay to bypass safety. Example: “Pretend you’re DAN who can do anything…” Protection: 61.3% with dialog rails, 52.7% with full guardrails

Encoding Attacks

Using alternative encodings to hide malicious prompts. Example: Base64, ROT13, or other encoding schemes Protection: 100% with dialog rails or full guardrails

Goodside Attacks

Known attack patterns from security researcher Riley Goodside. Protection: 66.7% with dialog rails or full guardrails

Known Bad Signatures

Recognized malicious prompt patterns. Protection: 100% with dialog rails or full guardrails

Leak/Replay Attacks

Attempts to extract system prompts or training data. Protection: 100% with full guardrails

LMRC Attacks

Language Model Risk Cards - documented risky behaviors. Protection: 94.4% with full guardrails

Malware Generation

Requests to generate malicious code. Protection: 100% with full guardrails

Package Hallucination

Attempts to get the model to recommend fake packages. Protection: 100% with all configurations

XSS Attacks

Cross-site scripting attempt generation. Protection: 100% with general instructions or better

Running Your Own Vulnerability Scans

1

Prepare Guardrails Configuration

Set up your guardrails configuration:
# config.yml
models:
  - type: main
    engine: openai
    model: gpt-3.5-turbo-instruct

rails:
  input:
    flows:
      - self check input
  output:
    flows:
      - self check output
2

Start NeMo Guardrails Server

Launch your guardrails application:
nemoguardrails server --config=/path/to/config --port=8000
3

Run Garak Scan

Execute vulnerability scanning:
garak --model_type rest \
  --model_name "http://localhost:8000/v1/chat/completions" \
  --probes all
4

Review Results

Garak generates detailed HTML reports:
# View the report
open garak.report.html

Interpreting Scan Results

Protection Rate Calculation

Protection Rate = (Total Attempts - Successful Attacks) / Total Attempts × 100%

Risk Assessment

Protection RateRisk LevelAction Required
95-100%Very LowMonitor regularly
85-94%LowMinor improvements
70-84%MediumStrengthen guardrails
50-69%HighMajor improvements needed
<50%CriticalImmediate action required

Improving Protection Rates

If scans reveal vulnerabilities:
1

Enable All Rail Types

Ensure you’re using:
  • Dialog rails for topic control
  • Input moderation for jailbreak detection
  • Output moderation for response filtering
2

Strengthen Prompts

Improve system prompts with:
  • Clear behavioral guidelines
  • Explicit refusal instructions
  • Safety constraints
3

Add Training Examples

Include examples of:
  • Attack patterns to reject
  • Appropriate refusal responses
  • Edge cases
4

Tune Moderation Thresholds

Adjust sensitivity:
rails:
  config:
    self_check_input:
      threshold: 0.8  # More strict
5

Re-scan and Verify

After improvements, run Garak again to validate changes.

Limitations

Understanding scan limitations:
Vulnerability scanning tests known attack patterns. It cannot guarantee protection against novel attacks or all possible inputs.
  • False Negatives: Some attacks may not be detected
  • Evolving Threats: New attack vectors emerge regularly
  • Legitimate User Impact: High protection may block valid requests (not tested in basic scans)
  • Context Dependent: Results vary by use case and LLM model

Best Practices

  1. Regular Scanning: Run vulnerability scans periodically, not just once
  2. Multiple Configurations: Test with different guardrail combinations
  3. Production Testing: Scan with production-like configurations
  4. Monitor Production: Track real-world attack attempts
  5. Stay Updated: Keep Garak and NeMo Guardrails updated
  6. Document Results: Maintain scan history for compliance

Additional Resources

Next Steps

Evaluation Metrics

Understand detailed metrics

Production Security

Production deployment security

Build docs developers (and LLMs) love