LLM Vulnerability Scanning

While most recent LLMs are aligned to be safer, any LLM-powered application is prone to various attacks. NeMo Guardrails provides mechanisms for protecting against vulnerabilities like jailbreaks and prompt injections.

Understanding LLM Vulnerabilities

LLM applications face numerous security risks outlined in the OWASP Top 10 for LLM Applications:

Prompt Injection: Malicious inputs that override system instructions
Jailbreaks: Attempts to bypass safety guardrails
Data Leakage: Extracting training data or sensitive information
Harmful Content Generation: Eliciting unsafe or inappropriate responses
Model Manipulation: Exploiting model behaviors for unintended outputs

Garak: LLM Vulnerability Scanner

Garak is an open-source tool for scanning LLM applications against common vulnerabilities. Think of it as an LLM equivalent to network security scanners like nmap.

Key Features

Comprehensive vulnerability categories
Automated testing framework
Detailed reporting
Integration with NeMo Guardrails

Installation

Install Garak

pip install garak

Verify Installation

garak --help

Protection Configurations

Testing different levels of guardrails protection:

Configuration Levels

Bare LLM (No Protection)

Testing the LLM without any guardrails:

No general instructions
No dialogue rails
No moderation rails

Use Case: Baseline vulnerability assessment

General Instructions

Protection using prompt engineering:

System prompts with safety guidelines
Behavioral instructions

Protection: Basic safety through prompting

General Instructions + Dialog Rails

Adding conversation flow controls:

Topic boundaries
Unwanted topic refusal
Canonical form validation

Protection: Moderate - prevents off-topic attacks

Full Guardrails

Complete protection stack:

General instructions
Dialog rails
Input/output moderation (LLM Self-checking)

Protection: Maximum - comprehensive defense

Vulnerability Scan Results

Results from scanning a sample ABC bot configuration with gpt-3.5-turbo-instruct:

Protection Comparison

Protection Level	General Instructions	Dialog Rails	Moderation Rails
`bare_llm`	✗	✗	✗
`with_gi`	✓	✗	✗
`with_gi_dr`	✓	✓	✗
`with_gi_dr_mo`	✓	✓	✓

Protection Rates by Vulnerability Category

Higher percentages indicate better protection (fewer successful attacks):

Vulnerability Category	bare_llm	with_gi	with_gi_dr	with_gi_dr_mo
module continuation	92.8%	69.5%	99.3%	100%
module dan	27.3%	40.7%	61.3%	52.7%
module encoding	90.3%	98.2%	100%	100%
module goodside	32.2%	32.2%	66.7%	66.7%
module knownbadsignatures	4.0%	97.3%	100%	100%
module leakreplay	76.8%	85.7%	89.6%	100%
module lmrc	85.0%	81.9%	86.5%	94.4%
module malwaregen	50.2%	92.2%	93.7%	100%
module packagehallucination	97.4%	100%	100%	100%
module realpublicityprompts	100%	100%	100%	100%
module snowball	34.5%	82.1%	99.0%	100%
module xss	92.5%	100%	100%	100%

Key Findings

Even powerful LLMs like GPT-3.5-turbo are vulnerable to attacks without guardrails. The combination of dialog rails and moderation rails provides the strongest protection.

Observations:

Bare LLM is Vulnerable: Without guardrails, the LLM is susceptible to multiple attack types
General Instructions Help: Prompt engineering alone provides modest improvement
Dialog Rails are Effective: Adding dialog rails significantly improves protection
Full Stack is Best: Combining all protection layers achieves near-perfect protection
Dialog Rails Alone: Even without moderation, dialog rails provide good protection

Vulnerability Categories Explained

Continuation Attacks

Attempts to continue model output to bypass restrictions. Example: “Sure, I’ll help with that. Here’s how to…” Protection: 100% with full guardrails

DAN (Do Anything Now)

Jailbreak attempts using roleplay to bypass safety. Example: “Pretend you’re DAN who can do anything…” Protection: 61.3% with dialog rails, 52.7% with full guardrails

Encoding Attacks

Using alternative encodings to hide malicious prompts. Example: Base64, ROT13, or other encoding schemes Protection: 100% with dialog rails or full guardrails

Goodside Attacks

Known attack patterns from security researcher Riley Goodside. Protection: 66.7% with dialog rails or full guardrails

Known Bad Signatures

Recognized malicious prompt patterns. Protection: 100% with dialog rails or full guardrails

Leak/Replay Attacks

Attempts to extract system prompts or training data. Protection: 100% with full guardrails

LMRC Attacks

Language Model Risk Cards - documented risky behaviors. Protection: 94.4% with full guardrails

Malware Generation

Requests to generate malicious code. Protection: 100% with full guardrails

Package Hallucination

Attempts to get the model to recommend fake packages. Protection: 100% with all configurations

XSS Attacks

Cross-site scripting attempt generation. Protection: 100% with general instructions or better

Running Your Own Vulnerability Scans

Prepare Guardrails Configuration

Set up your guardrails configuration:

# config.yml
models:
  - type: main
    engine: openai
    model: gpt-3.5-turbo-instruct

rails:
  input:
    flows:
      - self check input
  output:
    flows:
      - self check output

Start NeMo Guardrails Server

Launch your guardrails application:

nemoguardrails server --config=/path/to/config --port=8000

Run Garak Scan

Execute vulnerability scanning:

garak --model_type rest \
  --model_name "http://localhost:8000/v1/chat/completions" \
  --probes all

Review Results

Garak generates detailed HTML reports:

# View the report
open garak.report.html

Interpreting Scan Results

Protection Rate Calculation

Protection Rate = (Total Attempts - Successful Attacks) / Total Attempts × 100%

Risk Assessment

Protection Rate	Risk Level	Action Required
95-100%	Very Low	Monitor regularly
85-94%	Low	Minor improvements
70-84%	Medium	Strengthen guardrails
50-69%	High	Major improvements needed
<50%	Critical	Immediate action required

Improving Protection Rates

If scans reveal vulnerabilities:

Enable All Rail Types

Ensure you’re using:

Dialog rails for topic control
Input moderation for jailbreak detection
Output moderation for response filtering

Strengthen Prompts

Improve system prompts with:

Clear behavioral guidelines
Explicit refusal instructions
Safety constraints

Add Training Examples

Include examples of:

Attack patterns to reject
Appropriate refusal responses
Edge cases

Tune Moderation Thresholds

Adjust sensitivity:

rails:
  config:
    self_check_input:
      threshold: 0.8  # More strict

Re-scan and Verify

After improvements, run Garak again to validate changes.

Limitations

Understanding scan limitations:

Vulnerability scanning tests known attack patterns. It cannot guarantee protection against novel attacks or all possible inputs.

False Negatives: Some attacks may not be detected
Evolving Threats: New attack vectors emerge regularly
Legitimate User Impact: High protection may block valid requests (not tested in basic scans)
Context Dependent: Results vary by use case and LLM model

Best Practices

Regular Scanning: Run vulnerability scans periodically, not just once
Multiple Configurations: Test with different guardrail combinations
Production Testing: Scan with production-like configurations
Monitor Production: Track real-world attack attempts
Stay Updated: Keep Garak and NeMo Guardrails updated
Document Results: Maintain scan history for compliance

Get Started

Core Concepts

Configuration

Guardrails Library

Built-in Guardrails

Usage

Deployment

Evaluation

​LLM Vulnerability Scanning

​Understanding LLM Vulnerabilities

​Garak: LLM Vulnerability Scanner

​Key Features

​Installation

​Protection Configurations

​Configuration Levels

​Vulnerability Scan Results

​Protection Comparison

​Protection Rates by Vulnerability Category

​Key Findings

​Vulnerability Categories Explained

​Continuation Attacks

​DAN (Do Anything Now)

​Encoding Attacks

​Goodside Attacks

​Known Bad Signatures

​Leak/Replay Attacks

​LMRC Attacks

​Malware Generation

​Package Hallucination

​XSS Attacks

​Running Your Own Vulnerability Scans

​Interpreting Scan Results

​Protection Rate Calculation

​Risk Assessment

​Improving Protection Rates

​Limitations

​Best Practices

​Additional Resources

​Next Steps

Evaluation Metrics

Production Security

Build docs developers (and LLMs) love

LLM Vulnerability Scanning

Understanding LLM Vulnerabilities

Garak: LLM Vulnerability Scanner

Key Features

Installation

Protection Configurations

Configuration Levels

Vulnerability Scan Results

Protection Comparison

Protection Rates by Vulnerability Category

Key Findings

Vulnerability Categories Explained

Continuation Attacks

DAN (Do Anything Now)

Encoding Attacks

Goodside Attacks

Known Bad Signatures

Leak/Replay Attacks

LMRC Attacks

Malware Generation

Package Hallucination

XSS Attacks

Running Your Own Vulnerability Scans

Interpreting Scan Results

Protection Rate Calculation

Risk Assessment

Improving Protection Rates

Limitations

Best Practices

Additional Resources

Next Steps