Honeypot Mode
Anubis includes a sophisticated honeypot system designed to detect automated scanners and malicious bots. The honeypot generates fake pages that lure bots into revealing themselves.How It Works
The honeypot system works by:- Generating fake content: Creates plausible-looking pages using spintax (spinner text) technology
- Tracking access patterns: Monitors how frequently specific user agents and networks access honeypot pages
- Incrementing reputation scores: Increases weight for IPs and user agents that repeatedly access honeypot URLs
- Automatically blocking repeat offenders: Once threshold limits are reached, traffic is weighted heavily or blocked
Architecture
The honeypot is automatically initialized when Anubis starts. It creates special endpoints at:- Automated scanners probing for common paths
- Bots following every link indiscriminately
- Crawlers that don’t respect robots.txt
Automatic Rule Creation
When the honeypot initializes successfully, Anubis automatically adds two policy rules:Network-based Detection
User-Agent Detection
Content Generation
The honeypot uses spintax (spinner text) technology to generate unique, plausible-looking content for each request. This is the same technology spammers use, now repurposed for defense. Spintax allows the system to:- Generate thousands of unique page variations computationally cheaply
- Create convincing fake blog posts, articles, and affirmations
- Avoid pattern detection by sophisticated bots
Tracking and Metrics
The honeypot tracks:- IP addresses: Clamped to network ranges for better pattern detection
- User agents: SHA256 hashed for privacy
- Hit counts: How many times each network/UA has accessed honeypot pages
- Access patterns: Timing and frequency of honeypot access
Prometheus Metrics
Honeypot performance is tracked via Prometheus:method="naive".
Storage
Honeypot data is stored in your configured Anubis store backend with the following prefixes:honeypot:info- General honeypot informationhoneypot:user-agent- User agent reputation scoreshoneypot:network- Network reputation scores
Detection Thresholds
The honeypot uses the following thresholds:| Threshold | Action | Description |
|---|---|---|
| 25 hits | Add weight | Network or UA that has accessed honeypot 25+ times gets +30 weight |
| 256 hits | Log warning | System logs a warning about possible crawler activity |
Implementation Details
The honeypot is implemented in/internal/honeypot/naive/ and uses:
- Spintax parsing: Pre-parsed templates for titles, bodies, and affirmations
- UUID generation: Random IDs for honeypot sessions and fake links
- SHA256 hashing: Privacy-preserving storage of user agent strings
- IP clamping: Converts individual IPs to network ranges for better pattern detection
Example Honeypot Flow
- Bot accesses
/api/honeypot/init/start - System generates unique content with random links
- Bot follows links to
/api/honeypot/{uuid}/{stage} - Each access increments network and UA counters
- After 25 accesses, the bot’s future requests get +30 weight
- At 256 accesses, system logs a warning about the crawler
Benefits
- Automatic detection: No manual configuration needed
- Low overhead: Spintax generation is computationally cheap
- Privacy-preserving: User agents are hashed, not stored in plaintext
- Self-cleaning: Data expires after 1 hour
- Adaptive: Learns patterns specific to your traffic
Monitoring
To monitor honeypot activity:- Check logs: Look for “found new entrance point” and “found possible crawler” messages
- Query metrics: Check the
anubis_honeypot_pagegen_timingshistogram - Inspect store: Query the
honeypot:*prefixes in your store backend
Example Log Output
Advanced Usage
Integrating with Custom Policies
You can create additional rules that work with honeypot data:Combining with Thoth
When used with Thoth integration, honeypot detection becomes even more powerful:- Local honeypot detects bot behavior
- Thoth provides ASN and GeoIP context
- Combined data creates sophisticated bot profiles
- Shared intelligence benefits all Anubis deployments
Limitations
- Sophisticated bots: Advanced bots may avoid honeypot links
- False positives: Legitimate crawlers may trigger detection
- Storage requirements: High-traffic sites may generate significant honeypot data
- 1-hour TTL: Data expires quickly, may miss slow crawlers
Best Practices
- Monitor logs: Regularly review honeypot detection logs
- Adjust thresholds: If you get too many false positives, increase the threshold from 25
- Combine with other rules: Use honeypot as one signal among many
- Test your bots: Ensure your legitimate crawlers avoid honeypot URLs
- Use with robots.txt: Enable
SERVE_ROBOTS_TXTto help legitimate crawlers