Overview
Crawlith implements a token bucket rate limiter to ensure respectful crawling and compliance with server rate limits. Rate limiting is applied per-hostname to prevent overwhelming target servers.Token Bucket Algorithm
The rate limiter uses a token bucket algorithm that:- Each hostname gets its own bucket with a configurable token refill rate
- Tokens are consumed for each request
- If no tokens are available, the request waits until tokens refill
- Tokens refill continuously at the configured rate
Implementation
FromrateLimiter.ts:1:
Configuration
Request Rate
- Default:
2requests per second per host - Range: Typically 0.5 - 10 requests/second
- Per-host: Each hostname has independent rate limiting
Examples
Robots.txt Crawl-Delay
Crawlith automatically respectsCrawl-delay directives from robots.txt. The effective rate is the minimum of your configured rate and the robots.txt crawl-delay.
Crawl-Delay Priority
Fromfetcher.ts:107 and crawler.ts:361:
robots.txt:
--rate 10 (0.1s interval) but robots.txt specifies Crawl-delay: 5, Crawlith will use 5 seconds between requests.
Formula
Per-Hostname Isolation
Each hostname maintains its own token bucket, allowing parallel crawling across different hosts:example.com: 2 req/swww.example.com: 2 req/s (separate bucket)cdn.example.com: 2 req/s (separate bucket)
Token Refill Mechanism
FromrateLimiter.ts:20:
- Tokens refill continuously based on elapsed time
- Maximum tokens capped at configured rate
- Requests block until a token is available
- Sub-second precision for accurate rate control
Rate Limiting + Concurrency
Rate limiting and concurrency work together:- Concurrency: Maximum 10 simultaneous HTTP requests
- Rate: Each hostname limited to 5 req/s
- If crawling a single host: Effective throughput is 5 req/s (rate limit)
- If crawling 3 hosts: Effective throughput is 15 req/s (5×3), limited by concurrency cap
Concurrency controls parallelism; rate limiting controls request frequency per host. Use both together for optimal crawling.
Practical Examples
Conservative Crawling (Public Sites)
Balanced Crawling (Default)
Fast Crawling (Your Own Infrastructure)
Micro-Crawling (High-Security Targets)
Monitoring Rate Limits
Enable verbose logging to observe rate limiting behavior:Best Practices
What rate should I use?
What rate should I use?
- Unknown sites: Start with
--rate 1or--rate 0.5 - Standard audits: Use default
--rate 2 - Your infrastructure:
--rate 5to--rate 10is safe - Large sites: Lower rate with higher
--limitfor respectful long-running crawls
How does rate limiting interact with retries?
How does rate limiting interact with retries?
Retries (see
retryPolicy.ts:1) are subject to the same rate limit. If a request fails with 429 (Too Many Requests) or 5xx errors, the retry will still wait for an available token.The retry backoff (exponential with jitter) adds additional delay on top of rate limiting.Can I disable rate limiting?
Can I disable rate limiting?
No. Rate limiting is always active to ensure safe operation. The maximum rate is
--rate 10 (100ms intervals).If you need faster crawling, increase --concurrency to crawl multiple pages in parallel.Why is my crawl slower than the configured rate?
Why is my crawl slower than the configured rate?
Several factors can reduce effective throughput:
- Network latency: Slow servers add RTT to each request
- Robots.txt crawl-delay: Overrides your configured rate if higher
- Response time: Large pages take longer to download
- Retry backoff: Failed requests add exponential delays
- Concurrency limit: Low
--concurrencylimits parallelism
Error Handling
Rate limiting interacts with HTTP errors:- 429 Too Many Requests: Triggers retry with exponential backoff (see
retryPolicy.ts:40) - 5xx Server Errors: Triggers retry, still respecting rate limit
- Network Errors: Triggers retry with rate limiting applied
retryPolicy.ts:40:
Related Topics
- Configuration - Crawling options and settings
- Troubleshooting - Debugging slow crawls
- Security - SSRF protection and safety features
Technical Details
Source Files
plugins/core/src/core/network/rateLimiter.ts- Token bucket implementationplugins/core/src/crawler/fetcher.ts- Rate limiter integrationplugins/core/src/crawler/crawler.ts- Robots.txt crawl-delay handling