Chrome Plugin Architecture
Many ArchiveBox plugins depend on Chrome/Chromium via the Chrome DevTools Protocol (CDP). The chrome plugin provides the core infrastructure that other plugins build upon.Core Chrome Plugin
The foundation for all Chrome-based operations:chrome
Core Chrome/Chromium integration and browser lifecycle management.- Plugin:
chrome - Purpose: Launch and manage Chrome/Chromium sessions
- Features:
- Browser process management
- CDP connection handling
- Session persistence
- Profile management
- Extension loading
- Config:
CHROME_ENABLED(default:true)CHROME_BINARY(default:"chromium")CHROME_HEADLESS(default:true)CHROME_SANDBOX(default:true)CHROME_TIMEOUT(default:60)CHROME_RESOLUTION(default:"1440,2000")CHROME_USER_DATA_DIR- Persistent profile directoryCHROME_ARGS- Chrome command-line flagsCHROME_PAGELOAD_TIMEOUT(default:60)CHROME_WAIT_FOR(default:"networkidle2") - Page load conditionCHROME_DELAY_AFTER_LOAD(default:0) - Extra delay for SPAs
Metadata Extraction Plugins
These plugins use CDP to extract metadata without modifying page content:dns
Resolve and record DNS information.- Output:
dns/dns.json - Config:
DNS_ENABLED(default:true) - Extracts: IP addresses, DNS records
- Use case: Track IP changes, verify domain ownership
ssl
Capture SSL/TLS certificate information.- Output:
ssl/ssl.json - Config:
SSL_ENABLED(default:true) - Extracts: Certificate details, expiry, issuer
- Use case: Security auditing, certificate tracking
headers
Capture HTTP response headers.- Output:
headers/headers.json - Config:
HEADERS_ENABLED(default:true) - Extracts: All HTTP response headers
- Use case: Debug caching, security headers, redirects
redirects
Track redirect chains.- Output:
redirects/redirects.json - Config:
REDIRECTS_ENABLED(default:true) - Extracts: All redirects from initial URL to final destination
- Use case: Track URL changes, detect redirect loops
responses
Capture network responses.- Output:
responses/responses.json - Config:
RESPONSES_ENABLED(default:true) - Extracts: All network requests and responses
- Use case: Debug API calls, track resource loading
consolelog
Capture browser console output.- Output:
consolelog/consolelog.txt - Config:
CONSOLELOG_ENABLED(default:true) - Extracts: console.log, console.error, console.warn output
- Use case: Debug JavaScript errors, track analytics calls
title
Extract page title from the DOM.- Output:
title/title.txt - Config:
TITLE_ENABLED(default:true) - Extracts: Page
<title>tag - Use case: Metadata for search, display in UI
favicon
Download site favicon.- Output:
favicon/favicon.ico - Config:
FAVICON_ENABLED(default:true) - Extracts: Site icon (checks multiple locations)
- Use case: Visual identification in UI
accessibility
Extract accessibility tree.- Output:
accessibility/accessibility.json - Config:
ACCESSIBILITY_ENABLED(default:true) - Extracts: Complete accessibility tree structure
- Use case: Accessibility auditing, screen reader testing
seo
Extract SEO metadata.- Output:
seo/seo.json - Config:
SEO_ENABLED(default:true) - Extracts: meta tags, Open Graph, Twitter Cards
- Use case: Track SEO metadata, social sharing info
hashes
Calculate content hashes.- Output:
hashes/hashes.json - Config:
HASHES_ENABLED(default:true) - Extracts: SHA256 hashes of page content
- Use case: Detect content changes, deduplication
Content Extraction Plugins
These plugins use CDP to extract and save page content:dom
Capture complete DOM snapshot.- Output:
dom/dom.html - Config:
DOM_ENABLED(default:true) - Dependencies: chrome
- Use case: Preserve JavaScript-rendered content
screenshot
Capture full-page screenshot.- Output:
screenshot/screenshot.png - Config:
SCREENSHOT_ENABLED(default:true) - Dependencies: chrome
- Use case: Visual snapshots
- Output:
pdf/pdf.pdf - Config:
PDF_ENABLED(default:true) - Dependencies: chrome
- Use case: Print-friendly archives
singlefile
Save complete HTML with inlined resources.- Output:
singlefile/singlefile.html - Config:
SINGLEFILE_ENABLED(default:true) - Dependencies: chrome, npm
- Can run as: CLI tool or Chrome extension
- Use case: Self-contained HTML files
staticfile
Download direct file links.- Output:
staticfile/{filename} - Config:
STATICFILE_ENABLED(default:true) - Dependencies: chrome (uses CDP download handling)
- Use case: Archive PDFs, images, executables
Browser Extension Plugins
These plugins install and control Chrome extensions:ublock
uBlock Origin ad blocker.- Extension: uBlock Origin
- Config:
UBLOCK_ENABLED(default:false) - Purpose: Block ads and trackers during archiving
- Use case: Cleaner archives, faster loading, privacy
Extensions persist across sessions when using
CHROME_USER_DATA_DIR.istilldontcareaboutcookies
Automatically dismiss cookie consent banners.- Extension: I Still Don’t Care About Cookies
- Config:
ISTILLDONTCAREABOUTCOOKIES_ENABLED(default:false) - Purpose: Remove cookie banners automatically
- Use case: Cleaner screenshots, better DOM snapshots
twocaptcha
Automatic CAPTCHA solving.- Extension: 2Captcha Solver
- Config:
TWOCAPTCHA_ENABLED(default:false)TWOCAPTCHA_API_KEY- Your 2Captcha API key
- Purpose: Solve CAPTCHAs automatically
- Use case: Archive CAPTCHA-protected sites
- Cost: Requires paid 2Captcha account
Page Manipulation Plugins
These plugins modify page behavior before archiving:modalcloser
Automatically close modal dialogs.- Config:
MODALCLOSER_ENABLED(default:false) - Purpose: Dismiss popups, overlays, login prompts
- Use case: Better screenshots, cleaner DOM
- Method: Injects JavaScript to close common modal patterns
infiniscroll
Handle infinite scroll pages.- Config:
INFINISCROLL_ENABLED(default:false) - Purpose: Scroll page to load all content
- Use case: Archive infinite-scroll sites (Twitter, Facebook, etc.)
- Method: Automatically scrolls until no new content loads
Chrome Session Management
Session Lifecycle
- Crawl Start: Chrome plugin launches browser with
on_Crawl__*hook - Session Creation: Browser starts, CDP endpoint exposed
- Plugin Connections: Other plugins connect to existing session
- Snapshot Processing: Multiple plugins share the same session
- Crawl End: Browser closes when all snapshots complete
Session Persistence
Use persistent profiles for:- Login sessions: Stay logged in across runs
- Cookies: Preserve authentication
- Extensions: Keep extension state
- Settings: Maintain browser preferences
Multiple Snapshots, One Session
Chrome plugins are designed to reuse sessions:Chrome Configuration
Headless vs Headful
Sandbox
Viewport Size
Page Load Conditions
Custom Chrome Arguments
Troubleshooting
Chrome Won’t Start
Problem: Chrome plugin fails to launch browser Solutions:- Verify Chrome/Chromium is installed:
chromium --version - Check binary path:
export CHROME_BINARY=/usr/bin/google-chrome - Disable sandbox in Docker:
export CHROME_SANDBOX=False - Kill zombie processes:
pkill -9 chrome
CDP Connection Failed
Problem: Plugins can’t connect to Chrome session Solutions:- Ensure chrome plugin is enabled:
export CHROME_ENABLED=True - Check if Chrome is running:
ps aux | grep chrome - Verify CDP port is accessible (usually 9222)
- Review chrome plugin logs
Chrome Crashes
Problem: Browser crashes during archiving Solutions:- Increase memory limits
- Reduce concurrent snapshots
- Disable problematic extensions
- Check
consolelogoutput for errors - Add
--disable-dev-shm-usagetoCHROME_ARGS_EXTRA(Docker)
Pages Not Loading
Problem: Blank screenshots, empty DOM Solutions:- Increase timeout:
export CHROME_PAGELOAD_TIMEOUT=120 - Change wait condition:
export CHROME_WAIT_FOR="load" - Add delay:
export CHROME_DELAY_AFTER_LOAD=3 - Disable extensions temporarily
- Check site doesn’t block automation (some sites detect headless Chrome)
Extensions Not Working
Problem: uBlock or other extensions don’t load Solutions:- Verify extension is enabled in config
- Check extension files are present in plugin directory
- Use persistent profile:
export CHROME_USER_DATA_DIR=/path/to/profile - Enable extension manually in headful mode first
Performance Optimization
Reuse Sessions
Best practice: Archive multiple URLs in one command to reuse the Chrome session:Disable Unused Plugins
Only enable Chrome plugins you need:Use Lighter Wait Conditions
Chrome Plugin Development
See Custom Plugins for guidance on creating Chrome-based plugins. Key requirements:- Use
chrome_utils.jsfor all Chrome operations - Test both connection modes (existing session + new session)
- Handle CDP disconnections gracefully
- Don’t depend on ArchiveBox core (plugins are isolated)
Related Topics
Extractor Plugins
More details on Chrome-based extractors
Custom Plugins
Create your own Chrome-based plugins