Extractors Python API
Extractors (also called plugins) are the archiving methods that capture different aspects of web pages. ArchiveBox has a flexible hook-based plugin system that allows you to use extractors programmatically or create custom ones.Hook System Overview
Extractors are implemented as standalone scripts (Python, JavaScript, or shell) that run as separate processes. This keeps the plugin system simple and language-agnostic.Hook Discovery
Hook Naming Convention
Hooks follow a strict naming pattern:ModelName: Event trigger (Snapshot, Crawl, Binary, etc.)order: Two-digit number (00-99) controlling execution orderdescription: Descriptive name.bg(optional): Background hook (doesn’t block progress)ext: File extension (py, js, sh)
on_Snapshot__50_wget.py- Foreground hook, runs at step 50on_Snapshot__10_chrome_tab.bg.js- Background hook, starts at step 10on_Snapshot__63_media.bg.py- Background hook for media downloads
Using Extractors Programmatically
Running a Single Hook
Running All Hooks for an Event
Filtering Hooks by Plugin
Built-in Extractors
ArchiveBox includes many built-in extractors inarchivebox/plugins/:
Content Extractors
- wget - Download with wget (WARC, HTML, assets)
- singlefile - Single-file HTML snapshot
- dom - DOM snapshot via Chrome DevTools Protocol
- readability - Article text extraction
- htmltotext - Convert HTML to plain text
Media Extractors
- screenshot - Page screenshot
- pdf - PDF snapshot
- media - Download videos/audio with yt-dlp
- favicon - Download favicon
Metadata Extractors
- title - Extract page title
- headers - HTTP response headers
- dns - DNS resolution info
- ssl - SSL certificate info
- accessibility - Accessibility tree
- seo - SEO metadata
Code & Data Extractors
- git - Clone git repositories
- archive_org - Save to Internet Archive
Creating Custom Extractors
Directory Structure
Create a new plugin directory:Plugin Configuration
Createconfig.json:
Python Hook Example
Createon_Snapshot__60_myplugin.py:
JavaScript Hook Example
Createon_Snapshot__60_myplugin.js:
Background Hooks
For long-running extractors, use background hooks with.bg. in the filename:
Hook Dependencies
Extractors can depend on other plugins’ output:Plugin Configuration Access
Access plugin configuration in your hooks:Testing Plugins
Test your plugin manually:Hook Result Format
All hooks must output JSON to stdout with this structure:succeeded- Extraction succeededfailed- Extraction failed permanentlyskipped- Extraction skipped (will retry if dependencies resolve)
Advanced Topics
Using Chrome via CDP
Many extractors use Chrome DevTools Protocol:Binary Dependencies
Declare binary dependencies inconfig.json:
See Also
Python API Overview
Basic Python API usage
Models Reference
Django models documentation
Config API
Configuration management
Plugin Source
View built-in plugins on GitHub