Extractors Python API

Extractors (also called plugins) are the archiving methods that capture different aspects of web pages. ArchiveBox has a flexible hook-based plugin system that allows you to use extractors programmatically or create custom ones.

Hook System Overview

Extractors are implemented as standalone scripts (Python, JavaScript, or shell) that run as separate processes. This keeps the plugin system simple and language-agnostic.

Hook Discovery

from archivebox.hooks import discover_hooks

# Find all Snapshot hooks
hooks = discover_hooks('Snapshot', filter_disabled=True)

for hook in hooks:
    plugin_name = hook.parent.name
    hook_filename = hook.name
    print(f"{plugin_name}: {hook_filename}")

Output:

wget: on_Snapshot__50_wget.py
screenshot: on_Snapshot__51_screenshot.js
dom: on_Snapshot__53_dom.js
pdf: on_Snapshot__54_pdf.js
singlefile: on_Snapshot__50_singlefile.py
...

Hook Naming Convention

Hooks follow a strict naming pattern:

on_{ModelName}__{order}_{description}[.bg].{ext}

ModelName: Event trigger (Snapshot, Crawl, Binary, etc.)
order: Two-digit number (00-99) controlling execution order
description: Descriptive name
.bg (optional): Background hook (doesn’t block progress)
ext: File extension (py, js, sh)

Examples:

on_Snapshot__50_wget.py - Foreground hook, runs at step 50
on_Snapshot__10_chrome_tab.bg.js - Background hook, starts at step 10
on_Snapshot__63_media.bg.py - Background hook for media downloads

Using Extractors Programmatically

Running a Single Hook

from pathlib import Path
from archivebox.hooks import run_hook
from archivebox.core.models import Snapshot

# Get a snapshot
snapshot = Snapshot.objects.first()

# Run a specific hook
result = run_hook(
    script=Path('/path/to/hook.py'),
    snapshot=snapshot,
    url=snapshot.url,
    timeout=60
)

print(f"Exit code: {result['returncode']}")
print(f"Duration: {result['duration_ms']}ms")
print(f"Output: {result['stdout']}")

if result['output_json']:
    print(f"Status: {result['output_json']['status']}")
    print(f"Files: {result['output_json'].get('output_files', [])}")

Running All Hooks for an Event

from archivebox.hooks import run_hooks
from archivebox.core.models import Snapshot

snapshot = Snapshot.objects.first()

# Run all enabled Snapshot hooks
results = run_hooks(
    event_name='Snapshot',
    snapshot=snapshot,
    url=snapshot.url
)

# Process results
for result in results:
    plugin = result['plugin']
    status = result['output_json'].get('status', 'unknown')
    print(f"{plugin}: {status}")
    
    if result['returncode'] != 0:
        print(f"  Error: {result['stderr']}")

Filtering Hooks by Plugin

from archivebox.hooks import discover_hooks
from archivebox.config.configset import get_config

# Get config with specific overrides
config = get_config(
    snapshot=snapshot,
    overrides={'SAVE_SCREENSHOT': 'False'}  # Disable screenshot
)

# Discover hooks with custom config
hooks = discover_hooks('Snapshot', config=config)

# Or filter hooks manually
screenshot_hooks = [
    h for h in hooks 
    if h.parent.name == 'screenshot'
]

Built-in Extractors

ArchiveBox includes many built-in extractors in archivebox/plugins/:

Content Extractors

wget - Download with wget (WARC, HTML, assets)
singlefile - Single-file HTML snapshot
dom - DOM snapshot via Chrome DevTools Protocol
readability - Article text extraction
htmltotext - Convert HTML to plain text

Media Extractors

screenshot - Page screenshot
pdf - PDF snapshot
media - Download videos/audio with yt-dlp
favicon - Download favicon

Metadata Extractors

title - Extract page title
headers - HTTP response headers
dns - DNS resolution info
ssl - SSL certificate info
accessibility - Accessibility tree
seo - SEO metadata

Code & Data Extractors

git - Clone git repositories
archive_org - Save to Internet Archive

Creating Custom Extractors

Directory Structure

Create a new plugin directory:

mkdir -p archivebox/plugins/myplugin
cd archivebox/plugins/myplugin

Plugin Configuration

Create config.json:

{
  "name": "myplugin",
  "version": "1.0.0",
  "enabled": true,
  "config": {
    "SAVE_MYPLUGIN": {
      "type": "bool",
      "default": true,
      "description": "Enable my custom plugin"
    },
    "MYPLUGIN_TIMEOUT": {
      "type": "int",
      "default": 30,
      "description": "Timeout for my plugin (seconds)"
    }
  }
}

Python Hook Example

Create on_Snapshot__60_myplugin.py:

#!/usr/bin/env python3
"""Custom extractor plugin example."""
import sys
import json
import argparse
from pathlib import Path

def extract(url: str, snapshot_id: str, output_dir: Path) -> dict:
    """Extract custom data from URL.
    
    Args:
        url: URL to archive
        snapshot_id: Snapshot UUID
        output_dir: Where to write output files
        
    Returns:
        Result dict with status, output_files, etc.
    """
    try:
        # Your extraction logic here
        output_file = output_dir / 'custom_data.txt'
        output_file.write_text(f"Processed: {url}\n")
        
        return {
            'status': 'succeeded',
            'output_files': ['custom_data.txt'],
            'output': f'Successfully extracted {url}'
        }
    except Exception as e:
        return {
            'status': 'failed',
            'output': str(e)
        }

if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument('--url', required=True)
    parser.add_argument('--snapshot-id', required=True)
    args = parser.parse_args()
    
    # Output directory is always current working directory
    output_dir = Path.cwd()
    
    # Run extraction
    result = extract(args.url, args.snapshot_id, output_dir)
    
    # Output JSON to stdout (required)
    print(json.dumps(result))
    
    # Exit with appropriate code
    sys.exit(0 if result['status'] == 'succeeded' else 1)

Make it executable:

chmod +x on_Snapshot__60_myplugin.py

JavaScript Hook Example

Create on_Snapshot__60_myplugin.js:

#!/usr/bin/env node
/**
 * Custom JavaScript extractor plugin.
 */
const fs = require('fs');
const path = require('path');

function parseArgs() {
  const args = {};
  process.argv.slice(2).forEach(arg => {
    if (arg.startsWith('--')) {
      const [key, value] = arg.slice(2).split('=');
      args[key] = value;
    }
  });
  return args;
}

async function extract(url, snapshotId, outputDir) {
  try {
    // Your extraction logic here
    const outputFile = path.join(outputDir, 'custom_data.json');
    fs.writeFileSync(outputFile, JSON.stringify({
      url,
      snapshotId,
      extracted_at: new Date().toISOString()
    }));
    
    return {
      status: 'succeeded',
      output_files: ['custom_data.json'],
      output: `Extracted ${url}`
    };
  } catch (err) {
    return {
      status: 'failed',
      output: err.message
    };
  }
}

(async () => {
  const args = parseArgs();
  const outputDir = process.cwd();
  
  const result = await extract(args.url, args['snapshot-id'], outputDir);
  
  // Output JSON to stdout (required)
  console.log(JSON.stringify(result));
  
  process.exit(result.status === 'succeeded' ? 0 : 1);
})();

Background Hooks

For long-running extractors, use background hooks with .bg. in the filename:

# on_Snapshot__63_media.bg.py
#!/usr/bin/env python3
"""Background media downloader that doesn't block other extractors."""
import sys
import json
import signal
import time

# Handle graceful shutdown
shutdown = False
def signal_handler(sig, frame):
    global shutdown
    shutdown = True
    sys.exit(0)

signal.signal(signal.SIGTERM, signal_handler)

# Long-running download
for i in range(100):
    if shutdown:
        break
    time.sleep(1)  # Simulate work
    
print(json.dumps({'status': 'succeeded', 'output': 'Downloaded media'}))

Hook Dependencies

Extractors can depend on other plugins’ output:

#!/usr/bin/env python3
"""Extractor that depends on chrome plugin output."""
import sys
import json
from pathlib import Path

# Check if dependency output exists
chrome_dir = Path.cwd().parent / 'chrome'
if not (chrome_dir / 'cdp_url.txt').exists():
    print(json.dumps({
        'status': 'skipped',
        'output': 'Chrome session not available, will retry later'
    }))
    sys.exit(1)  # Non-zero exit triggers retry

# Use chrome output
cdp_url = (chrome_dir / 'cdp_url.txt').read_text().strip()
print(f"Using Chrome at {cdp_url}", file=sys.stderr)

# Continue with extraction...

Plugin Configuration Access

Access plugin configuration in your hooks:

#!/usr/bin/env python3
import os
import json

# Configuration is passed via environment variables
timeout = int(os.environ.get('MYPLUGIN_TIMEOUT', '30'))
enabled = os.environ.get('SAVE_MYPLUGIN', 'true').lower() == 'true'

if not enabled:
    print(json.dumps({'status': 'skipped', 'output': 'Plugin disabled'}))
    sys.exit(0)

# Use configuration
result = extract_with_timeout(timeout)
print(json.dumps(result))

Testing Plugins

Test your plugin manually:

# Run hook directly
cd /path/to/snapshot/output/dir
./path/to/on_Snapshot__60_myplugin.py \
  --url=https://example.com \
  --snapshot-id=01234567-89ab-cdef-0123-456789abcdef

# Check output
ls -la
cat custom_data.txt

Test via Python:

from pathlib import Path
from archivebox.hooks import run_hook
from archivebox.core.models import Snapshot

snapshot = Snapshot.objects.first()
script = Path('archivebox/plugins/myplugin/on_Snapshot__60_myplugin.py')

result = run_hook(
    script=script,
    snapshot=snapshot,
    url=snapshot.url,
    timeout=60
)

assert result['returncode'] == 0, f"Hook failed: {result['stderr']}"
assert result['output_json']['status'] == 'succeeded'
print(f"Created files: {result['output_json']['output_files']}")

Hook Result Format

All hooks must output JSON to stdout with this structure:

{
  "status": "succeeded",
  "output": "Human-readable output message",
  "output_files": [
    "relative/path/to/file1.html",
    "relative/path/to/file2.png"
  ],
  "cmd": ["command", "args"],
  "pwd": "/path/to/working/dir"
}

Status values:

succeeded - Extraction succeeded
failed - Extraction failed permanently
skipped - Extraction skipped (will retry if dependencies resolve)

Advanced Topics

Using Chrome via CDP

Many extractors use Chrome DevTools Protocol:

import json
from pathlib import Path
import asyncio
from pyppeteer import connect

# Read CDP URL from chrome plugin output
chrome_dir = Path.cwd().parent / 'chrome'
cdp_url = (chrome_dir / 'cdp_url.txt').read_text().strip()

async def extract_with_chrome(url):
    browser = await connect(browserWSEndpoint=cdp_url)
    page = await browser.newPage()
    await page.goto(url)
    
    # Extract data
    title = await page.title()
    content = await page.content()
    
    await page.close()
    
    return {'title': title, 'content': content}

result = asyncio.run(extract_with_chrome(url))
print(json.dumps({'status': 'succeeded', 'output': result['title']}))

Binary Dependencies

Declare binary dependencies in config.json:

{
  "name": "myplugin",
  "binaries": {
    "my_tool": {
      "type": "binary",
      "binpath": "my_tool",
      "version_flag": "--version"
    }
  }
}

Check availability in your hook:

import shutil

if not shutil.which('my_tool'):
    print(json.dumps({
        'status': 'skipped',
        'output': 'my_tool not installed'
    }))
    sys.exit(1)

Python API Overview

Basic Python API usage

Models Reference

Django models documentation

Config API

Configuration management

Plugin Source

View built-in plugins on GitHub

REST API

Python API

Extractors Python API

Extractors Python API

Hook System Overview

Hook Discovery

Hook Naming Convention

Using Extractors Programmatically

Running a Single Hook

Running All Hooks for an Event

Filtering Hooks by Plugin

Built-in Extractors

Content Extractors

Media Extractors

Metadata Extractors

Code & Data Extractors

Creating Custom Extractors

Directory Structure

Plugin Configuration

Python Hook Example

JavaScript Hook Example

Background Hooks

Hook Dependencies

Plugin Configuration Access

Testing Plugins

Hook Result Format

Advanced Topics

Using Chrome via CDP

Binary Dependencies

See Also

Python API Overview

Models Reference

Config API

Plugin Source

Build docs developers (and LLMs) love

REST API

Python API

​Extractors Python API

​Hook System Overview

​Hook Discovery

​Hook Naming Convention

​Using Extractors Programmatically

​Running a Single Hook

​Running All Hooks for an Event

​Filtering Hooks by Plugin

​Built-in Extractors

​Content Extractors

​Media Extractors

​Metadata Extractors

​Code & Data Extractors

​Creating Custom Extractors

​Directory Structure

​Plugin Configuration

​Python Hook Example

​JavaScript Hook Example

​Background Hooks

​Hook Dependencies

​Plugin Configuration Access

​Testing Plugins

​Hook Result Format

​Advanced Topics

​Using Chrome via CDP

​Binary Dependencies

​See Also

Python API Overview

Models Reference

Config API

Plugin Source

Build docs developers (and LLMs) love

Extractors Python API

Hook System Overview

Hook Discovery

Hook Naming Convention

Using Extractors Programmatically

Running a Single Hook

Running All Hooks for an Event

Filtering Hooks by Plugin

Built-in Extractors

Content Extractors

Media Extractors

Metadata Extractors

Code & Data Extractors

Creating Custom Extractors

Directory Structure

Plugin Configuration

Python Hook Example

JavaScript Hook Example

Background Hooks

Hook Dependencies

Plugin Configuration Access

Testing Plugins

Hook Result Format

Advanced Topics

Using Chrome via CDP

Binary Dependencies

See Also