Skip to main content

Overview

The Crawls API allows you to manage crawl sessions. A crawl represents a group of snapshots created from a single import or archiving command (e.g., archivebox add). Base URL: /api/v1/crawls/

Crawl Schema

A crawl object contains:
{
  "TYPE": "crawls.models.Crawl",
  "id": "01234567-89ab-cdef-0123-456789abcdef",
  "created_at": "2024-01-15T10:30:00Z",
  "modified_at": "2024-01-15T10:35:00Z",
  "created_by_id": "1",
  "created_by_username": "admin",
  "status": "succeeded",
  "retry_at": null,
  "urls": "https://example.com\nhttps://example.org",
  "extractor": "auto",
  "max_depth": 0,
  "tags_str": "important,tutorial",
  "config": {
    "TIMEOUT": "60",
    "CHECK_SSL_VALIDITY": "True"
  }
}

Field Descriptions

FieldTypeDescription
idUUIDUnique identifier for the crawl
created_atdatetimeWhen the crawl was created
modified_atdatetimeWhen the crawl was last modified
created_by_idstringUser ID who created the crawl
created_by_usernamestringUsername who created the crawl
statusstringCurrent status (see Status Values)
retry_atdatetime?When to retry crawl (null if not scheduled)
urlsstringNewline-separated list of URLs
extractorstringParser used (e.g., “auto”, “wget_log”, “rss”)
max_depthintRecursion depth for crawling
tags_strstringComma-separated tag names
configobjectConfiguration overrides for this crawl

Status Values

  • queued - Waiting to be processed
  • started - Currently crawling
  • succeeded - Successfully completed
  • failed - Crawl failed
  • sealed - Cancelled/frozen (no further processing)

List Crawls

Get all crawls in the system.
curl http://127.0.0.1:8000/api/v1/crawls/crawls \
  -H "X-ArchiveBox-API-Key: your-token-here"

Response

Returns an array of crawl objects:
[
  {
    "TYPE": "crawls.models.Crawl",
    "id": "01234567-89ab-cdef-0123-456789abcdef",
    "created_at": "2024-01-15T10:30:00Z",
    ...
  }
]

Get Single Crawl

Retrieve a specific crawl by ID.
curl http://127.0.0.1:8000/api/v1/crawls/crawl/01234567 \
  -H "X-ArchiveBox-API-Key: your-token-here"

Path Parameters

ParameterDescription
crawl_idCrawl UUID (full or prefix match)

Query Parameters

ParameterTypeDefaultDescription
with_snapshotsboolfalseInclude snapshots array
with_archiveresultsboolfalseInclude archiveresults in snapshots
as_rssboolfalseReturn snapshots as RSS XML feed

Response

Returns a single crawl object.

RSS Feed Export

Get snapshots from a crawl as an RSS feed:
curl "http://127.0.0.1:8000/api/v1/crawls/crawl/01234567?as_rss=true" \
  -H "X-ArchiveBox-API-Key: your-token-here"
Response:
<rss version="2.0">
  <channel>
    <item>
      <url>https://example.com</url>
      <title>Example Domain</title>
      <bookmarked_at>2024-01-15T10:30:00Z</bookmarked_at>
      <tags>important,tutorial</tags>
    </item>
  </channel>
</rss>

Update Crawl

Update crawl status or retry time.
curl -X PATCH http://127.0.0.1:8000/api/v1/crawls/crawl/01234567 \
  -H "X-ArchiveBox-API-Key: your-token-here" \
  -H "Content-Type: application/json" \
  -d '{
    "status": "sealed"
  }'

Request Body

{
  "status": "sealed",          // Optional: new status value
  "retry_at": "2024-01-20T10:00:00Z"  // Optional: schedule retry
}

Behavior

When setting status to sealed:
  • The crawl’s retry_at is set to null
  • All queued or started snapshots in this crawl are also sealed
  • Their retry_at fields are also set to null
This effectively cancels all pending work for the entire crawl.

Valid Status Transitions

You can update status to any of these values:
  • queued
  • started
  • succeeded
  • failed
  • sealed

Response

Returns the updated crawl object.

Common Workflows

Cancel a Running Crawl

Stop a crawl and all its associated snapshots:
import requests

api_key = "your-token-here"
base_url = "http://127.0.0.1:8000/api/v1"

response = requests.patch(
    f"{base_url}/crawls/crawl/01234567",
    headers={
        "X-ArchiveBox-API-Key": api_key,
        "Content-Type": "application/json"
    },
    json={"status": "sealed"}
)

print(f"Crawl sealed: {response.json()['id']}")

View All Snapshots in a Crawl

curl "http://127.0.0.1:8000/api/v1/crawls/crawl/01234567?with_snapshots=true" \
  -H "X-ArchiveBox-API-Key: your-token-here"
Or query snapshots directly by crawl:
curl "http://127.0.0.1:8000/api/v1/core/snapshots?crawl_id=01234567" \
  -H "X-ArchiveBox-API-Key: your-token-here"

Monitor Recent Crawls

curl "http://127.0.0.1:8000/api/v1/crawls/crawls" \
  -H "X-ArchiveBox-API-Key: your-token-here" \
  | jq 'sort_by(.created_at) | reverse | .[0:10]'

Export Crawl as RSS Feed

Useful for sharing or re-importing:
curl "http://127.0.0.1:8000/api/v1/crawls/crawl/01234567?as_rss=true" \
  -H "X-ArchiveBox-API-Key: your-token-here" \
  > crawl-export.xml

Understanding Crawls vs Snapshots

Relationship:
  • Crawl: Represents a single import operation (e.g., one archivebox add command)
  • Snapshot: Individual URL within a crawl
One crawl can contain multiple snapshots. For example:
archivebox add --depth=1 https://example.com
This creates:
  • 1 Crawl with max_depth=1
  • Multiple Snapshots (example.com + any linked pages)

Crawl Configuration

The config field stores configuration overrides that were active when the crawl was created:
{
  "config": {
    "TIMEOUT": "60",
    "CHECK_SSL_VALIDITY": "True",
    "SAVE_SCREENSHOT": "True"
  }
}
These settings were applied to all snapshots in the crawl.

Error Responses

404 Not Found

{
  "succeeded": false,
  "message": "ObjectDoesNotExist: Crawl matching query does not exist."
}

400 Bad Request

{
  "succeeded": false,
  "message": "Invalid status: invalid-status"
}

Snapshots API

View snapshots within a crawl

Tags API

Crawls can have tags applied to all snapshots

Build docs developers (and LLMs) love