Skip to main content
ArchiveBox Logo

What is ArchiveBox?

ArchiveBox is a self-hosted app that lets you preserve content from websites in a variety of formats. Without active preservation effort, everything on the internet eventually disappears or degrades. We aim to make your data immediately useful, and kept in formats that other programs can read directly. As output, we save standard HTML, PNG, PDF, TXT, JSON, WARC, SQLite, all guaranteed to be readable for decades to come. ArchiveBox also has a CLI, REST API, and webhooks so you can set up integrations with other services. ArchiveBox is an open source tool that lets organizations & individuals archive both public & private web content while retaining control over their data. It can be used to save copies of bookmarks, preserve evidence for legal cases, backup photos from FB/Insta/Flickr or media from YT/Soundcloud/etc., save research papers, and more.

Key Features

Free & Open Source

Own your own data & maintain your privacy by self-hosting. Licensed under MIT.

Powerful CLI & Web UI

Comprehensive command-line interface plus a self-hosted web UI for managing your archive.

Multiple Archive Formats

Saves HTML, PDF, screenshots, videos, git repos, WARC files, and more.

Import from Anywhere

Import URLs from bookmarks, RSS feeds, browser history, Pocket, Pinboard, and more.

Scheduled Archiving

Set up automated imports from RSS feeds and other sources on a schedule.

Archive.org Integration

Optionally saves all pages to archive.org for redundancy (can be disabled).

Content Extraction

Automatically extracts media (yt-dlp), articles (readability), code (git), and more.

Long-term Preservation

Uses standard, durable formats like HTML, JSON, PDF, PNG, MP4, TXT, and WARC.

What Does ArchiveBox Save?

For each web page you archive, ArchiveBox creates a snapshot and preserves its content in multiple redundant formats:
  • HTML & DOM: Original HTML+CSS+JS, SingleFile snapshot, DOM dump
  • Visual: Screenshots (PNG), PDF printouts
  • Metadata: Title, favicon, headers, response data
  • Media: Audio/video files via yt-dlp, including subtitles and metadata
  • Articles: Extracted article text using Readability & Mercury
  • Source Code: Git repository clones from GitHub, GitLab, Bitbucket
  • Archives: WARC files via wget, Archive.org permalinks
  • And more: See Output Formats for the full list

Use Cases

Journalists

Crawl websites during research, preserve cited pages for fact-checking and review.

Lawyers

Collect and preserve evidence, detect changes over time, organize with tags for review.

Researchers

Analyze social media trends, gather LLM training data, build crawling pipelines.

Individuals

Save bookmarks, preserve portfolio content, create legacy archives and memoirs.

How to Import URLs

ArchiveBox supports importing URLs from many sources:
  • Browser Extension: Real-time archiving from Chrome/Chromium/Firefox
  • Browser Exports: Import bookmarks or history from any browser
  • RSS Feeds: Automatically archive new posts from feeds
  • Bookmark Services: Import from Pocket, Pinboard, Instapaper, etc.
  • Social Media: Save posts from Reddit, Twitter bookmarks, and more
  • Plain Text: Any text file containing URLs (CSV, JSON, TXT, Markdown, etc.)
  • Manual Entry: Add individual URLs via CLI or Web UI

Get Started

Quickstart Guide

Get ArchiveBox running in minutes with Docker Compose

Installation Guide

Detailed installation instructions for all platforms

Community & Support

ArchiveBox is free for everyone to self-host. We also provide professional support, security review, and custom integrations for NGOs, governments, and organizations. Learn more

Build docs developers (and LLMs) love