What is ArchiveBox?
ArchiveBox is a self-hosted app that lets you preserve content from websites in a variety of formats. Without active preservation effort, everything on the internet eventually disappears or degrades. We aim to make your data immediately useful, and kept in formats that other programs can read directly. As output, we save standard HTML, PNG, PDF, TXT, JSON, WARC, SQLite, all guaranteed to be readable for decades to come. ArchiveBox also has a CLI, REST API, and webhooks so you can set up integrations with other services. ArchiveBox is an open source tool that lets organizations & individuals archive both public & private web content while retaining control over their data. It can be used to save copies of bookmarks, preserve evidence for legal cases, backup photos from FB/Insta/Flickr or media from YT/Soundcloud/etc., save research papers, and more.Key Features
Free & Open Source
Own your own data & maintain your privacy by self-hosting. Licensed under MIT.
Powerful CLI & Web UI
Comprehensive command-line interface plus a self-hosted web UI for managing your archive.
Multiple Archive Formats
Saves HTML, PDF, screenshots, videos, git repos, WARC files, and more.
Import from Anywhere
Import URLs from bookmarks, RSS feeds, browser history, Pocket, Pinboard, and more.
Scheduled Archiving
Set up automated imports from RSS feeds and other sources on a schedule.
Archive.org Integration
Optionally saves all pages to archive.org for redundancy (can be disabled).
Content Extraction
Automatically extracts media (yt-dlp), articles (readability), code (git), and more.
Long-term Preservation
Uses standard, durable formats like HTML, JSON, PDF, PNG, MP4, TXT, and WARC.
What Does ArchiveBox Save?
For each web page you archive, ArchiveBox creates a snapshot and preserves its content in multiple redundant formats:- HTML & DOM: Original HTML+CSS+JS, SingleFile snapshot, DOM dump
- Visual: Screenshots (PNG), PDF printouts
- Metadata: Title, favicon, headers, response data
- Media: Audio/video files via yt-dlp, including subtitles and metadata
- Articles: Extracted article text using Readability & Mercury
- Source Code: Git repository clones from GitHub, GitLab, Bitbucket
- Archives: WARC files via wget, Archive.org permalinks
- And more: See Output Formats for the full list
Use Cases
Journalists
Crawl websites during research, preserve cited pages for fact-checking and review.
Lawyers
Collect and preserve evidence, detect changes over time, organize with tags for review.
Researchers
Analyze social media trends, gather LLM training data, build crawling pipelines.
Individuals
Save bookmarks, preserve portfolio content, create legacy archives and memoirs.
How to Import URLs
ArchiveBox supports importing URLs from many sources:- Browser Extension: Real-time archiving from Chrome/Chromium/Firefox
- Browser Exports: Import bookmarks or history from any browser
- RSS Feeds: Automatically archive new posts from feeds
- Bookmark Services: Import from Pocket, Pinboard, Instapaper, etc.
- Social Media: Save posts from Reddit, Twitter bookmarks, and more
- Plain Text: Any text file containing URLs (CSV, JSON, TXT, Markdown, etc.)
- Manual Entry: Add individual URLs via CLI or Web UI
Get Started
Quickstart Guide
Get ArchiveBox running in minutes with Docker Compose
Installation Guide
Detailed installation instructions for all platforms
Community & Support
ArchiveBox is free for everyone to self-host. We also provide professional support, security review, and custom integrations for NGOs, governments, and organizations. Learn more
- Demo: Try it out at demo.archivebox.io
- Documentation: Full wiki at github.com/ArchiveBox/ArchiveBox/wiki
- GitHub: Star us at github.com/ArchiveBox/ArchiveBox
- Community: Join the discussion in our Web Archiving Community