Quickstart

Get up and running with olmOCR quickly by converting your first PDF document.

Try the online demo at https://olmocr.allenai.org/ for instant testing without installation.

Prerequisites

Before you begin, make sure you have:

✅ Completed the installation steps
✅ A NVIDIA GPU (RTX 4090, L40S, A100, or H100)
✅ At least one PDF file to convert

Convert your first PDF

Prepare your workspace

Create a local directory where olmOCR will store results:

mkdir ./localworkspace

Convert a single PDF

Run the pipeline on a single PDF file:

python -m olmocr.pipeline ./localworkspace --pdfs path/to/your/document.pdf

For testing, you can use one of the test PDFs from the repository:

python -m olmocr.pipeline ./localworkspace --pdfs tests/gnarly_pdfs/horribleocr.pdf

The pipeline automatically handles GPU memory management and model loading.

View the results

Extracted text is stored as JSONL in the results directory:

cat localworkspace/results/output_*.jsonl

The output follows the Dolma format with metadata and page spans.

Visualize side-by-side

Generate an HTML viewer to see extracted text alongside the original PDF:

python -m olmocr.viewer.dolmaviewer localworkspace/results/output_*.jsonl

Open the generated HTML file in your browser:

# The viewer creates a file like: dolma_previews/tests_gnarly_pdfs_horribleocr_pdf.html
open ./dolma_previews/*.html

Process multiple PDFs

Convert multiple PDFs in a single run using wildcards:

python -m olmocr.pipeline ./localworkspace --pdfs tests/gnarly_pdfs/*.pdf

Or provide a file containing a list of PDF paths:

python -m olmocr.pipeline ./localworkspace --pdfs @pdf_list.txt

For S3-based PDFs, use s3:// paths. See Cluster Usage for distributed processing.

Configure the pipeline

Customize the pipeline behavior with additional options:

# Filter out non-English PDFs, forms, and spam
python -m olmocr.pipeline ./localworkspace --pdfs tests/*.pdf --apply_filter

Understanding the output

Dolma format

Each output file contains JSONL with one document per line:

{
  "id": "s3://bucket/path/document.pdf",
  "text": "Extracted text content...",
  "source": "olmocr",
  "added": "2024-03-03T15:30:00",
  "created": "2024-01-15T10:20:00",
  "metadata": {
    "olmocr_version": "0.1.58",
    "pdf_page_count": 5,
    "pdf_page_numbers": [
      [0, 245, 1],
      [245, 512, 2],
      [512, 890, 3]
    ]
  }
}

Page spans

The pdf_page_numbers field maps character positions to PDF pages:

[0, 245, 1] - Characters 0-245 are from page 1
[245, 512, 2] - Characters 245-512 are from page 2
[512, 890, 3] - Characters 512-890 are from page 3

This enables page-level analysis and attribution.

Troubleshooting

CUDA out of memory errors

The pipeline automatically adjusts batch sizes, but if you still encounter memory issues:

# Reduce image resolution
python -m olmocr.pipeline ./localworkspace --pdfs tests/*.pdf --target_longest_image_dim 768

# Process fewer workers
python -m olmocr.pipeline ./localworkspace --pdfs tests/*.pdf --workers 1

Import errors or missing dependencies

Make sure sglang is properly installed:

pip install sgl-kernel==0.0.3.post1 --force-reinstall --no-deps
pip install "sglang[all]==0.4.2" --find-links https://flashinfer.ai/whl/cu124/torch2.4/flashinfer/

Poppler rendering errors

Install poppler-utils and required fonts:

sudo apt-get update
sudo apt-get install poppler-utils ttf-mscorefonts-installer fonts-crosextra-caladea

Empty or corrupted output

Check the debug log for errors:

tail -f olmocr-pipeline-debug.log

Common causes:

Encrypted or password-protected PDFs
Extremely large PDFs (>1000 pages)
Corrupted PDF files

Next steps

Local Usage Guide

Explore all pipeline options and parameters

Cluster Usage

Scale to millions of PDFs with distributed processing

Training

Fine-tune models on your own data

API Reference

Use olmOCR programmatically in Python

Get Started

Core Concepts

Usage Guides

Data Preparation

Training

Evaluation

Prerequisites

Convert your first PDF

Process multiple PDFs

Configure the pipeline

Understanding the output

Dolma format

Page spans

Troubleshooting

Next steps

Local Usage Guide

Cluster Usage

Training

API Reference

Build docs developers (and LLMs) love

Get Started

Core Concepts

Usage Guides

Data Preparation

Training

Evaluation

​Prerequisites

​Convert your first PDF

​Process multiple PDFs

​Configure the pipeline

​Understanding the output

​Dolma format

​Page spans

​Troubleshooting

​Next steps

Local Usage Guide

Cluster Usage

Training

API Reference

Build docs developers (and LLMs) love

Prerequisites

Convert your first PDF

Process multiple PDFs

Configure the pipeline

Understanding the output

Dolma format

Page spans

Troubleshooting

Next steps