Skip to main content
Get up and running with olmOCR quickly by converting your first PDF document.
Try the online demo at https://olmocr.allenai.org/ for instant testing without installation.

Prerequisites

Before you begin, make sure you have:
  • ✅ Completed the installation steps
  • ✅ A NVIDIA GPU (RTX 4090, L40S, A100, or H100)
  • ✅ At least one PDF file to convert

Convert your first PDF

1

Prepare your workspace

Create a local directory where olmOCR will store results:
mkdir ./localworkspace
2

Convert a single PDF

Run the pipeline on a single PDF file:
python -m olmocr.pipeline ./localworkspace --pdfs path/to/your/document.pdf
For testing, you can use one of the test PDFs from the repository:
python -m olmocr.pipeline ./localworkspace --pdfs tests/gnarly_pdfs/horribleocr.pdf
The pipeline automatically handles GPU memory management and model loading.
3

View the results

Extracted text is stored as JSONL in the results directory:
cat localworkspace/results/output_*.jsonl
The output follows the Dolma format with metadata and page spans.
4

Visualize side-by-side

Generate an HTML viewer to see extracted text alongside the original PDF:
python -m olmocr.viewer.dolmaviewer localworkspace/results/output_*.jsonl
Open the generated HTML file in your browser:
# The viewer creates a file like: dolma_previews/tests_gnarly_pdfs_horribleocr_pdf.html
open ./dolma_previews/*.html
olmOCR Viewer

Process multiple PDFs

Convert multiple PDFs in a single run using wildcards:
python -m olmocr.pipeline ./localworkspace --pdfs tests/gnarly_pdfs/*.pdf
Or provide a file containing a list of PDF paths:
python -m olmocr.pipeline ./localworkspace --pdfs @pdf_list.txt
For S3-based PDFs, use s3:// paths. See Cluster Usage for distributed processing.

Configure the pipeline

Customize the pipeline behavior with additional options:
# Filter out non-English PDFs, forms, and spam
python -m olmocr.pipeline ./localworkspace --pdfs tests/*.pdf --apply_filter

Understanding the output

Dolma format

Each output file contains JSONL with one document per line:
{
  "id": "s3://bucket/path/document.pdf",
  "text": "Extracted text content...",
  "source": "olmocr",
  "added": "2024-03-03T15:30:00",
  "created": "2024-01-15T10:20:00",
  "metadata": {
    "olmocr_version": "0.1.58",
    "pdf_page_count": 5,
    "pdf_page_numbers": [
      [0, 245, 1],
      [245, 512, 2],
      [512, 890, 3]
    ]
  }
}

Page spans

The pdf_page_numbers field maps character positions to PDF pages:
  • [0, 245, 1] - Characters 0-245 are from page 1
  • [245, 512, 2] - Characters 245-512 are from page 2
  • [512, 890, 3] - Characters 512-890 are from page 3
This enables page-level analysis and attribution.

Troubleshooting

The pipeline automatically adjusts batch sizes, but if you still encounter memory issues:
# Reduce image resolution
python -m olmocr.pipeline ./localworkspace --pdfs tests/*.pdf --target_longest_image_dim 768

# Process fewer workers
python -m olmocr.pipeline ./localworkspace --pdfs tests/*.pdf --workers 1
Make sure sglang is properly installed:
pip install sgl-kernel==0.0.3.post1 --force-reinstall --no-deps
pip install "sglang[all]==0.4.2" --find-links https://flashinfer.ai/whl/cu124/torch2.4/flashinfer/
Install poppler-utils and required fonts:
sudo apt-get update
sudo apt-get install poppler-utils ttf-mscorefonts-installer fonts-crosextra-caladea
Check the debug log for errors:
tail -f olmocr-pipeline-debug.log
Common causes:
  • Encrypted or password-protected PDFs
  • Extremely large PDFs (>1000 pages)
  • Corrupted PDF files

Next steps

Local Usage Guide

Explore all pipeline options and parameters

Cluster Usage

Scale to millions of PDFs with distributed processing

Training

Fine-tune models on your own data

API Reference

Use olmOCR programmatically in Python

Build docs developers (and LLMs) love