Try the online demo at https://olmocr.allenai.org/ for instant testing without installation.
Prerequisites
Before you begin, make sure you have:- ✅ Completed the installation steps
- ✅ A NVIDIA GPU (RTX 4090, L40S, A100, or H100)
- ✅ At least one PDF file to convert
Convert your first PDF
Convert a single PDF
Run the pipeline on a single PDF file:For testing, you can use one of the test PDFs from the repository:
View the results
Extracted text is stored as JSONL in the results directory:The output follows the Dolma format with metadata and page spans.
Process multiple PDFs
Convert multiple PDFs in a single run using wildcards:For S3-based PDFs, use
s3:// paths. See Cluster Usage for distributed processing.Configure the pipeline
Customize the pipeline behavior with additional options:Understanding the output
Dolma format
Each output file contains JSONL with one document per line:Page spans
Thepdf_page_numbers field maps character positions to PDF pages:
[0, 245, 1]- Characters 0-245 are from page 1[245, 512, 2]- Characters 245-512 are from page 2[512, 890, 3]- Characters 512-890 are from page 3
Troubleshooting
CUDA out of memory errors
CUDA out of memory errors
The pipeline automatically adjusts batch sizes, but if you still encounter memory issues:
Import errors or missing dependencies
Import errors or missing dependencies
Make sure sglang is properly installed:
Poppler rendering errors
Poppler rendering errors
Install poppler-utils and required fonts:
Empty or corrupted output
Empty or corrupted output
Check the debug log for errors:Common causes:
- Encrypted or password-protected PDFs
- Extremely large PDFs (>1000 pages)
- Corrupted PDF files
Next steps
Local Usage Guide
Explore all pipeline options and parameters
Cluster Usage
Scale to millions of PDFs with distributed processing
Training
Fine-tune models on your own data
API Reference
Use olmOCR programmatically in Python