Processing Pipeline
Zerox processes documents through a series of stages, transforming them from their original format into structured markdown.File Download & Validation
Zerox accepts both local file paths and URLs. The file is downloaded to a temporary directory and validated:
- Downloads remote files via HTTP/HTTPS
- Copies local files to temp directory
- Detects MIME type and file extension
- Validates credentials and input parameters
File Type Detection & Conversion
Based on the file type, Zerox routes the document through the appropriate conversion path:Direct Image Processing:
- PNG, JPEG, JPG: Used directly
- HEIC: Converted to JPEG format
- Native PDFs are validated using magic number check (
%PDF) - Legacy Compound File Binary (CFB) formats are detected
- PDFs are converted to high-resolution images (default 2048px height, 300 DPI)
- DOCX, PPTX, and other formats are converted to PDF using LibreOffice
- Then processed through the PDF pipeline
- Excel files (XLSX, XLS, XLSM, XLSB) are converted directly to HTML tables
- Each sheet becomes a separate page
- Bypasses image conversion entirely
Image Preprocessing
Before sending to the vision model, images undergo several optimizations:Compression:
- Images are compressed to stay under the
maxImageSizelimit (default: 15MB) - Maintains visual quality while reducing token costs
- Uses Tesseract OCR to detect incorrect orientation
- Automatically rotates images to correct reading angle
- Utilizes worker pool for parallel processing
- Removes unnecessary whitespace and borders
- Detects aspect ratios exceeding threshold (>5:1)
- Adjusts image dimensions for optimal processing
- Converts all images to PNG or JPEG
- Encodes as base64 for API transmission
Vision Model Processing
Images are sent to the configured vision model for OCR or extraction:OCR Mode (Default):
- System prompt instructs model to convert document to markdown
- Includes specific rules for tables, charts, logos, watermarks
maintainFormatoption ensures consistent formatting across pages- Prior page context helps maintain document structure
- Processes pages concurrently (default: 10 at a time)
- Uses structured output with JSON schema
- Can process text (from OCR), images directly, or both (hybrid mode)
- Supports per-page extraction or full-document extraction
- Returns structured data matching the provided schema
Response Processing & Assembly
Results from the vision model are processed and assembled:
- Collects content from all pages
- Tracks token usage (input/output)
- Records success/failure rates
- Calculates total completion time
- Optionally saves aggregated markdown to output directory
Cleanup & Return
Final cleanup and result formatting:
- Terminates Tesseract worker pool
- Removes temporary files and directories (if
cleanup: true) - Returns comprehensive output including:
- Page-by-page content
- Extracted structured data (if schema provided)
- Token counts and timing metrics
- Success/failure summary
- Optional logprobs for analysis
Operation Modes
Zerox supports three primary operation modes:OCR Mode
The default mode that converts documents to markdown.Extraction Mode
Extracts structured data using a JSON schema, with optional OCR.Extract-Only Mode
Skips OCR entirely and processes images directly for extraction.Hybrid Extraction Mode
Combines OCR text with original images for best accuracy.Hybrid mode cannot be used with
extractOnly or directImageExtraction modes.Concurrency & Performance
Zerox uses several strategies to optimize performance:- Parallel Page Processing: Processes multiple pages simultaneously (configurable via
concurrency) - Tesseract Worker Pool: Maintains reusable OCR workers for orientation detection
- Dynamic Worker Scaling: Automatically adjusts worker count based on document size
- Retry Logic: Automatically retries failed requests (configurable via
maxRetries) - Format Maintenance: Optional sequential processing when consistency is critical
When
maintainFormat: true, pages are processed sequentially to ensure consistent formatting across the document.Error Handling
Zerox provides two error handling modes:- IGNORE (default): Failed pages are marked with error status, processing continues
- THROW: Processing stops immediately on first error
System Prompts
Zerox uses carefully crafted system prompts to guide the vision model: Base OCR Prompt:- Convert document to markdown
- Include all information (headers, footers, subtext)
- Return tables in HTML format
- Interpret charts and infographics
- Wrap logos, watermarks, and page numbers in brackets
- Use ☐ and ☑ for checkboxes
maintainFormat: true, includes the previous page’s content to maintain formatting consistency.
Temporary Files
Zerox creates temporary directories for processing:- Location:
tempDirparameter or OS temp directory - Structure:
zerox-temp-{random}/source/ - Cleanup: Automatic when
cleanup: true(default) - Contents: Downloaded files, converted images, compressed versions
cleanup is set to false for debugging purposes.
