What is MarkItDown?
MarkItDown is a lightweight Python utility for converting various file formats to Markdown, specifically designed for use with Large Language Models (LLMs) and text analysis pipelines. Built by the Microsoft AutoGen team, it preserves important document structure and content while producing clean, LLM-friendly Markdown output. While the output is often reasonably presentable and human-friendly, MarkItDown is optimized for consumption by text analysis tools—not for high-fidelity document conversions for human readers.Quick start
Get up and running in minutes with your first conversion
Installation
Detailed setup instructions for all environments
Python API
Integrate MarkItDown into your applications
CLI reference
Command-line interface documentation
Supported formats
MarkItDown currently supports conversion from a wide range of file types:Documents
- PDF files
- Word documents (.docx)
- PowerPoint (.pptx)
- Excel spreadsheets (.xlsx, .xls)
- EPub books
Media
- Images (JPEG, PNG)
- EXIF metadata extraction
- OCR via LLM integration
- Audio transcription
Web & text
- HTML pages
- YouTube videos
- Wikipedia articles
- CSV, JSON, XML
- ZIP archives
MarkItDown can also convert Outlook messages, Jupyter notebooks, RSS feeds, and more.
Key features
Structure preservation
Structure preservation
MarkItDown maintains important document structure including:
- Headings and hierarchy
- Lists (ordered and unordered)
- Tables with proper formatting
- Links and references
- Code blocks and formatting
Flexible input sources
Flexible input sources
Convert documents from multiple sources:
- Local file paths
- URLs (HTTP/HTTPS)
- File URIs
- Data URIs (base64 encoded)
- Binary streams and file-like objects
- HTTP Response objects
LLM integration
LLM integration
Enhance conversions with AI:
- Image description via GPT-4o or other multimodal models
- Custom prompts for specialized output
- Optimized for token efficiency
Azure Document Intelligence
Azure Document Intelligence
Use Microsoft’s Document Intelligence service for advanced PDF and document processing with superior accuracy and layout understanding.
Extensible architecture
Extensible architecture
- Plugin system for custom converters
- Priority-based converter registration
- Custom document converter support
- Modular optional dependencies
Smart file detection
Smart file detection
Automatic format detection using:
- MIME type analysis
- File extension matching
- Content-based detection with Magika
- Charset normalization
Why Markdown for LLMs?
Markdown is the ideal format for LLM consumption and here’s why:Natural language alignment
Markdown is extremely close to plain text with minimal markup, making it easy for both humans and AI models to parse and understand.Native LLM support
Mainstream LLMs like OpenAI’s GPT-4o natively “speak” Markdown and often incorporate it into their responses unprompted. This suggests they have been trained on vast amounts of Markdown-formatted text.Token efficiency
Markdown conventions are highly token-efficient compared to HTML or other markup languages. Less tokens means:- Lower API costs
- Faster processing
- Ability to fit more content in context windows
Structure preservation
Unlike plain text, Markdown preserves document structure (headings, lists, tables) that helps LLMs understand document organization and relationships between content.MCP server integration
The MCP server allows AI assistants to convert documents on-the-fly, enabling powerful document analysis workflows directly within your AI chat interface.Get started
Ready to start?
Follow the quickstart guide to convert your first document