Introduction to MarkItDown

What is MarkItDown?

MarkItDown is a lightweight Python utility for converting various file formats to Markdown, specifically designed for use with Large Language Models (LLMs) and text analysis pipelines. Built by the Microsoft AutoGen team, it preserves important document structure and content while producing clean, LLM-friendly Markdown output. While the output is often reasonably presentable and human-friendly, MarkItDown is optimized for consumption by text analysis tools—not for high-fidelity document conversions for human readers.

Quick start

Get up and running in minutes with your first conversion

Installation

Detailed setup instructions for all environments

Python API

Integrate MarkItDown into your applications

CLI reference

Command-line interface documentation

Supported formats

MarkItDown currently supports conversion from a wide range of file types:

Documents

PDF files
Word documents (.docx)
PowerPoint (.pptx)
Excel spreadsheets (.xlsx, .xls)
EPub books

Media

Images (JPEG, PNG)
EXIF metadata extraction
OCR via LLM integration
Audio transcription

Web & text

HTML pages
YouTube videos
Wikipedia articles
CSV, JSON, XML
ZIP archives

MarkItDown can also convert Outlook messages, Jupyter notebooks, RSS feeds, and more.

Key features

Structure preservation

MarkItDown maintains important document structure including:

Headings and hierarchy
Lists (ordered and unordered)
Tables with proper formatting
Links and references
Code blocks and formatting

Flexible input sources

Convert documents from multiple sources:

Local file paths
URLs (HTTP/HTTPS)
File URIs
Data URIs (base64 encoded)
Binary streams and file-like objects
HTTP Response objects

LLM integration

Enhance conversions with AI:

Image description via GPT-4o or other multimodal models
Custom prompts for specialized output
Optimized for token efficiency

Azure Document Intelligence

Use Microsoft’s Document Intelligence service for advanced PDF and document processing with superior accuracy and layout understanding.

Extensible architecture

Plugin system for custom converters
Priority-based converter registration
Custom document converter support
Modular optional dependencies

Smart file detection

Automatic format detection using:

MIME type analysis
File extension matching
Content-based detection with Magika
Charset normalization

Why Markdown for LLMs?

Markdown is the ideal format for LLM consumption and here’s why:

Natural language alignment

Markdown is extremely close to plain text with minimal markup, making it easy for both humans and AI models to parse and understand.

Native LLM support

Mainstream LLMs like OpenAI’s GPT-4o natively “speak” Markdown and often incorporate it into their responses unprompted. This suggests they have been trained on vast amounts of Markdown-formatted text.

Token efficiency

Markdown conventions are highly token-efficient compared to HTML or other markup languages. Less tokens means:

Lower API costs
Faster processing
Ability to fit more content in context windows

Structure preservation

Unlike plain text, Markdown preserves document structure (headings, lists, tables) that helps LLMs understand document organization and relationships between content.

# Heading
- List item 1
- List item 2

**Bold text** and *italic text*

MCP server integration

MarkItDown offers an MCP (Model Context Protocol) server for seamless integration with LLM applications like Claude Desktop. See markitdown-mcp for more information.

The MCP server allows AI assistants to convert documents on-the-fly, enabling powerful document analysis workflows directly within your AI chat interface.

Get started

Install MarkItDown

pip install 'markitdown[all]'

Convert your first document

markitdown document.pdf > output.md

Explore the API

from markitdown import MarkItDown

md = MarkItDown()
result = md.convert("document.pdf")
print(result.text_content)

Ready to start?

Follow the quickstart guide to convert your first document

Get Started

Guides

File Formats

Advanced

Introduction to MarkItDown

What is MarkItDown?

Quick start

Installation

Python API

CLI reference

Supported formats

Documents

Media

Web & text

Key features

Why Markdown for LLMs?

Natural language alignment

Native LLM support

Token efficiency

Structure preservation

MCP server integration

Get started

Ready to start?

Build docs developers (and LLMs) love

Get Started

Guides

File Formats

Advanced

​What is MarkItDown?

Quick start

Installation

Python API

CLI reference

​Supported formats

Documents

Media

Web & text

​Key features

​Why Markdown for LLMs?

​Natural language alignment

​Native LLM support

​Token efficiency

​Structure preservation

​MCP server integration

​Get started

Ready to start?

Build docs developers (and LLMs) love

What is MarkItDown?

Supported formats

Key features

Why Markdown for LLMs?

Natural language alignment

Native LLM support

Token efficiency

Structure preservation

MCP server integration

Get started