Document Loader Types
File Loaders
PDF, DOCX, CSV, and other file formats
Web Loaders
Web pages, APIs, and online services
Cloud Storage
S3, Azure Blob Storage, Google Cloud Storage
Databases
SQL, NoSQL, and specialized databases
Installation
Most document loaders are in the@langchain/community package:
File Loaders
PDF Loader
Load and parse PDF documents:Split PDF by Pages
DOCX Loader
Load Microsoft Word documents:CSV Loader
Load CSV files with customizable parsing:Custom CSV Column
Text Loader
Load plain text files:JSON Loader
Load and parse JSON files:Directory Loader
Load all files from a directory:Web Loaders
Cheerio Web Scraper
Load and parse HTML from URLs:Custom Selector
Puppeteer Web Scraper
Load dynamic web pages that require JavaScript:Firecrawl
Use Firecrawl for production-ready web scraping:GitHub Loader
Load files from GitHub repositories:Notion
Load pages from Notion:Cloud Storage
AWS S3
Load files from Amazon S3:Azure Blob Storage
Load files from Azure Blob Storage:Google Cloud Storage
Load files from Google Cloud Storage:Specialized Loaders
Unstructured API
Use Unstructured.io for complex document parsing:Obsidian
Load Obsidian vault notes:ChatGPT Conversation
Load exported ChatGPT conversations:Audio Transcription (Whisper)
Transcribe audio files using OpenAI Whisper:Additional Loaders
Confluence
Load pages from Atlassian Confluence
Figma
Load designs from Figma files
Airtable
Load records from Airtable bases
GitBook
Load documentation from GitBook
Apify
Load data from Apify scrapers
AssemblyAI
Transcribe audio with AssemblyAI
Common Patterns
Load and Split
Combine loading with text splitting:Load into Vector Store
Custom Metadata
Best Practices
- Choose the right loader: Match the loader to your data source and format
- Handle errors gracefully: Wrap loader calls in try-catch blocks
- Split large documents: Use text splitters for better chunk sizing
- Preserve metadata: Keep source information for traceability
- Batch processing: Load multiple documents efficiently
- Cache when possible: Store loaded documents to avoid redundant processing
Document Structure
All loaders return documents with this structure:Next Steps
Text Splitters API
Split documents into chunks
Vector Stores
Store loaded documents as embeddings
Embeddings
Convert documents to embeddings
Retrieval Guide
Build RAG applications
