Overview
This approach showcases a modern analytics workflow where you can:- Query Parquet files directly from S3 without data movement
- Build dbt models locally using DuckDB
- Deploy the same models to MotherDuck for cloud analytics
- Leverage DuckDB’s external file capabilities for zero-copy analytics
The project queries data directly from S3 using DuckDB’s external file support—no data copying required.
Prerequisites
Set up MotherDuck
Create a MotherDuck account and obtain your access token
Project Structure
The dbt project includes:- models/sources.yml: Defines S3 Parquet files as dbt sources
- models/top_story_by_comments.sql: Top HN story by comments per month
- models/duckdb_keyword_mentions.sql: Monthly DuckDB mentions in HN
- models/top_domains.sql: Top 20 domains from HN story URLs
- profiles.yml: Configuration for local and cloud targets
Configuration
profiles.yml
Configure both local (DuckDB) and production (MotherDuck) targets:profiles.yml
Understanding the configuration
Understanding the configuration
- local target: Uses a local DuckDB file (
local.db) for development - prod target: Connects to MotherDuck using the
md:prefix - default target: Set to
localfor safe development
Source Definition
Define external S3 data as a dbt source:models/sources.yml
dbt Models
Top Story by Comments
Finds the most-commented Hacker News story for each month:models/top_story_by_comments.sql
DuckDB Keyword Mentions
Counts monthly mentions of “duckdb” in Hacker News stories:models/duckdb_keyword_mentions.sql
Top Domains
Extracts and ranks the top 20 domains from HN story URLs:models/top_domains.sql
All models use the
{{ source() }} function to reference the S3 data defined in sources.yml.Running dbt
Local Development (DuckDB)
Run models locally for testing and development:local.db).
Production Deployment (MotherDuck)
Key Features
Zero-Copy Analytics
Query S3 Parquet files directly without data duplication
Unified Workflow
Use the same dbt project for local and cloud environments
DuckDB Performance
Leverage DuckDB’s optimized Parquet reader and columnar processing
S3 Integration
Native support for reading external files from S3
Development Workflow
Example Use Cases
Hacker News Trend Analysis
Hacker News Trend Analysis
Track how often specific keywords (like “duckdb”) appear in HN stories over time to measure community interest and engagement.
Content Performance
Content Performance
Identify which stories generate the most discussion by analyzing comment counts, helping content creators understand what resonates.
Domain Popularity
Domain Popularity
Discover which websites are most frequently shared on HN, revealing trusted sources and trending platforms.
Best Practices
Use Explicit Schemas
Define clear column names and types in your models for better documentation
Test Locally First
Always validate models with the local target before deploying to production
Leverage Window Functions
Use DuckDB’s efficient window functions for ranking and partitioning
Optimize Filters
Apply WHERE clauses early to reduce data scanned from S3
Next Steps
dbt-duckdb Documentation
Learn more about the dbt-duckdb adapter
DuckDB Extensions
Explore DuckDB extensions for S3, httpfs, and more
MotherDuck Docs
Comprehensive MotherDuck documentation
dbt Best Practices
dbt style guide and conventions