Skip to main content
This guide demonstrates how to ingest and analyze data directly from S3 using dbt with DuckDB or MotherDuck. You’ll learn how to query Parquet files in S3 without copying data and build analytical models on top of that data.

Overview

This approach showcases a modern analytics workflow where you can:
  • Query Parquet files directly from S3 without data movement
  • Build dbt models locally using DuckDB
  • Deploy the same models to MotherDuck for cloud analytics
  • Leverage DuckDB’s external file capabilities for zero-copy analytics
The project queries data directly from S3 using DuckDB’s external file support—no data copying required.

Prerequisites

1

Install dbt-duckdb

pip install dbt-duckdb
2

Set up MotherDuck

Create a MotherDuck account and obtain your access token
3

Create Database

Create the target database in MotherDuck (via web UI or SQL):
CREATE DATABASE IF NOT EXISTS hacker_news_stats;

Project Structure

The dbt project includes:
  • models/sources.yml: Defines S3 Parquet files as dbt sources
  • models/top_story_by_comments.sql: Top HN story by comments per month
  • models/duckdb_keyword_mentions.sql: Monthly DuckDB mentions in HN
  • models/top_domains.sql: Top 20 domains from HN story URLs
  • profiles.yml: Configuration for local and cloud targets

Configuration

profiles.yml

Configure both local (DuckDB) and production (MotherDuck) targets:
profiles.yml
dbt_ingestion_s3:
  outputs:
    local:
      type: duckdb
      path: local.db 
      threads: 1
    prod:
      type: duckdb
      path: "md:hacker_news_stats"
      threads: 1
  target: local
  • local target: Uses a local DuckDB file (local.db) for development
  • prod target: Connects to MotherDuck using the md: prefix
  • default target: Set to local for safe development

Source Definition

Define external S3 data as a dbt source:
models/sources.yml
version: 2

sources:
  - name: hn_external
    meta:
      external_location: "s3://us-prd-motherduck-open-datasets/hacker_news/parquet/{name}.parquet"
    tables:
      - name: hacker_news_2024_2025
The {name} placeholder allows dbt to dynamically construct the file path for each table in the source.

dbt Models

Top Story by Comments

Finds the most-commented Hacker News story for each month:
models/top_story_by_comments.sql
WITH ranked_stories AS (
    SELECT
        title,
        'https://news.ycombinator.com/item?id=' || id AS hn_url,
        descendants AS nb_comments,
        YEAR(timestamp) AS year,
        MONTH(timestamp) AS month,
        ROW_NUMBER() OVER (
            PARTITION BY YEAR(timestamp), MONTH(timestamp)
            ORDER BY descendants DESC
        ) AS rn
    FROM {{ source('hn_external', 'hacker_news_2024_2025') }}
    WHERE type = 'story'
)
SELECT
    year,
    month,
    title,
    hn_url,
    nb_comments
FROM ranked_stories
WHERE rn = 1
ORDER BY year, month

DuckDB Keyword Mentions

Counts monthly mentions of “duckdb” in Hacker News stories:
models/duckdb_keyword_mentions.sql
SELECT
    YEAR(timestamp) AS year,
    MONTH(timestamp) AS month,
    COUNT(*) AS keyword_mentions
FROM {{ source('hn_external', 'hacker_news_2024_2025') }}
WHERE
    (title LIKE '%duckdb%' OR text LIKE '%duckdb%')
GROUP BY year, month
ORDER BY year ASC, month ASC

Top Domains

Extracts and ranks the top 20 domains from HN story URLs:
models/top_domains.sql
SELECT
    regexp_extract(url, 'http[s]?://([^/]+)/', 1) AS domain,
    count(*) AS count
FROM {{ source('hn_external', 'hacker_news_2024_2025') }}
WHERE url IS NOT NULL AND regexp_extract(url, 'http[s]?://([^/]+)/', 1) != ''
GROUP BY domain
ORDER BY count DESC
LIMIT 20
All models use the {{ source() }} function to reference the S3 data defined in sources.yml.

Running dbt

Local Development (DuckDB)

Run models locally for testing and development:
dbt run --target local
This builds all models in your local DuckDB database (local.db).

Production Deployment (MotherDuck)

1

Set MotherDuck Token

Export your MotherDuck token as an environment variable:
export MOTHERDUCK_TOKEN=your_token_here
2

Run dbt

Build models in MotherDuck:
dbt run --target prod
3

Verify

Query your models in the MotherDuck web UI or via SQL client
Ensure the hacker_news_stats database exists in MotherDuck before running dbt with the prod target.

Key Features

Zero-Copy Analytics

Query S3 Parquet files directly without data duplication

Unified Workflow

Use the same dbt project for local and cloud environments

DuckDB Performance

Leverage DuckDB’s optimized Parquet reader and columnar processing

S3 Integration

Native support for reading external files from S3

Development Workflow

1

Develop Locally

Write and test models using dbt run --target local
2

Validate Results

Query local DuckDB to verify model logic:
duckdb local.db "SELECT * FROM top_story_by_comments LIMIT 10"
3

Deploy to Production

Once validated, deploy to MotherDuck:
dbt run --target prod
4

Share Results

Collaborate with your team using MotherDuck’s cloud database

Example Use Cases

Track how often specific keywords (like “duckdb”) appear in HN stories over time to measure community interest and engagement.
Identify which stories generate the most discussion by analyzing comment counts, helping content creators understand what resonates.
Discover which websites are most frequently shared on HN, revealing trusted sources and trending platforms.

Best Practices

Use Explicit Schemas

Define clear column names and types in your models for better documentation

Test Locally First

Always validate models with the local target before deploying to production

Leverage Window Functions

Use DuckDB’s efficient window functions for ranking and partitioning

Optimize Filters

Apply WHERE clauses early to reduce data scanned from S3

Next Steps

dbt-duckdb Documentation

Learn more about the dbt-duckdb adapter

DuckDB Extensions

Explore DuckDB extensions for S3, httpfs, and more

MotherDuck Docs

Comprehensive MotherDuck documentation

dbt Best Practices

dbt style guide and conventions

Build docs developers (and LLMs) love