dbt Ingestion from S3

This guide demonstrates how to ingest and analyze data directly from S3 using dbt with DuckDB or MotherDuck. You’ll learn how to query Parquet files in S3 without copying data and build analytical models on top of that data.

Overview

This approach showcases a modern analytics workflow where you can:

Query Parquet files directly from S3 without data movement
Build dbt models locally using DuckDB
Deploy the same models to MotherDuck for cloud analytics
Leverage DuckDB’s external file capabilities for zero-copy analytics

The project queries data directly from S3 using DuckDB’s external file support—no data copying required.

Prerequisites

Install dbt-duckdb

pip install dbt-duckdb

Set up MotherDuck

Create a MotherDuck account and obtain your access token

Create Database

Create the target database in MotherDuck (via web UI or SQL):

CREATE DATABASE IF NOT EXISTS hacker_news_stats;

Project Structure

The dbt project includes:

models/sources.yml: Defines S3 Parquet files as dbt sources
models/top_story_by_comments.sql: Top HN story by comments per month
models/duckdb_keyword_mentions.sql: Monthly DuckDB mentions in HN
models/top_domains.sql: Top 20 domains from HN story URLs
profiles.yml: Configuration for local and cloud targets

Configuration

profiles.yml

Configure both local (DuckDB) and production (MotherDuck) targets:

profiles.yml

dbt_ingestion_s3:
  outputs:
    local:
      type: duckdb
      path: local.db 
      threads: 1
    prod:
      type: duckdb
      path: "md:hacker_news_stats"
      threads: 1
  target: local

Understanding the configuration

local target: Uses a local DuckDB file (local.db) for development
prod target: Connects to MotherDuck using the md: prefix
default target: Set to local for safe development

Source Definition

Define external S3 data as a dbt source:

models/sources.yml

version: 2

sources:
  - name: hn_external
    meta:
      external_location: "s3://us-prd-motherduck-open-datasets/hacker_news/parquet/{name}.parquet"
    tables:
      - name: hacker_news_2024_2025

The {name} placeholder allows dbt to dynamically construct the file path for each table in the source.

dbt Models

Top Story by Comments

Finds the most-commented Hacker News story for each month:

models/top_story_by_comments.sql

WITH ranked_stories AS (
    SELECT
        title,
        'https://news.ycombinator.com/item?id=' || id AS hn_url,
        descendants AS nb_comments,
        YEAR(timestamp) AS year,
        MONTH(timestamp) AS month,
        ROW_NUMBER() OVER (
            PARTITION BY YEAR(timestamp), MONTH(timestamp)
            ORDER BY descendants DESC
        ) AS rn
    FROM {{ source('hn_external', 'hacker_news_2024_2025') }}
    WHERE type = 'story'
)
SELECT
    year,
    month,
    title,
    hn_url,
    nb_comments
FROM ranked_stories
WHERE rn = 1
ORDER BY year, month

DuckDB Keyword Mentions

Counts monthly mentions of “duckdb” in Hacker News stories:

models/duckdb_keyword_mentions.sql

SELECT
    YEAR(timestamp) AS year,
    MONTH(timestamp) AS month,
    COUNT(*) AS keyword_mentions
FROM {{ source('hn_external', 'hacker_news_2024_2025') }}
WHERE
    (title LIKE '%duckdb%' OR text LIKE '%duckdb%')
GROUP BY year, month
ORDER BY year ASC, month ASC

Top Domains

Extracts and ranks the top 20 domains from HN story URLs:

models/top_domains.sql

SELECT
    regexp_extract(url, 'http[s]?://([^/]+)/', 1) AS domain,
    count(*) AS count
FROM {{ source('hn_external', 'hacker_news_2024_2025') }}
WHERE url IS NOT NULL AND regexp_extract(url, 'http[s]?://([^/]+)/', 1) != ''
GROUP BY domain
ORDER BY count DESC
LIMIT 20

All models use the {{ source() }} function to reference the S3 data defined in sources.yml.

Running dbt

Local Development (DuckDB)

Run models locally for testing and development:

dbt run --target local

This builds all models in your local DuckDB database (local.db).

Production Deployment (MotherDuck)

Set MotherDuck Token

Export your MotherDuck token as an environment variable:

export MOTHERDUCK_TOKEN=your_token_here

Run dbt

Build models in MotherDuck:

dbt run --target prod

Verify

Query your models in the MotherDuck web UI or via SQL client

Ensure the hacker_news_stats database exists in MotherDuck before running dbt with the prod target.

Key Features

Zero-Copy Analytics

Query S3 Parquet files directly without data duplication

Unified Workflow

Use the same dbt project for local and cloud environments

DuckDB Performance

Leverage DuckDB’s optimized Parquet reader and columnar processing

S3 Integration

Native support for reading external files from S3

Development Workflow

Develop Locally

Write and test models using dbt run --target local

Validate Results

Query local DuckDB to verify model logic:

duckdb local.db "SELECT * FROM top_story_by_comments LIMIT 10"

Deploy to Production

Once validated, deploy to MotherDuck:

dbt run --target prod

Share Results

Collaborate with your team using MotherDuck’s cloud database

Example Use Cases

Hacker News Trend Analysis

Track how often specific keywords (like “duckdb”) appear in HN stories over time to measure community interest and engagement.

Content Performance

Identify which stories generate the most discussion by analyzing comment counts, helping content creators understand what resonates.

Domain Popularity

Discover which websites are most frequently shared on HN, revealing trusted sources and trending platforms.

Best Practices

Use Explicit Schemas

Define clear column names and types in your models for better documentation

Test Locally First

Always validate models with the local target before deploying to production

Leverage Window Functions

Use DuckDB’s efficient window functions for ranking and partitioning

Optimize Filters

Apply WHERE clauses early to reduce data scanned from S3

Next Steps

dbt-duckdb Documentation

Learn more about the dbt-duckdb adapter

DuckDB Extensions

Explore DuckDB extensions for S3, httpfs, and more

MotherDuck Docs

Comprehensive MotherDuck documentation

dbt Best Practices

dbt style guide and conventions

Get Started

Data Ingestion

dbt Patterns

Data Replication

Integrations

Overview

Prerequisites

Project Structure

Configuration

profiles.yml

Source Definition

dbt Models

Top Story by Comments

DuckDB Keyword Mentions

Top Domains

Running dbt

Local Development (DuckDB)

Production Deployment (MotherDuck)

Key Features

Zero-Copy Analytics

Unified Workflow

DuckDB Performance

S3 Integration

Development Workflow

Example Use Cases

Best Practices

Use Explicit Schemas

Test Locally First

Leverage Window Functions

Optimize Filters

Next Steps

dbt-duckdb Documentation

DuckDB Extensions

MotherDuck Docs

dbt Best Practices

Build docs developers (and LLMs) love

Get Started

Data Ingestion

dbt Patterns

Data Replication

Integrations

​Overview

​Prerequisites

​Project Structure

​Configuration

​profiles.yml

​Source Definition

​dbt Models

​Top Story by Comments

​DuckDB Keyword Mentions

​Top Domains

​Running dbt

​Local Development (DuckDB)

​Production Deployment (MotherDuck)

​Key Features

Zero-Copy Analytics

Unified Workflow

DuckDB Performance

S3 Integration

​Development Workflow

​Example Use Cases

​Best Practices

Use Explicit Schemas

Test Locally First

Leverage Window Functions

Optimize Filters

​Next Steps

dbt-duckdb Documentation

DuckDB Extensions

MotherDuck Docs

dbt Best Practices

Build docs developers (and LLMs) love

Overview

Prerequisites

Project Structure

Configuration

profiles.yml

Source Definition

dbt Models

Top Story by Comments

DuckDB Keyword Mentions

Top Domains

Running dbt

Local Development (DuckDB)

Production Deployment (MotherDuck)

Key Features

Development Workflow

Example Use Cases

Best Practices

Next Steps