Lean Data Stack with dlt, DuckDB, DuckLake, and dbt

A lean analytics stack built with dlt, DuckDB, DuckLake, and dbt delivers fast insights without the cost or complexity of a traditional cloud data warehouse. For teams prioritizing speed, simplicity, and control, this architecture provides a practical path from raw data to production-ready analytics.

In practice, teams run this stack using Datacoves to standardize environments, manage workflows, and apply production guardrails without adding operational overhead.

A lean analytics stack built with dlt, DuckDB, DuckLake, and dbt delivers fast, production-ready insights without the cost or complexity of a traditional cloud data warehouse.

The Lean Data Stack: Tools and Roles

A lean analytics stack works when each tool has a clear responsibility. In this architecture, ingestion, storage, and transformation are intentionally separated so the system stays fast, simple, and flexible.

dlt handles ingestion. It reliably loads raw data from APIs, files, and databases into DuckDB with minimal configuration and strong defaults.
DuckDB provides the analytical engine. It is fast, lightweight, and ideal for running analytical queries directly on local or cloud-backed data.
DuckLake defines the storage layer. It stores tables as Parquet files with centralized metadata, enabling a true lakehouse pattern without a heavyweight platform.
dbt manages transformations. It brings version control, testing, documentation, and repeatable builds to analytics workflows.

Together, these tools form a modern lakehouse-style stack without the operational cost of a traditional cloud data warehouse.

Setting Up DuckDB and DuckLake with MotherDuck

Running DuckDB locally is easy. Running it consistently across machines, environments, and teams is not. This is where MotherDuck matters.

MotherDuck provides a managed control plane for DuckDB and DuckLake, handling authentication, metadata coordination, and cloud-backed storage without changing how DuckDB works. You still query DuckDB. You just stop worrying about where it runs.

To get started:

Create a MotherDuck account.
In Settings → Integrations, generate an API token (MOTHERDUCK_TOKEN).
Configure access to your object storage, such as S3, for DuckLake-managed tables.
Export the token as an environment variable on your local machine.

This single token is used by dlt, DuckDB, and dbt to authenticate securely with MotherDuck. No additional credentials or service accounts are required.

At this point, you have:

A DuckDB-compatible analytics engine
An open table format via DuckLake
Centralized metadata and storage
A setup that works the same locally and in production

That consistency is what makes the rest of the stack reliable.

Ingesting Data with dlt into DuckDB

In a lean data stack, ingestion should be reliable, repeatable, and boring. That is exactly what dlt is designed to do.

dlt loads raw data into DuckDB with strong defaults for schema handling, incremental loads, and metadata tracking. It removes the need for custom ingestion frameworks while remaining flexible enough for real-world data sources.

In this example, dlt ingests a CSV file and loads it into a DuckDB database hosted in MotherDuck. The same pattern works for APIs, databases, and file-based sources.

To keep dependencies lightweight and avoid manual environment setup, we use uv to run the ingestion script with inline dependencies.

pip install uv
touch us_populations.py
chmod +x us_populations.py

The script below uses dlt’s MotherDuck destination. Authentication is handled through the MOTHERDUCK_TOKEN environment variable, and data is written to a raw schema in DuckDB.

#!/usr/bin/env -S uv run
# /// script
# dependencies = [
#   "dlt[motherduck]==1.16.0",
#   "psutil",
#   "pandas",
#   "duckdb==1.3.0"
# ]
# ///

"""Loads a CSV file to MotherDuck"""
import dlt
import pandas as pd
from utils.datacoves_utils import pipelines_dir

@dlt.resource(write_disposition="replace")
def us_population():
    url = "https://raw.githubusercontent.com/dataprofessor/dashboard-v3/master/data/us-population-2010-2019.csv"
    df = pd.read_csv(url)
    yield df

@dlt.source
def us_population_source():
    return [us_population()]

if __name__ == "__main__":
    # Configure MotherDuck destination with explicit credentials
    motherduck_destination = dlt.destinations.motherduck(
        destination_name="motherduck",
        credentials={
            "database": "raw",
            "motherduck_token": dlt.secrets.get("MOTHERDUCK_TOKEN")
        }
    )

    pipeline = dlt.pipeline(
        progress = "log",
        pipeline_name = "us_population_data",
        destination = motherduck_destination,
        pipelines_dir = pipelines_dir,

        # dataset_name is the target schema name in the "raw" database
        dataset_name="us_population"
    )

    load_info = pipeline.run([
        us_population_source()
    ])

    print(load_info)

Running the script loads the data into DuckDB:

./us_populations.py

At this point, raw data is available in DuckDB and ready for transformation. Ingestion is fully automated, reproducible, and versionable, without introducing a separate ingestion platform.

Transforming Data with dbt and DuckLake

Once raw data is loaded into DuckDB, transformations should follow the same disciplined workflow teams already use elsewhere. This is where dbt fits naturally.

dbt provides version-controlled models, testing, documentation, and repeatable builds. The difference in this stack is not how dbt works, but where tables are materialized.

By enabling DuckLake, dbt materializes tables as Parquet files with centralized metadata instead of opaque DuckDB-only files. This turns DuckDB into a true lakehouse engine while keeping the developer experience unchanged.

To get started, install dbt and the DuckDB adapter:

pip install dbt-core==1.10.17
pip install dbt-duckdb==1.10.0
dbt init

Next, configure your dbt profile to target DuckLake through MotherDuck:

default:
  outputs:
    dev:
      type: duckdb
      # This requires the environment var MOTHERDUCK_TOKEN to be set
      path: 'md:datacoves_ducklake'
      threads: 4
      schema: dev  # this will be the prefix used in the duckdb schema
      is_ducklake: true

  target: dev

This configuration does a few important things:

Authenticates using the MOTHERDUCK_TOKEN environment variable
Writes tables using DuckLake’s open format
Separates transformed data from raw ingestion
Keeps development and production workflows consistent

With this in place, dbt models behave exactly as expected. Models materialized as tables are stored in DuckLake, while views and ephemeral models remain lightweight and fast.

From here, teams can:

Add dbt tests for data quality
Generate documentation and lineage
Run transformations locally or in shared environments
Promote models to production without changing tooling

This is the key advantage of the stack: modern analytics engineering practices, without the overhead of a traditional warehouse.

When This Stack Makes Sense

This lean stack is not trying to replace every enterprise data warehouse. It is designed for teams that value speed, simplicity, and cost control over heavyweight infrastructure.

This approach works especially well when:

You want fast analytics without committing to a full cloud warehouse.
Your team prefers open, file-based storage over proprietary formats.
You are building prototypes, internal analytics, or domain-specific data products.
Cost predictability matters more than elastic, multi-tenant scale.
You want modern analytics engineering practices without platform sprawl.

The trade-offs are real and intentional. DuckDB and DuckLake excel at analytical workloads and developer productivity, but they are not designed for high-concurrency BI at massive scale. Teams with hundreds of dashboards and thousands of daily users may still need a traditional warehouse.

Where this stack shines is time to value. You can move from raw data to trusted analytics quickly, with minimal infrastructure, and without locking yourself into a platform that is expensive to unwind later.

In practice, many teams use this architecture as:

A lightweight production analytics stack
A proving ground before scaling to a larger warehouse
A cost-efficient alternative for departmental or embedded analytics

When paired with Datacoves, teams get the operational guardrails this stack needs to run reliably. Datacoves standardizes environments, integrates orchestration and CI/CD, and applies best practices so the simplicity of the stack does not turn into fragility over time.

Teams often run this stack with Datacoves to standardize environments, apply production guardrails, and avoid the operational drag of DIY platform management.

See it in action

If you want to see this stack running end to end, watch the Datacoves + MotherDuck webinar. It walks through ingestion with dlt, transformations with dbt and DuckLake, and how teams operationalize the workflow with orchestration and governance.

The session also covers: