dbt & Airflow

Best practices, tips, and real-world use cases for building reliable data pipelines with dbt and Airflow.
5 mins read

In Apache Airflow, scheduling workflows has traditionally been managed using the schedule_interval parameter, which accepts definitions such as datetime objects or cron expressions to establish time-based intervals for DAG (Directed Acyclic Graph) executions. Airflow was a powerful scheduler but became even more efficient when Airflow introduced a significant enhancement in the incorporation of datasets into scheduling. This advancement enables data-driven DAG execution, allowing workflows to be triggered by specific data updates rather than relying on predetermined time intervals.  

In this article, we'll dive into the concept of Airflow datasets, explore their transformative impact on workflow orchestration, and provide a step-by-step guide to schedule your DAGs using Datasets!

Jump to Tutorial

Understanding Airflow Scheduling (Pre-Datasets)

DAG scheduling in Airflow was primarily time-based, relying on parameters like schedule_interval and start_date to define execution times. With this set up there were three ways to schedule your DAGs: Cron, presets, or timedelta objects. Let's examine each one.

  • Cron Expressions: These expressions allowed precise scheduling. For example, to run a DAG daily at 4:05 AM, you would set schedule_interval='5 4 * * *'.  
  • Presets: Airflow provided string presets for common intervals:  
    • @hourly: Runs the DAG at the beginning of every hour.  
    • @daily: Runs the DAG at midnight every day.  
    • @weekly: Runs the DAG at midnight on the first day of the week.  
    • @monthly: Runs the DAG at midnight on the first day of the month.  
    • @yearly: Runs the DAG at midnight on January 1st.  
  • Timedelta Objects: For intervals not easily expressed with cron, a timedelta object could be used. For instance, schedule_interval=timedelta(hours=6) would schedule the DAG every six hours.  

Limitations of Time-Based Scheduling

While effective for most complex jobs, time-based scheduling had some limitations:  

Fixed Timing: DAGs ran at predetermined times, regardless of data readiness (this is the key to Datasets). If data wasn't available at the scheduled time, tasks could fail or process incomplete data.  

Sensors and Polling: To handle data dependencies, sensors were employed to wait for data availability. However, sensors often relied on continuous polling, which could be resource-intensive and lead to inefficiencies.  

Airflow Datasets were created to overcome these scheduling limitations.

Intro to Airflow Datasets

A Dataset is a way to represent a specific set of data. Think of it as a label or reference to a particular data resource. This can be anything: a csv file, an s3 bucket or SQL table. A Dataset is defined by passing a string path to the Dataset() object. This path acts as an identifier — it doesn't have to be a real file or URL, but it should be consistent, unique, and ideally in ASCII format (plain English letters, numbers, slashes, underscores, etc.).

from airflow.datasets import Dataset

my_dataset = Dataset("s3://my-bucket/my-data.csv")
# or
my_dataset = Dataset("my_folder/my_file.txt")

When using Airflow Datasets, remember that Airflow does not monitor the actual contents of your data. It doesn’t check if a file or table has been updated.

Instead, it tracks task completion. When a task that lists a Dataset in its outlets finishes successfully, Airflow marks that Dataset as “updated.” This means the task doesn’t need to actually modify any data — even a task that only runs a print() statement will still trigger any Consumer DAGs scheduled on that Dataset. It’s up to your task logic to ensure the underlying data is actually being modified when necessary. Even though Airflow isn’t checking the data directly, this mechanism still enables event-driven orchestration because your workflows can run when upstream data should be ready.

For example, if one DAG has a task that generates a report and writes it to a file, you can define a Dataset for that file. Another DAG that depends on the report can be triggered automatically as soon as the first DAG’s task completes. This removes the need for rigid time-based scheduling and reduces the risk of running on incomplete or missing data.

Datasets give you a new way to schedule your DAGs—based on when upstream DAGs completion, not just on a time interval. Instead of relying on schedule_interval, Airflow introduced the schedule parameter to support both time-based and dataset-driven workflows. When a DAG finishes and "updates" a dataset, any DAGs that depend on that dataset can be triggered automatically. And if you want even more control, you can update your Dataset externally using the Airflow API.

When using Datasets in Airflow, you'll typically work with two types of DAGs: Producer and Consumer DAGs.

What is a Producer DAG?

A DAG responsible for defining and "updating" a specific Dataset. We say "updating" because Airflow considers a Dataset "updated" simply when a task that lists it in its outlets completes successfully — regardless of whether the data was truly modified.

A Producer DAG:
✅ Must have the Dataset variable defined or imported
✅ Must include a task with the outlets parameter set to that Dataset

What is a Consumer DAG?

A DAG that is scheduled to execute once the Producer DAG successfully completes.  

A Consumer DAG:
✅ Must reference the same Dataset using the schedule parameter

It’s this producer-consumer relationship that enables event-driven scheduling in Airflow — allowing workflows to run as soon as the data they're dependent on is ready, without relying on fixed time intervals.

Tutorial: Scheduling with Datasets  

Create a producer DAG

1. Define your Dataset.  

In a new DAG file, define a variable using the Dataset object and pass in the path to your data as a string. In this example, it’s the path to a CSV file.

# producer.py
from airflow.datasets import Dataset 

# Define the dataset representing the CSV file 
csv_dataset = Dataset("/path/to/your_dataset.csv") 

2. Create a DAG with a task that updates the CSV dataset.

We’ll use the @dag and @task decorators for a cleaner structure. The key part is passing the outlets parameter to the task. This tells Airflow that the task updates a specific dataset. Once the task completes successfully, Airflow will consider the dataset "updated" and trigger any dependent DAGs.

We’re also using csv_dataset.uri to get the path to the dataset—this is the same path you defined earlier (e.g., "/path/to/your_dataset.csv").

# producer.py
from airflow.decorators import dag, task
from airflow.datasets import Dataset
from datetime import datetime
import pandas as pd
import os

# Define the dataset representing the CSV file
csv_dataset = Dataset("/path/to/your_dataset.csv")

@dag(
    dag_id='producer_dag',
    start_date=datetime(2025, 3, 31),
    schedule='@daily',
    catchup=False,
)
def producer_dag():

    @task(outlets=[csv_dataset])
    def update_csv():
        data = {'column1': [1, 2, 3], 'column2': ['A', 'B', 'C']}
        df = pd.DataFrame(data)
        file_path = csv_dataset.uri

        # Check if the file exists to append or write
        if os.path.exists(file_path):
            df.to_csv(file_path, mode='a', header=False, index=False)
        else:
            df.to_csv(file_path, index=False)

    update_csv()

producer_dag()

Create a Consumer DAG

Now that we have a producer DAG that is updating a Dataset. We can create our DAG that will be dependent on the consumer DAG. This is where the magic happens since this DAG will no longer be time dependent but rather Dataset dependant.  

1. Instantiate the same Dataset used in the Producer DAG

In a new DAG file (the consumer), start by defining the same Dataset that was used in the Producer DAG. This ensures both DAGs are referencing the exact same dataset path.

# consumer.py
from airflow.datasets import Dataset 

# Define the dataset representing the CSV file 
csv_dataset = Dataset("/path/to/your_dataset.csv") 

2. Set the schedule to the Dataset

Create your DAG and set the schedule parameter to the Dataset you instantiated earlier (the one being updated by the producer DAG). This tells Airflow to trigger this DAG only when that dataset is updated—no need for time-based scheduling.

# consumer.py
import datetime
from airflow.decorators import dag, task
from airflow.datasets import Dataset

csv_dataset = Dataset("/path/to/your_dataset.csv")

@dag(
    default_args={
        "start_date": datetime.datetime(2024, 1, 1, 0, 0),
        "owner": "Mayra Pena",
        "email": "mayra@example.com",
        "retries": 3
    },
    description="Sample Consumer DAG",
    schedule=[csv_dataset],
    tags=["transform"],
    catchup=False,
)
def data_aware_consumer_dag():
    
    @task
    def run_consumer():
        print("Processing updated CSV file")
    
    run_consumer()

dag = data_aware_consumer_dag()

Thats it!🎉 Now this DAG will run whenever the first Producer DAG completes (updates the file).  

Dry Principles for Datasets

When using Datasets you may be using the same dataset across multiple DAGs and therfore having to define it many times. There is a simple DRY (Dont Repeat Yourself) way to overcome this.

1. Create a central datasets.py file
To follow DRY (Don't Repeat Yourself) principles, centralize your dataset definitions in a utility module.

Simply create a utils folder and add a datasets.py file.
If you're using Datacoves, your Airflow-related files typically live in a folder named orchestrate, so your path might look like:
orchestrate/utils/datasets.py

2. Import the Dataset object

Inside your datasets.py file, import the Dataset class from Airflow:

from airflow.datasets import Dataset 

3. Define your Dataset in this file

Now that you’ve imported the Dataset object, define your dataset as a variable. For example, if your DAG writes to a CSV file:

from airflow.datasets import Dataset 

# Define the dataset representing the CSV file 
CSV_DATASET= Dataset("/path/to/your_dataset.csv") 

Notice we’ve written the variable name in all caps (CSV_DATASET)—this follows Python convention for constants, signaling that the value shouldn’t change. This makes your code easier to read and maintain.

4. Import the Dataset in your DAG

In your DAG file, simply import the dataset you defined in your utils/datasets.py file and use it as needed.

from airflow.decorators import dag, task
from orchestrate.utils.datasets import CSV_DATASET
from datetime import datetime
import pandas as pd
import os

@dag(
    dag_id='producer_dag',
    start_date=datetime(2025, 3, 31),
    schedule='@daily',
    catchup=False,
)
def producer_dag():

    @task(outlets=[CSV_DATASET])
    def update_csv():
        data = {'column1': [1, 2, 3], 'column2': ['A', 'B', 'C']}
        df = pd.DataFrame(data)
        file_path = CSV_DATASET.uri

        # Check if the file exists to append or write
        if os.path.exists(file_path):
            df.to_csv(file_path, mode='a', header=False, index=False)
        else:
            df.to_csv(file_path, index=False)

    update_csv()

producer_dag()

Now you can reference CSV_DATASET in your DAG's schedule or as a task outlet, keeping your code clean and consistent across projects.🎉

Visualizing Dataset Dependencies in the UI

You can visualize your Datasets as well as events triggered by Datasets in the Airflow UI.  There are 3 tabs that will prove helpful for implementation and debugging your event triggered pipelines:  

Dataset Events

The Dataset Events sub-tab shows a chronological list of recent events associated with datasets in your Airflow environment. Each entry details the dataset involved, the producer task that updated it, the timestamp of the update, and any triggered consumer DAGs. This view is important for monitoring the flow of data, ensuring that dataset updates occur as expected, and helps with prompt identification and resolution of issues within data pipelines.  

Dependency Graph

The Dependency Graph is a visual representation of the relationships between datasets and DAGs. It illustrates how producer tasks, datasets, and consumer DAGs interconnect, providing a clear overview of data dependencies within your workflows. This graphical depiction helps visualize the structure of your data pipelines to identify potential bottlenecks and optimize your pipeline.

Datasets

The Datasets sub-tab provides a list of all datasets defined in your Airflow instance. For each dataset, it shows important information such as the dataset's URI, associated producer tasks, and consumer DAGs. This centralized view provides efficient management of datasets, allowing users to track dataset usage across various workflows and maintain organized data dependencies.  

Datasets UI
Datasets View Airflow UI

Best Practices & Considerations

When working with Datasets, there are a couple of things to take into consideration to maintain readability.  

Naming datasets meaningfully: Ensure your names are verbose and descriptive. This will help the next person who is looking at your code and even future you.

Avoid overly granular datasets: While they are a great tool too many = hard to manage. So try to strike a balance.  

Monitor for dataset DAG execution delays: It is important to keep an eye out for delays since this could point to an issue in your scheduler configuration or system performance.  

Task Completion Signals Dataset Update: It’s important to understand that Airflow doesn’t actually check the contents of a dataset (like a file or table). A dataset is considered “updated” only when a task that lists it in its outlets completes successfully. So even if the file wasn’t truly changed, Airflow will still assume it was. At Datacoves, you can also trigger a DAG externally using the Airflow API and an AWS Lambda Function to trigger your DAG once data lands in an S3 Bucket.

Datacoves provides a scalable Managed Airflow solution and handles these upgrades for you. This alleviates the stress of managing Airflow Infrastructure so you can data teams focus on their pipelines. Checkout how Datadrive saved 200 hours yearly by choosing Datacoves.  

Conclusion

The introduction of data-aware scheduling with Datasets in Apache Airflow is a big advancement in workflow orchestration. By enabling DAGs to trigger based on data updates rather than fixed time intervals, Airflow has become more adaptable and efficient in managing complex data pipelines.  

By adopting Datasets, you can enhance the maintainability and scalability of your workflows, ensuring that tasks are executed exactly when the upstream data is ready. This not only optimizes resource utilization but also simplifies dependency management across DAGs.

Give it a try! 😎

The secret to enterprise dbt analytics success
5 mins read

Enterprises are increasingly relying on dbt (Data Build Tool) for their data analytics; however, dbt wasn’t designed to be an enterprise-ready platform on its own. This leads to struggles with scalability, orchestration, governance, and operational efficiency when implementing dbt at scale. But if dbt is so amazing why is this the case? Like our title suggests, you need more than just dbt to have a successful dbt analytics implementation. Keep on reading to learn exactly what you need to super charge your data analytics with dbt successfully.  

Why Enterprises Adopt dbt for Data Transformation

dbt is popular because it solves problems facing the data analytics world. Enterprises today are dealing with growing volumes of data, making efficient data transformation a critical part of their analytics strategy. Traditionally, data transformation was handled using complex ETL (Extract, Transform, Load) processes, where data engineers wrote custom scripts to clean, structure, and prepare data before loading it into a warehouse. However, this approach has several challenges:

  • Slow Development Cycles – ETL processes often required significant engineering effort, creating bottlenecks and slowing down analytics workflows.
  • High Dependency on Engineers – Analysts and business users had to rely on data engineers to implement transformations, limiting agility.
  • Difficult Collaboration & Maintenance – Custom scripts and siloed processes made it hard to track changes, ensure consistency, and maintain documentation.
issues without dbt
Issues without dbt

dbt (Data Build Tool) transforms this paradigm by enabling SQL-based, modular, and version-controlled transformations directly inside the data warehouse. By following the ELT (Extract, Load, Transform) approach, dbt allows raw data to be loaded into the warehouse first, then transformed within the warehouse itself—leveraging the scalability and processing power of modern cloud data platforms.

Unlike traditional ETL tools, dbt applies software engineering best practices to SQL-based transformations, making it easier to develop, test, document, and scale data pipelines. This shift has made dbt a preferred solution for enterprises looking to empower analysts, improve collaboration, and create maintainable data workflows.

Key Benefits of dbt

  • SQL-Based Transformations – dbt enables data teams to perform transformations within the data warehouse using standard SQL. By managing the Data Manipulation Language (DML) statements, dbt allows anyone with SQL skills to contribute to data modeling, making it more accessible to analysts and reducing reliance on specialized engineering resources.
  • Automated Testing & Documentation – With more people contributing to data modeling things can become a mess but dbt shines by incorporating automated testing and documentation to ensure data reliability. With dbt  teams can have a decentralized development pattern but maintain centralized governance.  
  • Version Control & Collaboration – Borrowing from software engineering best practices dbt enables teams to track changes using Git. Any changes made to data models can be clearly tracked and reverted, simplifying collaboration.  
  • Modular and Reusable Code – dbt's powerful combination of SQL and Jinja enables the creation of modular and reusable code, significantly enhancing maintainability. Using Jinja, dbt allows users to define macros—reusable code snippets that encapsulate complex logic. This means less redundancies and consistent application of business rules across models.  
  • Scalability & Performance Optimization – dbt leverages the data warehouse’s native processing power, enabling incremental models that minimize recomputation and improve efficiency.
  • Extensibility & Ecosystem – dbt integrates with orchestration tools (e.g., Airflow) and metadata platforms (e.g., DataHub), supporting a growing ecosystem of plugins and APIs.

With these benefits it is clear why over 40,000 companies are leveraging dbt today!

The Challenges of Scaling dbt in the Enterprise

Despite dbt’s strengths, enterprises face several challenges when implementing it at scale for a variety of reasons:

Complexity of Scaling dbt

Running dbt in production requires robust orchestration beyond simple scheduled jobs. dbt only manages transformations, but a complete end-to-end pipeline includes Extracting, Loading and Visualizing of data. To manage the full end-to-end data pipeline (ELT + Viz) organizations will need a full-fledged orchestrator like Airflow. While there are other orchestration options on the market,  Airflow and dbt are a common pattern.  

Lack of Integrated CI/CD & Development Controls

CI/CD pipelines are essential for dbt at the enterprise level, yet one of dbt Core’s major limitations is the lack of a built-in CI/CD pipeline for managing deployments. This makes workflows more complex and increases the likelihood of errors reaching production. To address this, teams can implement external tools like Jenkins, GitHub Actions, or GitLab Workflows that provide a flexible and customizable CI/CD process to automate deployments and enforce best practices.

While dbt Cloud does offer an out-of-the-box CI/CD solution, it lacks customization options. Some organizations find that their use cases demand greater flexibility, requiring them to build their own CI/CD processes instead.

Infrastructure & Deployment Constraints

Enterprises seek alternative solutions that provide greater control, scalability, and security over their data platform. However, this comes with the responsibility of managing their own infrastructure, which introduces significant operational overhead ($$$). Solutions like dbt Cloud do not offer Virtual Private Cloud (VPC) deployment, full CI/CD flexibility, and a fully-fledged orchestrator leaving organizations to handle additional platform components.

We saw a need for a middle ground that combined the best of both worlds; something as flexible as dbt Core and Airflow, but fully managed like dbt Cloud. This led to Datacoves which provides a seamless experience with no platform maintenance overhead or  onboarding hassles. Teams can focus on generating insights from data and not worry about the platform.

Avoiding Vendor Lock-In

Vendor lock-in is a major concern for organizations that want to maintain flexibility and avoid being tied to a single provider. The ability to switch out tools easily without excessive cost or effort is a key advantage of the modern data stack. Enterprises benefit from mixing and matching best-in-class solutions that meet their specific needs.

How Datacoves Solves Enterprise dbt Challenges

Datacoves is a fully managed enterprise platform for dbt, solving the challenges outlined above. Below is how Datacoves' features align with enterprise needs:

Platform Capabilities

  • Integrated Development Environment (IDE): With in-browser VS Code, users can develop SQL and Python seamlessly within a browser-based VS Code environment. This includes full access to the terminal, python libraries, and VS Code extensions for the most customizable development experience.  
VS Code in Datacoves
  • Managed Development Environment: Pre-configured VS Code, dbt, and Airflow setup for enterprise teams. Everything is managed so project leads dont have to worry about dependencies, docker images, upgrades or onboarding. Datacoves users can be onboarded to a new project it minutes not days.  
  • Scalability & Flexibility: Kubernetes-powered infrastructure for elastic scaling. Users don’t have the operational overhead of managing their dbt and Airflow environments, they simply login and everything just works.  
  • Version Control & Collaboration: Datacoves integrates seamlessly with Git services like Github, Gitlab, Bitbucket, and Azure DevOps. When deployed in the customer’s VPC, Datacoves can even access private Git servers and Docker registries.
  • Security & User Management: Datacoves can integrate Single Sign-On (SSO) for authentication., and AD groups for role management.
  • Use of Open-Source Tools: Built on standard dbt Core, Airflow, and VS Code to ensure maximum flexibility. At the end of the day it is your code and you can take it with you.  

Data Extraction and Loading

  • With Datacoves, companies can leverage a managed Airbyte instance out of the box. However, if users are not using Airbyte or need addition EL tools Datacoves seamlessly integrates with enterprise EL solutions such as Amazon Glue, Azure Data Factory, Databricks, Streamsets, etc. Additionally, since Datacoves supports Python development organizations can leverage their custom Python frameworks or develop using tools like dlt (data load tool) with ease.
airbyte in datacoves
Airbyte in Datacoves

Data Transformation

  • Support for SQL & Python: In addition to SQL or Python modeling via dbt, users can develop non-dbt Python scripts right within VS Code.
  • Data Warehouse & Data Lake Support: As a platform, Datacoves is warehouse agnostic. It works with Snowflake, BigQuery, Redshift, Databricks, MS Fabric, and any other dbt-compatible warehouse.  

Pipeline Orchestration

  • Enterprise-Grade Managed Apache Airflow: By adopting a full fledged orchestrator, developers can orchestrate the full ELT + Viz pipeline minimizing cost and pipeline failures. One of the biggest benefits of Datacoves is its fully managed Airflow scheduler for data pipeline orchestration. Developers don’t have to worry about the infrastructure overhead or scaling headaches of managing their own Airflow.
Airflow in Datacoves
  • Developer Instance of Airflow ("My Airflow"): With a few clicks easily stand-up a solo Sandbox Airflow instance for testing DAGs before deployment. My Airflow can speed up DAG development by 20%+!
  • Orchestrator Flexibility & Extensibility: Datacoves provides templated accelerators for creating Airflow DAGs and managing dbt runs. These best practices can be invaluable to an organization getting started or looking to optimize.
  • Alerting & Monitoring: Out of the box SMTP integration as well as support for custom SMTP, Slack, and Microsoft Teams notifications for proactive monitoring.  

Data Quality and Governance

  • Cross-project lineage via Datacoves Mesh (aka dbt Mesh): Have a large dbt project that would benefit by being split into multiple projects? Datacoves enables large-scale cross-team collaboration with cross dbt project support.
  • Enterprise-Grade Data Catalog (Datahub): Datacoves provides an optionally hosted Datahub instance which comes with Column-level lineage for tracking data transformations and includes cross project column-level lineage support.
  • CI/CD Accelerators: Need a robust CI/CD pipeline? Datacoves provides accelerator scripts for Jenkins, Github Actions, and Gitlab workflows so teams dont start at square one. These scripts are fully customizable to meet any team’s needs.
  • Enterprise Ready RBAC: Datacoves provides tools and processes that simplify Snowflake permissions while maintainig the controls necessary for securing PII data and complying with GDPR and CCPA regulations.

Licensing and Pricing Plans

Datacoves offers flexible deployment and pricing options to accommodate various enterprise needs:

  • Deployment Options: Choose between Datacoves' multi-tenant SaaS platform or a customer-hosted Virtual Private Cloud (VPC) deployment, ensuring compliance with security and regulatory requirements.  
  • Scalable Pricing: Pricing structures are designed to scale to enterprise levels, optimizing costs as your data operations grow.
  • Total Cost of Ownership (TCO): By providing a fully managed environment for dbt and Airflow, Datacoves reduces the need for in-house infrastructure management, lowering TCO by up to 50%.  

Vendor Information and Support

Datacoves is committed to delivering enterprise-grade support and resources through our white-glove service:

  • Dedicated Support: Comprehensive support packages, providing direct access to Datacoves' development team for timely assistance in Teams, Slack, and or email.  
  • Documentation and Training: Extensive documentation and optional training packages to help teams effectively utilize the platform.  
  • Change Management Expertise: We know that true adoption does not lie with the tools but rather change management. As a thought leader on the subject, Datacoves has guided many organizations through the implementation and scaling of dbt, ensuring a smooth transition and adoption of best practices.  

Conclusion

Enterprises need more than just dbt to achieve scalable and efficient analytics. While dbt is a powerful tool for data transformation, it lacks the necessary infrastructure, governance, and orchestration capabilities required for enterprise-level deployments. Datacoves fills these gaps by providing a fully managed environment that integrates dbt-Core, VS Code, Airflow, and Kubernetes-based deployments, Datacoves is the ultimate solution for organizations looking to scale dbt successfully.  

Whats new in dbt 1.9
5 mins read

The latest release of dbt 1.9, introduces some exciting features and updates meant to enhance functionality and tackle some pain points of dbt. With improvements like microbatch incremental strategy, snapshot enhancements, Iceberg table format support, and streamlined CI workflows, dbt 1.9 continues to help data teams work smarter, faster, and with greater precision. All the more reason to start using dbt today!  

We looked through the release notes, so you don’t have to. This article highlights the key updates in dbt 1.9, giving you the insights needed to upgrade confidently and unlock new possibilities for your data workflows. If you need a flexible dbt and Airflow experience, Datacoves might be right for your organization. Lower total cost of ownership by 50% and shortened your time to market today!

Compatibility Note: Upgrading from Older Versions

If you are upgrading from dbt 1.7 or earlier, you will need to install both dbt-core and the appropriate adapter. This requirement stems from the decoupling introduced in dbt 1.8, a change that enhances modularity and flexibility in dbt’s architecture. These updates demonstrate dbt’s commitment to providing a streamlined and adaptable experience for its users while ensuring compatibility with modern tools and workflows.

pip install dbt-core dbt-snowflake

Microbatch Incremental Strategy: A Better Way to Handle Large Data

In dbt 1.9, the microbatch incremental strategy is a new way to process massive datasets. In earlier versions of dbt, incremental materialization was available to process datasets which were too large to drop and recreate at every build. However, it struggled to efficiently manage very large datasets that are too large to fit into one query. This limitation led to timeouts and complex query management.

The microbatch incremental strategy comes to the rescue by breaking large datasets into smaller chunks for processing using the batch_size, event_time, and lookback configurations to automatically generate the necessary filters for you. However, at the time of this publication this feature is only available on the following adapters: Postgres, Redshift, Snowflake, BigQuery, Spark, and Databricks, with more on the way.  

Key Benefits of Microbatching

  • Simplified Query Design: As mentioned earlier, dbt will handle the logic for your batch data using simple, yet powerful configurations. By setting the event_time, lookback, and batch_size configurations dbt will generate the necessary filters for each batch. One less thing to worry about!  
  • Independent Batch Processing: dbt automatically splits your data into smaller chunks based on the batch_size you set. Each batch is processed separately and in parallel, unless you disable this feature using the +concurrent_batches config. This independence in batch processing improves performance, minimizes the risk of query failures, allows you to retry failed batches using the dbt retry command, and provides the granularity to load specific batches. Gotta love the control without the extra leg work!

Compatibility Note:  Custom microbatch macros

To take advantage of the microbatch incremental strategy, first upgrade to dbt 1.9 and ensure your project is configured correctly. By default, dbt will handle the microbatch logic for you, as explained above. However, if you’re using custom logic, such as a custom microbatch macro, don’t forget to  set the require_batched_execution_for_custom_microbatch_strategy behavior flag to True in your dbt_project.yml file. This prevents deprecation warnings and ensures dbt knows how to handle your custom configuration.

If you have custom microbatch but wish to migrate, its important to note that earlier versions required setting the environment variable DBT_EXPERIMENTAL_MICROBATCH to enable microbatching, but this is no longer needed. Starting with Core 1.9, the microbatch strategy works seamlessly out of the box, so you can remove it.

Enhanced Snapshots: Smarter and More Flexible Data Tracking

With dbt 1.9, snapshots have become easier to use than ever! This is great news for dbt users since snapshots in dbt allow you to capture the state of your data at specific points in time, helping you track historical changes and maintain a clear picture of how your data evolves.  Below are a couple of improvements to implement or be aware of.

Key Improvements in Snapshots

  • YAML Configurations: Snapshots can now be defined directly in YAML files. This makes them easier to manage, read, and update, allowing for a more streamlined configuration process that aligns with other dbt project components. Lots of things are easier in YAML. 😉
  • Customizable Metadata Fields: With the snapshot_meta_column_names config you now have the option to rename metadata fields to match your project's naming conventions. This added flexibility helps ensure consistency across your data models and simplifies collaboration within teams.  
  • Default target_schema: If you do not specify a schema for your snapshots, dbt will use the schema defined for the current environment. This means that snapshots will be created in the default schema associated with your dbt environment settings.
  • Standardization of resource type: Snapshots now support the standard schema and database configurations, similar to models and seeds. This standardization allows you to define where your snapshots are stored using familiar configuration patterns.
  • New Warnings: You will now get a warning if you set an incorrect updated_at data type. This ensures it is an accepted data type or timestamp. No more silent error.  
  • Set an expiration date: Before dbt 1.9 the dbt_valid_to variable is set to NULL but you can now you can configure it to a data with the dbt_valid_to_current config. It is important to note that dbt will not automatically adjust the current value in the existing dbt_valid_to column. Meaning, any existing current records will still have dbt_valid_to set to NULL and new records will have this value set to your configured date.  You will have to manually update existing data to match. Less NULL values to handle downstream!  
  • dbt snapshot–empty: In dbt 1.9, the --empty flag is now supported for the dbt snapshot command, allowing you to execute snapshot operations without processing data. This enhancement is particularly useful in Continuous Integration (CI) environments, enabling the execution of unit tests for models downstream of snapshots without requiring actual data processing, streamlining the testing process. The empty flag, introduced in dbt 1.8, also has some powerful applications in Slim CI to optimize your CI/CD worth checking out.
  • Improved Handling of Deleted Records: In dbt version 1.9, the hard_deletes configuration enhances the management of deleted records in snapshots. This feature offers three methods: the default ignore, which takes no action on deleted records; invalidate, replacing the invalidate_hard_deletes=trueconfig, which marks deleted records as invalid by setting their dbt_valid_to timestamp to the current time; and lastly new_record, which tracks deletions by inserting a new record with a dbt_is_deleted config set to True.  

Compatibility Note:  hard_deletes

It's important to note some migration efforts will be required for this. While the invalidate_hard_deletes configuration is still supported for existing snapshots, it cannot be used alongside hard_deletes. For new snapshots, it's recommended to use hard_deletes instead of the legacy invalidate_hard_deletes. If you switch an existing snapshot to use hard_deletes without migrating your data, you may encounter inconsistent or incorrect results, such as a mix of old and new data formats. Keep this in mind when implementing these new configs.

Unit Testing Enhancements: Streamlined Testing for Better Data Quality

Testing is a vital part of maintaining high data quality and ensuring your data models work as intended. Unit testing was introduced in dbt 1.8 and has seen continued improvement in dbt 1.9.  

Key Enhancements in Unit Testing:

  • Selective Testing with Unit Test Selectors: dbt 1.9 introduces a new selection method for unit tests, allowing users to target specific unit tests directly using the unit_test: selector. This feature enables more granular control over test execution, allowing you to focus on particular tests without running the entire suite, thereby saving time and resources.
dbt test --select unit_test:my_project.my_unit_test 

dbt build --select unit_test:my_project.my_unit_test 
  • Improved Resource Type Handling: The update ensures that commands like dbt list --resource-type test now correctly include only data tests, excluding unit tests. This distinction enhances clarity and precision when managing different test types within your project.  
dbt ls --select unit_test:my_project.my_unit_test 

Slim CI State Modifications: Smarter and More Accurate Workflows

In dbt version 1.9, the state:modified selector has been enhanced to improve the accuracy of Slim CI workflows. Previously, dynamic configurations—such as setting the database based on the environment—could lead to dbt perceiving changes in models, even when the actual model remained unchanged. This misinterpretation caused Slim CI to rebuild all models unnecessarily, resulting in false positives.

dynamic database
dbt dynamic configurations

By comparing unrendered configuration values, dbt now accurately detects genuine modifications, eliminating false positives during state comparisons. This improvement ensures that only truly modified models are selected for rebuilding, streamlining your CI processes.

Key Benefits:

  • Improved Accuracy: Focusing on unrendered configurations reduces false positives during state comparisons.
  • Streamlined CI Processes: Enhanced change detection allows CI workflows to concentrate solely on resources that require updates or testing.
  • Time and Resource Efficiency: Minimizing unnecessary computations conserves both time and computational resources.

To enable this feature, set the state_modified_compare_more_unrendered_values flag to True in your dbt_project.yml file:

flags: 
	state_modified_compare_more_unrendered_values: True 

Enhanced Documentation Hosting with --host Flag in dbt 1.9

In dbt 1.9, the dbt docs serve command now has more customization abilities with a new --host flag. This flag allows users to specify the host address for serving documentation. Previously, dbt docs serve defaulted to binding the server to 127.0.0.1 (localhost) without an option to override this setting.  

Users can now specify a custom host address using the --host flag when running dbt docs serve. This enhancement provides the flexibility to bind the documentation server to any desired address, accommodating various deployment needs. The default of the --host flag will continue to bind to 127.0.0.1 by default, ensuring backward compatibility and secure defaults.

Key Benefits:

  • Deployment Flexibility: Users can bind the documentation server to different host addresses as required by their deployment environment.
  • Improved Accessibility: Facilitates access to dbt documentation across various network configurations by enabling custom host bindings.
  • Enhanced Compatibility: Addresses previous limitations and resolves issues encountered in deployments that require non-default host bindings.

Other Notable Improvements in dbt 1.9

dbt 1.9 includes several updates aimed at improving performance, usability, and compatibility across projects. These changes ensure a smoother experience for users while keeping dbt aligned with modern standards.

  • Iceburg table  support: With dbt 1.9, you can now add Iceberg table support to table, incremental, dynamic table materializations.
  • Optimized dbt clone Performance: The dbt clone command now executes clone operations concurrently, enhancing efficiency and reducing execution time.
  • Parseable JSON and Text Output in Quiet Mode: The dbt show and dbt compile commands now support parseable JSON and text outputs when run in quiet mode, facilitating easier integration with other tools and scripts by providing machine-readable outputs.
  • skip_nodes_if_on_run_start_fails Behavior Change Flag: A new behavior change flag, skip_nodes_if_on_run_start_fails, has been introduced to gracefully handle failures in on-run-start hooks. When enabled, if an on-run-start hook fails, subsequent hooks and nodes are skipped, preventing partial or inconsistent runs.

Compatibility Note:  Sans Python 3.8

  • Python 3.8 Support Removed: dbt 1.9 no longer supports Python 3.8, encouraging users to upgrade to newer Python versions. This ensures compatibility with the latest features and enhances overall performance.  

Conclusion

dbt 1.9 introduces a range of powerful features and enhancements, reaffirming its role as a cornerstone tool for modern data transformations.  The enhancements in this release reflect the community's commitment to innovation and excellence as well as its strength and vitality. There's no better time to join this dynamic ecosystem and elevate your data workflows!

If you're looking to implement dbt efficiently, consider partnering with Datacoves. We can help you reduce your total cost of ownership by 50% and accelerate your time to market. Book a call with us today to discover how we can help your organization in building a modern data stack with minimal technical debt.

Checkout the full release notes.

dbt and airflow
5 mins read

dbt and Airflow are cornerstone tools in the modern data stack, each excelling in different areas of data workflows. Together, dbt and Airflow provide the flexibility and scalability needed to handle complex, end-to-end workflows.

This article delves into what dbt and Airflow are, why they work so well together, and the challenges teams face when managing them independently. It also explores how Datacoves offers a fully managed solution that simplifies operations, allowing organizations to focus on delivering actionable insights rather than managing infrastructure.

What is dbt?

dbt (Data Build Tool) is an open-source analytics engineering framework that transforms raw data into analysis-ready datasets using SQL. It enables teams to write modular, version-controlled workflows that are easy to test and document, bridging the gap between analysts and engineers.

  • Adoption: With over 40,000 companies using dbt, the majority rely on open-source dbt Core available to anyone.
  • Key Strength: dbt empowers anyone with SQL knowledge to own the logic behind data transformations, giving them control over cleansing data and delivering actionable insights.
  • Key Weakness: Teams using open-source dbt on their own must manage infrastructure, developer environments, job scheduling, documentation hosting, and the integration of tools for loading data into their data warehouse.  

What is Airflow?

Apache Airflow is an open-source platform designed to orchestrate workflows and automate tasks. Initially created for ETL processes, it has evolved into a versatile solution for managing any sequence of tasks in data engineering, machine learning, or beyond.

  • Adoption: With over 37,000 stars on GitHub, Airflow is one of the most popular orchestration tools, seeing thousands of downloads every month.
  • Key strength: Airflow excels at handling diverse workflows. Organizations use it to orchestrate tools like Azure Data Factory, Amazon Glue, and open-source options like dlt (data load tool). Airflow can trigger dbt transformations, post-transformation processes like refreshing dashboards, or even marketing automation tasks. Its versatility extends to orchestrating AI and ML pipelines, making it a go-to solution for modern data stacks.
  • Key weakness: Scaling Airflow often requires running it on Kubernetes for its scalable nature. However, this introduces significant operational overhead and a steep learning curve to configure and maintain the Kubernetes cluster.

Why dbt and Airflow are a natural pair

Stitch together disjointed schedules

While dbt excels at SQL-based data transformations, it has no built-in scheduler, and solutions like dbt Cloud’s scheduling capabilities are limited to triggering jobs in isolation or getting a trigger from an external source. This approach risks running transformations on stale or incomplete data if upstream processes fail. Airflow eliminates this risk by orchestrating tasks across the entire pipeline, ensuring transformations occur at the right time as part of a cohesive, integrated workflow.

Tools like Airbyte and Fivetran also provide built-in schedulers, but these are designed for loading data at a given time and optionally trigger a dbt pipeline. As complexity grows and organizations need to trigger dbt pipelines after data loads via different means such as dlt and Fivetran, then this simple approach does not scale. It is also common to trigger operations after a dbt pipeline and scheduling using the data loading tool will not handle that complexity. With dbt and Airflow, a team can connect the entire process and assure that processes don’t run if upstream tasks fail or are delayed.

Airflow centralizes orchestration, automating the timing and dependencies of tasks—extracting and loading data, running dbt transformations, and delivering outputs. This connected approach reduces inefficiencies and ensures workflows run smoothly with minimal manual intervention.

Handle complexity with ease

Modern data workflows extend beyond SQL transformations. Airflow complements dbt by supporting complex, multi-stage processes such as integrating APIs, executing Python scripts, and training machine learning models. This flexibility allows pipelines to adapt as organizational needs evolve.

Airflow also provides a centralized view of pipeline health, offering data teams complete visibility. With its ability to trace issues and manage dependencies, Airflow helps prevent cascading failures and keeps workflows reliable.

By combining dbt’s transformation strengths with Airflow’s orchestration capabilities, teams can move past fragmented processes. Together, these tools enable scalable, efficient analytics workflows, helping organizations focus on delivering actionable insights without being bogged down by operational hurdles.

Managed Airflow and managed dbt in Datacoves

In our previous article, we discussed building vs buying your Airflow and dbt infrastructure. There are many cons associated with self-hosting these two tools, but Datacoves takes the complexity out of managing dbt and Airflow by offering a fully integrated, managed solution. Datacoves has given many organizations the flexibility of open-source tools with the freedom of managed tools. See how we helped Johnson and Johnson MedTech migrate to our managed dbt and airflow platform.

Managed dbt

Datacoves offers the most flexible and robust managed dbt Core environment on the market, enabling teams to fully harness the power of dbt without the complexities of infrastructure management, environment setup, or upgrades. Here’s why our customers choose Datacoves to implement dbt:

  • Seamless VS Code environment: Users can log in to a secure, browser-based VS Code development environment and start working immediately. With access to the terminal, VS Code extensions, packages, and libraries, developers have the full power of the tools they already know and love—without the hassle of managing local setups. Unlike inflexible custom IDEs, the familiar and flexible VS Code environment empowers developers to work efficiently. Scaling and onboarding new analytics engineers is streamlined so they can be productive in minutes.
Real-time SQL linting
Real-time SQL linting
  • Optimized for dbt development: Datacoves is designed to enhance the dbt development experience with features like SQL formatting, autocomplete, linting, compiled dbt preview, curated extensions, and python libraries. It ensures teams can develop efficiently and maintain high standards for their project.
  • Effortless upgrade management: Datacoves manages both platform and version upgrades. Upgrades require minimal work from the data teams and is usually as simple as “change this line in this file.”  
  • CI/CD accelerators: Many teams turn to Datacoves after outgrowing the basic capabilities of dbt Cloud CI. Datacoves integrates seamlessly with leading CI/CD tools like GitHub Actions, GitLab Workflows, and Jenkins. But we don’t stop at providing the tools—we understand that setting up and optimizing these pipelines requires expertise. That’s why we work closely with our customers to implement robust CI/CD pipelines, saving them valuable time and reducing costs.
  • dbt best practices and guidance: Datacoves provides accelerators and starting points for dbt projects, offering teams a strong foundation to begin their work or improve their project following best practices. This approach has helped teams minimize technical debt and ensure long-term project success. As an active and engaged member of the dbt community, Datacoves stays up to date on new improvements and changes. Supporting customers by providing expert guidance on required updates and optimizations.

Managed Airflow

Datacoves offers a fully managed Airflow environment, designed for scalability, reliability, and simplicity. Whether you're orchestrating complex ETL workflows, triggering dbt transformations, or integrating with third-party APIs, Datacoves takes care of the heavy lifting by managing the Kubernetes infrastructure, monitoring, and scaling. Here’s what sets Datacoves apart as a managed Airflow solution:

  • Multiple Airflow environments: Teams can seamlessly access their hosted Airflow UI and easily set up dedicated development and production instances. Managing secrets is simplified with secure options like Datacoves Secrets Manager or AWS Secrets Manager, enabling a streamlined and secure workflow without the logistical headaches.
Airflow in Datacoves
Airflow in Datacoves
  • Observability: With built-in tools like Grafana, teams gain comprehensive visibility into their Airflow jobs and workflows. Monitor performance, identify bottlenecks, and troubleshoot issues with ease—all without the operational overhead of managing Kubernetes clusters or runtime errors.
  • Upgrade management: Upgrading Airflow is simple and seamless. Teams can transition to newer Airflow versions without downtime or complexity.
  • Git sync/S3 sync: Users can effortlessly sync their Airflow DAGs using two popular methods—Git Sync or S3 Sync—without needing to worry about the complexities of setup or configuration.
  • My Airflow: Datacoves offers My Airflow, a standalone Airflow instance that lets users instantly test and develop DAGs at the push of a button. This feature provides developers with the freedom to experiment safely without affecting development or production Airflow instances.
Start My Airlfow
Start My Airlfow

  • Airflow best practices and guidance: Datacoves provides expert guidance on DAG optimization and Airflow best practices, ensuring your organization avoids costly technical debt and gets it right from the start.

Conclusion

dbt and Airflow are a natural pair in the Modern Data Stack. dbt’s powerful SQL-based transformations enable teams to build clean, reliable datasets, while Airflow orchestrates these transformations within a larger, cohesive pipeline. Their combination allows teams to focus on delivering actionable insights rather than managing disjointed processes or stale data.

However, managing these tools independently can introduce challenges, from infrastructure setup to scaling and ongoing maintenance. That’s where platforms like Datacoves make a difference. For organizations seeking to unlock the full potential of dbt and Airflow without the operational overhead, solutions like Datacoves provide the scalability and efficiency needed to modernize data workflows and accelerate insights.

Book a call today to see how Datacoves can help your organization realize the power of Airflow and dbt.

build vs buy analytics
5 mins read

Not long ago, the data analytics world relied on monolithic infrastructures—tightly coupled systems that were difficult to scale, maintain, and adapt to changing needs. These legacy setups often resulted in operational bottlenecks, delayed insights, and high maintenance costs. To overcome these challenges, the industry shifted toward what was deemed the Modern Data Stack (MDS)a suite of focused tools optimized for specific stages of the data engineering lifecycle.

This modular approach was revolutionary, allowing organizations to select best-in-class tools like Airflow for Orchestration or a managed version of Airflow from Astronomer or Amazon without the need to build custom solutions. While the MDS improved scalability, reduced complexity, and enhanced flexibility, it also reshaped the build vs. buy decision for analytics platforms. Today, instead of deciding whether to create a component from scratch, data teams face a new question: Should they build the infrastructure to host open-source tools like Apache Airflow and dbt Core, or purchase their managed counterparts? This article focuses on these two components because pipeline orchestration and data transformation lie at the heart of any organization’s data platform.

What does it mean to build vs buy?

Build

When we say build in terms of open-source solutions, we mean building infrastructure to self-host and manage mature open-source tools like Airflow and dbt. These two tools are popular because they have been vetted by thousands of companies! In addition to hosting and managing, engineers must also ensure interoperability of these tools within their stack, handle security, scalability, and reliability. Needless to say, building is a huge undertaking that should not be taken lightly.

Buy

dbt and Airflow both started out as open-source tools, which were freely available to use due to their permissive licensing terms. Over time, cloud-based managed offerings of these tools were launched to simplify the setup and development process. These managed solutions build upon the open-source foundation, incorporating proprietary features like enhanced user interfaces, automation, security integration, and scalability. The goal is to make the tools more convenient and reduce the burden of maintaining infrastructure while lowering overall development costs. In other words, paid versions arose out of the pain points of self-managing the open-source tools.

This begs the important question: Should you self-manage or pay for your open-source analytics tools?

Comparing build vs. buy: Key tradeoffs  

As with most things, both options come with trade-offs, and the “right” decision depends on your organization’s needs, resources, and priorities. By understanding the pros and cons of each approach, you can choose the option that aligns with your goals, budget, and long-term vision.

Building In-House  

Pros:  

  • Customization: The biggest advantage of building in-house is the flexibility to customize the tool to fit your exact use case. You maintain full control, allowing you to align configurations with your organization’s unique needs. However, with great power comes great responsibility—your team must have a deep understanding of the tools, their options, and best practices.  
  • Control: Owning the entire stack gives your team the ability to integrate deeply with existing systems and workflows, ensuring seamless operation within your ecosystem.  
  • Cost Perception: Without licensing fees, building in-house may initially appear more cost-effective, particularly for smaller-scale deployments.  

Cons:  

  • High Upfront Investment: Setting up infrastructure requires a considerable time commitment from developers. Tasks like configuring environments, integrating tools like Git or S3 for Airflow DAG syncing, and debugging can consume weeks of developer hours.  
  • Operational Complexity: Ongoing maintenance—such as managing dependencies, handling upgrades, and ensuring reliability—can be overwhelming, especially as the system grows in complexity.  
  • Skill Gaps: Many teams underestimate the level of expertise needed to manage Kubernetes clusters, Python virtual environments, and secure credential storage systems like AWS Secrets Manager.  
  • Experimentation: Your organization is using the first iteration the team is producing which can lead to unintended consequences, edge cases, and security issues.  

Example:

A team building Airflow in-house may spend weeks configuring a Kubernetes-backed deployment, managing Python dependencies, and setting up DAG synchronizing files via S3 or Git. While the outcome can be tailored to their needs, the time and expertise required represent a significant investment.  

Building with open-source is not free. Cons Continued

Before moving on to the buy tradeoffs, it is important to set the record straight. You may have noticed that we did not include “the tool is free to use” as one of our pros for building with open-source. You might have guessed by reading the title of this section, but many people incorrectly believe that building their MDS using open-source tools like dbt is free. When in reality there are many factors that contribute to the true dbt pricing and the same is true for Airflow.

How can that be?  Well, setting up everything you need and managing infrastructure for Airflow and dbt isn’t necessarily plug and play.  There is day-to-day work from managing Python virtual environments, keeping dependencies in check, and tackling scaling challenges which require ongoing expertise and attention. Hiring a team to handle this will be critical particularly as you scale. High salaries and benefits are needed to avoid costly mistakes; this can easily cost anywhere from $5,000 to $26,000+/month depending on the size of your team.

In addition to the cost of salaries, let’s look at other possible hidden costs that come with using open-source tools.  

Time and expertise  

The time it takes to configure, customize, and maintain a complex open-source solution is often underestimated. It’s not until your team is deep in the weeds—resolving issues, figuring out integrations, and troubleshooting configurations—that the actual costs start to surface. With each passing day your ROI is threatened. You want to start gathering insights from your data as soon as possible. Datacoves helped Johnson and Johnson set up their data stack in weeks and when issues arise, a you will need expertise to accelerate the time to resolution.

And then there’s the learning curve. Not all engineers on your team will be seniors, and turnover is inevitable. New hires will need time to get up to speed before they can contribute effectively. This is the human side of technology: while the tools themselves might move fast, people don’t. That ramp-up period, filled with training and trial-and-error, represents a hidden cost.  

Security and compliance  

Security and compliance add another layer of complexity. With open-source tools, your team is responsible for implementing best practices—like securely managing sensitive credentials with a solution like AWS Secrets Manager. Unlike managed solutions, these features don’t come prepackaged and need to be integrated with the system.  

Compliance is no different. Ensuring your solution meets enterprise governance requirements takes time, research, and careful implementation. It’s a process of iteration and refinement, and every hour spent here is another hidden cost as well as risking security if not done correctly.  

Scaling complexities  

Scaling open-source tools is where things often get complicated. Beyond everything already mentioned, your team will need to ensure the solution can handle growth. For many organizations, this means deploying on Kubernetes. But with Kubernetes comes steep learning curves and operational challenges. Making sure you always have a knowledgeable engineer available to handle unexpected issues and downtimes can become a challenge. Extended downtime due to this is a hidden cost since business users are impacted as they become reliant on your insights.    

Buying a managed solution  

A managed solution for Airflow and dbt can solve many of the problems that come with building your own solution from open-source tools such as: hassle-free maintenance, improved UI/UX experience, and integrated functionality. Let’s take a look at the pros.

Pros:  

  • Faster Time to Value: With a managed solution, your team can get up and running quickly without spending weeks—or months—on setup and configuration.  
  • Reduced Operational Overhead: Managed providers handle infrastructure, maintenance, and upgrades, freeing your team to focus on business objectives rather than operational minutiae.  
  • Predictable Costs: Managed solutions typically come with transparent pricing models, which can make budgeting simpler compared to the variable costs of in-house built tooling.  
  • Reliability: Your team is using version 1000+ of a managed solution vs the 1st version of your self-managed solution. This provides reliability and peace of mind that edge cases have been handled, and security is under wraps.  

Cons:  

  • Potentially Less Flexibility: This is the biggest con and reason why many teams choose to build. Managed solutions may not allow for the same level of customization as building in-house, which could limit certain niche use cases. Care must be taken to choose a provider that is built for enterprise level flexibility.
  • Dependency on a Vendor: Relying on a vendor for your analytics stack introduces some level of risk, such as service disruptions or limited migration paths if you decide to switch providers. Some managed solution providers simply offer the service, but leave it up to your team to “make it work” and troubleshoot. A good provider will have a vested interest in your success, because they can’t afford for you to fail.  

Example:  

Using a solution like MWAA, teams can leverage managed Airflow eliminating the need for infrastructure worries however additional configuration and development will be needed for teams to leverage it with dbt and to troubleshoot infrastructure issues suck as containers running out of memory.

Datacoves for Airflow and dbt: The buy that feels like a build

For data teams, the allure of a custom-built solution often lies in its promise of complete control and customization. However, building this requires significant time, expertise, and ongoing maintenance. Datacoves bridges the gap between custom-built flexibility and the simplicity of managed services, offering the best of both worlds.

With Datacoves, teams can leverage managed Airflow and pre-configured dbt environments to eliminate the operational burden of infrastructure setup and maintenance. This allows data teams to focus on what truly matters—delivering insights and driving business decisions—without being bogged down by tool management.

Unlike other managed solutions for dbt or Airflow, which often compromise on flexibility for the sake of simplicity, Datacoves retains the adaptability that custom builds are known for. By combining this flexibility with the ease and efficiency of managed services, Datacoves empowers teams to accelerate their analytics workflows while ensuring scalability and control.

Datacoves doesn’t just run the open-source solutions, but through real-world implementations, the platform has been molded to handle enterprise complexity while simplifying project onboarding. With Datacoves, teams don’t have to compromize on features like Datacoves-Mesh (aka dbt-mesh), column level lineage, GenAI, Semantic Layer, etc. Best of all, the company’s goal is to make you successful and remove hosting complexity without introducing vendor lock-in. What Datacove does, you can do yourself if given enough time, experience, and money. Finally, for security concious organizations, Datacoves is the only solution on the market that can be deployed in your private cloud with white-glove enterprise support.

Datacoves isn’t just a platform—it’s a partnership designed to help your data team unlock their potential. With infrastructure taken care of, your team can focus on what they do best: generating actionable insights and maximizing your ROI.

Conclusion

The build vs. buy debate has long been a challenge for data teams, with building offering flexibility at the cost of complexity, and buying sacrificing flexibility for simplicity. As discussed earlier in the article, solutions like dbt and Airflow are powerful, but managing them in-house requires significant time, resources, and expertise. On the other hand, managed offerings like dbt Cloud and MWAA simplify operations but often limit customization and control.

Datacoves bridges this gap, providing a managed platform that delivers the flexibility and control of a custom build without the operational headaches. By eliminating the need to manage infrastructure, scaling, and security. Datacoves enables data teams to focus on what matters most: delivering actionable insights and driving business outcomes.

As highlighted in Fundamentals of Data Engineering, data teams should prioritize extracting value from data rather than managing the tools that support them. Datacoves embodies this principle, making the argument to build obsolete. Why spend weeks—or even months—building when you can have the customization and adaptability of a build with the ease of a buy? Datacoves is not just a solution; it’s a rethinking of how modern data teams operate, helping you achieve your goals faster, with fewer trade-offs.

dbt pricing build vs buy
5 mins read

Organizations often opt for open-source tools because "free" seems like an easy decision, especially compared to the higher price of managed versions of the same tooling. However, as with many things, there is no such thing as a free lunch. When choosing these open-source tools, it is easy to say that the Airflow and dbt pricing is $0 dollars meaning a cost-saving choice, but hidden expenses that are hard to ignore will quickly be revealed.

dbt Core and Apache Airflow are a natural pair in modern data analytics. dbt Core simplifies SQL-based data transformations, empowering data teams to create and maintain clean, well-documented, structured pipelines. Apache Airflow takes care of orchestrating these workflows, automating the movement and processing of data through the data engineering life cycle. Together, they can drive a powerful analytics stack that’s flexible and scalable—when used correctly. But this flexibility often comes at a price.  

In this article, we’ll examine the build vs. buy dilemma, highlighting the flexibility and true costs of open-source tools like dbt Core and Apache Airflow. We’ll also compare them to managed solutions such as dbt Cloud pricing and Datacoves pricing, providing the insights you need to evaluate the trade-offs and choose the best option for your organization.  

dbt and Airflow pricing

Open-source dbt pricing

The open-source tool dbt is free to download and use. However, the actual cost emerges when considering the technical resources required for effective implementation and management. Tasks such as setting up infrastructure, ensuring scalability, and maintaining the tool demand skilled engineers.

Assuming a team of 2–4 engineers is responsible for these tasks, with annual salaries ranging from $120,000 to $160,000 (approximately $10,000 to $13,000 per month), even dedicating 25–50% of their time to managing dbt Core results in a monthly cost of $5,000 to $26,000. As your use of dbt scales, you may need to hire a dedicated team to manage the open-source solution full-time, leading to costs equating to 100% of their combined salaries.

So we can begin to see the true open source dbt pricing, especially at scale. In addition to engineering labor are other costs such as time, and effort required to maintain and scale the platform. More on that later.

dbt Cloud Pricing

Just on engineering pricing alone, we begin to see the comparison between the open-source and managed solutions. dbt Labs offers a hosted solution, dbt Cloud, with added features and tiered pricing options.

  • Developer Plan: Best for individual users, this free tier includes a single developer seat, up to 3,000 models built per month, and support for one dbt project.
  • Team Plan: Aimed at small to mid-sized teams, this plan supports up to 8 developer seats, one dbt project, and 15,000 models built per month. Priced at $100/user/month, additional model builds cost $0.01 per model.
  • Enterprise Plan: (This is the plan that most medium and large organizations will need) Designed for larger organizations, this tier supports unlimited users and projects. The MSRP jumps to a whopping $4,800 per seat and a fixed number of model runs after which the $0.01 per model price also applies.

Opting for a managed solution will allow your organization to cut engineering costs down or allow your engineers to focus on other projects. However, while dbt Cloud reduced the infrastructure burden a bit, it only focuses on the T of ELT. Meaning, you still need engineers to manage the other pieces of the stack which can result in a disconnected data pipeline.

Open-source Airflow

It is worth noting that some companies decide to use dbt cloud for the scheduler feature which can quickly become limiting as workflows become more complex. The next step is always a full fledged orchestrator such as Airflow.

Just like dbt Core, Apache Airflow is also free to use, but the true cost comes from deploying and maintaining it securely and at scale, which requires significant expertise, particularly in areas like Kubernetes, dependency management, and high-availability configurations.

Assuming 2–4 engineers with annual salaries between $130,000 and $170,000 (around $11,000 to $14,000 per month) dedicate 25–50% of their time to Airflow, the monthly cost ranges from $5,500 to $28,000. The pattern we saw with dbt Core rings true here as well. As your workflows grow, hiring a dedicated team to manage Airflow becomes necessary, leading to costs equating to 100% of their salaries.

Managed Airflow from AWS, MWAA

For teams looking to sidestep the complexities of managing Airflow in-house, managed solutions provide an appealing alternative:

  • AWS Managed Workflows for Apache Airflow (MWAA): A managed Airflow service from Amazon, MWAA simplifies deployment and scaling but has variable pricing based on environment size and execution time, which can make costs unpredictable.
  • Other Providers: Options like Astronomer and Google Cloud Composer offer similar managed Airflow solutions, each with unique features, performance considerations, and pricing structures.

A managed Airflow solution typically costs between $5,000 and $15,000 per year, depending on workload, resource requirements, and the number of Airflow instances. By choosing a managed solution, organizations can see cost savings in the infrastructure maintenance, overall maintenance stress and more.

The hidden costs of open-source tools

Setting up and managing infrastructure for Airflow and dbt Core isn’t as straightforward—or as “free”—as it might seem. The day-to-day work from managing Python virtual environments, keeping dependencies in check, and tackling scaling challenges require ongoing expertise and attention. In addition to salaries and benefits, what starts as an open-source experiment can quickly morph into a significant operational overhead full of hidden costs. Let’s dive into how by looking at time and expertise, security and compliance, and scaling complexities which, if not considered, can lead to possible side effects such as extended downtime, security issues and more.

Time and expertise

The time it takes to configure, customize, and maintain a complex open-source solution is often underestimated. It’s not until your team is deep in the weeds—resolving issues, figuring out integrations, and troubleshooting configurations—that the actual costs start to surface. With each passing day your ROI is threatened. You want to start gathering insights from your data as soon as possible. Datacoves helped Johnson and Johnson set up their data stack in weeks

And then there’s the learning curve. Not all engineers on your team will be senior, and turnover is inevitable. New hires will need time to get up to speed before they can contribute effectively. This is the human side of technology: while the tools themselves might move fast, people don’t. That ramp-up period, filled with training and trial-and-error, represents yet another hidden cost.

Security and compliance

Security and compliance add another layer of complexity. With open-source tools, your team is responsible for implementing best practices—like securely managing sensitive credentials with a solution like AWS Secrets Manager. Unlike managed solutions, these features don’t come prepackaged and need to be built integrated with the system.

Compliance is no different. Ensuring your solution meets enterprise governance requirements takes time, research, and careful implementation. It’s a process of iteration and refinement, and every hour spent here is another hidden cost as well as risking security if not done correctly.

Scaling complexities

Scaling open-source tools is where things often get complicated. Beyond everything already mentioned, your team will need to ensure the solution can handle growth. For many organizations, this means deploying on Kubernetes. But with Kubernetes comes steep learning curves and operational challenges. Making sure you always have a knowledgeable engineer available to handle unexpected issues and downtimes can become a challenge. Extended downtime due to this is a hidden cost since business user are impacted as they become reliant on your insights.  

Comparing build vs. buy: Key tradeoffs

Throughout this article, we have uncovered the true costs of open-source tools, bringing us to the critical decision between building in-house or buying a managed solution. Even after we have uncovered the actual cost of open-source, the decision isn’t just about price—it’s also about flexibility a custom build offers.

Managed solutions often adopt a one-size-fits-all approach designed to attract the widest range of customers. While this can simplify implementation for many organizations, it may not always meet the specific needs of your team. To make an informed decision, let’s examine the key advantages and challenges of each approach.

Building In-House

Pros:

  • Customization: The biggest advantage of building in-house is the flexibility to customize the tool to fit your exact use case. You maintain full control, allowing you to align configurations with your organization’s unique needs. However, with great power comes great responsibility—your team must have a deep understanding of the tools, their options, and best practices.
  • Control: Owning the entire stack gives your team the ability to integrate deeply with existing systems and workflows, ensuring seamless operation within your ecosystem.
  • Cost Perception: Without licensing fees, building in-house may initially appear more cost-effective, particularly for smaller-scale deployments.

Cons:

  • High Upfront Investment: Setting up infrastructure requires a significant time commitment from developers. Tasks like configuring environments, integrating tools like Git or S3 for Airflow DAG syncing, and debugging can consume weeks of developer hours.
  • Operational Complexity: Ongoing maintenance—such as managing dependencies, handling upgrades, and ensuring reliability—can be overwhelming, especially as the system grows in complexity.
  • Skill Gaps: Many teams underestimate the level of expertise needed to manage Kubernetes clusters, Python virtual environments, and secure credential storage systems like AWS Secrets Manager.

Example:
A team building Airflow in-house may spend weeks configuring a Kubernetes-backed deployment, managing Python dependencies, and setting up DAG synchronizing files via S3 or Git. While the outcome can be tailored to their needs, the time and expertise required represent a significant investment.

Buying a managed solution

Pros:

  • Faster Time to Value: With a managed solution, your team can get up and running quickly without spending weeks—or months—on setup and configuration.
  • Reduced Operational Overhead: Managed providers handle infrastructure, maintenance, and upgrades, freeing your team to focus on business objectives rather than operational minutiae.
  • Predictable Costs: Managed solutions typically come with transparent pricing models, which can make budgeting simpler compared to the variable costs of in-house built tooling.

Cons:

  • Potentially Less Flexibility: Managed solutions may not allow for the same level of customization as building in-house, which could limit certain niche use cases.
  • Dependency on a Vendor: Relying on a vendor for your analytics stack introduces some level of risk, such as service disruptions or limited migration paths if you decide to switch providers.

Example:

Using a solution like MWAA, teams can leverage managed Airflow eliminating the need for infrastructure worries however it may not have the flexibility or interoperability with other aspects of their stack

Whereas using a solution like Datacoves, teams can leverage managed Airflow and pre-configured environments for dbt Core. This eliminates the need for infrastructure setup, simplifies day-to-day operations, and allows teams to focus on deriving value from their analytics, not maintaining the tools that support them.  

Verdict on build vs buy

There is no universal right answer to the build vs. buy dilemma—every use case is unique. However, it’s important to recognize that many problems have already been solved. Unless there is a compelling reason to reinvent the wheel, leveraging existing solutions can save time, money, and effort.

In Fundamentals of Data Engineering, Joe Reis and Matt Housley emphasize the importance of focusing on delivering insights rather than getting entangled in the complexities of building and maintaining data infrastructure. They advocate for using existing solutions wherever possible to streamline processes and allow teams to concentrate on extracting value from data. The key question to ask is: Will building this solution provide your organization with a competitive edge? If the answer is no, it’s worth seeking out an existing solution that fits your needs. Managed platforms can reduce the need for dedicated personnel as we saw above and provide predictable costs, making them an attractive option for many teams.

This philosophy underpins why we built Datacoves. We believe data teams shouldn’t be bogged down by the operational complexities of tools like dbt and Airflow. And we also believe that Data teams should have access to the flexibility a custom-built solution has to offer. Datacoves offers the flexibility these tools are known for while removing the infrastructure burden, enabling your team to focus on what really matters: generating actionable insights that drive your organization forward.  

Why teams choose Datacoves for dbt and Airflow  

Simplifying complex analytics stacks

Datacoves delivers the best of both worlds: the flexibility of a custom-built open-source solution combined with the rich features and zero-infrastructure maintenance of a managed platform—all with minimal vendor lock-in. How does Datacoves achieve this? By focusing on open-source tools and eliminating the burden of maintenance. Datacoves has already done the challenging work of identifying the best tools for the job, configuring them to work seamlessly together, and optimizing performance.

With Datacoves, your team can stop worrying about infrastructure and focus entirely on generating insights. The platform includes bundled in-browser VS Code, dbt Core, and Python extensions, alongside ready-to-use virtual environments tailored to analytics needs. Add to this a fully managed Airflow experience, and you have a solution where the code remains yours, but the operational headaches are gone.

Managed Airflow with unique features

Datacoves has enhanced Airflow with features designed to make DAG development more intuitive and enjoyable:

  • Flexible Airflow: Datacoves can be used with or without dbt, making it an ideal choice for teams seeking a robust, scalable Airflow solution without the hassle of managing it themselves. Anything you can do with open-source Airflow, you can do with Datacoves—without requiring a dedicated team.
  • Shared “Team Airflow” for Collaboration: This shared space is designed for developing and testing DAGs in a production-like environment, ensuring that buggy DAGs don’t disrupt your production workflows. Team members can collaborate in an isolated development environment, enabling faster iteration and testing.
  • Developer-Specific “My Airflow” for Isolated Workflows: Recognizing the limitations of a shared environment, Datacoves created “My Airflow,” a standalone instance for individual developers. Changes made to DAGs immediately reflect in this instance, allowing for rapid iteration and testing before moving to “Team Airflow” for more robust validation.
  • Streamlined DAG Deployment (S3/Git Sync): Whether syncing DAGs via S3 or Git, Datacoves simplifies the process, eliminating common deployment pain points associated with Airflow orchestration.

Enterprise-grade infrastructure

  • Kubernetes-Backed Scalability: Building a scalable platform often requires Kubernetes expertise, which can be costly and time-intensive. Datacoves eliminates this need with a skilled team that manages Kubernetes deployments, handling upgrades, maintenance, and downtime risks for you.
  • SaaS and Private Cloud Options: Datacoves offers deployment flexibility, accommodating enterprises with strict security requirements. Choose between a SaaS offering or a private deployment within your organization’s VPC, depending on your needs.

Cost predictability

One of the key benefits of Datacoves is the elimination of hidden costs through its all-in-one platform approach. Teams often realize too late that piecing together the modern data stack—combining open-source tools, hosting solutions, and server infrastructure—results in unpredictable costs. A single misstep in configuration can lead to high cloud bills.

Datacoves removes the guesswork. Its optimized infrastructure provides predictable billing for non-variable services, along with clear guidelines for variable costs. By implementing best practices and optimizations, Datacoves ensures that your costs remain as low as possible without sacrificing performance.

Datacoves makes it easier for teams to harness the power of open-source tools like dbt and Airflow, without the operational burden. From simplifying complex workflows to delivering enterprise-grade infrastructure and predictable costs, Datacoves empowers teams to focus on what matters most: driving insights and business value.

Conclusion

Open-source tools like Airflow are incredibly powerful, offering flexibility and extensibility that modern analytics teams need. However, as we have seen, the initial appeal of "free" tools is not true. Actual costs exist in the form of salaries and benefits and hidden costs like costs of implementation, scaling, and long-term maintenance are very real and expensive. Paid solutions are around for a reason and finding the best one that suits your needs is essential.  

The most flexible managed data platform on the market

If your team is looking to scale its analytics stack without the operational burden of managing open-source tools, Datacoves offers the perfect balance of flexibility, simplicity, and cost-efficiency.  Explore Datacoves to learn more about our all-in-one platform for dbt Core and Airflow or check out our case studies and testimonials to see how other teams have accelerated their analytics engineering journey with Datacoves.

optimize dbt slim ci
5 mins read

Any experienced data engineer will tell you that efficiency and resource optimization are always top priorities. One powerful feature that can significantly optimize your dbt CI/CD workflow is dbt Slim CI. However, despite its benefits, some limitations have persisted. Fortunately, the recent addition of the --empty flag in dbt 1.8 addresses these issues. In this article, we will share a GitHub Action Workflow and demonstrate how the new --empty flag can save you time and resources.

What is dbt Slim CI?

dbt Slim CI is designed to make your continuous integration (CI) process more efficient by running only the models that have been changed and their dependencies, rather than running all models during every CI build. In large projects, this feature can lead to significant savings in both compute resources and time.

Key Benefits of dbt Slim CI

  • Speed Up Your Workflows: Slim CI accelerates your CI/CD pipelines by skipping the full execution of all dbt models. Instead, it focuses on only the modified models and their dependencies and uses the defer flag to pull the unmodified models from production. So, if we have model A, B and C yet only make changes to C, then only model C will be run during the CI/CD process.
  • Save Time, Snowflake Credits, and Money: By running only the necessary models, Slim CI helps you save valuable build time and Snowflake credits. This selective approach means fewer computational resources are used, leading to cost savings.

dbt Slim CI flags explained

dbt Slim CI is implemented efficiently using these flags:

--select state:modified:  The state:modified selector allows you to choose the models whose "state" has changed (modified) to be included in the run/build. This is done using the state:modified+ selector which tells dbt to run only the models that have been modified and their downstream dependencies.

--state <path to production manifest>: The --state flag specifies the directory where the artifacts from a previous dbt run are stored ie) the production dbt manifest. By comparing the current branch's manifest with the production manifest, dbt can identify which models have been modified.

--defer: The --defer flag tells dbt to pull upstream models that have not changed from a different environment (database). Why rebuild something that exists somewhere else? For this to work, dbt will need access to the dbt production manifest.

dbt build
dbt CI/CD command

You may have noticed that there is an additional flag in the command above.  

--fail-fast: The --fail-fast flag is an example of an optimization flag that is not essential to a barebones Slim CI but can provide powerful cost savings. This flag stops the build as soon as an error is encountered instead of allowing dbt to continue building downstream models, therefore reducing wasted builds. To learn more about these arguments you can use have a look at our dbt cheatsheet.

dbt Slim CI with Github Actions before dbt 1.8

The following sample Github Actions workflow below is executed when a Pull Request is opened. ie) You have a feature branch that you want to merge into main.

sample Github Actions workflow is executed when a Pull Request is opened
Same Github Action Prior to dbt 1.8

Workflow Steps

Checkout Branch: The workflow begins by checking out the branch associated with the pull request to ensure that the latest code is being used.

Set Secure Directory: This step ensures the repository directory is marked as safe, preventing potential issues with Git operations.  

List of Files Changed: This command lists the files changed between the PR branch and the base branch, providing context for the changes and helpful for debugging.

Install dbt Packages: This step installs all required dbt packages, ensuring the environment is set up correctly for the dbt commands that follow.

Create PR Database: This step creates a dedicated database for the PR, isolating the changes and tests from the production environment.

Get Production Manifest: Retrieves the production manifest file, which will be used for deferred runs and governance checks in the following steps.

Run dbt Build in Slim Mode or Run dbt Build Full Run: If a manifest is present in production, dbt will be run in slim mode with deferred models. This build includes only the modified models and their dependencies. If no manifest is present in production we will do a full refresh.

Grant Access to PR Database: Grants the necessary access to the new PR database for end user review.

Generate Docs Combining Production and Branch Catalog: If a dbt test is added to a YAML file, the model will not be run, meaning it will not be present in the PR database. However, governance checks (dbt-checkpoint) will need the model in the database for some checks and if not present this will cause a failure. To solve this, the generate docs step is added to merge the catalog.json from the current branch with the production catalog.json.

Run Governance Checks: Executes governance checks such as SQLFluff and dbt-checkpoint.

Problems with the dbt CI/CD Workflow

As mentioned in the beginning of the article, there is a limitation to this setup. In the existing workflow, governance checks need to run after the dbt build step. This is because dbt-checkpoint relies on the manifest.json and catalog.json. However, if these governance checks fail, it means that the dbt build step will need to run again once the governance issues are fixed. As shown in the diagram below, after running our dbt build, we proceed with governance checks. If these checks fail, we need to resolve the issue and re-trigger the pipeline, leading to another dbt build. This cycle can lead to unnecessary model builds even when leveraging dbt Slim CI.

ci/cd process before dbt 1.8
dbt CI/CD before dbt 1.8

Leveraging the --empty Flag for Efficient dbt CI/CD Workflows

The solution to this problem is the --empty flag in dbt 1.8. This flag allows dbt to perform schema-only dry runs without processing large datasets. It's like building the wooden frame of a house—it sets up the structure, including the metadata needed for governance checks, without filling it with data. The framework is there, but the data itself is left out, enabling you to perform governance checks without completing an actual build.

Let’s see how we can rework our Github Action:

rework our Github Action
Sample Github Action with dbt 1.8

Workflow Steps

Checkout Branch: The workflow begins by checking out the branch associated with the pull request to ensure that the latest code is being used.

Set Secure Directory: This step ensures the repository directory is marked as safe, preventing potential issues with Git operations.  

List of Files Changed: This step lists the files changed between the PR branch and the base branch, providing context for the changes and helpful for debugging.

Install dbt Packages: This step installs all required dbt packages, ensuring the environment is set up correctly for the dbt commands that follow.

Create PR Database: This command creates a dedicated database for the PR, isolating the changes and tests from the production environment.

Get Production Manifest: Retrieves the production manifest file, which will be used for deferred runs and governance checks in the following steps.

*NEW* Governance Run of dbt (Slim or Full) with EMPTY Models: If there is a manifest in production, this step runs dbt with empty models using slim mode and using the empty flag. The models will be built in the PR database with no data inside and we can now use the catalog.json to run our governance checks since the models. Since the models are empty and we have everything we need to run our checks, we have saved on compute costs as well as run time.  

Generate Docs Combining Production and Branch Catalog: If a dbt test is added to a YAML file, the model will not be run, meaning it will not be present in the PR database. However, governance checks (dbt-checkpoint) will need the model in the database for some checks and if not present this will cause a failure. To solve this, the generate docs step is added to merge the catalog.json from the current branch with the production catalog.json.

Run Governance Checks: Executes governance checks such as SQLFluff and dbt-checkpoint.

Run dbt Build: Runs dbt build using either slim mode or full run after passing governance checks.

Grant Access to PR Database: Grants the necessary access to the new PR database for end user review.

By leveraging the dbt --empty flag, we can materialize models in the PR database without the computational overhead, as the actual data is left out. We can then use the metadata that was generated during the empty build. If any checks fail, we can repeat the process again but without the worry of wasting any computational resources doing an actual build. The cycle still exists but we have moved our real build outside of this cycle and replaced it with an empty or fake build. Once all governance checks have passed, we can proceed with the real dbt build of the dbt models as seen in the diagram below.

ci/cd process after dbt 1.8
dbt CI/CD after dbt 1.8 --empty flag

Conclusion

dbt Slim CI is a powerful addition to the dbt toolkit, offering significant benefits in terms of speed, resource savings, and early error detection. However, we still faced an issue of wasted models when it came to failing governance checks. By incorporating dbt 1.8’s  --empty flag into your CI/CD workflows we can reduce wasted model builds to zero, improving the efficiency and reliability of your data engineering processes.

🔗 Watch the vide where Noel explains the  --empty flag implementation in Github Actions:

Dbt jinja cheat sheet
5 mins read

Jinja templating in dbt offers flexibility and expressiveness that can significantly improve SQL code organization and reusability. There is a learning curve, but this cheat sheet is designed to be a quick reference for data practitioners, helping to streamline the development process and reduce common pitfalls.

Whether you're troubleshooting a tricky macro or just brushing up on syntax, bookmark this page. Trust us, it will come in handy and help you unlock the full potential of Jinja in your dbt projects.

If you find this cheat sheet useful, be sure to check out our Ultimate dbt Jinja Functions Cheat Sheet. It covers the specialized Jinja functions created by dbt, designed to enhance versatility and expedite workflows.

dbt Jinja: Basic syntax

This is the foundational syntax of Jinja, from how to comment to the difference between statements and expressions.

dbt Jinja: Variable assignment 

Define and assign variables in different data types such as strings, lists, and dictionaries.

dbt Jinja: White space control 

Jinja allows fine-grained control over white spaces in compiled output. Understand how to strategically strip or maintain spaces.

       

dbt Jinja: Control flow

In dbt, conditional structures guide the flow of transformations. Grasp how to integrate these structures seamlessly.

Control Flow
If/elif/else/endif
{%- if target.name == 'dev' -%}
{{ some code }}
{%- elif target.name == 'prod' -%}
{{ some other code }}
{%- else -%}
{{ some other code }}
{%- endif -%}

dbt Jinja: Looping

Discover how to iterate over lists and dictionaries. Understand  simple loop syntax or accessing loop properties.

Looping
Loop Syntax
{%- for item in my_iterable -%}
  --Do something with item
  {{ item }}
{%- endfor -%}
loop.last
This boolean is False unless the current iteration is the last iteration.
          {% for item in list %}
  {% if loop.last %}   
    --This is the last item
    {{ item }}
  {% endif %}
{% endfor %}
loop.first
A boolean that is True if the current iteration is the first iteration, otherwise False.
{% for item in list %}
  {% if loop.first %}
    --first item
    {{ item }}
  {% endif %}
{% endfor %}
loop.index
An integer representing the current iteration of the loop (1-indexed). So, the first iteration would have loop.index of 1, the second would be 2, and so on.
{% for item in list %}
   --This is item number
   {{ loop.index }}
{% endfor %}
Looping a List
{% set rating_categories = ["quality_rating",
                            "design_rating",
                            "usability_rating"] %}
SELECT product_id,
 {%- for col_name in rating_categories -%}
   AVG({{ col_name }}) as {{ column_name }}_average
   {%- if not loop.last  -%} 
     , 
   {%- endif -%}
 {%- endfor -%}
 FROM product_reviews
 GROUP BY 1

Compiled code
SELECT product_id,
   AVG(quality_rating) as quality_rating_average,
   AVG(design_rating) as design_rating_average,
   AVG(usability_rating) as usability_rating_average
FROM product_reviews
GROUP BY 1
Looping a Dictionary
{% set delivery_type_dict = {"a": "digital_download",
                             "b": "physical_copy"} %}
SELECT order_id,
{%- for type, column_name in delivery_type_dict.items() -%}
COUNT(CASE 
      WHEN delivery_method = '{{ type }}' THEN order_id 
      END) as {{ column_name }}_count
      {%- if not loop.last  -%}
       , 
      {%- endif -%}
      {%- endfor -%}
FROM order_deliveries
GROUP BY 1

SELECT order_id,
COUNT(CASE 
      WHEN delivery_method = 'a' THEN order_id 
      END) as digital_download_count,
COUNT(CASE 
      WHEN delivery_method = 'b' THEN order_id 
      END) as physical_copy_count
FROM order_deliveries
GROUP BY 1

dbt Jinja: Operators 

These logical and comparison operators come in handy, especially when defining tests or setting up configurations in dbt.

Logic Operators
and
{% if condition1 and condition2 %}
or
{% if condition1 or condition2 %}
not
{{  not condition1 }}

Comparison Operators
Equal To
{% if 1 == 2 %}
Not Equal To
{% if 1 != 2 %}
Greater Than
{% if 1 > 2 %}
Less Than
{% if 1 < 2 %}
Greater Than or Equal to
{% if 1 >= 2 %}
Less Than or Equal To
{% if 1 <= 2 %}

dbt Jinja: Variable tests

Within dbt, you may need to validate if a variable is defined or a if a value is odd or even. These Jinja Variable tests allow you to validate with ease.

Jinja Variable Tests
Is Defined
                  {% if my_variable is defined %}
-- Handle conditions when variable exists
{% endif %}
Is None

{% if my_variable is none %}
-- Handle absence of my_variable
{% endif %}
Is Even

{% if my_variable is even %}
-- Handle when my_variable is even
{% endif %}
Is Odd

{% if my_variable is odd %}
-- Handle when my_variable is odd
{% endif %}
Is a String

{% if my_variable is string %}
-- Handle when my_variable is a string
{% endif %}
Is a Number

{% if my_variable is number %}
-- Handle when my_variable is a number
{% endif %}

dbt Jinja: Creating macros & tests

Macros are the backbone of advanced dbt workflows. Review how to craft these reusable code snippets and also how to enforce data quality with tests.

Creating Macros & Tests
Define a Macro
Write your macros in your project's macros directory.
{% macro ms_to_sec(col_name, precision=3) %}   
  ( {{ col_name }} / 1000 )::numeric(16, {{ precision }})   
{% endmacro %}
Use a Macro from a Model
In a model:
SELECT order_id,       
  {{ ms_to_sec(col_name=time_ms, precision=3) }} as time_sec
FROM order_timings;

Compiled code:
SELECT order_id,
(time_ms/ 1000 )::numeric(16, 3) AS time_sec
FROM order_timings;
Run a Macro from the Terminal
Define in your macros directory. Ex)macros/create_schema_macro.sql:
{% macro create_schema(schema_name) %}
    CREATE SCHEMA IF NOT EXISTS {{ schema_name }};
{% endmacro %}

In Termial:

dbt run-operation create_schema --args '{"schema_name": "my_new_schema"}'
Define a Generic Test
Generic Tests used to be defined in the macros directory. It is now recommended to write your Generic Tests in the tests/generic directory.

{% test over_10000(model, column_name) %}
  SELECT {{column_name}} 
  FROM {{ model }}   
  WHERE {{column_name}} > 10000     
{% endtest %}
Use a Generic test
In models/schema.yml add the generic test to the model and column you wish to test.
version: 2

models:
  - name: my_model
    columns:
      - name: column_to_test
        tests:
          - over_10000
          - not_null
Define a Singular Test
Write your dbt Singular tests in the tests directory and give it a descriptive name. Ex) test/test_suspicious_refunds.sql
    
SELECT order_id, 
SUM(CASE
    WHEN amount < 0 THEN amount 
    ELSE 0 
    END) as total_refunded_amount,       
COUNT(CASE 
     WHEN amount < 0 THEN 1 
     END) as number_of_refunds  
FROM {{ ref('my_model') }}  
GROUP BY 1   
HAVING number_of_refunds > 5

dbt Jinja: Filters (aka Methods)

Fine-tune your dbt data models with these transformation and formatting utilities.

String Manipulation
Lower
{{ "DATACOVES" | lower }} => "datacoves"
Upper
{{ "datacoves" | upper }} => "DATACOVES"
Default
{{ variable_name | default("Default Value") }}    
If value exists => "Sample Value"
If value does not exist => "Default Value"
Trim
{{ "Datacoves   " | trim }} => "Datacoves"  
Replace
{{ "Datacoves" | replace("v", "d") }} => "Datacodes" 
Length
{{ "Datacoves" | length }} => 9
Capitalize
{{ "datacoves" | capitalize }} => "Datacoves"  
Title
{{ "datacoves managed platform" | capitalize }}
  => "Datacoves managed platform”
Repeat a String
{{ print('-' * 20) }}
Substring
{{ "Datacoves"[0:4] }} => "Data"
Split
{{ "Data coves".split(' ') }} => ["Data", "coves"]  

Number Manipulation
Int
{{ "20" | int }} => 20 
Float
{{ 20 | float }} => 20.0 
Rounding to Nearest Whole Number
{{ 20.1434 | round }} => 21
Rounding to a Specified Decimal Place
{{ 20.1434 | round(2) }} => 20.14
Rounding Down (Floor Method)
{{ 20.5 | round(method='floor') }} => 20 
Rounding Up (Ceil Method)
{{ 20.5 | round(method='ceil') }} => 21

Please contact us with any errors or suggestions.

Get our free ebook dbt Cloud vs dbt Core

Get the PDF