CI/CD With dbt Slim CI: Optimize Using dbt 1.8 --empty Flag

5 mins read

Any experienced data engineer will tell you that efficiency and resource optimization are always top priorities. One powerful feature that can significantly optimize your dbt CI/CD workflow is dbt Slim CI. However, despite its benefits, some limitations have persisted. Fortunately, the recent addition of the --empty flag in dbt 1.8 addresses these issues. In this article, we will share a GitHub Action Workflow and demonstrate how the new --empty flag can save you time and resources.

What is dbt Slim CI?

dbt Slim CI is designed to make your continuous integration (CI) process more efficient by running only the models that have been changed and their dependencies, rather than running all models during every CI build. In large projects, this feature can lead to significant savings in both compute resources and time.

Key Benefits of dbt Slim CI

Speed Up Your Workflows: Slim CI accelerates your CI/CD pipelines by skipping the full execution of all dbt models. Instead, it focuses on only the modified models and their dependencies and uses the defer flag to pull the unmodified models from production. So, if we have model A, B and C yet only make changes to C, then only model C will be run during the CI/CD process.

Save Time, Snowflake Credits, and Money: By running only the necessary models, Slim CI helps you save valuable build time and Snowflake credits. This selective approach means fewer computational resources are used, leading to cost savings.

dbt Slim CI flags explained

dbt Slim CI is implemented efficiently using these flags:

--select state:modified: The state:modified selector allows you to choose the models whose "state" has changed (modified) to be included in the run/build. This is done using the state:modified+ selector which tells dbt to run only the models that have been modified and their downstream dependencies.

--state <path to production manifest>: The --state flag specifies the directory where the artifacts from a previous dbt run are stored ie) the production dbt manifest. By comparing the current branch's manifest with the production manifest, dbt can identify which models have been modified.

--defer: The --defer flag tells dbt to pull upstream models that have not changed from a different environment (database). Why rebuild something that exists somewhere else? For this to work, dbt will need access to the dbt production manifest.

You may have noticed that there is an additional flag in the command above.

--fail-fast: The --fail-fast flag is an example of an optimization flag that is not essential to a barebones Slim CI but can provide powerful cost savings. This flag stops the build as soon as an error is encountered instead of allowing dbt to continue building downstream models, therefore reducing wasted builds. To learn more about these arguments you can use have a look at our dbt cheatsheet.

dbt Slim CI with Github Actions before dbt 1.8

The following sample Github Actions workflow below is executed when a Pull Request is opened. ie) You have a feature branch that you want to merge into main.

sample Github Actions workflow is executed when a Pull Request is opened

Workflow Steps

Checkout Branch: The workflow begins by checking out the branch associated with the pull request to ensure that the latest code is being used.

Set Secure Directory: This step ensures the repository directory is marked as safe, preventing potential issues with Git operations.

List of Files Changed: This command lists the files changed between the PR branch and the base branch, providing context for the changes and helpful for debugging.

Install dbt Packages: This step installs all required dbt packages, ensuring the environment is set up correctly for the dbt commands that follow. ‍

Create PR Database: This step creates a dedicated database for the PR, isolating the changes and tests from the production environment.

Get Production Manifest: Retrieves the production manifest file, which will be used for deferred runs and governance checks in the following steps.

Run dbt Build in Slim Mode or Run dbt Build Full Run: If a manifest is present in production, dbt will be run in slim mode with deferred models. This build includes only the modified models and their dependencies. If no manifest is present in production we will do a full refresh.

Grant Access to PR Database: Grants the necessary access to the new PR database for end user review.

Generate Docs Combining Production and Branch Catalog: If a dbt test is added to a YAML file, the model will not be run, meaning it will not be present in the PR database. However, governance checks (dbt-checkpoint) will need the model in the database for some checks and if not present this will cause a failure. To solve this, the generate docs step is added to merge the catalog.json from the current branch with the production catalog.json.

Run Governance Checks: Executes governance checks such as SQLFluff and dbt-checkpoint.

Problems with the dbt CI/CD Workflow

As mentioned in the beginning of the article, there is a limitation to this setup. In the existing workflow, governance checks need to run after the dbt build step. This is because dbt-checkpoint relies on the manifest.json and catalog.json. However, if these governance checks fail, it means that the dbt build step will need to run again once the governance issues are fixed. As shown in the diagram below, after running our dbt build, we proceed with governance checks. If these checks fail, we need to resolve the issue and re-trigger the pipeline, leading to another dbt build. This cycle can lead to unnecessary model builds even when leveraging dbt Slim CI.

‍

Leveraging the --empty Flag for Efficient dbt CI/CD Workflows

The solution to this problem is the --empty flag in dbt 1.8. This flag allows dbt to perform schema-only dry runs without processing large datasets. It's like building the wooden frame of a house—it sets up the structure, including the metadata needed for governance checks, without filling it with data. The framework is there, but the data itself is left out, enabling you to perform governance checks without completing an actual build.

Let’s see how we can rework our Github Action:

Workflow Steps

Checkout Branch: The workflow begins by checking out the branch associated with the pull request to ensure that the latest code is being used.

Set Secure Directory: This step ensures the repository directory is marked as safe, preventing potential issues with Git operations.

List of Files Changed: This step lists the files changed between the PR branch and the base branch, providing context for the changes and helpful for debugging.

Install dbt Packages: This step installs all required dbt packages, ensuring the environment is set up correctly for the dbt commands that follow. ‍

Create PR Database: This command creates a dedicated database for the PR, isolating the changes and tests from the production environment.

Get Production Manifest: Retrieves the production manifest file, which will be used for deferred runs and governance checks in the following steps.

*NEW* Governance Run of dbt (Slim or Full) with EMPTY Models: If there is a manifest in production, this step runs dbt with empty models using slim mode and using the empty flag. The models will be built in the PR database with no data inside and we can now use the catalog.json to run our governance checks since the models. Since the models are empty and we have everything we need to run our checks, we have saved on compute costs as well as run time.

Run Governance Checks: Executes governance checks such as SQLFluff and dbt-checkpoint.

‍Run dbt Build: Runs dbt build using either slim mode or full run after passing governance checks.

Grant Access to PR Database: Grants the necessary access to the new PR database for end user review.

By leveraging the dbt --empty flag, we can materialize models in the PR database without the computational overhead, as the actual data is left out. We can then use the metadata that was generated during the empty build. If any checks fail, we can repeat the process again but without the worry of wasting any computational resources doing an actual build. The cycle still exists but we have moved our real build outside of this cycle and replaced it with an empty or fake build. Once all governance checks have passed, we can proceed with the real dbt build of the dbt models as seen in the diagram below.

‍

Conclusion

dbt Slim CI is a powerful addition to the dbt toolkit, offering significant benefits in terms of speed, resource savings, and early error detection. However, we still faced an issue of wasted models when it came to failing governance checks. By incorporating dbt 1.8’s --empty flag into your CI/CD workflows we can reduce wasted model builds to zero, improving the efficiency and reliability of your data engineering processes.

🔗 Watch the vide where Noel explains the --empty flag implementation in Github Actions:

Dbt & Airflow

Ultimate dbt Jinja Cheat Sheet

5 mins read

Jinja templating in dbt offers flexibility and expressiveness that can significantly improve SQL code organization and reusability. There is a learning curve, but this cheat sheet is designed to be a quick reference for data practitioners, helping to streamline the development process and reduce common pitfalls.

Whether you're troubleshooting a tricky macro or just brushing up on syntax, bookmark this page. Trust us, it will come in handy and help you unlock the full potential of Jinja in your dbt projects.

If you find this cheat sheet useful, be sure to check out our Ultimate dbt Jinja Functions Cheat Sheet. It covers the specialized Jinja functions created by dbt, designed to enhance versatility and expedite workflows.

dbt Jinja: Basic syntax

This is the foundational syntax of Jinja, from how to comment to the difference between statements and expressions.

Basic Syntax
Comments	`{# Example Comment #}`
Statements	`{% … %} e.g.: for, if`
Expressions	`{{ … }} e.g.: ref(), source(),`

dbt Jinja: Variable assignment

Define and assign variables in different data types such as strings, lists, and dictionaries.

Variable Assignment
Strings	`{% set my_string = "example" %}`
Lists	`{% set my_list = ["apple", "lemon"] %}`
Dictionaries	`{% set my_dict = {"fruit_1": "apple", "fruit_2": "lemon"} %}`

‍

dbt Jinja: White space control

Jinja allows fine-grained control over white spaces in compiled output. Understand how to strategically strip or maintain spaces.

White Space Control
Strip Before	`{%- ... %}`
Strip After	`{% ... -%}`
String Before and After	`{%- ... -%}`

dbt Jinja: Control flow

In dbt, conditional structures guide the flow of transformations. Grasp how to integrate these structures seamlessly.

Control Flow
If/elif/else/endif	`{%- if target.name == 'dev' -%} {{ some code }} {%- elif target.name == 'prod' -%} {{ some other code }} {%- else -%} {{ some other code }} {%- endif -%}`

dbt Jinja: Looping

Discover how to iterate over lists and dictionaries. Understand simple loop syntax or accessing loop properties.

Looping
Loop Syntax	`{%- for item in my_iterable -%} --Do something with item {{ item }} {%- endfor -%}`
loop.last	This boolean is False unless the current iteration is the last iteration. `{% for item in list %} {% if loop.last %} --This is the last item {{ item }} {% endif %} {% endfor %}`
loop.first	A boolean that is True if the current iteration is the first iteration, otherwise False. `{% for item in list %} {% if loop.first %} --first item {{ item }} {% endif %} {% endfor %}`
loop.index	An integer representing the current iteration of the loop (1-indexed). So, the first iteration would have loop.index of 1, the second would be 2, and so on. `{% for item in list %} --This is item number {{ loop.index }} {% endfor %}`
Looping a List	`{% set rating_categories = ["quality_rating", "design_rating", "usability_rating"] %}` `SELECT product_id, {%- for col_name in rating_categories -%} AVG({{ col_name }}) as {{ column_name }}_average {%- if not loop.last -%} , {%- endif -%} {%- endfor -%} FROM product_reviews GROUP BY 1` Compiled code `SELECT product_id, AVG(quality_rating) as quality_rating_average, AVG(design_rating) as design_rating_average, AVG(usability_rating) as usability_rating_average FROM product_reviews GROUP BY 1`
Looping a Dictionary	`{% set delivery_type_dict = {"a": "digital_download", "b": "physical_copy"} %}` `SELECT order_id, {%- for type, column_name in delivery_type_dict.items() -%} COUNT(CASE WHEN delivery_method = '{{ type }}' THEN order_id END) as {{ column_name }}_count {%- if not loop.last -%} , {%- endif -%} {%- endfor -%} FROM order_deliveries GROUP BY 1` `SELECT order_id, COUNT(CASE WHEN delivery_method = 'a' THEN order_id END) as digital_download_count, COUNT(CASE WHEN delivery_method = 'b' THEN order_id END) as physical_copy_count FROM order_deliveries GROUP BY 1`

dbt Jinja: Operators

These logical and comparison operators come in handy, especially when defining tests or setting up configurations in dbt.

Logic Operators
and	`{% if condition1 and condition2 %}`
or	`{% if condition1 or condition2 %}`
not	`{{ not condition1 }}`

‍

Comparison Operators
Equal To	`{% if 1 == 2 %}`
Not Equal To	`{% if 1 != 2 %}`
Greater Than	`{% if 1 > 2 %}`
Less Than	`{% if 1 < 2 %}`
Greater Than or Equal to	`{% if 1 >= 2 %}`
Less Than or Equal To	`{% if 1 <= 2 %}`

dbt Jinja: Variable tests

Within dbt, you may need to validate if a variable is defined or a if a value is odd or even. These Jinja Variable tests allow you to validate with ease.

Jinja Variable Tests
Is Defined	`{% if my_variable is defined %} -- Handle conditions when variable exists {% endif %}`
Is None	`{% if my_variable is none %} -- Handle absence of my_variable {% endif %}`
Is Even	`{% if my_variable is even %} -- Handle when my_variable is even {% endif %}`
Is Odd	`{% if my_variable is odd %} -- Handle when my_variable is odd {% endif %}`
Is a String	`{% if my_variable is string %} -- Handle when my_variable is a string {% endif %}`
Is a Number	`{% if my_variable is number %} -- Handle when my_variable is a number {% endif %}`

dbt Jinja: Creating macros & tests

Macros are the backbone of advanced dbt workflows. Review how to craft these reusable code snippets and also how to enforce data quality with tests.

Creating Macros & Tests
Define a Macro	Write your macros in your project's macros directory. `{% macro ms_to_sec(col_name, precision=3) %} ( {{ col_name }} / 1000 )::numeric(16, {{ precision }}) {% endmacro %}`
Use a Macro from a Model	In a model: `SELECT order_id, {{ ms_to_sec(col_name=time_ms, precision=3) }} as time_sec FROM order_timings;` Compiled code: `SELECT order_id, (time_ms/ 1000 )::numeric(16, 3) AS time_sec FROM order_timings;`
Run a Macro from the Terminal	Define in your macros directory. Ex)macros/create_schema_macro.sql: `{% macro create_schema(schema_name) %} CREATE SCHEMA IF NOT EXISTS {{ schema_name }}; {% endmacro %}` In Termial: dbt run-operation create_schema --args '{"schema_name": "my_new_schema"}'
Define a Generic Test	Generic Tests used to be defined in the macros directory. It is now recommended to write your Generic Tests in the tests/generic directory. `{% test over_10000(model, column_name) %} SELECT {{column_name}} FROM {{ model }} WHERE {{column_name}} > 10000 {% endtest %}`
Use a Generic test	In models/schema.yml add the generic test to the model and column you wish to test. `version: 2 models: - name: my_model columns: - name: column_to_test tests: - over_10000 - not_null`
Define a Singular Test	Write your dbt Singular tests in the tests directory and give it a descriptive name. Ex) test/test_suspicious_refunds.sql `SELECT order_id, SUM(CASE WHEN amount < 0 THEN amount ELSE 0 END) as total_refunded_amount, COUNT(CASE WHEN amount < 0 THEN 1 END) as number_of_refunds FROM {{ ref('my_model') }} GROUP BY 1 HAVING number_of_refunds > 5`

dbt Jinja: Filters (aka Methods)

Fine-tune your dbt data models with these transformation and formatting utilities.

String Manipulation
Lower	`{{ "DATACOVES" \| lower }} => "datacoves"`
Upper	`{{ "datacoves" \| upper }} => "DATACOVES"`
Default	`{{ variable_name \| default("Default Value") }} If value exists => "Sample Value" If value does not exist => "Default Value"`
Trim	`{{ "Datacoves " \| trim }} => "Datacoves"`
Replace	`{{ "Datacoves" \| replace("v", "d") }} => "Datacodes"`
Length	`{{ "Datacoves" \| length }} => 9`
Capitalize	`{{ "datacoves" \| capitalize }} => "Datacoves"`
Title	`{{ "datacoves managed platform" \| capitalize }} => "Datacoves managed platform”`
Repeat a String	`{{ print('-' * 20) }}`
Substring	`{{ "Datacoves"[0:4] }} => "Data"`
Split	`{{ "Data coves".split(' ') }} => ["Data", "coves"]`

‍

Number Manipulation
Int	`{{ "20" \| int }} => 20`
Float	`{{ 20 \| float }} => 20.0`
Rounding to Nearest Whole Number	`{{ 20.1434 \| round }} => 21`
Rounding to a Specified Decimal Place	`{{ 20.1434 \| round(2) }} => 20.14`
Rounding Down (Floor Method)	`{{ 20.5 \| round(method='floor') }} => 20`
Rounding Up (Ceil Method)	`{{ 20.5 \| round(method='ceil') }} => 21`

Please contact us with any errors or suggestions.

Dbt & Airflow

dbt Won’t Fix Your Data Maturity Problem

5 mins read

Implementing dbt (data build tool) can revolutionize your organization's data maturity, however, if your organization is not ready to take advantage of the benefits of dbt it might not be the right time to start. Why? Because the success of data initiatives often hinges on aspects beyond the tooling itself.

Many companies rush into implementing dbt without assessing their organization’s maturity and this leads to poor implementation. The consequences that come from a poorly implemented dbt initiative can leave the organization frustrated, overwhelmed with technical debt, and wasted resources. To avoid these pitfalls and ensure your organization is truly ready for dbt, you should complete an assessment of your organization's readiness by answering the questions presented later in this article.

What is Data Maturity?

Before diving into the maturity assessment questions, it’s important to understand what data maturity means. Data maturity is the extent to which an organization can effectively leverage its data to drive business value. It encompasses multiple areas, including:

Data-Driven Culture: Fostering an environment where data is integral to decision-making processes.

Data Quality: Ensuring data is accurate, consistent, and reliable.

Data Governance: Implementing policies and procedures to manage data assets.

Data Integration: Seamlessly combining data from various sources for a unified view.

A mature data organization not only ensures data accuracy and consistency but also embeds data-driven decision-making into its core operations.

How dbt Improves Data Maturity

By leveraging dbt's features, organizations can significantly enhance their data maturity, leading to better decision-making, improved data quality, robust governance, and seamless integration. For example:

Data-Driven Culture: By using dbt, you can improve many aspects that contribute to creating a data-driven culture within an organization. One way is by encouraging business users to be involved in providing or reviewing accurate model and column descriptions which are embedded in dbt. You can also involve them in defining what data to test with dbt. Better Data Quality will improve trust in the data. More trust in the data will always lead to more frequent use and reliance on it.

Data Quality and Observability: dbt enables automated testing and validation of data transformations. This ensures data quality by catching issues like schema changes or data anomalies early in the pipeline. As your data quality and data observability needs grow you can assess where you are on the data maturity curve. For example, in a sales data model, we can write tests to ensure there are no negative order quantities and that each order has a valid customer ID. With dbt you can also understand data lineage and this can improve impact and root cause analysis when insights don’t seem quite right.

Data Governance: dbt facilitates version control and documentation for all transformations, enhancing transparency and accountability. Organizations can track changes to data models ensuring compliance with data governance policies.

Data Integration: dbt supports the integration of data from multiple sources by providing a framework for consistent and reusable transformations. This allows for the creation of unified data models that provide a holistic view of business operations.

dbt Readiness Assessment

Now that we understand what data maturity is and how dbt can help improve it, you might be ready to jump on the dbt bandwagon. But first, we encourage you to assess your organization’s readiness for dbt. The journey to data maturity involves not only choosing the right tools but also ensuring that your organization is philosophically and operationally prepared to take full advantage of these tools. It is important to recognize that dbt’s approach requires a shift in mindset towards modern data practices, emphasizing transparency, collaboration, and automation.

To determine if your organization is mature enough for dbt or if dbt is the right fit, consider the following assessment questions:

Are you philosophically aligned?

dbt requires a philosophical alignment with its principles, such as ELT (Extract, Load, Transform) instead of the traditional ETL (Extract, Transform, Load) approach. dbt is also based on idempotency meaning that given the same input, you will always get the same output. This is different than traditional ETL that may use incompatible constructs like Auto-Incrementing Primary Keys. If your organization prefers processes that are incompatible with dbt’s methodology, you will face challenges fighting the dbt framework to make it do something it was not intended to do.

Are you just going to do a lift and shift?

Simply migrating existing processes and code to dbt without rethinking them won’t leverage dbt’s full potential. Assess whether you’re ready to redesign your workflows to take advantage of dbt’s capabilities such as incremental tables, snapshots, seeds, etc.

Are you going to take advantage of features like data quality and documentation?

dbt offers excellent features for data quality and documentation. Evaluate if your team is prepared to prioritize the utilization of these features to enhance transparency and trust in your data. Tests and model descriptions will not write themselves. When it comes to good descriptions, they shouldn't come from a data engineering team that does not know how the data is used or the best data quality rules to implement. Good descriptions must involve business user review at a minimum.

Are you going to open up dbt to teams outside of IT?

The goal of dbt is to empower various teams including IT and business users by using the same tooling. Consider if your organization is ready to foster this cross-functional collaboration. When you implement dbt correctly, you will empower anyone who knows SQL to contribute. You can have multiple teams contribute to the insight delivery process and still ensure proper governance and testing before updating production.

Are you going to automate processes like deployment and testing?

Automation is key to achieving efficiency with dbt. Implementing automated deployment, testing, and CI/CD pipelines can significantly improve your workflows. If you aren’t ready to automate, the benefits of dbt may not be fully realized. If you simply put in dbt without thinking about the end-to-end process and the failure points, you will miss opportunities for errors. The spaghetti code you have today didn't happen just because you were not using dbt.

Are you changing more than just tooling?

dbt is a framework, not a silver bullet. Merely changing tools without altering your underlying processes will not solve existing issues. This is a huge issue with organizations that have not done the work to create a data-driven culture. Assess if your team is ready to adopt better naming conventions and more structured processes to make data more understandable.

Conclusion

Data immaturity might manifest as a reliance on manual processes, lack of data quality controls, or poor documentation practices. These factors can derail the effective implementation of dbt since dbt thrives in environments where data practices are robust and standardized. In other words, dbt alone will not solve these problems.

Ensuring your organization is ready for the changes that come with implementing dbt is not just best practice, it is essential for success. By thoroughly assessing your readiness, you can avoid technical debt, optimize your workflows, and fully harness the power of dbt. Remember, dbt is a powerful tool, but its effectiveness depends on the readiness of your organization to improve data practices and its alignment with dbt’s philosophy.

Dbt & Airflow

dbt Alternatives: 10 Platforms Compared (Transformation Guide)

5 mins read

The top dbt alternatives include Datacoves, SQLMesh, Bruin Data, Dataform, and visual ETL tools such as Alteryx, Matillion, and Informatica. Code-first engines offer stronger rigor, testing, and CI/CD, while GUI platforms emphasize ease of use and rapid prototyping. Teams choose these alternatives when they need more security, governance, or flexibility than dbt Core or dbt Cloud provide.

The top dbt alternatives include Datacoves, SQLMesh, Bruin Data, Dataform, and GUI-based ETL tools such as Alteryx, Matillion, and Informatica.

The top dbt alternatives we will cover are:

dbt Cloud Alternatives
1. Datacoves
2. DIY dbt Core
Code-Based ETL Tools
1. SQLMesh
2. Dataform
3. AWS Glue
4. Bruin Data
Graphical ETL Tools
1. Matillion
2. Informatica
3. Alteryx
4. Azure Data Factory
5. Talend
6. SSIS

Why Teams Look for dbt Alternatives

Teams explore dbt alternatives when they need stronger governance, private deployments, or support for Python and code-first workflows that go beyond SQL. Many also prefer GUI-based ETL tools for faster onboarding. Recent market consolidation, including Fivetran acquiring SQLMesh and merging with dbt Labs, has increased concerns about vendor lock-in, which makes tool neutrality and platform flexibility more important than ever.

Teams look for dbt alternatives when they need stronger orchestration, consistent development environments, Python support, or private cloud deployment options that dbt Cloud does not provide.

Categories of dbt Alternatives

Organizations evaluating dbt alternatives typically compare tools across three categories. Each category reflects a different approach to data transformation, development preferences, and organizational maturity.

Category	Best For	Key Trade-Offs
dbt Cloud Alternatives	Teams that want dbt with stronger security, governance, or private/VPC deployment	Requires aligning the platform with your security, governance, and deployment needs
Code-Based ETL Tools	Engineering-first teams that want CI/CD, testing, Python workflows, and strict modeling guardrails	Have smaller communities and ecosystems compared to mature SQL-based tools like dbt
GUI-Based ETL Tools	Mixed-skill teams that prefer drag-and-drop development and faster onboarding	Less flexible for complex SQL modeling, testing, and version-controlled workflows

dbt Cloud Alternatives

Organizations consider alternatives to dbt Cloud when they need more flexibility, stronger security, or support for development workflows that extend beyond dbt. Teams comparing platform options often begin by evaluating the differences between dbt Cloud vs dbt Core.

Running enterprise-scale ELT pipelines often requires a full orchestration layer, consistent development environments, and private deployment options that dbt Cloud does not provide. Costs can also increase at scale (see our breakdown of dbt pricing considerations), and some organizations prefer to avoid features that are not open source to reduce long-term vendor lock-in.

This category includes platforms that deliver the benefits of dbt Cloud while providing more control, extensibility, and alignment with enterprise data platform requirements.

Datacoves

Datacoves provides a secure, flexible platform that supports dbt, SQLMesh, and Bruin in a unified environment with private cloud or VPC deployment.

Datacoves is an enterprise data platform that serves as a secure, flexible alternative to dbt Cloud. It supports dbt Core, SQLMesh, and Bruin inside a unified development and orchestration environment, and it can be deployed in your private cloud or VPC for full control over data access and governance.

Benefits

Flexibility and Customization:
Datacoves provides a customizable in-browser VS Code IDE, Git workflows, and support for Python libraries and VS Code extensions. Teams can choose the transformation engine that fits their needs without being locked into a single vendor.

Handling Enterprise Complexity:
Datacoves includes managed Airflow for end-to-end orchestration, making it easy to run dbt and Airflow together without maintaining your own infrastructure. It standardizes development environments, manages secrets, and supports multi-team and multi-project workflows without platform drift.

Cost Efficiency:
Datacoves reduces operational overhead by eliminating the need to maintain separate systems for orchestration, environments, CI, logging, and deployment. Its pricing model is predictable and designed for enterprise scalability.

Data Security and Compliance:
Datacoves can be deployed fully inside your VPC or private cloud. This gives organizations complete control over identity, access, logging, network boundaries, and compliance with industry and internal standards.

Reduced Vendor Lock-In:
Datacoves supports dbt, SQLMesh, and Bruin Data, giving teams long-term optionality. This avoids being locked into a single transformation engine or vendor ecosystem.

Capability	Datacoves	dbt Cloud
Supported Transformation Engines	dbt Core, SQLMesh, Bruin Data	dbt only
Deployment Model	SaaS or Private Cloud/VPC deployment	SaaS only
Integrated Orchestration	Built-in Airflow with full DAG control	Built-in dbt scheduler (limited orchestration)
Development Environment	In-browser VS Code with extensions, Python, and dbt	Local VS Code integration and web-based dbt IDE
Environment Consistency	Standardized dev environment across users	Standardized dbt development environment
Security & Compliance	Full control in Private Cloud/VPC; SaaS option available	Depends on dbt Cloud’s SaaS environment
Governance & DevOps	Editable GitHub Actions and full CI/CD control	Standardized CI/CD workflow
Ingestion & Python Workloads	Supports Python development and Airflow-based orchestration for ingestion pipelines	Requires additional ingestion tools or processes and does not support Python development in the IDE

DIY dbt Core

Running dbt Core yourself is a flexible option that gives teams full control over how dbt executes. It is also the most resource-intensive approach. Teams choosing DIY dbt Core must manage orchestration, scheduling, CI, secrets, environment consistency, and long-term platform maintenance on their own.

Benefits

Full Control:
Teams can configure dbt Core exactly as they want and integrate it with internal tools or custom workflows.

Cost Flexibility:
There are no dbt Cloud platform fees, but total cost of ownership often increases as the system grows.

Considerations

High Maintenance Overhead:
Teams must maintain Airflow or another orchestrator, build CI pipelines, manage secrets, and keep development environments consistent across users.

Requires Platform Engineering Skills:
DIY dbt Core works best for teams with strong Kubernetes, CI, Python, and DevOps expertise. Without this expertise, the environment becomes fragile over time.

Slow to Scale:
As more engineers join the team, keeping dbt environments aligned becomes challenging. Onboarding, upgrades, and platform drift create operational friction.

Security and Compliance Responsibility:
Identity, permissions, logging, and network controls must be designed and maintained internally, which can be significant for regulated organizations.

dbt alternatives – Code based ETL tools

Teams that prefer code-first tools often look for dbt alternatives that provide strong SQL modeling, Python support, and seamless integration with CI/CD workflows and automated testing. These are part of a broader set of data transformation tools. Code-based ETL tools give developers greater control over transformations, environments, and orchestration patterns than GUI platforms. Below are four code-first contenders that organizations should evaluate.

Code-first dbt alternatives like SQLMesh, Bruin Data, and Dataform provide stronger CI/CD integration, automated testing, and more control over complex transformation workflows.

SQLMesh

SQLMesh is an open-source framework for SQL and Python-based data transformations. It provides strong visibility into how changes impact downstream models and uses virtual data environments to preview changes before they reach production.

Benefits

Efficient Development Environments:
Virtual environments reduce unnecessary recomputation and speed up iteration.

Considerations

Part of the Fivetran Ecosystem:
SQLMesh was acquired by Fivetran, which may influence its future roadmap and level of independence.

Dataform

Dataform is a SQL-based transformation framework focused specifically for BigQuery. It enables teams to create table definitions, manage dependencies, document models, and configure data quality tests inside the Google Cloud ecosystem. It also provides version control and integrates with GitHub and GitLab.

Benefits

Centralized BigQuery Development:
Dataform keeps all modeling and testing within BigQuery, reducing context switching and making it easier for teams to collaborate using familiar SQL workflows.

Considerations

Focused Only on the GCP Ecosystem:
Because Dataform is geared toward BigQuery, it may not be suitable for organizations that use multiple cloud data warehouses.

AWS Glue

AWS Glue is a serverless data integration service that supports Python-based ETL and transformation workflows. It works well for organizations operating primarily in AWS and provides native integration with services like S3, Lambda, and Athena.

Benefits

Python-First ETL in AWS:
Glue supports Python scripts and PySpark jobs, making it a good fit for engineering teams already invested in the AWS ecosystem.

Considerations

Requires Engineering Expertise:
Glue can be complex to configure and maintain, and its Python-centric approach may not be ideal for SQL-first analytics teams.

Bruin Data

Bruin is a modern SQL-based data modeling framework designed to simplify development, testing, and environment-aware deployments. It offers a familiar SQL developer experience while adding guardrails and automation to help teams manage complex transformation logic.

Benefits

Modern SQL Modeling Experience:
Bruin provides a clean SQL-first workflow with strong dependency management and testing.

Considerations

Growing Ecosystem:
Bruin is newer than dbt and has a smaller community and fewer third-party integrations.

dbt alternatives – Graphical ETL tools

While code-based transformation tools provide the most flexibility and long-term maintainability, some organizations prefer graphical user interface (GUI) tools. These platforms use visual, drag-and-drop components to build data integration and transformation workflows. Many of these platforms fall into the broader category of no-code ETL tools. GUI tools can accelerate onboarding for teams less comfortable with code editors and may simplify development in the short term. Below are several GUI-based options that organizations often consider as dbt alternatives.

GUI-based dbt alternatives such as Matillion, Informatica, and Alteryx use drag-and-drop interfaces that simplify development and accelerate onboarding for mixed-skill teams.

Matillion

Matillion is a cloud-based data integration platform that enables teams to design ETL and transformation workflows through a visual, drag-and-drop interface. It is built for ease of use and supports major cloud data warehouses such as Amazon Redshift, Google BigQuery, and Snowflake.

Benefits

User-Friendly Visual Development:
Matillion simplifies pipeline building with a graphical interface, making it accessible for users who prefer low-code or no-code tooling.

Considerations

Limited Flexibility for Complex SQL Modeling:
Matillion’s visual approach can become restrictive for advanced transformation logic or engineering workflows that require version control and modular SQL development.

Informatica

Informatica is an enterprise data integration platform with extensive ETL capabilities, hundreds of connectors, data quality tooling, metadata-driven workflows, and advanced security features. It is built for large and diverse data environments.

Benefits

Enterprise-Scale Data Management:
Informatica supports complex data integration, governance, and quality requirements, making it suitable for organizations with large data volumes and strict compliance needs.

Considerations

High Complexity and Cost:
Informatica’s power comes with a steep learning curve, and its licensing and operational costs can be significant compared to lighter-weight transformation tools.

Alteryx

Alteryx is a visual analytics and data preparation platform that combines data blending, predictive modeling, and spatial analysis in a single GUI-based environment. It is designed for analysts who want to build workflows without writing code and can be deployed on-premises or in the cloud.

Benefits

Powerful GUI Analytics Capabilities:
Alteryx allows users to prepare data, perform advanced analytics, and generate insights in one tool, enabling teams without strong coding skills to automate complex workflows.

Considerations

High Cost and Limited SQL Modeling Flexibility:
Alteryx is one of the more expensive platforms in this category and is less suited for SQL-first transformation teams who need modular modeling and version control.

Azure Data Factory (ADF)

Azure Data Factory (ADF) is a fully managed, serverless data integration service that provides a visual interface for building ETL and ELT pipelines. It integrates natively with Azure storage, compute, and analytics services, allowing teams to orchestrate and monitor pipelines without writing code.

Benefits

Strong Integration for Microsoft-Centric Teams:
ADF connects seamlessly with other Azure services and supports a pay-as-you-go model, making it ideal for organizations already invested in the Microsoft ecosystem.

Considerations

Limited Transformation Flexibility:
ADF excels at data movement and orchestration but offers limited capabilities for complex SQL modeling, making it less suitable as a primary transformation engine

Talend

Talend provides an end-to-end data management platform with support for batch and real-time data integration, data quality, governance, and metadata management. Talend Data Fabric combines these capabilities into a single low-code environment that can run in cloud, hybrid, or on-premises deployments.

Benefits

Comprehensive Data Quality and Governance:
Talend includes built-in tools for data cleansing, validation, and stewardship, helping organizations improve the reliability of their data assets.

Considerations

Broad Platform, Higher Operational Complexity:
Talend’s wide feature set can introduce complexity, and teams may need dedicated expertise to manage the platform effectively.

SSIS (SQL Server Integration Services)

SQL Server Integration Services is part of the Microsoft SQL Server ecosystem and provides data integration and transformation workflows. It supports extracting, transforming, and loading data from a wide range of sources, and offers graphical tools and wizards for designing ETL pipelines.

Benefits

Strong Fit for SQL Server-Centric Teams:
SSIS integrates deeply with SQL Server and other Microsoft products, making it a natural choice for organizations with a Microsoft-first architecture.

Considerations

Not Designed for Modern Cloud Data Warehouses:
SSIS is optimized for on-premises SQL Server environments and is less suitable for cloud-native architectures or modern ELT workflows.

Why These dbt Alternatives Exist: The Full Context

Recent consolidation, including Fivetran acquiring SQLMesh and merging with dbt Labs, has increased concerns about vendor lock-in and pushed organizations to evaluate more flexible transformation platforms.

Organizations explore dbt alternatives when dbt no longer meets their architectural, security, or workflow needs. As teams scale, they often require stronger orchestration, consistent development environments, mixed SQL and Python workflows, and private deployment options that dbt Cloud does not provide.

Some teams prefer code-first engines for deeper CI/CD integration, automated testing, and strong guardrails across developers. Others choose GUI-based tools for faster onboarding or broader integration capabilities. Recent market consolidation, including Fivetran acquiring SQLMesh and merging with dbt Labs, has also increased concerns about vendor lock-in.

These factors lead many organizations to evaluate tools that better align with their governance requirements, engineering preferences, and long-term strategy.

Should You DIY a dbt Data Platform?

DIY dbt Core offers full control but requires significant engineering work to manage orchestration, CI/CD, security, and long-term platform maintenance.

Running dbt Core yourself can seem attractive because it offers full control and avoids platform subscription costs. However, building a stable, secure, and scalable dbt environment requires significantly more than executing dbt build on a server. It involves managing orchestration, CI/CD, and ensuring development environment consistency along with long-term platform maintenance, all of which require mature DataOps practices.

The true question for most organizations is not whether they can run dbt Core themselves, but whether it is the best use of engineering time. This is essentially a question of whether to build vs buy your data platform. DIY dbt platforms often start simple and gradually accumulate technical debt as teams grow, pipelines expand, and governance requirements increase.

When DIY Makes Sense

The team has strong platform engineering expertise
Pipelines are relatively simple
Security and compliance needs are minimal
The organization prefers to own and operate every part of the stack

When DIY Becomes a Liability

Multiple analytics engineers need consistent development environments
Governance, auditing, or private deployment become required
Pipelines need enterprise-grade orchestration
Upgrades and maintenance begin consuming valuable engineering time

For many organizations, DIY works in the early stages but becomes difficult to sustain as the platform matures.

How to Choose the Right dbt Alternative

The right dbt alternative depends on your team’s skills, governance requirements, pipeline complexity, and long-term data platform strategy.

Selecting the right dbt alternative depends on your team’s skills, security requirements, and long-term data platform strategy. Each category of tools solves different problems, so it is important to evaluate your priorities before committing to a solution.

1. Team Skills and Workflow Preferences

SQL-first teams: Tools like dbt and Dataform work well for analysts and analytics engineers.
Engineering-first teams: SQLMesh, and AWS Glue offer deeper CI integration, testing, and Python support.‍
Mixed-skill teams: GUI tools like Matillion, Informatica, and Alteryx provide visual development.

2. Governance and Security Requirements

Need for private cloud or VPC deployment
Centralized identity and access management
Audit logging and compliance standards
Ability to control data movement and network boundaries

If these are priorities, a platform with secure deployment options or multi-engine support may be a better fit than dbt Cloud.

3. Complexity of Pipelines

Simple pipelines may work with lightweight tools
Complex, multi-team pipelines benefit from strong orchestration, consistent environments, and guardrails
SQL-only tools may fall short when pipelines require Python-based logic or mixed-language workflows

4. Integration and Ecosystem Compatibility

Choose a tool that integrates cleanly with your cloud environment and data warehouse
Engineering-forward teams may prioritize CI/CD and Git workflows
Analytics-focused or traditional Data Engineering teams may value GUI tools

5. Vendor Lock-In and Long-Term Flexibility

Recent consolidation in the ecosystem has raised concerns about vendor dependency. Organizations that want long-term flexibility often look for:

Multi-engine support
Open-source components
Tooling that can be run in their cloud environment

6. Total Cost of Ownership

Consider platform fees, engineering maintenance, onboarding time, and the cost of additional supporting tools such as orchestrators, IDEs, and environment management

Team Profile	Pipeline Complexity	Recommended dbt Alternative Category
Small team with limited platform engineering capacity	Simple pipelines	dbt Cloud, Datacoves SaaS, or GUI tools (Alteryx)
SQL-first analytics team	Simple to moderate transformations	dbt Cloud, Dataform, Bruin Data, or Datacoves SaaS for standardized SQL development
Mixed-skill team with analysts and engineers	Moderate complexity with collaboration needs	GUI ETL tools (Matillion, Data Factory) and a code-based SQL/Python tool for advanced modeling
Highly regulated or security-focused organization	Moderate to high complexity	dbt Cloud Alternatives with Private Cloud/VPC deployment (Datacoves)
Engineering-first data platform team	Complex, multi-step pipelines	Code-based ETL tools with CI/CD (SQLMesh, Bruin Data, or AWS Glue) or Datacoves for integrated orchestration and multi-engine support

Final dbt alternative Recommendation

dbt remains a strong choice for SQL-based transformations, but it is not the only option. As organizations scale, they often need stronger orchestration, consistent development environments, Python support, and private deployment capabilities that dbt Cloud or DIY dbt Core may not provide. Evaluating alternatives helps ensure that your transformation layer aligns with your long-term platform and governance strategy.

Code-first tools like SQLMesh, Bruin Data, and Dataform offer strong engineering workflows, while GUI-based tools such as Matillion, Informatica, and Alteryx support faster onboarding for mixed-skill teams. The right choice depends on the complexity of your pipelines, your team’s technical profile, and the level of security and control your organization requires.

Datacoves provides a flexible, secure alternative that supports dbt, SQLMesh, and Bruin in a unified environment. With private cloud or VPC deployment, managed Airflow, and a standardized development experience, Datacoves helps teams avoid vendor lock-in while gaining an enterprise-ready platform for analytics engineering.

Selecting the right dbt alternative is ultimately about aligning your transformation approach with your data architecture, governance needs, and long-term strategy. Taking the time to assess these factors will help ensure your platform remains scalable, secure, and flexible for your future needs.

‍

Dbt & Airflow

dbt Deployment Options

5 mins read

dbt is wildly popular and has become a fundamental part of many data stacks. While it’s easy to spin up a project and get things running on a local machine, taking the next step and deploying dbt to production isn’t quite as simple.

In this article we will discuss options for deploying dbt to production, comparing some high, medium, and low effort options so that you can find which works best for your business and team. You might be deploying dbt using one of these patterns already; if you are, hopefully this guide will help highlight some improvements you can make to your existing deployment process.

We're going to assume you know how to run dbt on your own computer (aka your local dbt setup). We’re also going to assume that you either want to or need to run dbt in a “production” environment – a place where other tools and systems make use of the data models that dbt creates in your warehouse.

Enhancing understanding of dbt deployment

The deployment process for dbt jobs extends beyond basic scheduling and involves a multifaceted approach. This includes establishing various dbt environments with distinct roles and requirements, ensuring the reliability and scalability of these environments, integrating dbt with other tools in the (EL)T stack, and implementing effective scheduling strategies for dbt tasks. By focusing on these aspects, a comprehensive and robust dbt deployment strategy can be developed. This strategy will not only address current data processing needs but also adapt to future challenges and changes in your data landscape, ensuring long-term success and reliability.

dbt environments

In deploying dbt you have the creation and management of certain dbt environments. The development environment is the initial testing ground for creating and refining dbt models. It allows for experimentation without impacting production data. Following this, the testing environment, including stages like UAT and regression testing, rigorously evaluates the models for accuracy and performance. Finally, the production environment is where these models are executed on actual data, demanding high stability and performance.

Data reliability and scalability

Reliability and scalability of data models are also important. Ensuring that the data models produce accurate and consistent results is essential for maintaining trust in your data. As your data grows, the dbt deployment should be capable of scaling, handling increased volumes, and maintaining performance.

End-to-End data pipeline integration

Integration with other data tools and systems is another key aspect. A seamless integration of dbt with EL tools, data visualization platforms, and data warehouses ensures efficient data flow and processing, making dbt a harmonious component of your broader data stack.

dbt job scheduling

Effective dbt scheduling goes beyond mere time-based scheduling. It involves context-aware execution, such as triggering jobs based on data availability or other external events. Managing dependencies within your data models is critical to ensure that transformations occur in the correct sequence. Additionally, adapting to varying data loads is necessary to scale resources effectively and maintain the efficiency of dbt job executions.

The main flavors for dbt deployment are:

Cron Job
Cloud Runner Service (like dbt Cloud)
Integrated Platform (like Datacoves)
Fully Custom (like Airflow, Astronomer, etc)

They each have their place, and the trade-offs between setup costs and long-term maintainability is important to consider when you’re choosing one versus another.

We can compare these dbt deployment options across the following criteria:

Ease of Use / Implementation
Required Technical Ability
Configurability
Customization
Best for End-to End Deployment

Cron job

Cron jobs are scripts that run at a set schedule. They can be defined in any language. For instance, we can use a simple bash script to run dbt. It’s just like running the CLI commands, but instead of you running them by hand, a computer process would do it for you.

Here’s a simple cron script:

In order to run on schedule, you’ll need to add this file to your system’s crontab.

As you can tell, this is a very basic dbt run script; we are doing the bare minimum to run the project. There is no consideration for tagged models, test, alerting, or more advanced checks.

Even though Cron jobs are the most basic way to deploy dbt there is still a learning curve. It requires some technical skills to set up this deployment. Additionally, because of its simplicity, it is pretty limited. If you are thinking of using crons for multi-step deployments, you might want to look elsewhere.

While it's relatively easy to set up a cron job to run on your laptop this defeats the purpose of using a cron altogether. Crons will only run when the daemon is running, so unless you plan on never turning off your laptop, you’ll want to set up the cron on an EC2 instance (or another server). Now you have infrastructure to support and added complexity to keep in mind when making changes. Running a cron on an EC2 instance is certainly doable, but likely not the best use of resources. Just because it can be done does not mean it should be done. At this point, you’re better off using a different deployment method.

The biggest downside, however, is that your cron script must handle any edge cases or errors gracefully. If it doesn’t, you might wind up with silent failures – a data engineer’s worst enemy.

Who should use Cron for dbt deployment?

Cron jobs might serve you well if you have some running servers you can use, have a strong handle on the types of problems your dbt runs and cron executions might run into, and you can get away with a simple deployment with limited dbt steps. It is also a solid choice if you are running a small side-project where missed deployments are probably not a big deal.

Use crons for anything more complex, and you might be setting yourself up for future headaches.

Ease of Use / Implementation – You need to know what you’re doing

Required Technical Ability – Medium/ High

Configurability – High, but with the added complexity of managing more complex code

Customization – High, but with a lot of overhead. Best to keep things very simple

Best for End-to-End Deployment - Low.

Cloud Service Runners (like dbt Cloud)

Cloud Service Runners like dbt Cloud are probably the most obvious way to deploy your dbt project without writing code for those deployments, but they are not perfect.

dbt Cloud is a product from dbt Labs, the creators of dbt. The platform has some out-of-the-box integrations, such as Github Actions and Webhooks, but anything more will have to be managed by your team. While there is an IDE (Integrated Developer Experience) that allows the user to write new dbt models, you are adding a layer of complexity by orchestrating your deployments in another tool. If you are only orchestrating dbt runs, dbt Cloud is a reasonable choice – it's designed for just that.

However, when you want to orchestrate more than just your dbt runs – for instance, kickoff multiple Extract / Load (EL) processes or trigger other jobs after dbt completes – you will need to look elsewhere.

dbt Cloud will host your project documentation and provide access to its APIs. But that is the lion’s share of the offering. Unless you spring for the Enterprise Tier, you will not be able to do custom deployments or trigger dbt runs based on incoming data with ease.

Deploying your dbt project with dbt Cloud is straightforward, though. And that is its best feature. All deployment commands use native dbt command line syntax, and you can create various "Jobs" through their UI to run specific models at different cadences.

Who should use dbt Cloud for dbt deployment?

If you are a data team with data pipelines that are not too complex and you are looking to handle dbt deployments without the need for standing up infrastructure or stringing together advanced deployment logic, then dbt Cloud will work for you. If you are interested in more complex triggers to kickoff your dbt runs - for instance, triggering a run immediately after your data is loaded – there are other options which natively support patterns like that. The most important factor is the complexity of the pieces you need to coordinate, not necessarily the size of your team or organization.

Overall, it is a great choice if you’re okay working within its limitations and support a simple workflow. As soon as you reach any scale, however, the cost may be too high.

Ease of Use / Implementation – Very easy

Required Technical Ability – Low

Configurability – Low / Medium

Customization – Low

Best for End-to-End Deployment - Low

Integrated platform (like Datacoves)

The Modern Data Stack is a composite of tools. Unfortunately, many of those tools are disconnected because they specialize in handling one of the steps in the ELT process. Only after working with them do you realize that there are implicit dependencies between these tools. Tools like Datacoves bridge the gaps between the tools in the Modern Data Stack and enable some more flexible dbt deployment patterns. Additionally, they cover the End-to-End solution, from Extraction to Visualization, meaning it can handle steps before and after Transformation.

Coordinating dbt runs with EL processes

If you are loading your data into Snowflake with Fivetran or Airbyte, your dbt runs need to be coordinated with those EL processes. Often, this is done by manually setting the ETL schedule and then defining your dbt run schedule to coincide with your ETL completion. It is not a hard dependency, though. If you’re processing a spike in data or running a historical re-sync, your ETL pipeline might take significantly longer than usual. Your normal dbt run won’t play nicely with this extended ETL process, and you’ll wind up using Snowflake credits for nothing.

This is a common issue for companies moving from early stage / MVP data warehouses into more advanced patterns. There are ways to connect your EL processes and dbt deployments with code, but Datacoves makes it much easier. Datacoves will trigger the right dbt job immediately after the load is complete. No need to engineer a solution yourself. The value of the Modern Data Stack is being able to mix and match tools that are fit for purpose.

Seamless integration and orchestration

Meeting established data freshness and quality SLAs is challenging enough, but with Datacoves, you’re able to skip building custom solutions for these problems. Every piece of your data stack is integrated and working together. If you are orchestrating with Airflow, then you’re likely running a Docker container which may or may not have added dependencies. That’s one common challenge teams managing their own instances of Airflow will meet, but with Datacoves, container / image management and synchronization between EL and dbt executions are all handled on the platform. The setup and maintenance of the scalable Kubernetes infrastructure necessary to run Airflow is handled entirely by the Datacoves platform, which gives you flexibility but with a lower learning curve. And, it goes without saying that this works across multiple environments like development, UAT, and production.

Streamlining the Datacoves experience

With the End-to-End Pipeline in mind, one of the convenient features is that Datacoves provides a singular place to access all the tools within your normal analytics workflow - extraction + load, transformation, orchestration, and security controls are in a single place. The implicit dependencies are now codified; it is clear how a change to your dbt model will flow through to the various pieces downstream.

Who should use Datacoves for dbt deployment?

Datacoves is for teams who want to introduce a mature analytics workflow without the weight of adopting and integrating a new suite of tools on their own. This might mean you are a small team at a young company, or an established analytics team at an enterprise looking to simplify and reduce platform complexity and costs.

There are some prerequisites, though. To make use of Datacoves, you do need to write some code, but you’ll likely already be used to writing configuration files and dbt models that Datacoves expects. You won't be starting from scratch because best practices, accelerators, and expertise are already provided.

Ease of Use / Implementation – You can utilize YAML to generate DAGS for a simpler approach, but you also have the option to use Python DAGS for added flexibility and complexity in your pipelines.

Required Technical Ability – Medium

Configurability – High

Customization – High. Datacoves is modular, allowing you to embed the tools you already use

Best for End-to-End Deployment - High. Datacoves takes into account all of the factors of dbt Deployment

Fully custom (like Airflow, Dagster, etc)

What do you use to deploy your dbt project when you have a large, complex set of models and dependencies? An orchestrator like Airflow is a popular choice, with many companies opting to use managed deployments through services such as Astronomer.

For many companies – especially in the enterprise – this is familiar territory. Adoption of these orchestrators is widespread. The tools are stable, but they are not without some downsides.

These orchestrators require a lot of setup and maintenance. If you’re not using a managed service, you’ll need to deploy the orchestrator yourself, and handle the upkeep of the infrastructure running your orchestrator, not to mention manage the code your orchestrator is executing. It’s no small feat, and a large part of the reason that many large engineering groups have dedicated data engineering and infrastructure teams.

Running your dbt deployment through Airflow or any other orchestrator is the most flexible option you can find, though. The increase in flexibility means more overhead in terms of setting up the systems you need to run and maintain this architecture. You might need to get DevOps involved, you’ll need to move your dbt project into a Docker image, you’ll want an airtight CI/CD process, and ultimately have well defined SLAs. This typically requires Docker images, container management, and some DevOps work. There can be a steep learning curve, especially if you’re unfamiliar with what’s needed to take an Airflow instance to a stable production release.

There are 3 ways to run Airflow, specifically – deploying on your own, using a managed service, or using an integrated platform like Datacoves. When using a managed service or an integrated platform like Datacoves, you need to consider a few factors:

Airflow use cases

Ownership and contributing teams

Integrations with rest of your stack

Airflow use cases

Airflow is a multi-purpose tool. It’s not just for dbt deployments. Many organizations run complex data engineering pipelines with Airflow, and by design, it is flexible. If your use of Airflow extends well beyond dbt deployments or ELT jobs oriented around your data warehouse, you may be better suited for a dedicated managed service.

Ownership and contributing teams

Similarly, if your organization has numerous teams dedicated to designing, building and maintaining your data infrastructure, you may want to use a dedicated Airflow solution. However, not every organization is able to stand up platform engineering teams or DevOps squads dedicated to the data infrastructure. Regardless of the size of your team, you will need to make sure that your data infrastructure needs do not outmatch your team’s ability to support and maintain that infrastructure.

Integrations with the rest of your stack

Every part of the Modern Data Stack relies on other tools performing their jobs; data pipelines, transformations, data models, BI tools - they are all connected. Using Airflow for your dbt deployment adds another link in the dependency chain. Coordinating dbt deployments via Airflow can always be done through writing additional code, but this is an additional overhead you will need to design, implement, and maintain. With this approach, you begin to require strong software engineering and design principles. Your data models are only as useful as your data is fresh; meeting your required SLAs will require significant cross-tool integration and customization.

Who should use a fully custom setup for dbt deployment?

If you are a small team looking to deploy dbt, there are likely better options. If you are a growing team, there are certainly simpler options with less infrastructure overhead. For Data teams with complex data workflows that combine multiple tools and transformation technologies such as Python, Scala, and dbt, however, Airflow and other orchestrators can be a good choice.

Ease of Use / Implementation – Can be quite challenging starting from scratch

Required Technical Ability – High

Configurability – High

Customization – High, but build time and maintenance costs can be prohibitive ‍

Best for End-to-End Deployment - High, but requires a lot of resources to set up and maintain

dbt Deployment Options Overview

	Cron Job	Cloud Runner	Integrated Platform	Fully Custom Deployment
Ease of Use/ Implementation	Need to know what you’re doing and only good for simple use cases.	Very easy to start up	Simple using YAML with ability to build more complex pipelines with Python.	Challenging for most
Required Technical Ability	Medium/ High	Low	Medium	High
Configurability	High	Low / Medium	High	High
Customization	High	Low	High	High
End-to-End Deployment	Low	Low	High	High

Final thoughts

The way you should deploy your dbt project depends on a handful of factors – how much time you’re willing to invest up front, your level of technical expertise, as well as how much configuration and customization you need.

Small teams might have high technical acumen but not enough capacity to manage a deployment on their own. Enterprise teams might have enough resources but maintain disparate, interdependent projects for analytics. Thankfully, there are several options to move your project beyond your local and into a production environment with ease. And while specific tools like Airflow have their own pros and cons, it’s becoming increasingly important to evaluate your data stack vendor solution holistically. Ultimately, there are many ways to deploy dbt to production, and the decision comes down to spending time building a robust deployment pipeline or spending more time focusing on analytics.

Dbt & Airflow

Ultimate dbt Jinja Functions Cheat Sheet

5 mins read

Jinja is the game changing feature of dbt Core that allows us to create dynamic SQL code. In addition to the standard Jinja library, dbt Core includes additional functions and variables to make working with dbt even more powerful out of the box.

See our original post, The Ultimate dbt Jinja Cheat Sheet, to get started with Jinja fundamentals like syntax, variable assignment, looping and more. Then dive into the information below which covers Jinja additions added by dbt Core.

This cheatsheet references the extra functions, macros, filters, variables and context methods that are specific to dbt Core.

Enjoy!

dbt Core: Functions

These pre-defined dbt Jinja functions are essential to the dbt workflow by allowing you to reference models and sources, write documentation, print, write to the log and more.

Functions
ref(project_or_package, model_name, version=latest_version)	This function allows you to reference models within or accross dbt projects. The version and project_or_package parameters are optional. `select * from {{ ref('model_name') }} select * from {{ ref('model_name', version=1) }} select * from {{ ref('project_or_package', 'model_name') }}`
source(source_name, table_name )	This function allows you to create a dependency between the current model and a source. `select * from {{ source('source_name', 'table_name') }}`
log(msg, info=False)	This function writes a message to the log and/or stdout. To write to the both the log file and stdout set the info arg to True. If set to False, the function will only write to the log file. `{{ log('Logging something', info=True) }}`
print("string to print")	Use this function when you wish to print messages to the log and stdout. This is the same as log() with info=True. `{{ print("Print something: " ~ value1 ~ ", " ~ value2) }}`
doc('doc_block_name')	The doc function allows you reference {% docs blocks %} from markdown (.md) files in the description field of property files(.yml). It is important to note that the doc() function is how you reference the contents of a {% docs block %}. `{% docs table_sales_description %} # docs - go - here {% enddocs %} #schema.yml version: 2 models: - name: sales description: '{{ doc("table_sales_description") }}'` The {% docs block %} can be used to override preset overviews.
return(data)	Use this function to return data to the caller. This function will preserve the data type: dict, list, int etc. `{% macro get_list() %} {{ return(['column_1', 'column_2', 'column_2']) }} {% endmacro %}` `select -- getdata() returns a list! {% for item in get_list() %} {{ item }} {% if not loop.last %},{% endif %} {% endfor %} from {{ ref('some_model') }}`
env_var('env_var_name', 'default_value')	env_var is used to incorporate environment variables anywhere you can use Jinja code. `"{{ env_var('DB_PORT') \| as_number }}"` Use the optional second argument to set a default and avoid compilation errors. `#dbt_project.yml models: my_project: +materialized: "{{ env_var('DBT_MATERIALIZATION', 'table') }}"`
var('variable_name')	Use when configuring packages in multiple environments or defining values across multiple models in a package. For variables that rarely change, define them in the dbt_project.yml file. Variables that change frequently can be defined through the CLI. `#dbt_project.yml vars: start_date: '2023-10-01'` `select * from orders where start_date >'{{ var("start_date") }}'`

dbt Core: Macros

These macros are provided in dbt Core to improve the dbt workflow.

Macros
run_query()	This macro allows you to run SQL queries and store their results. It returns a table object with the query result. `{% set results = run_query('select order from orders_table') %}`
debug()	Call this macro within a macro to open up a python debugger. The DBT_MACRO_DEBUGGING environment variable must be set. The debug macro is only available in dbt Core/Datacoves. Only use {{ debug }} in a development environment. `{% macro my_macro() %} {% set result = my_macro() %} {{ debug() }} {% endmacro %}`

dbt Core: Filters

These dbt Jinja filters are used to modify data types.

Filters
as_bool	This Jinja filter will coerce the output into a boolean value. If the conversion can’t happen then it will return an error. `{{ <my_value> \| as_bool }}`
as_number	This Jinja filter will coerce the output into an numeric value. If the conversion can’t happen then it will return an error. `{{ <my_value> \| as_number }}`
as_native	This Jinja filter will return the native python type (set, list, tuple, dict, etc). Best for iterables, use as_bool or as_number for boolean or numeric values. `{{ <my_value> \| as_native }}`

dbt Core: Project context variables

These dbt core "variables" such as config, target, source, and others contain values that provide contextual information about your dbt project and configuration settings. They are typically used for accessing and customizing the behavior of your dbt models based on project-specific and environment-specific information.

Project Context Variables
adapters	dbt uses the adapter to communicate with the database. Setup correctly using the Database specific adapter i.e Snowflake, RedShift. The adapter has methods that will be translated into SQL statements specific to your database.
builtins	The builtins variable is a dictionary containing keys for dbt context methods ref, source, and config. `{% macro ref(modelname) %} {% set db_name = builtins.ref(modelname).database \| lower %}} {% if db_name.startswith('staging') or db_name.endswith('staging') %} {{ return(builtins.ref(modelname).include(database=false)) }} {% else %} {{ return(builtins.ref(modelname)) }} {% endif %} {% endmacro %}`
config	The config variable allows you to get configuration set for dbt models and set constraints that assure certain configurations must exist for a given model: config.get('<config_key>', default='value'): Fetches a configuration named <config_key> for example "materialization". - If no default is provided or the configuration item is not defined, it this will return None `{%- set unique_key = config.get('unique_key', default='id') -%}` config.require("<config_key>"): Strictly requires a key named <config_key> is defined in the configuration. - Throws error if not set. `{%- set unique_key = config.require('unique_key') -%}` config.get and config.require are commonly seen in the context of custom materializations, however, they can be used in other macros, if those macros are used within a model, seed, or snapshot context and they have the relevant configurations set.
dbt_version	The dbt_version variable is helpful for debugging by returning the installed dbt version. This is the dbt version running not what you define in your project. e.g. If you make a project with dbt 1.3 and run it on another machine with dbt 1.6, this will say 1.6
execute	The execute variable is set to True when dbt runs SQL against the databse such as when executing dbt run. When running dbt commands such as dbt compile where dbt parses your project but no SQL is run against the database execute is set to False. This variable is helpful when your Jinja is relying on a result from the database. Wrap the jinja in an if statement. `{% set payment_method_query %} SELECT DISTINCT payment_method FROM {{ ref('raw_payments') }} ORDER BY 1 {% endset %} {% set payment_methods = run_query(payment_method_query) %} {% if execute %} {# Extract the payment methods from the query results #} {% set payment_methods_list = payment_methods.columns[0].values() %} {% else %} {% set payment_methods_list = [] %} {% endif %}`
flags	This variable holds the value of the flags passed in the command line such as FULL_REFRESH, STORE_FAILURES, and WHICH(compile, run, build, run-operation, test, and show). It allows you to set logic based on run modes and run hooks based on current commands/type. `{% if flags.STORE_FAILURES %} --your logic {% else %} --other logic {% endif %}`
graph	The graph variable is dictionary that hold the information about the nodes(Models, Sources, Tests, Snapshots) in your dbt project.
model	This is the graph object of the current model. This object allows you to access the contents of a model, model structure and JSON schema, config settings, and the path to the model.
modules	This variable contains Python Modules for working with data including:datetime, pytz, re, and itertools. `modules.<desired_module>.<desired_function>` `{% set now = modules.datetime.datetime.now() %}`
project_name	This variable returns the name for the root-level project that is being run.
target	This variable contains information about your dbt profile target, such as your warehouse connection information. Use the dot notation to access more such as: target.name or target.schema or target.database, ect

dbt Core: Run context variables

These special variables provide information about the current context in which your dbt code is running, such as the model, schema, or project name.

Run Context Variablels
database_schemas	Only available in the context for on-run-end. This variable allows you to reference the databases and schemas. Useful if using multiple different databases
invocation_id	This function outputs a UUID every time you run or compile your dbt project. It is useful for auditing. You may access it in the query-comment, info dictionary in events and logs, and in the metadata dictionary in dbt artifacts.
results	Only available in the context for on-run-end. This variable contains a list of Results Objects. Allows access to the information populated in run results JSON artifact.
run_started_at	This variable outputs a timestamp for the start time of a dbt run and defaults to UTC. It Is a python datetime object. Use standard strftime formatting. `select '{{ run_started_at.strftime("%Y-%m-%d") }}' as run_start_utc`
schemas	Only available in the context for on-run-end. This variable allows you to reference a list of schemas for models built during a dbt run. Useful for granting privileges.
selected_resources	This variable allows you to access a list of selected nodes from the current dbt command. The items in the list depend on the parameters of —select, —exclude,—selector.
this	{{ this }} is the database representation of the current model. Use the dot notation to access more properties such as: {{ this.database }} and {{ this.schema }}.
thread_id	The thread_id is a unique identifier assigned to the Python threads that are actively executing tasks, such as running code nodes in dbt. It typically takes the form of names like "Thread-1," "Thread-2," and so on, distinguishing different threads.

dbt Core: Context methods

These methods allow you to retrieve information about models, columns, or other elements of your project.

Context Methods
set(value, default)	Allows you to use the python method set(), which converts an iterable into a unique set of values. This is NOT the same as the jinja expression set which is used to assign a value to a variable. This will return none if the value entered is not an iterable. In the example below both the python set() method and the jinja set expression are used to remove a duplicate element in the list. `{% set my_list = ['a', 'b', 'b', 'c'] %} {% set my_set = set(my_list) %} {% do print(my_set) %} {# evaluates to {'a','b', 'c'} #}`
set_strict(value)	Same as the set method above however it will raise a TypeError if the entered value is not a valid iterable. `{% set my_set = set_strict(my_list) %}`
exceptions	Is used to raise errors and warnings in a dbt run: raise_compiler_error will raise an error, print out the message. Model will FAIL. `exceptions.raise_compiler_error("Custom Error message")` warn will raise a compiler warning and print out the set method. Model will still PASS. `exceptions.warn("Custom Warning message")`
fromjson(string, default)	Is used to deserialize a JSON string into a python dict or list.
fromyaml(string, default)	Is used to deserialize a YAML string into a python dict or list.
tojson(value, default)	Serializes a Python dict or list to a JSON string.
toyaml(value, default )	Serializes a Python dict or list to a YAML string.
local_md5	This variable locally creates an MD5 hash of the given string. `{%- set hashed_value = local_md5(“String to hash”) -%} select '{{ hashed_value }}' as my_hashed_value`
zip(*args, default)	This method is allows you to combine any number of iterables. `{% set my_list_a = [1, 2] %} {% set my_list_b = ['Data', 'Coves'] %} {% set my_zip = zip(my_list_a, my_list_b) \| list %} {% do print(my_zip) %} {# Result [(1, 'Data'), (2, 'Coves')] #}`
zip_strict(value)	Same as the zip method but will raise a TypeError if one of the given values are not a valid iterable. `{% set my_list_a = 123 %} {% set my_list_b = 'Datacoves' %} {% set my_zip = zip_strict(my_list_a, my_list_b) \| list %} {# This will fail #}`

Please contact us with any errors or suggestions.

Dbt & Airflow

Ultimate dbt-utils Cheat Sheet

5 mins read

The dbt-utils package enhances the dbt experience by offering a suite of utility macros. Designed to tackle common SQL modeling patterns, it streamlines complex operations, allowing users to focus on data transformation rather than the intricacies of SQL. dbt-utils is a must-have tool for dbt aficionados!

dbt-utils Cheat sheet

The dbt-utils package is a gem in the world of data transformations. Let this cheat sheet guide you swiftly through its features, ensuring you get the most out of dbt-utils. Enjoy!

SQL Generators

The SQL generators in the dbt-utils package streamline your modeling tasks. By automating common SQL patterns, they minimize manual coding and guarantee consistent, high-quality queries. Think of it as a handy toolkit for every dbt user's SQL endeavors!

SQL Generators
date_spine(datepart, start_date, end_date) (view source)	This macro provides the SQL needed to create a sequence of dates. It starts with the 'start_date' (if it matches the specified date part) but doesn't go up to the 'end_date'.
deduplicate(relation, partition_by, order_by) (view source)	This macro produces the SQL needed to delete duplicate rows from a model, source, or CTE.
haversine_distance(lat1, lon1, lat2, lon2, unit='mi') (view source)	This macro computes the haversine distance between a pair of x/y coordinates. You can also specify a 'unit' for the distance.
group_by(n) (view source)	This macro constructs a "group by" statement for fields ranging from 1 to N.
star(from, except= [], relation_alias= [], prefix=' ', suffix=' ', quote_identifiers=True) (view source)	This macro creates a list of fields separated by commas from the 'from' argument, but it leaves out any fields mentioned in the 'except' argument. Further customizations are available with the optional arguments.
union_relations(relations, exclude, include, column_override, source_column_name='_dbt_source_relation', where) (view source)	This macro merges the items from the relations argument using 'union all'. It can handle when the columns are in different orders or when some columns are missing in certain Relations. Any unique columns will have 'null' values where they don't exist in some relations. Additionally, a new column, '_dbt_source_relation', is added to show where each record originated.
generate_series(upper_bound) (view source)	This macro creates a cross-database way to produce a list of numbers up to a specified maximum. It generates a SQL result set where the numbers start from 1 and go up to the number you've chosen.
generate_surrogate_key(<field list>) (view source)	This macro creates a cross-database way to produce a hashed surrogate key based on the fields you specify. Note: Older versions of this macro is called 'surrogate_key()'. Both nulls and blank strings were treated identically. If you need to revert to this behavior for reasons like maintaining backward compatibility, you should add the surrogate_key_treat_nulls_as_empty_strings variable to your 'dbt_project.yml'.
safe_add(<field list>) (view source)	This macro cross-database way to sum up fields that can have null values, based on the fields you indicate.
safe_divide(numerator, denominator) (view source)	This macro divides values but gives a null result if the bottom number (denominator) is 0.
safe_subtract(<field list>) (view source)	This macro implements a cross-database to calculate the difference between fields that might have null values, based on the fields you select.
pivot(column, values, alias=True, agg= 'sum', cmp='=', prefix= ' ', suffix= ' ', then_value= 1, else_value=0, quote_identifiers=True) (view source)	This macro transforms data by turning row values into column headers.
unpivot(relation, cast_to= 'varchar', exclude=none, remove=none, field_name=none, value_name=none, quote_identifiers=False) (view source)	This macro converts a table from a wide layout to a long layout, similar to the melt function in pandas. It replaces Boolean values with the strings 'true' or 'false'.
width_bucket(expr, min_value, max_value, num_buckets) (view source)	This macro creates equal-width histograms by dividing the range into uniform intervals. It then determines the bucket number for a given value after evaluating an expression. The output is either an integer representing the bucket number or null if any input is null.

Generic tests

^Within the dbt-utils package lies a set of generic tests, designed to validate your data effortlessly. These tests ensure consistency and quality, checking for common issues without the need to craft custom validation rules. It's data integrity made simple for dbt users.

Generic Tests
equal_rowcount(compare_model, group_by_columns) (view source)	This test compares two models or tables and asserts they have the same number of rows.
fewer_rows_than(compare_model, group_by_columns) (view source)	This test asserts that a specific model or table has fewer rows than a specified threshold or compared to another table.
equality(compare_model, compare_columns, group_by_columns) (view source)	This test compares two models or tables and asserts their equality. You can specify a subset of columns to compare.
expression_is_true(expression, where) (view source)	The test ensures that a given SQL expression holds true for every record, which is particularly helpful in maintaining column integrity. For instance, this can be used to confirm results from simple algebraic operations between columns, validate the length of a column, or assess the truth value of a column. Note: Use the where clause to check a subset of records.
recency(datepart, field, interval, group_by_columns) (view source)	This test ensures the timestamp column in the given model has data that's newer than a specific date range.
at_least_one(group_by_columns) (view source)	This test checks a specified column and asserts it has at least one value.
not_constant(group_by_columns) (view source)	This test makes sure a column doesn't have the same value for every row.
not_empty_string(trim_whitespace=true) (view source)	This test asserts that none of the values are equal to an empty string. The optional argument, trim_whitespace will remove whitespace from the column.
cardinality_equality(field, to) (view source)	This test asserts that a specific column has the same number of unique values as another column in a different table.
not_null_proportion(at_least, at_most=1.0, group_by_columns) (view source)	This test checks that a column has non-empty values within a certain range. If you don't specify an upper limit, it assumes up to 100% (or 1.0) of the values can be non-empty.
not_accepted_values(values) (view source)	This test checks that there aren't any rows with the specified values.
relationships_where(to, field, from_condition, to_condition) (view source)	This test checks the connection between two models, similar to the basic relationship checks. It also allows for filtering out specific rows, like test entries or recent entries, which might have temporary inconsistencies because of data processing limits.
mutually_exclusive_ranges(lower_bound_column, upper_bound_column, partition_by=None, gaps='allowed', zero_length_range_allowed=False) (view source)	By setting the ‘lower_bound_column’ and ‘upper_bound_column’ ranges, This test asserts that the ranges don't overlap with those of other rows.
sequential_values(interval=1, datepart=None, group_by_columns) (view source)	This test checks that a column has values in order, either numbers or dates
unique_combination_of_columns(combination_of_columns, quote_columns=false) (view source)	This test ensures that when certain columns are combined, their values are unique. For instance, while a month or a product alone might repeat, the pairing of a month with a product is always distinct.
accepted_range(min_value, max_value, where, Inclusive=true) (view source)	This test ensures a column's values are within a certain range, set by a minimum and maximum value. You can choose if the range includes or excludes the boundary values. If needed, you can also focus on specific records with a filter. As long as the data type can be compared using > or < signs, you can use this test.
Grouping in tests	Some tests offer an extra group_by_columns option to get more detailed checks. Arg for group_by_column must be in list format. Tests that support this include: equal_rowcount, fewer_rows_than, recency, at_least_one, not_constant, sequential_values, non_null_proportion.

Introspective macros

The introspective macros within the dbt-utils package are a window into your data's metadata. They empower you to dynamically interact with and understand the underlying structure of your datasets. It's like having a magnifying glass for the intricacies of your dbt projects!

Introspective Macros
get_column_values(table, column, where=none, order_by='count(*) desc', max_records=none, default=[]) (view source)	This macro produces an array containing the distinct values of a column from a specified relation. Note: Setting order_by='my_column' will sort alphabetically, while order_by='min(created_at)' will sort by when the value was first observed.
get_filtered_columns_in_relation(from, except=[]) (view source)	This macro provides a Jinja list, containing columns from a specific relation. It's meant for actual relations and not for CTEs.
get_relations_by_pattern(schema_pattern, table_pattern, exclude, database=target.database) (view source)	This macro produces a list of Relations that align with a specified schema or table name pattern. It pairs well with the 'union_relations' macro.
get_relations_by_prefix(schema, prefix, exclude, database= target.database) (view source)	DEPRECATED SOON: Use 'get_relations_by_pattern' instead. This macro gives a list of Relations matching a specific prefix and has an option to exclude certain patterns. It works especially well when combined with 'union_relations'.
get_query_results_as_dict(<sql query>) (view source)	This macro provides a dictionary based on a SQL query, eliminating the need to work with the Agate library to process the outcome.
get_single_value(<sql query>) (view source)	This macro fetches a single value from a SQL query, allowing you to bypass the use of the Agate library when handling the result.

‍

Please contact us with any errors or suggestions.

Dbt & Airflow

dbt vs Airflow: What's the Difference and Why You Need Both

5 mins read

dbt transforms data inside your warehouse using SQL. Airflow orchestrates workflows across your entire data pipeline. They are not competing tools -- they solve different problems. dbt handles the "T" in ELT: modeling, data testing, unit testing, documentation, and lineage. Airflow handles the scheduling, sequencing, retries, failure notifications, observability, and coordination of every step in the pipeline, including ingesting data, triggering dbt, and activation steps. Most production data teams use both. The real decision is not which tool to pick, but how to run them together without drowning in infrastructure overhead.

What is Apache Airflow?

Airflow is a workflow scheduler, but it is only one part of a broader data orchestration problem that includes dependencies, retries, visibility, and ownership across the entire data lifecycle.

How Does Airflow Work?

Imagine a scenario where you have a series of tasks: Task A, Task B, and Task C. These tasks need to be executed in sequence every day at a specific time. Airflow enables you to programmatically define the sequence of steps as well as what each step does. With Airflow you can also monitor the execution of each step and get alerts when something fails.

What Sets Airflow Apart?

Airflow's core strength is flexibility. You define the code behind each task, not just the sequence and schedule. Whether it's triggering an extraction job, calling a Python script, or integrating with an external platform, Airflow lets you control exactly how each step executes.

But that flexibility is a double-edged sword. Just because you can code everything inside Airflow doesn't mean you should. Overly complex task logic inside Airflow makes workflows harder to debug and maintain. The best practice: use Airflow as the orchestrator, not the execution engine. Let specialized tools handle the heavy lifting. Use tools like dbt for transformation and tools like Azure Data Factory, dlt, Fivetran or Airbyte for extraction. Let Airflow handle the scheduling, coordination, observability, and failure notification.

Best Use Cases for Airflow

Data Extraction and Loading: Trigger external tools like Azure Data Factory, dlt, Fivetran, or Airbyte for data ingestion. When no off-the-shelf tool exists, Airflow can call custom Python scripts or frameworks like dlt to handle the load.
Data Transformation Coordination: Trigger dbt Core to run transformation steps in the right sequence with appropriate parallelism, directly in the data warehouse.
ML Workflow Coordination: Orchestrate machine learning pipelines, from data preprocessing to model refreshes, ensuring each step runs after upstream transformations complete.
Automated Reporting: Trigger report generation or data exports to BI and reporting tools on a schedule.
System Maintenance and Monitoring: Schedule backups, manage log storage, and send alerts on system anomalies.

Airflow Benefits

Job Management: Define workflows and task dependencies with a built-in scheduler that handles sequencing and synchronization.
Retry Mechanism: Automatically retry failed tasks or entire pipelines with configurable retry logic, ensuring resilience and fault tolerance.
Alerting: Send failure notifications to email, Slack, MS Teams, or other tools so your team knows immediately when something breaks.
Monitoring: Track workflow status, task durations, and historical runs through the Airflow UI.
Scalability: Deploy on Kubernetes to scale up during peak processing and scale down when idle, keeping infrastructure costs aligned with actual usage.
Community: Active open-source project with robust support on GitHub and Slack, plus a deep library of learning resources.

Airflow Challenges

Python Required: Airflow is Python-based, so creating workflows requires Python proficiency.
Learning Curve: Concepts like DAGs, operators, hooks, and task dependencies take time to master, especially for team members without prior orchestration experience.
Production Deployment: Setting up a production-grade Airflow environment requires knowledge of Kubernetes, networking, and infrastructure tuning.
Ongoing Maintenance: Upgrading Airflow and its underlying infrastructure carries real cost and complexity, particularly for teams that don't manage distributed systems regularly.
Debugging: Identifying root causes in a complex Airflow environment requires experience. Knowing where to start is half the battle.
Cost of Ownership: Airflow is open-source and free to use, but the total cost of deployment, tuning, and ongoing support adds up quickly.
Development Experience: Testing DAGs locally requires setting up Docker-based Airflow environments, which adds friction and slows iteration cycles.

What is dbt?

dbt (data build tool) is an open-source framework that uses SQL and Jinja templating to transform data inside the warehouse. Developed by dbt Labs, dbt handles the "T" in ELT: transforming, testing, documenting, and tracking lineage of data models. It brings software engineering practices like version control, CI/CD, and modular code to analytics workflows.

How Does dbt Work?

dbt lets you write transformation logic as SQL select statements, enhanced with Jinja templating for dynamic execution. Each transformation is called a "model." When you run dbt, it compiles those models into raw SQL and executes them against your warehouse, creating or replacing tables and views. dbt resolves dependencies between models automatically, running them sequentially or in parallel as needed.

What Sets dbt Apart?

Unlike traditional ETL tools that abstract SQL behind drag-and-drop interfaces, dbt treats SQL as the primary language for transformation. This makes it immediately accessible to anyone who already knows SQL. But dbt goes further by adding Jinja templating for dynamic scripting, conditional logic, and reusable macros.

A few things set dbt apart from alternatives:

Idempotency: dbt transformations produce the same result every time they run, making pipelines predictable and repeatable.
Testing built in: Define data quality tests alongside your models. No need for a separate testing tool.
Auto-generated documentation: dbt produces a web-based docs site with descriptions, lineage graphs, and metadata -- all generated directly from your project.
Open metadata: dbt outputs metadata files (artifacts) that downstream tools like data catalogs and observability platforms can consume. This format is quickly becoming the standard for sharing information across data platforms.

dbt excels at the "T" in ELT but requires complementary tools for extraction, loading, and orchestration.

Best Use Cases for dbt

Data Transformation for Analytics: Transform and aggregate raw data into models ready for BI platforms, analytics tools, and ML pipelines.
SQL-based Workflow Creation: Build modular transformation workflows using SQL and Jinja, with reusable macros and dynamic script generation.
Data Validation and Testing: Define schema tests and custom tests alongside your models to catch data quality issues before they reach end users.
SQL Unit Testing: Verify that a specific transformation works as expected by providing sample input data and comparing against expected output.
Documentation and Lineage: Auto-generate a documentation site with descriptions, column-level lineage, and dependency graphs to simplify impact analysis and debugging.
Version Control and DataOps: Manage transformations through Git-based workflows with branching, pull requests, and environment-specific deployments.

dbt Benefits

SQL-based: dbt uses SQL and Jinja for transformation, so most data analysts and engineers can contribute without learning a new language.
Modularity: Each transformation is a SQL select statement that builds on other models. This modular approach simplifies debugging, testing, and reuse.
Testing: Built-in testing framework lets you validate data quality at every layer without introducing additional tools.
Documentation: Auto-generated docs site exposes descriptions, lineage, and metadata, making it easier to maintain projects over time and helping data consumers find and understand data before use.
Lineage and Impact Analysis: Visual dependency graphs simplify debugging when issues occur and help you trace which sources feed a specific data product like an executive dashboard.
Open-source: dbt Core is open-source with no vendor lock-in.
Packages and Libraries: A large community maintains reusable dbt packages and Python libraries that extend dbt's functionality out of the box.
Open Metadata: dbt produces artifact files that data catalogs and observability tools can consume, making it a de facto standard for sharing metadata across platforms.
AI-Ready Foundation: dbt's structured metadata, lineage, and documentation make your warehouse AI-ready. When dbt artifacts are available in your warehouse (e.g. Snowflake), AI tools can use that context to understand table relationships, column descriptions, and data quality rules, turning your transformation layer into a knowledge layer that LLMs and copilots can reason over. See Snowflake Cortex CLI running in Datacoves.
Community: 100k+ members on Slack, extensive learning resources, and a growing ecosystem of practitioners solving similar data problems.

dbt Challenges

Transformation only: dbt handles the "T" in ELT. Extraction, loading, orchestration, and downstream activation all require separate tools.
Macros complexity: dbt macros are powerful but can be hard to read, debug, and test, especially for analysts who are primarily SQL users.
Infrastructure management: Running dbt Core in production requires setting up CI/CD, environment management, secrets handling, and Git workflows. This overhead grows as the team and project scale.
No built-in orchestration: dbt Core has no scheduler. You need an external orchestrator like Airflow to automate runs and coordinate dbt with the rest of your pipeline.
Multi-project complexity: As organizations grow into multiple dbt projects, managing cross-project dependencies, shared models, and consistent conventions becomes a significant challenge.

Common Misconception: dbt Core vs. dbt Cloud

A common misunderstanding in the data community is equating "dbt" with "dbt Cloud." When people say dbt, they typically mean dbt Core -- the open-source framework. dbt Cloud is a commercial product from dbt Labs built on top of dbt Core that adds a scheduler, hosted IDE, monitoring, and managed infrastructure.

The key distinction: you can use dbt Core without paying for dbt Cloud, but you won't get a scheduler, hosted environment, or managed CI/CD. That means you'll need to solve orchestration, deployment, and infrastructure yourself. This is typically done with a tool like Airflow and a platform to tie it all together.

For a deeper comparison, see our article on dbt Cloud vs. dbt Core.

dbt Cloud's Scheduler: What It Does and Doesn't Cover

dbt Cloud includes a scheduler that automates dbt runs on a defined cadence. This keeps your transformation models fresh and your data reliable. But it only schedules dbt jobs, nothing else.

Your Extract and Load steps, downstream processes like BI refreshes, ML model retraining, marketing automation, and any cross-system coordination still need an orchestrator. dbt Cloud's scheduler does not replace Airflow. It replaces the need to cron-schedule dbt runs manually, but leaves the rest of the pipeline unmanaged.

Do I need Airflow if I use dbt Cloud?

Yes, unless your entire pipeline is just dbt transformations with nothing before or after them.

dbt Cloud schedules and runs your dbt jobs. That covers transformation, testing, and documentation. But most production pipelines involve more than that: extracting data from source systems, loading it into the warehouse, refreshing BI dashboards, triggering ML model retraining, sending alerts, and coordinating dependencies across all of these steps.

dbt Cloud does not orchestrate any of those. If your pipeline looks like this:

Extract → Load → Transform → Activate

Then dbt Cloud only covers one step. You still need an orchestrator like Airflow to tie the full pipeline together, manage dependencies between stages, handle retries, and provide end-to-end visibility when something breaks.

What dbt cloud's scheduler actually covers

For small teams with a simple pipeline (one source, one warehouse, one dashboard), dbt Cloud's scheduler may be enough to start. But as soon as you add sources, downstream consumers, or cross-system dependencies, you'll need Airflow or a platform that includes it.

Managed dbt Core + Managed Airflow: Removing the Infrastructure Burden

The hidden cost of choosing open-source dbt Core and Airflow is the platform you have to build around them. CI/CD pipelines, secrets management, environment provisioning, Git workflows, Kubernetes tuning, Airflow upgrades -- this is real engineering work that pulls your team away from delivering data products.

Running dbt + Airflow: Build it yourself vs Managed platform

Datacoves eliminates that overhead. It provides managed dbt Core and managed Airflow in a unified environment deployed in your private cloud. Developers get an in-browser VS Code editor with dbt, Python, extensions, and Git pre-configured. Orchestration is handled through Datacoves' managed Airflow, which includes a simplified YAML-based DAG configuration that connects Extract, Load, and Transform steps without writing boilerplate Python.

Best practices, governance guardrails, and accelerators are built in, so teams get a production-grade data platform in minutes instead of months. For a deeper look at the tradeoffs, see our guide on building vs. buying your analytics platform.

To learn more about Datacoves, check out our product page.

dbt vs Airflow: Which One Should You Choose?

This is not an either-or decision. dbt and Airflow solve different problems, and trying to force one tool to do the other's job creates fragile, hard-to-maintain pipelines.

Use dbt for transformation, testing, documentation, and lineage. Use Airflow for scheduling, dependency management, retries, and end-to-end pipeline coordination. The comparison table below shows where each tool fits.

	Apache Airflow	dbt Core
Recommended Usage	Data workflow orchestration	Data transformation, documentation, data quality, data lineage
Primary Strength	Orchestration of data pipelines	Data modeling and transformation
Development Language	Python or YAML using DagFactory or dbt-coves	SQL and Python in some warehouses
Ease of Use	Python required. Simpler with a managed platform, complex if self-hosted.	SQL-based, low learning curve. Infrastructure management grows harder at scale.
Pricing	Open-source. Real cost is deployment and maintenance. Managed options: Astronomer, Datacoves.	Open-source. Real cost is CI/CD, environments, and orchestration. Managed options: dbt Cloud, Datacoves.

Which Tool Should You Adopt First?

If you need to pick one tool to start with, here's the decision framework:

Start with dbt if your immediate priority is transforming raw data into reliable, tested models in the warehouse. dbt gives you version control, testing, documentation, and lineage from day one. For small teams with simple pipelines, a cron job or dbt Cloud's scheduler can handle automation in the short term.

Start with Airflow if you already have multiple systems that need coordination; extraction from SaaS sources, loading into the warehouse, transformation, and downstream activation. Airflow gives you end-to-end visibility and control across the full pipeline.

The reality: most teams outgrow a single-tool approach quickly. Once your pipeline spans more than one step, you need both. The question isn't which tool to pick, it's how soon you'll need the other one. If you're still evaluating transformation tools, check out our dbt alternatives comparison.

How dbt and Airflow Work Together

In a production pipeline, the tools fit together like this:

Airflow triggers extraction and loading, kicking off tools like dlt, Airbyte, Fivetran, or Azure Data Factory to pull data from source systems into the warehouse.
Airflow triggers dbt. Once data lands in the warehouse, Airflow calls dbt to run transformations and tests in the correct sequence.
Airflow handles downstream activation. After transformation, Airflow pushes data to BI platforms, refreshes ML models, triggers marketing automation, or sends alerts.

Who owns what: Airflow vs dbt in the data pipeline

Pipeline Stage	Tool
Extract & Load	Airflow triggers dlt, Airbyte, Fivetran, or ADF
Transform & Test	dbt Core
Orchestrate & Schedule	Airflow
Documentation & Lineage	dbt Core
Downstream Activation	Airflow triggers BI, ML, alerts
Retries & Failure Handling	Airflow

Airflow owns the "when" and "what order." dbt owns the "how" for transformation. Neither tool steps on the other's responsibilities.

The challenge is running them together. Deploying Airflow and dbt Core in a shared, production-grade environment requires Kubernetes, CI/CD, secrets management, environment provisioning, and ongoing maintenance. For teams that don't want to build and maintain that platform layer themselves, a managed solution like Datacoves provides both tools pre-integrated in your private cloud -- so your team focuses on data, not infrastructure.

Conclusion: dbt and Airflow Are Better Together

dbt and Airflow are not competing tools. dbt transforms your data. Airflow orchestrates your pipeline. Trying to use one without the other leaves gaps that show up as fragile pipelines, manual workarounds, and constant firefighting. The teams that struggle most are the ones who pick the right tools but never invest in the platform layer that makes them work together.

The real question is not "dbt vs Airflow," it's how you run them together without building and maintaining a platform from scratch. That infrastructure layer (Kubernetes, CI/CD, secrets, environments, Git workflows, upgrades) is where most teams burn time and budget.

Datacoves gives you managed dbt Core and managed Airflow in a unified environment, deployed in your private cloud, with best practices and governance built in. Your team focuses on delivering data products, not maintaining the platform underneath them.

See how Datacoves works.