Datacoves blog

Learn more about dbt Core, ELT processes, DataOps,
modern data stacks, and team alignment by exploring our blog.

Build vs Buy Analytics Platform: Hosting Open-Source Tools

Not long ago, the data analytics world relied on monolithic infrastructures—tightly coupled systems that were difficult to scale, maintain, and adapt to changing needs. These legacy setups often resulted in operational bottlenecks, delayed insights, and high maintenance costs. To overcome these challenges, the industry shifted toward what was deemed the Modern Data Stack (MDS)—a suite of focused tools optimized for specific stages of the data engineering lifecycle.

This modular approach was revolutionary, allowing organizations to select best-in-class tools like Airflow for Orchestration or a managed version of Airflow from Astronomer or Amazon without the need to build custom solutions. While the MDS improved scalability, reduced complexity, and enhanced flexibility, it also reshaped the build vs. buy decision for analytics platforms. Today, instead of deciding whether to create a component from scratch, data teams face a new question: Should they build the infrastructure to host open-source tools like Apache Airflow and dbt Core, or purchase their managed counterparts? This article focuses on these two components because pipeline orchestration and data transformation lie at the heart of any organization’s data platform.

What does it mean to build vs buy?

Build

When we say build in terms of open-source solutions, we mean building infrastructure to self-host and manage mature open-source tools like Airflow and dbt. These two tools are popular because they have been vetted by thousands of companies! In addition to hosting and managing, engineers must also ensure interoperability of these tools within their stack, handle security, scalability, and reliability. Needless to say, building is a huge undertaking that should not be taken lightly.

Buy

dbt and Airflow both started out as open-source tools, which were freely available to use due to their permissive licensing terms. Over time, cloud-based managed offerings of these tools were launched to simplify the setup and development process. These managed solutions build upon the open-source foundation, incorporating proprietary features like enhanced user interfaces, automation, security integration, and scalability. The goal is to make the tools more convenient and reduce the burden of maintaining infrastructure while lowering overall development costs. In other words, paid versions arose out of the pain points of self-managing the open-source tools.

This begs the important question: Should you self-manage or pay for your open-source analytics tools?

Comparing build vs. buy: Key tradeoffs

As with most things, both options come with trade-offs, and the “right” decision depends on your organization’s needs, resources, and priorities. By understanding the pros and cons of each approach, you can choose the option that aligns with your goals, budget, and long-term vision.

Building In-House

Pros:

Customization: The biggest advantage of building in-house is the flexibility to customize the tool to fit your exact use case. You maintain full control, allowing you to align configurations with your organization’s unique needs. However, with great power comes great responsibility—your team must have a deep understanding of the tools, their options, and best practices.

Control: Owning the entire stack gives your team the ability to integrate deeply with existing systems and workflows, ensuring seamless operation within your ecosystem.

Cost Perception: Without licensing fees, building in-house may initially appear more cost-effective, particularly for smaller-scale deployments.

Cons:

High Upfront Investment: Setting up infrastructure requires a considerable time commitment from developers. Tasks like configuring environments, integrating tools like Git or S3 for Airflow DAG syncing, and debugging can consume weeks of developer hours.

Operational Complexity: Ongoing maintenance—such as managing dependencies, handling upgrades, and ensuring reliability—can be overwhelming, especially as the system grows in complexity.

Skill Gaps: Many teams underestimate the level of expertise needed to manage Kubernetes clusters, Python virtual environments, and secure credential storage systems like AWS Secrets Manager.

Experimentation: Your organization is using the first iteration the team is producing which can lead to unintended consequences, edge cases, and security issues.

Example:

A team building Airflow in-house may spend weeks configuring a Kubernetes-backed deployment, managing Python dependencies, and setting up DAG synchronizing files via S3 or Git. While the outcome can be tailored to their needs, the time and expertise required represent a significant investment.

Building with open-source is not free. Cons Continued

Before moving on to the buy tradeoffs, it is important to set the record straight. You may have noticed that we did not include “the tool is free to use” as one of our pros for building with open-source. You might have guessed by reading the title of this section, but many people incorrectly believe that building their MDS using open-source tools like dbt is free. When in reality there are many factors that contribute to the true dbt pricing and the same is true for Airflow.

How can that be? Well, setting up everything you need and managing infrastructure for Airflow and dbt isn’t necessarily plug and play. There is day-to-day work from managing Python virtual environments, keeping dependencies in check, and tackling scaling challenges which require ongoing expertise and attention. Hiring a team to handle this will be critical particularly as you scale. High salaries and benefits are needed to avoid costly mistakes; this can easily cost anywhere from $5,000 to $26,000+/month depending on the size of your team.

In addition to the cost of salaries, let’s look at other possible hidden costs that come with using open-source tools.

Time and expertise

The time it takes to configure, customize, and maintain a complex open-source solution is often underestimated. It’s not until your team is deep in the weeds—resolving issues, figuring out integrations, and troubleshooting configurations—that the actual costs start to surface. With each passing day your ROI is threatened. You want to start gathering insights from your data as soon as possible. Datacoves helped Johnson and Johnson set up their data stack in weeks and when issues arise, a you will need expertise to accelerate the time to resolution.

And then there’s the learning curve. Not all engineers on your team will be seniors, and turnover is inevitable. New hires will need time to get up to speed before they can contribute effectively. This is the human side of technology: while the tools themselves might move fast, people don’t. That ramp-up period, filled with training and trial-and-error, represents a hidden cost.

Security and compliance

Security and compliance add another layer of complexity. With open-source tools, your team is responsible for implementing best practices—like securely managing sensitive credentials with a solution like AWS Secrets Manager. Unlike managed solutions, these features don’t come prepackaged and need to be integrated with the system.

Compliance is no different. Ensuring your solution meets enterprise governance requirements takes time, research, and careful implementation. It’s a process of iteration and refinement, and every hour spent here is another hidden cost as well as risking security if not done correctly.

Scaling complexities

Scaling open-source tools is where things often get complicated. Beyond everything already mentioned, your team will need to ensure the solution can handle growth. For many organizations, this means deploying on Kubernetes. But with Kubernetes comes steep learning curves and operational challenges. Making sure you always have a knowledgeable engineer available to handle unexpected issues and downtimes can become a challenge. Extended downtime due to this is a hidden cost since business users are impacted as they become reliant on your insights.

Buying a managed solution

A managed solution for Airflow and dbt can solve many of the problems that come with building your own solution from open-source tools such as: hassle-free maintenance, improved UI/UX experience, and integrated functionality. Let’s take a look at the pros.

Pros:

Faster Time to Value: With a managed solution, your team can get up and running quickly without spending weeks—or months—on setup and configuration.

Reduced Operational Overhead: Managed providers handle infrastructure, maintenance, and upgrades, freeing your team to focus on business objectives rather than operational minutiae.

Predictable Costs: Managed solutions typically come with transparent pricing models, which can make budgeting simpler compared to the variable costs of in-house built tooling.

Reliability: Your team is using version 1000+ of a managed solution vs the 1^st version of your self-managed solution. This provides reliability and peace of mind that edge cases have been handled, and security is under wraps.

Cons:

Potentially Less Flexibility: This is the biggest con and reason why many teams choose to build. Managed solutions may not allow for the same level of customization as building in-house, which could limit certain niche use cases. Care must be taken to choose a provider that is built for enterprise level flexibility.

Dependency on a Vendor: Relying on a vendor for your analytics stack introduces some level of risk, such as service disruptions or limited migration paths if you decide to switch providers. Some managed solution providers simply offer the service, but leave it up to your team to “make it work” and troubleshoot. A good provider will have a vested interest in your success, because they can’t afford for you to fail.

Example:

Using a solution like MWAA, teams can leverage managed Airflow eliminating the need for infrastructure worries however additional configuration and development will be needed for teams to leverage it with dbt and to troubleshoot infrastructure issues suck as containers running out of memory.

Datacoves for Airflow and dbt: The buy that feels like a build

For data teams, the allure of a custom-built solution often lies in its promise of complete control and customization. However, building this requires significant time, expertise, and ongoing maintenance. Datacoves bridges the gap between custom-built flexibility and the simplicity of managed services, offering the best of both worlds.

With Datacoves, teams can leverage managed Airflow and pre-configured dbt environments to eliminate the operational burden of infrastructure setup and maintenance. This allows data teams to focus on what truly matters—delivering insights and driving business decisions—without being bogged down by tool management.

Unlike other managed solutions for dbt or Airflow, which often compromise on flexibility for the sake of simplicity, Datacoves retains the adaptability that custom builds are known for. By combining this flexibility with the ease and efficiency of managed services, Datacoves empowers teams to accelerate their analytics workflows while ensuring scalability and control.

Datacoves doesn’t just run the open-source solutions, but through real-world implementations, the platform has been molded to handle enterprise complexity while simplifying project onboarding. With Datacoves, teams don’t have to compromize on features like Datacoves-Mesh (aka dbt-mesh), column level lineage, GenAI, Semantic Layer, etc. Best of all, the company’s goal is to make you successful and remove hosting complexity without introducing vendor lock-in. What Datacove does, you can do yourself if given enough time, experience, and money. Finally, for security concious organizations, Datacoves is the only solution on the market that can be deployed in your private cloud with white-glove enterprise support.

Datacoves isn’t just a platform—it’s a partnership designed to help your data team unlock their potential. With infrastructure taken care of, your team can focus on what they do best: generating actionable insights and maximizing your ROI.

Conclusion

The build vs. buy debate has long been a challenge for data teams, with building offering flexibility at the cost of complexity, and buying sacrificing flexibility for simplicity. As discussed earlier in the article, solutions like dbt and Airflow are powerful, but managing them in-house requires significant time, resources, and expertise. On the other hand, managed offerings like dbt Cloud and MWAA simplify operations but often limit customization and control.

Datacoves bridges this gap, providing a managed platform that delivers the flexibility and control of a custom build without the operational headaches. By eliminating the need to manage infrastructure, scaling, and security. Datacoves enables data teams to focus on what matters most: delivering actionable insights and driving business outcomes.

As highlighted in Fundamentals of Data Engineering, data teams should prioritize extracting value from data rather than managing the tools that support them. Datacoves embodies this principle, making the argument to build obsolete. Why spend weeks—or even months—building when you can have the customization and adaptability of a build with the ease of a buy? Datacoves is not just a solution; it’s a rethinking of how modern data teams operate, helping you achieve your goals faster, with fewer trade-offs.

Dbt & Airflow

dbt Alternatives: 10 Platforms Compared (Transformation Guide)

The top dbt alternatives include Datacoves, SQLMesh, Bruin Data, Dataform, and visual ETL tools such as Alteryx, Matillion, and Informatica. Code-first engines offer stronger rigor, testing, and CI/CD, while GUI platforms emphasize ease of use and rapid prototyping. Teams choose these alternatives when they need more security, governance, or flexibility than dbt Core or dbt Cloud provide.

The top dbt alternatives include Datacoves, SQLMesh, Bruin Data, Dataform, and GUI-based ETL tools such as Alteryx, Matillion, and Informatica.

The top dbt alternatives we will cover are:

dbt Cloud Alternatives
1. Datacoves
2. DIY dbt Core
Code-Based ETL Tools
1. SQLMesh
2. Dataform
3. AWS Glue
4. Bruin Data
Graphical ETL Tools
1. Matillion
2. Informatica
3. Alteryx
4. Azure Data Factory
5. Talend
6. SSIS

Why Teams Look for dbt Alternatives

Teams explore dbt alternatives when they need stronger governance, private deployments, or support for Python and code-first workflows that go beyond SQL. Many also prefer GUI-based ETL tools for faster onboarding. Recent market consolidation, including Fivetran acquiring SQLMesh and merging with dbt Labs, has increased concerns about vendor lock-in, which makes tool neutrality and platform flexibility more important than ever.

Teams look for dbt alternatives when they need stronger orchestration, consistent development environments, Python support, or private cloud deployment options that dbt Cloud does not provide.

Categories of dbt Alternatives

Organizations evaluating dbt alternatives typically compare tools across three categories. Each category reflects a different approach to data transformation, development preferences, and organizational maturity.

Category	Best For	Key Trade-Offs
dbt Cloud Alternatives	Teams that want dbt with stronger security, governance, or private/VPC deployment	Requires aligning the platform with your security, governance, and deployment needs
Code-Based ETL Tools	Engineering-first teams that want CI/CD, testing, Python workflows, and strict modeling guardrails	Have smaller communities and ecosystems compared to mature SQL-based tools like dbt
GUI-Based ETL Tools	Mixed-skill teams that prefer drag-and-drop development and faster onboarding	Less flexible for complex SQL modeling, testing, and version-controlled workflows

dbt Cloud Alternatives

Organizations consider alternatives to dbt Cloud when they need more flexibility, stronger security, or support for development workflows that extend beyond dbt. Teams comparing platform options often begin by evaluating the differences between dbt Cloud vs dbt Core.

Running enterprise-scale ELT pipelines often requires a full orchestration layer, consistent development environments, and private deployment options that dbt Cloud does not provide. Costs can also increase at scale (see our breakdown of dbt pricing considerations), and some organizations prefer to avoid features that are not open source to reduce long-term vendor lock-in.

This category includes platforms that deliver the benefits of dbt Cloud while providing more control, extensibility, and alignment with enterprise data platform requirements.

Datacoves

Datacoves provides a secure, flexible platform that supports dbt, SQLMesh, and Bruin in a unified environment with private cloud or VPC deployment.

Datacoves is an enterprise data platform that serves as a secure, flexible alternative to dbt Cloud. It supports dbt Core, SQLMesh, and Bruin inside a unified development and orchestration environment, and it can be deployed in your private cloud or VPC for full control over data access and governance.

Benefits

Flexibility and Customization:
Datacoves provides a customizable in-browser VS Code IDE, Git workflows, and support for Python libraries and VS Code extensions. Teams can choose the transformation engine that fits their needs without being locked into a single vendor.

Handling Enterprise Complexity:
Datacoves includes managed Airflow for end-to-end orchestration, making it easy to run dbt and Airflow together without maintaining your own infrastructure. It standardizes development environments, manages secrets, and supports multi-team and multi-project workflows without platform drift.

Cost Efficiency:
Datacoves reduces operational overhead by eliminating the need to maintain separate systems for orchestration, environments, CI, logging, and deployment. Its pricing model is predictable and designed for enterprise scalability.

Data Security and Compliance:
Datacoves can be deployed fully inside your VPC or private cloud. This gives organizations complete control over identity, access, logging, network boundaries, and compliance with industry and internal standards.

Reduced Vendor Lock-In:
Datacoves supports dbt, SQLMesh, and Bruin Data, giving teams long-term optionality. This avoids being locked into a single transformation engine or vendor ecosystem.

Capability	Datacoves	dbt Cloud
Supported Transformation Engines	dbt Core, SQLMesh, Bruin Data	dbt only
Deployment Model	SaaS or Private Cloud/VPC deployment	SaaS only
Integrated Orchestration	Built-in Airflow with full DAG control	Built-in dbt scheduler (limited orchestration)
Development Environment	In-browser VS Code with extensions, Python, and dbt	Local VS Code integration and web-based dbt IDE
Environment Consistency	Standardized dev environment across users	Standardized dbt development environment
Security & Compliance	Full control in Private Cloud/VPC; SaaS option available	Depends on dbt Cloud’s SaaS environment
Governance & DevOps	Editable GitHub Actions and full CI/CD control	Standardized CI/CD workflow
Ingestion & Python Workloads	Supports Python development and Airflow-based orchestration for ingestion pipelines	Requires additional ingestion tools or processes and does not support Python development in the IDE

DIY dbt Core

Running dbt Core yourself is a flexible option that gives teams full control over how dbt executes. It is also the most resource-intensive approach. Teams choosing DIY dbt Core must manage orchestration, scheduling, CI, secrets, environment consistency, and long-term platform maintenance on their own.

Benefits

Full Control:
Teams can configure dbt Core exactly as they want and integrate it with internal tools or custom workflows.

Cost Flexibility:
There are no dbt Cloud platform fees, but total cost of ownership often increases as the system grows.

Considerations

High Maintenance Overhead:
Teams must maintain Airflow or another orchestrator, build CI pipelines, manage secrets, and keep development environments consistent across users.

Requires Platform Engineering Skills:
DIY dbt Core works best for teams with strong Kubernetes, CI, Python, and DevOps expertise. Without this expertise, the environment becomes fragile over time.

Slow to Scale:
As more engineers join the team, keeping dbt environments aligned becomes challenging. Onboarding, upgrades, and platform drift create operational friction.

Security and Compliance Responsibility:
Identity, permissions, logging, and network controls must be designed and maintained internally, which can be significant for regulated organizations.

dbt alternatives – Code based ETL tools

Teams that prefer code-first tools often look for dbt alternatives that provide strong SQL modeling, Python support, and seamless integration with CI/CD workflows and automated testing. These are part of a broader set of data transformation tools. Code-based ETL tools give developers greater control over transformations, environments, and orchestration patterns than GUI platforms. Below are four code-first contenders that organizations should evaluate.

Code-first dbt alternatives like SQLMesh, Bruin Data, and Dataform provide stronger CI/CD integration, automated testing, and more control over complex transformation workflows.

SQLMesh

SQLMesh is an open-source framework for SQL and Python-based data transformations. It provides strong visibility into how changes impact downstream models and uses virtual data environments to preview changes before they reach production.

Benefits

Efficient Development Environments:
Virtual environments reduce unnecessary recomputation and speed up iteration.

Considerations

Part of the Fivetran Ecosystem:
SQLMesh was acquired by Fivetran, which may influence its future roadmap and level of independence.

Dataform

Dataform is a SQL-based transformation framework focused specifically for BigQuery. It enables teams to create table definitions, manage dependencies, document models, and configure data quality tests inside the Google Cloud ecosystem. It also provides version control and integrates with GitHub and GitLab.

Benefits

Centralized BigQuery Development:
Dataform keeps all modeling and testing within BigQuery, reducing context switching and making it easier for teams to collaborate using familiar SQL workflows.

Considerations

Focused Only on the GCP Ecosystem:
Because Dataform is geared toward BigQuery, it may not be suitable for organizations that use multiple cloud data warehouses.

AWS Glue

AWS Glue is a serverless data integration service that supports Python-based ETL and transformation workflows. It works well for organizations operating primarily in AWS and provides native integration with services like S3, Lambda, and Athena.

Benefits

Python-First ETL in AWS:
Glue supports Python scripts and PySpark jobs, making it a good fit for engineering teams already invested in the AWS ecosystem.

Considerations

Requires Engineering Expertise:
Glue can be complex to configure and maintain, and its Python-centric approach may not be ideal for SQL-first analytics teams.

Bruin Data

Bruin is a modern SQL-based data modeling framework designed to simplify development, testing, and environment-aware deployments. It offers a familiar SQL developer experience while adding guardrails and automation to help teams manage complex transformation logic.

Benefits

Modern SQL Modeling Experience:
Bruin provides a clean SQL-first workflow with strong dependency management and testing.

Considerations

Growing Ecosystem:
Bruin is newer than dbt and has a smaller community and fewer third-party integrations.

dbt alternatives – Graphical ETL tools

While code-based transformation tools provide the most flexibility and long-term maintainability, some organizations prefer graphical user interface (GUI) tools. These platforms use visual, drag-and-drop components to build data integration and transformation workflows. Many of these platforms fall into the broader category of no-code ETL tools. GUI tools can accelerate onboarding for teams less comfortable with code editors and may simplify development in the short term. Below are several GUI-based options that organizations often consider as dbt alternatives.

GUI-based dbt alternatives such as Matillion, Informatica, and Alteryx use drag-and-drop interfaces that simplify development and accelerate onboarding for mixed-skill teams.

Matillion

Matillion is a cloud-based data integration platform that enables teams to design ETL and transformation workflows through a visual, drag-and-drop interface. It is built for ease of use and supports major cloud data warehouses such as Amazon Redshift, Google BigQuery, and Snowflake.

Benefits

User-Friendly Visual Development:
Matillion simplifies pipeline building with a graphical interface, making it accessible for users who prefer low-code or no-code tooling.

Considerations

Limited Flexibility for Complex SQL Modeling:
Matillion’s visual approach can become restrictive for advanced transformation logic or engineering workflows that require version control and modular SQL development.

Informatica

Informatica is an enterprise data integration platform with extensive ETL capabilities, hundreds of connectors, data quality tooling, metadata-driven workflows, and advanced security features. It is built for large and diverse data environments.

Benefits

Enterprise-Scale Data Management:
Informatica supports complex data integration, governance, and quality requirements, making it suitable for organizations with large data volumes and strict compliance needs.

Considerations

High Complexity and Cost:
Informatica’s power comes with a steep learning curve, and its licensing and operational costs can be significant compared to lighter-weight transformation tools.

Alteryx

Alteryx is a visual analytics and data preparation platform that combines data blending, predictive modeling, and spatial analysis in a single GUI-based environment. It is designed for analysts who want to build workflows without writing code and can be deployed on-premises or in the cloud.

Benefits

Powerful GUI Analytics Capabilities:
Alteryx allows users to prepare data, perform advanced analytics, and generate insights in one tool, enabling teams without strong coding skills to automate complex workflows.

Considerations

High Cost and Limited SQL Modeling Flexibility:
Alteryx is one of the more expensive platforms in this category and is less suited for SQL-first transformation teams who need modular modeling and version control.

Azure Data Factory (ADF)

Azure Data Factory (ADF) is a fully managed, serverless data integration service that provides a visual interface for building ETL and ELT pipelines. It integrates natively with Azure storage, compute, and analytics services, allowing teams to orchestrate and monitor pipelines without writing code.

Benefits

Strong Integration for Microsoft-Centric Teams:
ADF connects seamlessly with other Azure services and supports a pay-as-you-go model, making it ideal for organizations already invested in the Microsoft ecosystem.

Considerations

Limited Transformation Flexibility:
ADF excels at data movement and orchestration but offers limited capabilities for complex SQL modeling, making it less suitable as a primary transformation engine

Talend

Talend provides an end-to-end data management platform with support for batch and real-time data integration, data quality, governance, and metadata management. Talend Data Fabric combines these capabilities into a single low-code environment that can run in cloud, hybrid, or on-premises deployments.

Benefits

Comprehensive Data Quality and Governance:
Talend includes built-in tools for data cleansing, validation, and stewardship, helping organizations improve the reliability of their data assets.

Considerations

Broad Platform, Higher Operational Complexity:
Talend’s wide feature set can introduce complexity, and teams may need dedicated expertise to manage the platform effectively.

SSIS (SQL Server Integration Services)

SQL Server Integration Services is part of the Microsoft SQL Server ecosystem and provides data integration and transformation workflows. It supports extracting, transforming, and loading data from a wide range of sources, and offers graphical tools and wizards for designing ETL pipelines.

Benefits

Strong Fit for SQL Server-Centric Teams:
SSIS integrates deeply with SQL Server and other Microsoft products, making it a natural choice for organizations with a Microsoft-first architecture.

Considerations

Not Designed for Modern Cloud Data Warehouses:
SSIS is optimized for on-premises SQL Server environments and is less suitable for cloud-native architectures or modern ELT workflows.

Why These dbt Alternatives Exist: The Full Context

Recent consolidation, including Fivetran acquiring SQLMesh and merging with dbt Labs, has increased concerns about vendor lock-in and pushed organizations to evaluate more flexible transformation platforms.

Organizations explore dbt alternatives when dbt no longer meets their architectural, security, or workflow needs. As teams scale, they often require stronger orchestration, consistent development environments, mixed SQL and Python workflows, and private deployment options that dbt Cloud does not provide.

Some teams prefer code-first engines for deeper CI/CD integration, automated testing, and strong guardrails across developers. Others choose GUI-based tools for faster onboarding or broader integration capabilities. Recent market consolidation, including Fivetran acquiring SQLMesh and merging with dbt Labs, has also increased concerns about vendor lock-in.

These factors lead many organizations to evaluate tools that better align with their governance requirements, engineering preferences, and long-term strategy.

Should You DIY a dbt Data Platform?

DIY dbt Core offers full control but requires significant engineering work to manage orchestration, CI/CD, security, and long-term platform maintenance.

Running dbt Core yourself can seem attractive because it offers full control and avoids platform subscription costs. However, building a stable, secure, and scalable dbt environment requires significantly more than executing dbt build on a server. It involves managing orchestration, CI/CD, and ensuring development environment consistency along with long-term platform maintenance, all of which require mature DataOps practices.

The true question for most organizations is not whether they can run dbt Core themselves, but whether it is the best use of engineering time. This is essentially a question of whether to build vs buy your data platform. DIY dbt platforms often start simple and gradually accumulate technical debt as teams grow, pipelines expand, and governance requirements increase.

When DIY Makes Sense

The team has strong platform engineering expertise
Pipelines are relatively simple
Security and compliance needs are minimal
The organization prefers to own and operate every part of the stack

When DIY Becomes a Liability

Multiple analytics engineers need consistent development environments
Governance, auditing, or private deployment become required
Pipelines need enterprise-grade orchestration
Upgrades and maintenance begin consuming valuable engineering time

For many organizations, DIY works in the early stages but becomes difficult to sustain as the platform matures.

How to Choose the Right dbt Alternative

The right dbt alternative depends on your team’s skills, governance requirements, pipeline complexity, and long-term data platform strategy.

Selecting the right dbt alternative depends on your team’s skills, security requirements, and long-term data platform strategy. Each category of tools solves different problems, so it is important to evaluate your priorities before committing to a solution.

1. Team Skills and Workflow Preferences

SQL-first teams: Tools like dbt and Dataform work well for analysts and analytics engineers.
Engineering-first teams: SQLMesh, and AWS Glue offer deeper CI integration, testing, and Python support.‍
Mixed-skill teams: GUI tools like Matillion, Informatica, and Alteryx provide visual development.

2. Governance and Security Requirements

Need for private cloud or VPC deployment
Centralized identity and access management
Audit logging and compliance standards
Ability to control data movement and network boundaries

If these are priorities, a platform with secure deployment options or multi-engine support may be a better fit than dbt Cloud.

3. Complexity of Pipelines

Simple pipelines may work with lightweight tools
Complex, multi-team pipelines benefit from strong orchestration, consistent environments, and guardrails
SQL-only tools may fall short when pipelines require Python-based logic or mixed-language workflows

4. Integration and Ecosystem Compatibility

Choose a tool that integrates cleanly with your cloud environment and data warehouse
Engineering-forward teams may prioritize CI/CD and Git workflows
Analytics-focused or traditional Data Engineering teams may value GUI tools

5. Vendor Lock-In and Long-Term Flexibility

Recent consolidation in the ecosystem has raised concerns about vendor dependency. Organizations that want long-term flexibility often look for:

Multi-engine support
Open-source components
Tooling that can be run in their cloud environment

6. Total Cost of Ownership

Consider platform fees, engineering maintenance, onboarding time, and the cost of additional supporting tools such as orchestrators, IDEs, and environment management

Team Profile	Pipeline Complexity	Recommended dbt Alternative Category
Small team with limited platform engineering capacity	Simple pipelines	dbt Cloud, Datacoves SaaS, or GUI tools (Alteryx)
SQL-first analytics team	Simple to moderate transformations	dbt Cloud, Dataform, Bruin Data, or Datacoves SaaS for standardized SQL development
Mixed-skill team with analysts and engineers	Moderate complexity with collaboration needs	GUI ETL tools (Matillion, Data Factory) and a code-based SQL/Python tool for advanced modeling
Highly regulated or security-focused organization	Moderate to high complexity	dbt Cloud Alternatives with Private Cloud/VPC deployment (Datacoves)
Engineering-first data platform team	Complex, multi-step pipelines	Code-based ETL tools with CI/CD (SQLMesh, Bruin Data, or AWS Glue) or Datacoves for integrated orchestration and multi-engine support

Final dbt alternative Recommendation

dbt remains a strong choice for SQL-based transformations, but it is not the only option. As organizations scale, they often need stronger orchestration, consistent development environments, Python support, and private deployment capabilities that dbt Cloud or DIY dbt Core may not provide. Evaluating alternatives helps ensure that your transformation layer aligns with your long-term platform and governance strategy.

Code-first tools like SQLMesh, Bruin Data, and Dataform offer strong engineering workflows, while GUI-based tools such as Matillion, Informatica, and Alteryx support faster onboarding for mixed-skill teams. The right choice depends on the complexity of your pipelines, your team’s technical profile, and the level of security and control your organization requires.

Datacoves provides a flexible, secure alternative that supports dbt, SQLMesh, and Bruin in a unified environment. With private cloud or VPC deployment, managed Airflow, and a standardized development experience, Datacoves helps teams avoid vendor lock-in while gaining an enterprise-ready platform for analytics engineering.

Selecting the right dbt alternative is ultimately about aligning your transformation approach with your data architecture, governance needs, and long-term strategy. Taking the time to assess these factors will help ensure your platform remains scalable, secure, and flexible for your future needs.

‍

Thank you! Your submission has been received!

Oops! Something went wrong while submitting the form.

Dbt & Airflow

Ultimate dbt-utils Cheat Sheet

5 mins read

The dbt-utils package enhances the dbt experience by offering a suite of utility macros. Designed to tackle common SQL modeling patterns, it streamlines complex operations, allowing users to focus on data transformation rather than the intricacies of SQL. dbt-utils is a must-have tool for dbt aficionados!

dbt-utils Cheat sheet

The dbt-utils package is a gem in the world of data transformations. Let this cheat sheet guide you swiftly through its features, ensuring you get the most out of dbt-utils. Enjoy!

SQL Generators

The SQL generators in the dbt-utils package streamline your modeling tasks. By automating common SQL patterns, they minimize manual coding and guarantee consistent, high-quality queries. Think of it as a handy toolkit for every dbt user's SQL endeavors!

SQL Generators
date_spine(datepart, start_date, end_date) (view source)	This macro provides the SQL needed to create a sequence of dates. It starts with the 'start_date' (if it matches the specified date part) but doesn't go up to the 'end_date'.
deduplicate(relation, partition_by, order_by) (view source)	This macro produces the SQL needed to delete duplicate rows from a model, source, or CTE.
haversine_distance(lat1, lon1, lat2, lon2, unit='mi') (view source)	This macro computes the haversine distance between a pair of x/y coordinates. You can also specify a 'unit' for the distance.
group_by(n) (view source)	This macro constructs a "group by" statement for fields ranging from 1 to N.
star(from, except= [], relation_alias= [], prefix=' ', suffix=' ', quote_identifiers=True) (view source)	This macro creates a list of fields separated by commas from the 'from' argument, but it leaves out any fields mentioned in the 'except' argument. Further customizations are available with the optional arguments.
union_relations(relations, exclude, include, column_override, source_column_name='_dbt_source_relation', where) (view source)	This macro merges the items from the relations argument using 'union all'. It can handle when the columns are in different orders or when some columns are missing in certain Relations. Any unique columns will have 'null' values where they don't exist in some relations. Additionally, a new column, '_dbt_source_relation', is added to show where each record originated.
generate_series(upper_bound) (view source)	This macro creates a cross-database way to produce a list of numbers up to a specified maximum. It generates a SQL result set where the numbers start from 1 and go up to the number you've chosen.
generate_surrogate_key(<field list>) (view source)	This macro creates a cross-database way to produce a hashed surrogate key based on the fields you specify. Note: Older versions of this macro is called 'surrogate_key()'. Both nulls and blank strings were treated identically. If you need to revert to this behavior for reasons like maintaining backward compatibility, you should add the surrogate_key_treat_nulls_as_empty_strings variable to your 'dbt_project.yml'.
safe_add(<field list>) (view source)	This macro cross-database way to sum up fields that can have null values, based on the fields you indicate.
safe_divide(numerator, denominator) (view source)	This macro divides values but gives a null result if the bottom number (denominator) is 0.
safe_subtract(<field list>) (view source)	This macro implements a cross-database to calculate the difference between fields that might have null values, based on the fields you select.
pivot(column, values, alias=True, agg= 'sum', cmp='=', prefix= ' ', suffix= ' ', then_value= 1, else_value=0, quote_identifiers=True) (view source)	This macro transforms data by turning row values into column headers.
unpivot(relation, cast_to= 'varchar', exclude=none, remove=none, field_name=none, value_name=none, quote_identifiers=False) (view source)	This macro converts a table from a wide layout to a long layout, similar to the melt function in pandas. It replaces Boolean values with the strings 'true' or 'false'.
width_bucket(expr, min_value, max_value, num_buckets) (view source)	This macro creates equal-width histograms by dividing the range into uniform intervals. It then determines the bucket number for a given value after evaluating an expression. The output is either an integer representing the bucket number or null if any input is null.

Generic tests

^Within the dbt-utils package lies a set of generic tests, designed to validate your data effortlessly. These tests ensure consistency and quality, checking for common issues without the need to craft custom validation rules. It's data integrity made simple for dbt users.

Generic Tests
equal_rowcount(compare_model, group_by_columns) (view source)	This test compares two models or tables and asserts they have the same number of rows.
fewer_rows_than(compare_model, group_by_columns) (view source)	This test asserts that a specific model or table has fewer rows than a specified threshold or compared to another table.
equality(compare_model, compare_columns, group_by_columns) (view source)	This test compares two models or tables and asserts their equality. You can specify a subset of columns to compare.
expression_is_true(expression, where) (view source)	The test ensures that a given SQL expression holds true for every record, which is particularly helpful in maintaining column integrity. For instance, this can be used to confirm results from simple algebraic operations between columns, validate the length of a column, or assess the truth value of a column. Note: Use the where clause to check a subset of records.
recency(datepart, field, interval, group_by_columns) (view source)	This test ensures the timestamp column in the given model has data that's newer than a specific date range.
at_least_one(group_by_columns) (view source)	This test checks a specified column and asserts it has at least one value.
not_constant(group_by_columns) (view source)	This test makes sure a column doesn't have the same value for every row.
not_empty_string(trim_whitespace=true) (view source)	This test asserts that none of the values are equal to an empty string. The optional argument, trim_whitespace will remove whitespace from the column.
cardinality_equality(field, to) (view source)	This test asserts that a specific column has the same number of unique values as another column in a different table.
not_null_proportion(at_least, at_most=1.0, group_by_columns) (view source)	This test checks that a column has non-empty values within a certain range. If you don't specify an upper limit, it assumes up to 100% (or 1.0) of the values can be non-empty.
not_accepted_values(values) (view source)	This test checks that there aren't any rows with the specified values.
relationships_where(to, field, from_condition, to_condition) (view source)	This test checks the connection between two models, similar to the basic relationship checks. It also allows for filtering out specific rows, like test entries or recent entries, which might have temporary inconsistencies because of data processing limits.
mutually_exclusive_ranges(lower_bound_column, upper_bound_column, partition_by=None, gaps='allowed', zero_length_range_allowed=False) (view source)	By setting the ‘lower_bound_column’ and ‘upper_bound_column’ ranges, This test asserts that the ranges don't overlap with those of other rows.
sequential_values(interval=1, datepart=None, group_by_columns) (view source)	This test checks that a column has values in order, either numbers or dates
unique_combination_of_columns(combination_of_columns, quote_columns=false) (view source)	This test ensures that when certain columns are combined, their values are unique. For instance, while a month or a product alone might repeat, the pairing of a month with a product is always distinct.
accepted_range(min_value, max_value, where, Inclusive=true) (view source)	This test ensures a column's values are within a certain range, set by a minimum and maximum value. You can choose if the range includes or excludes the boundary values. If needed, you can also focus on specific records with a filter. As long as the data type can be compared using > or < signs, you can use this test.
Grouping in tests	Some tests offer an extra group_by_columns option to get more detailed checks. Arg for group_by_column must be in list format. Tests that support this include: equal_rowcount, fewer_rows_than, recency, at_least_one, not_constant, sequential_values, non_null_proportion.

Introspective macros

The introspective macros within the dbt-utils package are a window into your data's metadata. They empower you to dynamically interact with and understand the underlying structure of your datasets. It's like having a magnifying glass for the intricacies of your dbt projects!

Introspective Macros
get_column_values(table, column, where=none, order_by='count(*) desc', max_records=none, default=[]) (view source)	This macro produces an array containing the distinct values of a column from a specified relation. Note: Setting order_by='my_column' will sort alphabetically, while order_by='min(created_at)' will sort by when the value was first observed.
get_filtered_columns_in_relation(from, except=[]) (view source)	This macro provides a Jinja list, containing columns from a specific relation. It's meant for actual relations and not for CTEs.
get_relations_by_pattern(schema_pattern, table_pattern, exclude, database=target.database) (view source)	This macro produces a list of Relations that align with a specified schema or table name pattern. It pairs well with the 'union_relations' macro.
get_relations_by_prefix(schema, prefix, exclude, database= target.database) (view source)	DEPRECATED SOON: Use 'get_relations_by_pattern' instead. This macro gives a list of Relations matching a specific prefix and has an option to exclude certain patterns. It works especially well when combined with 'union_relations'.
get_query_results_as_dict(<sql query>) (view source)	This macro provides a dictionary based on a SQL query, eliminating the need to work with the Agate library to process the outcome.
get_single_value(<sql query>) (view source)	This macro fetches a single value from a SQL query, allowing you to bypass the use of the Agate library when handling the result.

‍

Please contact us with any errors or suggestions.

Dbt & Airflow

dbt vs Airflow: What's the Difference and Why You Need Both

5 mins read

dbt transforms data inside your warehouse using SQL. Airflow orchestrates workflows across your entire data pipeline. They are not competing tools -- they solve different problems. dbt handles the "T" in ELT: modeling, data testing, unit testing, documentation, and lineage. Airflow handles the scheduling, sequencing, retries, failure notifications, observability, and coordination of every step in the pipeline, including ingesting data, triggering dbt, and activation steps. Most production data teams use both. The real decision is not which tool to pick, but how to run them together without drowning in infrastructure overhead.

What is Apache Airflow?

Airflow is a workflow scheduler, but it is only one part of a broader data orchestration problem that includes dependencies, retries, visibility, and ownership across the entire data lifecycle.

How Does Airflow Work?

Imagine a scenario where you have a series of tasks: Task A, Task B, and Task C. These tasks need to be executed in sequence every day at a specific time. Airflow enables you to programmatically define the sequence of steps as well as what each step does. With Airflow you can also monitor the execution of each step and get alerts when something fails.

What Sets Airflow Apart?

Airflow's core strength is flexibility. You define the code behind each task, not just the sequence and schedule. Whether it's triggering an extraction job, calling a Python script, or integrating with an external platform, Airflow lets you control exactly how each step executes.

But that flexibility is a double-edged sword. Just because you can code everything inside Airflow doesn't mean you should. Overly complex task logic inside Airflow makes workflows harder to debug and maintain. The best practice: use Airflow as the orchestrator, not the execution engine. Let specialized tools handle the heavy lifting. Use tools like dbt for transformation and tools like Azure Data Factory, dlt, Fivetran or Airbyte for extraction. Let Airflow handle the scheduling, coordination, observability, and failure notification.

Best Use Cases for Airflow

Data Extraction and Loading: Trigger external tools like Azure Data Factory, dlt, Fivetran, or Airbyte for data ingestion. When no off-the-shelf tool exists, Airflow can call custom Python scripts or frameworks like dlt to handle the load.
Data Transformation Coordination: Trigger dbt Core to run transformation steps in the right sequence with appropriate parallelism, directly in the data warehouse.
ML Workflow Coordination: Orchestrate machine learning pipelines, from data preprocessing to model refreshes, ensuring each step runs after upstream transformations complete.
Automated Reporting: Trigger report generation or data exports to BI and reporting tools on a schedule.
System Maintenance and Monitoring: Schedule backups, manage log storage, and send alerts on system anomalies.

Airflow Benefits

Job Management: Define workflows and task dependencies with a built-in scheduler that handles sequencing and synchronization.
Retry Mechanism: Automatically retry failed tasks or entire pipelines with configurable retry logic, ensuring resilience and fault tolerance.
Alerting: Send failure notifications to email, Slack, MS Teams, or other tools so your team knows immediately when something breaks.
Monitoring: Track workflow status, task durations, and historical runs through the Airflow UI.
Scalability: Deploy on Kubernetes to scale up during peak processing and scale down when idle, keeping infrastructure costs aligned with actual usage.
Community: Active open-source project with robust support on GitHub and Slack, plus a deep library of learning resources.

Airflow Challenges

Python Required: Airflow is Python-based, so creating workflows requires Python proficiency.
Learning Curve: Concepts like DAGs, operators, hooks, and task dependencies take time to master, especially for team members without prior orchestration experience.
Production Deployment: Setting up a production-grade Airflow environment requires knowledge of Kubernetes, networking, and infrastructure tuning.
Ongoing Maintenance: Upgrading Airflow and its underlying infrastructure carries real cost and complexity, particularly for teams that don't manage distributed systems regularly.
Debugging: Identifying root causes in a complex Airflow environment requires experience. Knowing where to start is half the battle.
Cost of Ownership: Airflow is open-source and free to use, but the total cost of deployment, tuning, and ongoing support adds up quickly.
Development Experience: Testing DAGs locally requires setting up Docker-based Airflow environments, which adds friction and slows iteration cycles.

What is dbt?

dbt (data build tool) is an open-source framework that uses SQL and Jinja templating to transform data inside the warehouse. Developed by dbt Labs, dbt handles the "T" in ELT: transforming, testing, documenting, and tracking lineage of data models. It brings software engineering practices like version control, CI/CD, and modular code to analytics workflows.

How Does dbt Work?

dbt lets you write transformation logic as SQL select statements, enhanced with Jinja templating for dynamic execution. Each transformation is called a "model." When you run dbt, it compiles those models into raw SQL and executes them against your warehouse, creating or replacing tables and views. dbt resolves dependencies between models automatically, running them sequentially or in parallel as needed.

What Sets dbt Apart?

Unlike traditional ETL tools that abstract SQL behind drag-and-drop interfaces, dbt treats SQL as the primary language for transformation. This makes it immediately accessible to anyone who already knows SQL. But dbt goes further by adding Jinja templating for dynamic scripting, conditional logic, and reusable macros.

A few things set dbt apart from alternatives:

Idempotency: dbt transformations produce the same result every time they run, making pipelines predictable and repeatable.
Testing built in: Define data quality tests alongside your models. No need for a separate testing tool.
Auto-generated documentation: dbt produces a web-based docs site with descriptions, lineage graphs, and metadata -- all generated directly from your project.
Open metadata: dbt outputs metadata files (artifacts) that downstream tools like data catalogs and observability platforms can consume. This format is quickly becoming the standard for sharing information across data platforms.

dbt excels at the "T" in ELT but requires complementary tools for extraction, loading, and orchestration.

Best Use Cases for dbt

Data Transformation for Analytics: Transform and aggregate raw data into models ready for BI platforms, analytics tools, and ML pipelines.
SQL-based Workflow Creation: Build modular transformation workflows using SQL and Jinja, with reusable macros and dynamic script generation.
Data Validation and Testing: Define schema tests and custom tests alongside your models to catch data quality issues before they reach end users.
SQL Unit Testing: Verify that a specific transformation works as expected by providing sample input data and comparing against expected output.
Documentation and Lineage: Auto-generate a documentation site with descriptions, column-level lineage, and dependency graphs to simplify impact analysis and debugging.
Version Control and DataOps: Manage transformations through Git-based workflows with branching, pull requests, and environment-specific deployments.

dbt Benefits

SQL-based: dbt uses SQL and Jinja for transformation, so most data analysts and engineers can contribute without learning a new language.
Modularity: Each transformation is a SQL select statement that builds on other models. This modular approach simplifies debugging, testing, and reuse.
Testing: Built-in testing framework lets you validate data quality at every layer without introducing additional tools.
Documentation: Auto-generated docs site exposes descriptions, lineage, and metadata, making it easier to maintain projects over time and helping data consumers find and understand data before use.
Lineage and Impact Analysis: Visual dependency graphs simplify debugging when issues occur and help you trace which sources feed a specific data product like an executive dashboard.
Open-source: dbt Core is open-source with no vendor lock-in.
Packages and Libraries: A large community maintains reusable dbt packages and Python libraries that extend dbt's functionality out of the box.
Open Metadata: dbt produces artifact files that data catalogs and observability tools can consume, making it a de facto standard for sharing metadata across platforms.
AI-Ready Foundation: dbt's structured metadata, lineage, and documentation make your warehouse AI-ready. When dbt artifacts are available in your warehouse (e.g. Snowflake), AI tools can use that context to understand table relationships, column descriptions, and data quality rules, turning your transformation layer into a knowledge layer that LLMs and copilots can reason over. See Snowflake Cortex CLI running in Datacoves.
Community: 100k+ members on Slack, extensive learning resources, and a growing ecosystem of practitioners solving similar data problems.

dbt Challenges

Transformation only: dbt handles the "T" in ELT. Extraction, loading, orchestration, and downstream activation all require separate tools.
Macros complexity: dbt macros are powerful but can be hard to read, debug, and test, especially for analysts who are primarily SQL users.
Infrastructure management: Running dbt Core in production requires setting up CI/CD, environment management, secrets handling, and Git workflows. This overhead grows as the team and project scale.
No built-in orchestration: dbt Core has no scheduler. You need an external orchestrator like Airflow to automate runs and coordinate dbt with the rest of your pipeline.
Multi-project complexity: As organizations grow into multiple dbt projects, managing cross-project dependencies, shared models, and consistent conventions becomes a significant challenge.

Common Misconception: dbt Core vs. dbt Cloud

A common misunderstanding in the data community is equating "dbt" with "dbt Cloud." When people say dbt, they typically mean dbt Core -- the open-source framework. dbt Cloud is a commercial product from dbt Labs built on top of dbt Core that adds a scheduler, hosted IDE, monitoring, and managed infrastructure.

The key distinction: you can use dbt Core without paying for dbt Cloud, but you won't get a scheduler, hosted environment, or managed CI/CD. That means you'll need to solve orchestration, deployment, and infrastructure yourself. This is typically done with a tool like Airflow and a platform to tie it all together.

For a deeper comparison, see our article on dbt Cloud vs. dbt Core.

dbt Cloud's Scheduler: What It Does and Doesn't Cover

dbt Cloud includes a scheduler that automates dbt runs on a defined cadence. This keeps your transformation models fresh and your data reliable. But it only schedules dbt jobs, nothing else.

Your Extract and Load steps, downstream processes like BI refreshes, ML model retraining, marketing automation, and any cross-system coordination still need an orchestrator. dbt Cloud's scheduler does not replace Airflow. It replaces the need to cron-schedule dbt runs manually, but leaves the rest of the pipeline unmanaged.

Do I need Airflow if I use dbt Cloud?

Yes, unless your entire pipeline is just dbt transformations with nothing before or after them.

dbt Cloud schedules and runs your dbt jobs. That covers transformation, testing, and documentation. But most production pipelines involve more than that: extracting data from source systems, loading it into the warehouse, refreshing BI dashboards, triggering ML model retraining, sending alerts, and coordinating dependencies across all of these steps.

dbt Cloud does not orchestrate any of those. If your pipeline looks like this:

Extract → Load → Transform → Activate

Then dbt Cloud only covers one step. You still need an orchestrator like Airflow to tie the full pipeline together, manage dependencies between stages, handle retries, and provide end-to-end visibility when something breaks.

What dbt cloud's scheduler actually covers

For small teams with a simple pipeline (one source, one warehouse, one dashboard), dbt Cloud's scheduler may be enough to start. But as soon as you add sources, downstream consumers, or cross-system dependencies, you'll need Airflow or a platform that includes it.

Managed dbt Core + Managed Airflow: Removing the Infrastructure Burden

The hidden cost of choosing open-source dbt Core and Airflow is the platform you have to build around them. CI/CD pipelines, secrets management, environment provisioning, Git workflows, Kubernetes tuning, Airflow upgrades -- this is real engineering work that pulls your team away from delivering data products.

Running dbt + Airflow: Build it yourself vs Managed platform

Datacoves eliminates that overhead. It provides managed dbt Core and managed Airflow in a unified environment deployed in your private cloud. Developers get an in-browser VS Code editor with dbt, Python, extensions, and Git pre-configured. Orchestration is handled through Datacoves' managed Airflow, which includes a simplified YAML-based DAG configuration that connects Extract, Load, and Transform steps without writing boilerplate Python.

Best practices, governance guardrails, and accelerators are built in, so teams get a production-grade data platform in minutes instead of months. For a deeper look at the tradeoffs, see our guide on building vs. buying your analytics platform.

To learn more about Datacoves, check out our product page.

dbt vs Airflow: Which One Should You Choose?

This is not an either-or decision. dbt and Airflow solve different problems, and trying to force one tool to do the other's job creates fragile, hard-to-maintain pipelines.

Use dbt for transformation, testing, documentation, and lineage. Use Airflow for scheduling, dependency management, retries, and end-to-end pipeline coordination. The comparison table below shows where each tool fits.

	Apache Airflow	dbt Core
Recommended Usage	Data workflow orchestration	Data transformation, documentation, data quality, data lineage
Primary Strength	Orchestration of data pipelines	Data modeling and transformation
Development Language	Python or YAML using DagFactory or dbt-coves	SQL and Python in some warehouses
Ease of Use	Python required. Simpler with a managed platform, complex if self-hosted.	SQL-based, low learning curve. Infrastructure management grows harder at scale.
Pricing	Open-source. Real cost is deployment and maintenance. Managed options: Astronomer, Datacoves.	Open-source. Real cost is CI/CD, environments, and orchestration. Managed options: dbt Cloud, Datacoves.

Which Tool Should You Adopt First?

If you need to pick one tool to start with, here's the decision framework:

Start with dbt if your immediate priority is transforming raw data into reliable, tested models in the warehouse. dbt gives you version control, testing, documentation, and lineage from day one. For small teams with simple pipelines, a cron job or dbt Cloud's scheduler can handle automation in the short term.

Start with Airflow if you already have multiple systems that need coordination; extraction from SaaS sources, loading into the warehouse, transformation, and downstream activation. Airflow gives you end-to-end visibility and control across the full pipeline.

The reality: most teams outgrow a single-tool approach quickly. Once your pipeline spans more than one step, you need both. The question isn't which tool to pick, it's how soon you'll need the other one. If you're still evaluating transformation tools, check out our dbt alternatives comparison.

How dbt and Airflow Work Together

In a production pipeline, the tools fit together like this:

Airflow triggers extraction and loading, kicking off tools like dlt, Airbyte, Fivetran, or Azure Data Factory to pull data from source systems into the warehouse.
Airflow triggers dbt. Once data lands in the warehouse, Airflow calls dbt to run transformations and tests in the correct sequence.
Airflow handles downstream activation. After transformation, Airflow pushes data to BI platforms, refreshes ML models, triggers marketing automation, or sends alerts.

Who owns what: Airflow vs dbt in the data pipeline

Pipeline Stage	Tool
Extract & Load	Airflow triggers dlt, Airbyte, Fivetran, or ADF
Transform & Test	dbt Core
Orchestrate & Schedule	Airflow
Documentation & Lineage	dbt Core
Downstream Activation	Airflow triggers BI, ML, alerts
Retries & Failure Handling	Airflow

Airflow owns the "when" and "what order." dbt owns the "how" for transformation. Neither tool steps on the other's responsibilities.

The challenge is running them together. Deploying Airflow and dbt Core in a shared, production-grade environment requires Kubernetes, CI/CD, secrets management, environment provisioning, and ongoing maintenance. For teams that don't want to build and maintain that platform layer themselves, a managed solution like Datacoves provides both tools pre-integrated in your private cloud -- so your team focuses on data, not infrastructure.

Conclusion: dbt and Airflow Are Better Together

dbt and Airflow are not competing tools. dbt transforms your data. Airflow orchestrates your pipeline. Trying to use one without the other leaves gaps that show up as fragile pipelines, manual workarounds, and constant firefighting. The teams that struggle most are the ones who pick the right tools but never invest in the platform layer that makes them work together.

The real question is not "dbt vs Airflow," it's how you run them together without building and maintaining a platform from scratch. That infrastructure layer (Kubernetes, CI/CD, secrets, environments, Git workflows, upgrades) is where most teams burn time and budget.

Datacoves gives you managed dbt Core and managed Airflow in a unified environment, deployed in your private cloud, with best practices and governance built in. Your team focuses on delivering data products, not maintaining the platform underneath them.

See how Datacoves works.

Dbt & Airflow

An Overview of Testing Options for dbt (data build tool)

5 mins read

dbt, also known as data build tool, is a data transformation framework that leverages templated SQL to transform and test data. dbt is part of the modern data stack and helps practitioners apply software development best practices on data pipelines. Some of these best practices include code modularity, version control, and continuous testing via its built in data quality framework. In this article we will focus on how data can be tested with dbt via build in functionality and with additional dbt packages and libraries.

Adding tests to workflows does more than ensure code and data integrity; it facilitates a continuous dialogue with your data, enhancing understanding and responsiveness. By integrating testing into your regular workflows, you can:

Identify Specific Issues: Tests can direct your attention to specific records that may require closer inspection or immediate action. This targeted approach helps maintain high data quality and reliability.
Enhance Data Familiarization: Regular interaction with test results promotes a deeper understanding of the data's characteristics and behaviors. This ongoing learning process can inform better decision-making and data handling practices.
Maintain Active Engagement: Incorporating tests as a routine part of your workflow turns data testing from a periodic audit into a consistent part of your data management strategy. This active engagement helps in preemptively identifying potential discrepancies before they escalate into larger issues.

By embedding testing into the development cycle and consuming the results diligently, teams not only safeguard the functionality of their data transformations but also enhance their overall data literacy and operational efficiency. This proactive approach to testing ensures that the insights derived from data are both accurate and actionable.

dbt tests in dbt Core

In dbt, there are two main categories of tests: data tests and unit tests.

Data tests are meant to be executed with every pipeline run to validate the integrity of the data and can be further divided into two types: Generic tests and Singular tests.

Singular Tests: A singular dbt test is written in a SQL file with a query that returns records that fail the test. This type of test is straightforward and focuses on specific conditions or rules that data must meet.
Generic Tests: A generic dbt test is defined in a YAML file and references a macro that contains the SQL logic. This setup allows for greater flexibility and reuse. A dbt test macro typically contains a select statement that returns records that don’t pass the test. The macro takes a model and column_name to be injected with Jinja templates, and extra arguments can be passed when configuring the test. This configurability makes generic tests versatile and adaptable to various scenarios, enhancing the robustness of your data testing framework.

Regardless of the type of data test, the process is the same behind the scenes: dbt will compile the code to a SQL SELECT statement and execute it against your database. If any rows are returned by the query, this indicates a failure to dbt.

Unit tests, on the other hand, are meant to validate your transformation logic. They rely on predefined data for comparison to ensure your logic is returning an expected result. Unlike data tests, which are meant to be run with every pipeline execution, unit tests are typically run during the CI (Continuous Integration) step when new code is introduced. Unit tests were incorporated in dbt Core as of version 1.8.

‍

Categories of dbt tests — dbt core tests

dbt Core Generic tests

These are foundational tests provided by dbt-core, focusing on basic schema validation and source freshness. These tests are ideal for ensuring that your data sources remain valid and up-to-date.

dbt-core provides four built-in generic tests that are essential for data modeling and ensuring data integrity:

unique: is a test to verify that every value in a column (e.g. customer_id) contains unique values. This is useful for finding records that may inadvertently be duplicated in your data.

not_null: is a test to check that the values for a given column are always present. This can help you find cases where data in a column suddenly arrives without being populated.

accepted_values: this test is used to validate whether a set of values within a column is present. For example, in a column called payment_status, there can be values like pending, failed, accepted, rejected, etc. This test is used to verify that each row within the column contains one of the different payment statuses, but no other. This is useful to detect changes in the data like when a value gets changed such as accepted being replaced with approved.

relationships: these tests check referential integrity. This type of test is useful when you have related columns (e.g. the customer identifier) in two different tables. One table serves as the “parent” and the other is the “child” table. This is common when one table has a transaction and only lists a customer_id and the other table has the details for that customer. With this test we can verify that every row in the transaction table has a corresponding record in the dimension/details table. For example, if you have orders for customer_ids 1, 2, 3 we can validate that we have information about each of these customers in the customer details table.

Using a generic test is done by adding it to the model's property (yml) file.

Validate information about each customers in the customer details table. — dbt Core Generic Tests

Generic tests can accept additional test configurations such as a where clause to apply the test on a subset of rows. This can be useful on large tables by limiting the test to recent data or excluding rows based on the value of another column. Since an error will stop a dbt build or dbt test of the project, it is also possible to assign a severity to a test and optionally a threshold where errors will be treated as warning instead of errors. Finally, since dbt will automatically generate a name for the test, it may be useful to override the auto generated test name for simplicity. Here's the same property file from above with the additional configurations defined.

Same property file with the additional configurations defined — dbt tests with where condition, severity, and name defined

dbt Core singular tests

Singular tests allow for the customization of testing parameters to create tailored tests when the default generic ones (or the ones in the packages discussed below) do not meet your needs. These tests are simple SQL queries that express assertions about your data. An example of this type of test can be a more complex assertion such as having sales for one product be within +/- 10% of another product. The SQL simply needs to return the rows that do not meet this condition.

dbt Core custom generic tests

In dbt, it is also possible to define your own custom generic tests. This may be useful when you find yourself creating similar Singular tests. A custom generic test is essentially the same as a dbt macro which has a least a model as a parameter, and optionally column_name, if the test will apply to a column. Once the generic test is defined, it can be applied many times just like the generic tests shipped with dbt Core. It is also possible to pass additional parameters to a custom generic test.

dbt Core unit testing

As our data transformations become more complex, the need for testing becomes increasingly important. The concept of unit testing is already well established in software development, where tests confirm that individual units of code work as intended. Recognizing this, dbt 1.8 introduced unit testing.

Unlike the data tests we have above, which ensure that incoming data meets specific criteria and are run at every data refresh, unit tests are designed to verify that the transformation logic itself produces the expected results. In the context of dbt, unit tests validate transformation logic by comparing the test results against predefined data typically defined using seeds (CSV files) or SQL queries. Unit tests should only be executed when new data transformation code is introduced and implemented since they are designed to help catch potential issues early in the development process. It is recommended to run unit tests only during the CI step. Running them in production would be a redundant use of compute resources because the expected outcomes do not change. Unit testing is only available in 1.8 or higher, but there are community packages (dbt-unit-testing, dbt_datamocktool, dbt-unittest) that have worked to solve this problem and are worth exploring if you are not using dbt 1.8.

dbt Core freshness check

While not technically a dbt test, a freshness check validates the timeliness of source data. The freshness check in dbt Core is designed to monitor the timeliness of the data. It helps ensure that the data loaded into your warehouse is updated regularly and remains relevant for decision-making processes. This is valuable because sometimes data will stop getting refreshed and the data pipelines will continue to run with a silent failure. To assure that you are alerted when a data delivery SLA is not met, simply add a freshness check to your sources.

This comprehensive suite of testing capabilities in dbt Core ensures that data teams can build, maintain, and verify the reliability and accuracy of their data models effectively.

Popular dbt Testing Packages

In addition to the generic tests that can be found within dbt Core, there are a lot more in the dbt ecosystem. These tests are found in dbt packages. Packages are libraries of reusable SQL code created by organizations of the dbt community. We will briefly go over some of the tests that can be found in these packages.

dbt-utils generic dbt tests

The dbt-utils package, created by dbt Labs, contains generic dbt tests, SQL generators, and macros. The dbt_utils package include 16 generic tests including:

not_accepted_values: this test is the opposite of the accepted_values test and is used to check that specific values are NOT present in a particular range of rows.‍

equal_rowcount: this test checks that two different tables have the same number of rows. This is a useful test that can assure that a transformation step does not accidentally introduce additional rows in the target table.‍

fewer_rows_than: this test is used to verify that a target table contains fewer rows than a source table. For example, if you are aggregating a table, you expect that the target table will have fewer rows than the table you are aggregating. This test can help you validate this condition.

There are 17 generic dbt tests available in the dbt-utils package.‍

dbt-expectations generic dbt tests

Another awesome package that can accelerate your data testing is dbt-expectations. This package is a port of the great Python library Great Expectations. For those not familiar, Great Expectations is an open-source Python library that is used for automated testing. dbt-expectations is modeled after this library and was developed by Calogica so dbt practitioners would have access to an additional set of pre-created Generic tests without adding another tool to the data platform. Tests in dbt-expectations are divided into seven categories encompassing a total of 62 generic dbt tests:

Table shape (15 generic dbt tests)‍
Missing values, unique values, and types (6 generic dbt tests)
‍Sets and ranges (5 generic dbt tests)
‍String matching (10 generic dbt tests)
‍Aggregate functions(17 generic dbt tests)‍
Multi-column (6 generic dbt tests)‍
Distributional functions (3 generic dbt tests)

You can find detailed information on all the dbt-expectations generics tests in their documentation.

dbt_constraints

Created by Snowflake, dbt_constraints adds primary and foreign key constraints to dbt models. When incorporated into a dbt project, this package automatically creates unique keys for all existing unique and dbt_utils.unique_combination_of_columns tests, along with foreign keys for existing relationship tests and not null constraints for not_null tests. It provides three flexible tests - primary_key, unique_key, and foreign_key - which can be used inline, out-of-line, and support multiple columns.

elementary dbt-data-reliabilit generic dbt tests

The elementary tool offers 10 generic dbt tests that help in detecting schema changes, validating JSON schemas, and monitoring anomalies in source freshness, among other functionalities.

dbt-fihr generic dbt tests

dbt-fihr focuses on the healthcare sector, providing 20 generic dbt tests for validating HL7® FHIR® (Fast Healthcare Interoperability Resources) data types, a standard for exchanging healthcare information across different systems.

fhir-dbt-analytics generic dbt tests

Maintained by Google, the fhir-dbt-analytics package includes tests that ensure the quality of clinical data. These tests might involve counting the number of FHIR resources to verify expected counts or checking references between FHIR resources.

By leveraging these diverse dbt testing packages, data teams can significantly enhance their data validation processes, ensuring that their data pipelines are robust, accurate, and reliable.

dbt testing during development

While the tests above run against production data and are run even when none of the dbt code has changed, there are some tests that should be applied during development. This will improve a project's long term maintainability, assure project governance, and validate transformation logic in isolation of production data.

dbt-meta-testing

This dbt-meta-testing package contains macros to assert test and documentation coverage leveraging a configuration defined in the dbt_project.yml configuration settings.

dbt-unit-testing (dbt 1.8 has built in unit testing)

While dbt tests are great to test with "real" data, sometimes you may want to test the logic of a transformation with "fake" data. This type of test is called a unit test. The dbt-unit-testing package has all you need to do proper dbt unit testing. (side note, the dbt Core team has announced the unit testing will be part of a future release of dbt although it may not be exactly as done using this package).

dbt_datamocktool

dbt_datamocktool can be used to create mock CSV seeds to stand in for the sources and refs that your models use and test that the model produces the expected output as compared with another CSV seed.

dbt-unittest (dbt 1.8 has built in unit testing)

The dbt-unittest is a dbt package to enhance dbt package development by providing unit testing macros.

CI/CD Testing: Advanced CI

Incorporating automated data validation into CI/CD pipelines helps catch issues early and ensures data accuracy before deployment. By integrating tests into every code change, teams can prevent bad data from reaching production and maintain reliable data pipelines.

dbt-checkpoint

dbt-checkpoint is a library that can be leveraged during the development and release life-cycle to assure a level of governance of the dbt project. Typical validations include assuring that dbt models and/or their columns have descriptions and that all the columns in a dbt model (sql) are present in a property file (yml).

Data Recce

Recce is an open-source data validation toolkit for comprehensive PR review in dbt projects. Recce helps to validate the data impact of code changes during development and PR review by enabling you to compare data structure, profiling statistics, and queries between two dbt environments, such as dev and prod. By performing Recce checks, you are able to identify unexpected data impact, validate expected impact, and prevent bad merges and incorrect data entering production.

Recce checks, can be performed during development, automatically as part of CI, and as part of PR review for root cause analysis. The suite of tools in Recce enable you to perform:

Structural checks such as schema and row count diffs.
Statistical checks such as data profile, data value, top-k, and histogram diffs.
Low level checks through diffing ad-hoc queries.

Record the results of your data validations in the Checklist and share as part of PR review or discussion with stakeholders.

For full coverage, use Recce’s automated ‘preset checks’ that are triggered with each pull request and automatically post an impact summary to your PR comment.

Recce Cloud users can also take advantage of check-syncing and PR merge-blocking until the reviewer or stakeholders have approved the check results.

Reporting results of dbt tests

By default, dbt will not store the results of a dbt test execution. There is a configuration that can be set for the dbt project or at the specific model level which will have dbt store the failures of the test in a table in the data warehouse. While this is a good start, these test results get overridden each time dbt tests are run. To overcome this deficiency, tools have been developed in the community that store results longitudinally and even provide dashboards of test results.

Elementary

Elementary is an open source data observability tool for dbt. It simplifies the capture of dbt test results over time, enables testing without having to manually add tests to all your dbt model columns, and has a user interface for viewing test results as well as dbt lineage.

Elementary also provides advanced configurations for generating Slack alerts for dbt tests, enhancing how teams monitor and respond to data quality issues. You can configure alerts based on test results, test statuses, and test durations. Additionally, you can set up recurring alerts based on a schedule that you define, ensuring continuous oversight without constant manual checking.

Key features include:

Custom Channel: Direct alerts to specific Slack channels or users, making sure the right team members receive updates in real time.
Suppression Interval: Decide how often to send alerts by setting suppression intervals, which prevent alert overload by spacing notifications according to your specified time frame.
Alert Fields: Customize the content of each alert with specific test details, providing immediate insight into the nature and urgency of the issue.
More Options: Tailor messages that are sent in alerts, integrating them seamlessly into your team’s communication flow.

This comprehensive suite of tools not only sends notifications but also allows for significant customization, ensuring that alerts are meaningful and actionable. The integration of these features into your workflow facilitates better data management and quicker response to potential data discrepancies, streamlining your project's efficiency and reliability.

dbt Data Quality package

This dbt Data Quality package is a Snowflake only package that helps users access and report on the outputs from dbt source freshness and dbt test results.

dq-tools

The dbt-tools package makes it simple to store and visualize dbt test results in a BI dashboard.

re_data

re_data is an open-source data reliability framework for modern data stack.

Migration

When migrating data from one system to another validating that tables match is incredibly important. For this we recommend datacompy to get the job done.

Conclusion

Getting started with dbt testing is simple thanks to the predefined generic dbt tests found within dbt Core and the additional generic tests found in dbt-utils and dbt-expectations. In addition to these juggernauts of the dbt community other organizations in the dbt community have contributed a additional generic tests, tools to improve dbt development, libraries that can help with validation and governance before releasing code to production and tools that can improve data quality observability. If you are using dbt cloud or dbt core you may be interested in reading more about dbt alternatives such as Datacoves which falls under the managed dbt core solutions.

Dbt & Airflow

Ultimate dbt Cheat Sheet

5 mins read

You now know what dbt (data build tool) is all about. You are being productive, but you forgot what `dbt build` does or you forgot what the @ dbt graph operator does. This handy dbt cheat sheet has it all in one place.

dbt cheat sheet - Updated for dbt 1.8

With the advent of dbt 1.6, we updated the awesome dbt cheat sheet created originally by Bruno de Lima

We have also moved the dbt jinja sheet sheet to a dedicated post.

This reference summarizes all the dbt commands you may need as you run your dbt jobs or study for your dbt certification.

If you ever wanted to know what the difference between +model and @model is in your dbt run, you will find the answer. Whether you are trying to understand dbt graph operators or what the dbt retry command does, but this cheat sheet has you covered. Check it out below.

Primary dbt commands

These are the principal commands you will use most frequently with dbt. Not all of these will be available on dbt Cloud

dbt development commands
dbt build	This command will load seeds, perform snapshots, run models, and execute tests
dbt compile	Generates executable SQL code of dbt models, analysis, and tests and outputs to the target folder
dbt docs	Generates and serves documentation for the dbt project (dbt docs generate, dbt docs serve)
dbt retry	Re-executes the last dbt command from the node point of failure. It references run_results.json to determine where to start
dbt run	Executes compiled SQL for the models in a dbt project against the target database
dbt run-operation	Is used to invoke a dbt macro from the command line. Typically used to run some arbitrary SQL against a database.
dbt seed	Loads CSV files located in the seeds folder into the target database
dbt show	Executes sql query against the target database and without materializing, displays the results to the terminal
dbt snapshot	Executes "snapshot" jobs defined in the snapshot folder of the dbt project
dbt source	Provides tools for working with source data to validate that sources are "fresh"
dbt test	Executes singular and generic tests defined on models, sources, snapshots, and seeds

dbt Command arguments

The dbt commands above have options that allow you to select and exclude models as well as deferring to another environment like production instead of building dependent models for a given run. This table shows which options are available for each dbt command

dbt command arguments
dbt build	--select / -s, --exclude, --selector, --resource-type, --defer, --empty, --full-refresh
dbt compile	--select / -s, --exclude, --selector, --inline
dbt docs generate	--select / -s, --no-compile, --empty-catalog
dbt docs serve	--port
dbt ls / dbt list	--select / -s, --exclude, --selector, --output, --output-keys, --resource-type
dbt run	--select / -s, --exclude, --selector, --resource-type, --defer, --empty, --full-refresh
dbt seed	--select / -s, --exclude, --selector
dbt show	--select / -s, --inline, --limit
dbt snapshot	--select / -s, --exclude, --selector
dbt source freshness	--select / -s, --exclude, --selector
dbt source	--select / -s, --exclude, --selector, --output
dbt test	--select / -s, --exclude, --selector, --defer

dbt selectors

By combining the arguments above like "-s" with the options below, you can tell dbt which items you want to select or exclude. This can be a specific dbt model, everything in a specific folder, or now with the latest versions of dbt, the specific version of a model you are interested in.

dbt node selectors
tag	Select models that match a specified tag
source	Select models that select from a specified source
path	Select models/sources defined at or under a specific path.
file / fqn	Used to select a model by its filename, including the file extension (.sql).
package	Select models defined within the root project or an installed dbt package.
config	Select models that match a specified node config.
test_type	Select tests based on their type, singular or generic, data, or unit (unit tests are available only in dbt 1.8)
test_name	Select tests based on the name of the generic test that defines it.
state	Select nodes by comparing them against a previous version of the same project, which is represented by a manifest. The file path of the comparison manifest must be specified via the --state flag or DBT_STATE environment variable.
exposure	Select parent resources of a specified exposure.
metric	Select parent resources of a specified metric.
result	The result method is related to the state method described above and can be used to select resources based on their result status from a prior run.
source_status	Select resource based on source freshness
group	Select models defined within a group
access	Selects models based on their access property.
version	Selects versioned models based on their version identifier and latest version.

dbt graph operators

dbt Graph Operator provide a powerful syntax that allow you to hone in on the specific items you want dbt to process.

dbt graph operators
+	If "plus" (+) operator is placed at the front of the model selector, + will select all parents of the selected model. If placed at the end of the string, + will select all children of the selected model.
n+	With the n-plus (n+) operator you can adjust the behavior of the + operator by quantifying the number of edges to step through.
@	The "at" (@) operator is similar to +, but will also include the parents of the children of the selected model.
*	The "star" (*) operator matches all models within a package or directory.

Project level dbt commands

The following commands are used less frequently and perform actions like initializing a dbt project, installing dependencies, or validating that you can connect to your database.

project level dbt commands
dbt clean	By default, this command deletes contents of the dbt_packages and target folders in the dbt project
dbt clone	In databases that support it, can clone nodes (views/tables) to the current dbt target database, otherwise it creates a view pointing to the other environment
dbt debug	Validates dbt project setup and tests connection to the database defined in profiles.yml
dbt deps	Installs dbt package dependencies for the project as defined in packages.yml
dbt init	Initializes a new dbt project and sets up the users's profiles.yml database connection
dbt ls / dbt list	Lists resources defined in a dbt project such as modem, tests, and sources
dbt parse	Parses and validates dbt files. It will fail if there are jinja and yaml errors in the project. It also outputs detailed timing info that may be useful when optimizing large projects
dbt rpc	DEPRECATED after dbt 1.4. Runs an RPC server that compiles dbt models into SQL that can be submitted to a database by external tools

dbt command line (CLI) flags

The flags below immediately follow the dbt command and go before the subcommand e.g. dbt <FLAG> run

Read the official dbt documentation

dbt command line (CLI) flags (general)
-x, --fail-fast / --no-fail-fast	Stop dbt execution as soon as a failure occurs.
-h, --help	Shows command help documentation
--send-anonymous-usage-stats / --no-send-anonymous-usage-stats	Send anonymous dbt usage statistics to dbt Labs.
-V, -v, --version	Returns information about the installed dbt version
--version-check / --no-version-check	Ensures or ignores that the installed dbt version matches the require-dbt-version specified in the dbt_project.yml file.
--warn-error	If dbt would normally warn, instead raise an exception.
--warn-error-options WARN_ERROR_OPTIONS	Allows for granular control over exactly which types of warnings are treated as errors. This argument receives a YAML string like '{"include": "all"}.
--write-json / --no-write-json	Whether or not to write the manifest.json and run_results.json files to the target directory

‍

dbt CLI flags (logging and debugging)
-d, --debug / --no-debug	Display debug logging during dbt execution useful for debugging and making bug reports. Not to be confused with the dbt debug command which tests database connection.
--log-cache-events / --no-log-cache-events	Enable verbose logging for relational cache events to help when debugging.
--log-format [text\|debug\|json\|default]	Specify the format of logging to the console and the log file.
--log-format-file [text\|debug\|json\|default]	Specify the format of logging to the log file by overriding the default format
--log-level [debug\|info\|warn\|error\|none]	Specify the severity of events that are logged to the console and the log file.
--log-level-file [debug\|info\|warn\|error\|none]	Specify the severity of events that are logged to the log file by overriding the default log level
--log-path PATH	Configure the 'log-path'. Overrides 'DBT_LOG_PATH' if it is set.
--print / --no-print	Outputs or hides all {{ print() }} statements within a macro call.
--printer-width INTEGER	Sets the number of characters for terminal output
-q, --quiet / --no-quiet	Suppress all non-error logging to stdout Does not affect {{ print() }} macro calls.
--use-colors / --no-use-colors	Specify whether log output is colorized in the terminal
--use-colors-file / --no-use-colors-file	Specify whether log file output is colorized

‍

dbt CLI flags (parsing and performance)
--cache-selected-only / --no-cache-selected-only	Have dbt cache or not cache metadata about all the objects in all the schemas where it might materialize resources
--partial-parse / --no-partial-parse	Uses or ignores the pickle file in the target folder used to speed up dbt invocations by only reading and parsing modified objects.
--populate-cache / --no-populate-cache	At start of run, use `show` or `information_schema` queries to populate a relational cache to speed up subsequent materializations
-r, --record-timing-info PATH	Saves performance profiling information to a file that can be visualized with snakeviz to understand the performance of a dbt invocation
--static-parser / --no-static-parser	Use or disable the static parser. (e.g. no partial parsing if enabled)
--use-experimental-parser / --no-use-experimental-parser	Enable experimental parsing features.

As a managed dbt Core solution, the Datacoves platform simplifies the dbt Core experience and retains its inherent flexibility. It effectively bridges the gap, capturing many benefits of dbt Cloud while mitigating the challenges tied to a pure dbt Core setup. See if Datacoves dbt pricing is right for your organization or visit our product page.

Please contact us with any errors or suggestions.

Tooling

The Transformative Power of Data Modeling in Data Warehouse

5 mins read

I read an article on Anchor Data Modeling, more specifically, Atomic modeling where the author proposes a different way of Data Modeling. The rationale for this change is that there is a lack of skills to model data well. We are giving powerful tools to novices, and that is bound to lead to problems.

From the article:

"we are in a distressful situation both concerning the art as a whole but also its place in modern architectures"

Is this the case? Do we have a big problem on the horizon that requires us to make this big shift?

I'd say I am open-minded and expose myself to different ways of thinking so I can broaden my views. A few years ago, I learned a bit about COBOL, not because I had any real use for it but because I was curious. I found it very interesting and even saw its similarities with SQL. I approached the topic with no preconceived ideas; this is the first time I heard of Atomic Modeling.

Is there something wrong with atomic and anchor data modeling in general?

The issues I see with ideas like Atomic data modeling are not in their goal. I am 100% aligned with the goal; the problem is the technology, process, and people needed to get there.

What we see in the market is a direct result of a backlash against doing things "perfectly." But why is this the case? I believe it is because we haven't communicated how we will achieve this vision of ideas like atomic data. The author even says a key phrase in the first paragraph:

"practitioners shying away from its complexity"

If doing anchor data modeling is "complex" how are we going to up-skill people? Is this feasible? I am happy if I can get more people to use SQL vs a GUI tool like Alterix 😁

I am by no means an expert. Yet, I am fairly technical, and if I am not convinced, how will we convince decision-makers?

As I read this article, here's what I am thinking:

1. First, I will need to deconstruct the data I get from some source like material master data form SAP. That will be a bunch of tables, and who is going to do all this data modeling? It sounds expensive and time-consuming.

2. I am going to need some tooling for this, and I am either going to build it or use something a few others are using. Will my company want to take a chance on something this early? This sounds risky.

3. After I deconstruct all this data, I need to catalog all these atoms. I now have a lot of metadata, and that's good, but is the metadata business-friendly? We can't get people to add table descriptions how is this going to happen with this explosion of objects? Who will maintain it? How will we expose this? Is there a catalog for it already? Does that catalog have the other features people need? It sounds like I need to figure out a bunch of things, the biggest one being the change management aspect.

4. What sort of database will I use to store this? This is a great use case for a graph database. But graph databases are not widely adopted, and I have seen graph databases choke at scale. We can use a relational database, but these joins are going to be complex. Someone may have figured all this out, but there's more tech and learning needed. It sounds like this will also add time and expense.

5. When I have managed to do all the above, I will need to construct what people can actually use. We need objects that work with tools that are available. I need to make relational tables I can query with SQL and viz tools, which are more user-friendly. This sounds like more work, more time, and more money.

I may have missed some steps and oversimplified what's needed for this type of change. I am also aware that I may not know what exists to solve all the above. However, if I don't know it, then there are a ton of other people who also don't know it and this is where we need to start. We need to understand how we will tactically achieve this "better" world.

How we tactically achieve a better world

What are we fixing?

I've had conversations on metadata-driven automation, and like atomic modeling, I am not clear on who we are helping and how. What are we improving and in what timeframe? In the end, it feels like we have optimized for something only a few companies can do. To do anchor modeling well would be a huge expense, and when things go wrong, there are several points of failure. When we look at business problems, we need to be sure to optimize the end-to-end system. We can't locally optimize one area because we are likely moving the problem somewhere else. This can be in terms of time, money, or usability.

Decision-makers are not interested in data modeling. They are expecting results and a faster time to market. It's hard enough getting people to do things "better." This is why I find it hard to imagine that we can get to this level of maturity any time soon.

What can we do instead?

There are incremental steps we can take to incorporate best practices into the modern data stack. We need to help people mature their data practice faster, and we should not let perfection get in the way of good. Most companies are not large enterprises with millions of dollars to spend on initiatives like atomic modeling. That being said, I have yet to see anchor modeling in practice, so I welcome the opportunity to learn. I remember years ago the debates about how Ruby on Rials was teaching people "bad practices." The other side of that argument is that Rails helped companies like Twitter and Github launch businesses faster. Rails was also better than the alternative at the time, which included messy PHP code. Others advocated for well-crafted "scalable" and expensive Java applications. Rails may not be the powerhouse it once was, but it has had a huge influence on how we build software. I even see its influence in dbt even if it might not have been intentional or direct.

Conclusion

Tools like Snowflake and dbt allow us to build processes that are much better than what most people have. Should we focus on all the "bad" things that may come with the modern data stack? Should we focus on how practitioners are not well educated, and so we need to throw all they are doing out?

I don't think so; I believe that we can help companies mature their data practices faster. Will we have the best data models? Maybe not. Will users do things perfectly? Nope. But can we help them move faster and guide them along their journey to avoid big pitfalls? I think we can. Getting people to use git, automating testing, and creating DataOps processes is a huge step forward for many organizations. Let's start there.

There's a reason Data Mesh and the Modern Data Stack resonate with so many people. There's a desire to do things faster with more autonomy at many companies, not just the ones with multi-million-dollar budgets. Let's focus on what is achievable, do the best we can, and help people mature along the way. We don't need more complexity; we need less.

Dbt & Airflow

dbt and Snowflake with dbt Power User and SQLFluff

5 mins read

Using dbt with Snowflake is one of the most popular and powerful data stacks today. Are you facing one of these situations:

- You just joined a data team using dbt Core with Snowflake, and you want to set up your local dbt development environment

- Your company is already using Snowflake and wants to try out dbt.

- You are using dbt with another data warehouse and you want to try dbt with Snowflake before migrating.

If you are facing any of these, this guide will help you set up your dbt environment for Snowflake using VS Code as your dbt development environment. We will also cover useful python libraries and VS Code extensions that will make you more productive. Want to become a dbt on Snowflake ninja, keep reading.

Pre-requisites for dbt development on Snowflake

To get started, you need to set up your dbt development environment. This includes Python, dbt, and VS Code. You will also need to set up your Snowflake account properly so dbt can work its magic.

While dbt supports versions of Python greater than 3.7, some other tools like Snowpark require Python 3.8, therefore we recommend you stick with that version. You can find the installer for your particular Operating System on this page. Finding the right Windows Installer can be confusing, so look for a link titled Windows installer (64-bit). If you are on a Mac, you can use the Universal Installer macOS 64-bit universal2 installer.

When using dbt, you will also need Git. To setup git, follow this handy guide.

The preferred IDE for dbt is VS Code, you can find it on the official Microsoft site. click the big Download button and install like any other app.

Installing dbt is done using pip. You can find more information on this page.

For dbt Snowflake, simply run:

This will install the latest version of dbt core along with the dbt adapter for Snowflake. If you need to use an older version of dbt, you will need to specify it when you run pip. However, the version of the dbt adapter may not match the version of dbt core.

For example, as of this writing, the last version of dbt core 1.3.x is 1.3.4. However, dbt-snowflake for 1.3.x is 1.3.2. So, if you want to install the latest dbt 1.3.x for snowflake, you would run

This will install dbt-snowflake 1.3.2 along with dbt-core 1.3.4

Configure Snowflake permissions

The final pre-requisite you will need to do is set up a Snowflake user that has been granted a role with the right access to a database where dbt will create views and tables. The role should also have access to a Snowflake Warehouse for compute. Here is a handy guide that gives you the basics. We would recommend a more comprehensive setup for a production deployment, but this will get you going for now.

The key items that are not covered in that guide is that you should create a role ANALYST and a database ANALYTICS_DEV and grant OWNERSHIP of that database to the ANALYST role. The ANALYST role should also be granted USAGE on the TRANSFORMING warehouse. You also need to grant the ANALYST role to your specific user. Don’t run as ACCOUNTADMIN.

This is all needed because when dbt runs it will create a schema for your user in the ANALYTICS_DEV database and you will use the TRANSFORMING warehouse when compute is needed like when dbt creates tables.

If all of this seems confusing or tedious, you should consider a managed dbt solution like dbt Cloud or Datacoves. For more information, checkout our in-depth article where we compare dbt cloud vs dbt core as well as managed dbt core.

Now with all the pre-requisites out of the way, let’s configure dbt to connect with Snowflake.

Configure your dbt Snowflake profile

To initialize your dbt project in Snowflake, dbt has a handy command dbt init. You can configure your dbt Snowflake profile using the dbt init command both for a new and a existing dbt project. First you will need to clone your repo using git. Then, simply run the dbt-init command and go through the prompts.

Once you get your project set up, consider adding a profile_template.yml to your project. As stated on that page, using a profiles template will simplify the dbt init process for users on your team.

To make sure dbt can connect to Snowflake, run dbt debug. If dbt can connect to your Snowflake account, you should see “All checks passed!” If you have problems, then join the dbt Community search the forums or ask a question in the #db-snowflake channel.

dbt debug used to check dbt to Snowflake connection

Even though dbt performed the setup of your profile.yml to connect to Snowflake with your credentials, it only provides the basic setup. This page provides additional parameters. that you can configure for the Snowflake connection in your profiles.yml file.

If you want to configure those parameters, you will need to open and edit the profiles.yml file. The profiles.yml file created by dbt init will be in your user’s home directory in a subdirectory called .dbt.

One handy configuration parameter to change is reuse_connections. Also, if you use SSO authentication with external browser, you should consider setting up connection caching on Snowflake, otherwise you will be prompted to authenticate for every connection dbt opens to the database.

Now that you have set up your dbt connection to Snowflake, there are some other options you can configure when dbt runs against your Snowflake account. This includes overriding the default warehouse for a specific model, adding query tags, copying grants, etc. This handy page has a lot more information on these dbt snowflake advanced configurations.

Improve your dbt Snowflake experience with SQLFluff and other Python libraries

Now that you have dbt connecting to your database, let’s talk about some python libraries you should set up to improve how you work with dbt.

dbt-coves

dbt-coves is an open source library created by Datacoves to complement dbt development by simplifying tedious tasks like generating staging models. It is a must-have tool for any dbt practitioner who wants to improve their efficiency. dbt-coves will automatically create your source.yml and staging models as well as their corresponding yml(property) files. It also has utilities for backing up Airbyte and Fivetran configurations.

SQLFluff

SQLFluff is a Python library for linting SQL code. SQLFluff seamlessly integrates dbt using a templater and it is the only linter compatible with dbt. If you have not heard of code linting it helps you enforce rules on how your SQL code is formatted for example, should everyone use leading or trailing commas, should SQL keywords be upper or lower case. We recommend everyone use a linter as this will improve code readability and long term maintainability.

pre-commit with dbt-checkpoint

dbt-checkpoint is a tool that allows you to make sure your dbt project complies with certain governance rules. For example, you can have a rule that validates whether every dbt model has a description. You can also ensure that every column is documented among many other rules. We also recommend the use of dbt-checkpoint as it will assure developers don’t add technical debt from the start of a project.

In addition to these Python libraries, at Datacoves we set up the development environment with other handy libraries like Snowpark and Streamlit. We believe that flexibility is important especially in enterprise environments. If you want to learn what to consider when selecting a managed dbt core platform, check out our guide.

Improve your dbt Snowflake experience with dbt power user and other VS Code extensions

In addition to Python libraries, you can improve your dbt workflow with Snowflake by installing these VS Code extensions.

Snowflake VS Code Extension

The official Snowflake dbt extension keeps you in the flow by bringing the Snowflake interface to VS Code. With it you can explore your database, run queries, and even upload and download files to Snowflake stages. It is a must-have for any Snowflake user.

dbt power user

dbt power user is a VS Code extension that improves the dbt development experience by adding handy shortcuts to run dbt models, tests, and even let’s you preview the result of a dbt model or a CTE within that model.

SQLFluff VS Code extension

The SQLFluff VS Code extension is the companion to the SQLFluff python library. It improves the development experience by highlighting linting errors right in line with your SQL code. It even has a handy hover which describes the linting error and links to the SQLFluff documentation.

There are many other great VS Code extensions and at Datacoves we are always improving the dbt developer’s experience by pre-installing them. One recent addition demonstrated on the video below is a ChatGPT extension that allows you to improve the dbt experience by writing documentation in addition to other functions.

Conclusion

Getting started with dbt and Snowflake is not hard and knowing how to streamline the development experience when working with dbt and Snowflake will maximize developer happiness.

Some users may run into issues configuring their development environment. If that happens, check out the #sqlfluff, #tools-dbt-libraries, and #tools-vscode channels on the dbt Slack community. There are many helpful people there always ready to help.

As you can see there are a lot of steps and potential gotchas to get a dbt environment properly configured. This gets more complicated as the number of dbt users increases. Upgrading everyone can also pose a challenge. These reasons and more are why we created the most flexible managed dbt-core environment. If you want your developers to be up and running in minutes with no installation required, reach out and we can show you how we can streamline your teams’ dbt experience with best practices from the start.

Dbt cloud vs dbt core vs managed dbt core

Dbt & Airflow

dbt Cloud vs dbt Core for Enterprise Data Management

5 mins read

Since its inception, dbt has served the needs of small to mid-sized companies to transform data into actionable insights. Now as larger enterprises look to adopt dbt, they must consider the unique complexities of large enterprises. These include integrating with existing systems behind a corporate firewall, orchestrating end-to-end data pipelines, and implementing DataOps processes to transform how teams collaborate across the organization.

This article compares three popular dbt development and deployment options: leveraging dbt Cloud, using dbt Core alone, or using a dbt Core managed platform other than dbt Cloud. To learn more about specific dbt Cloud features and how to implement them using dbt Core using other open-source tools checkout dbt Core vs dbt Cloud - Key Differences.

dbt Cloud for your enterprise data management platform

dbt Cloud is a hosted environment to develop and deploy dbt Core projects. dbt Cloud leverages dbt Core, the open-source data transformation framework we all know as dbt. dbt Cloud pricing starts at $100 per developer for teams or if you are a single developer, they offer a free single seat with limited features.

dbt Cloud IDE for dbt-core development

When companies are ramping up with dbt, one of the hardest parts is setting up and managing dbt IDE environments. Analytic Engineers coming to dbt may not be familiar with concepts like version control with git or using the command line. The dbt Cloud IDE simplifies developer onboarding by providing a web-based SQL IDE to team members so they can easily write, test, and refine data transformation code without having to install anything on their computers. Complexities like starting a git branch are tucked behind a big colorful button so users know that is the first step in their new feature journey.

Developers who are accustomed to more versatile local IDEs, such as VS Code, may find the dbt Cloud experience limiting as they cannot leverage extensions such as those from the VS Code Marketplace nor can they extend dbt Core using the vast array of Python libraries. An example of a common library employed in the dbt community is SQLFluff. With SQLFluff, teams can ensure a minimum level of code quality is achieved across developers.

dbt Cloud for DataOps

DataOps is a set of practices that combines data management and software engineering to improve the quality, speed, and collaboration of data analytics teams. In an enterprise environment, implementing DataOps processes early on in a project's lifecycle is crucial.

dbt Cloud can be paired with Continuous Integration (CI) tools like GitHub Actions to validate data transformations before they are added to a production environment.

If CI integration checks include things like SQLFluff and dbt-checkpoint, there is currently no way for users to run these validations in dbt Cloud prior to executing the automated CI checks in GitHub Actions. This may lead to a lot of back and forth between the GitHub Actions run failures and the dbt Cloud IDE to fix and commit the changes. Aspects such as code reviews and approvals will occur in the CI/CD tool of choice such as GitHub and dbt Cloud can report job execution status back to GitHub Actions. This allows teams to know when it is safe to merge changes to their code.

Companies that have tools like Bitbucket, Jira, and Jenkins within their corporate firewall may find it challenging to integrate with dbt Cloud.

dbt Cloud for deployment

dbt Cloud makes deploying a dbt Core project simple. It allows you to define custom environment variables and the specific dbt commands (seed, run, test) that you want to run during production runs. The dbt Cloud scheduler can be configured to trigger at specific intervals using an intuitive UI.

dbt Cloud is primarily focused on running dbt projects. Therefore if a data pipeline has more dependencies, an external orchestration tool may be required. Cases where this is necessary include when you need to connect the Extract and Load step to the Transformation happening in dbt. Without connecting these steps, you have no assurance that the load is complete before the transformation starts. This can be mitigated with freshness checks, but without an explicit connection between the load step and the transformation, you may be delaying transformation unnecessarily.

Fortunately, if you do use an external orchestrator, dbt Cloud offers an API to trigger dbt Cloud jobs from your orchestrator. Another reason an external orchestrator may be needed in an enterprise is when there are additional steps to run after the dbt transformation. These may include sending data to downstream BI tools, marketing campaign tools, MDM systems, or refreshing machine learning models. dbt Cloud recently announced a webhooks API that would allow for triggering steps after transformation. However, if these endpoints reside within a corporate firewall, this may not be an option and an orchestrator like Airflow may be necessary.

dbt Cloud extensibility

The dbt Cloud IDE does not allow for the installation of Python libraries and dbt Cloud cannot create and orchestrate Python scripts that leverage tools like SQLFluff (this is coming to dbt Cloud later this year), dbt-coves, dbt-fal, Snowpark, Permifrost, or Streamlit. Additionally, the dbt Cloud IDE cannot be extended with extensions such as the official Snowflake or Databricks VS Code extensions.

While dbt Cloud can send notifications via email and Slack, it does not support other popular enterprise systems such as MS Teams.

dbt Cloud security

dbt Cloud offers industry-standard SaaS security appropriate for any enterprise. At the enterprise level, it offers Single Sign-On and Role Based Access Controls. Enterprises can also choose to have dbt Cloud provisioned on a dbt Labs managed private cloud account. dbt Cloud is only offered as a managed SaaS solution on AWS and thus cannot be deployed to a company's existing private cloud account.

Who should consider using dbt Cloud

dbt Cloud is especially attractive for organizations with limited in-house technical resources or those seeking to minimize the time and effort spent on initial setup, maintenance, and upgrades. dbt Cloud runs in the browser and thus removes the need for manual installation and setup of dbt components, allowing your team to focus on data transformation tasks. dbt Cloud handles updates and upgrades for you, so it is simple to stay up-to-date and secure. dbt Cloud includes built-in scheduling capabilities, allowing you to automate dbt runs and testing without relying on external schedulers or orchestration tools.

dbt Cloud may introduce additional complexities in large enterprises with more complex orchestration needs or those that need to host the environment within their private cloud. Finally, the dbt Cloud IDE may feel constraining for users who prefer a VS Code environment.

All these features can in turn reduce the time to value in setting up an end-to-end enterprise data management platform. It is possible to start out with a simple setup, but inevitably not accounting for the breadth of the entire platform may create technical debt and cause rework in the future.

Pros and cons of using dbt Cloud

Pros:

- Simplified setup and management of dbt environment

- Enhanced collaboration capabilities for data teams

- Integrated dbt job scheduling and monitoring

- Managed upgrades and updates

- Web-based IDE for streamlined development

- Metadata API

Cons:

- Reliance on a third-party platform

- May be more expensive for larger teams or organizations with complex requirements

- Not able to deploy within the corporate's private cloud

- Unable to leverage VS Code extensions and other Python libraries

- May require additional components or additional SaaS solutions for complex orchestrations

dbt Core for your enterprise data management platform

dbt Core is the heart of dbt and since it is open source, anyone can use it freely. That being said, an enterprise data management platform must account for additional requirements such as the development experience, deployment and scheduling dbt runs, and creating a DataOps process. This will impact the total cost of ownership and the platform's time to market.

IDE for dbt Core development

Leveraging dbt Core allows for ultimate flexibility, but it involves setting up a development environment on each team member's local machine or on a virtual environment such as an AWS workspace. Installation includes installing a popular dbt IDE like VS Code, installing dbt Core, configuring a connection to the data warehouse, and managing additional dependencies such asa compatible version of Python.

While the overall installation process for dbt is similar for both Windows and Mac users, there are some differences and potential gotchas to be aware of. Additionally, most organizations provide users with pre-configured computers and these systems may be delivered with software that conflicts with dbt Core. This complexity can lead to frustration and delays and may require assistance from senior members of the team, pulling them away from value-added work.

Once everything is installed, configuring dbt to connect to a data warehouse can also introduce complexity before a new team member can begin familiarizing themselves with the organization's dbt project structure and coding practices.

An enterprise setup of dbt should also include components that improve developer code quality and productivity. Popular IDEs like VS Code include a vast array of extensions that also improve the development experience. These include dbt Power User, VS Code extension for SQLFluff, and the dbt Snowflake extension.

Configuring a development environment can take anywhere from a few days to as long as a week. Depending on the person’s technical aptitude and experience with dbt. To improve this process, companies spend weeks or months dialing in the installation process and documenting it in knowledge base tools like Confluence or Notion.

A development environment is not evergreen, it must be upgraded as new versions of dbt Core are released for added functionality or to fix bugs. Upgrading can be as error-prone as new installations and some companies opt not to keep their enterprise data platform current. This, however, is ill-advised because this type of technical debt only gets harder to remediate over time.

dbt Core for DataOps

When using dbt Core for your enterprise data platform, you will need to not only define and configure the automation scripts, but you will also need to ensure that all the components, such as a git server, CI server, CI runners, etc. are all working harmoniously together.

Since dbt Core runs within the corporate firewall, it can be integrated with any CI tool and internal components such as Jira, Bitbucket, and Jenkins. This flexibility comes at a price. To do this well, all the project dependencies must be packed into reusable Docker containers. Notifications will also need to be defined across the various components and all of this will take time and money.

dbt Core for deployment

When setting up a deployment environment, companies using dbt Core can leverage any orchestration tool, such as Airflow. They can connect steps prior to or after the dbt transformations and they can trigger any tool that exists within or outside the corporate network.

However, scaling tools like Airflow is not straightforward and may require knowledge of additional technologies like Kubernetes. By default, Airflow uses Python files and therefore, the team working with these orchestration scripts will need to also know Python on top of SQL.

A mature enterprise platform will also require multiple Airflow environments for development, testing, and production, adding complexity and increasing the total cost of ownership.

dbt Core extensibility

Using dbt Core is inevitably the most flexible option, which is why many organizations choose to go this route. Companies can run dbt Core and leverage leading code editors like VS Code. They can install any additional Python dependencies, including running any proprietary code that is only available within the corporate firewall. They can also take advantage of the ever-growing VS Code extensions such as those for Snowflake or Databricks.

Teams using dbt Core can send notifications to Slack, MSTeams, or any other platform as long as they develop and maintain these solutions.

However, this extensibility will only increase the platform's total cost of ownership and the time it takes to deploy the end-to-end platform.

dbt Core security

One reason some enterprises choose to use dbt Core is that they are not able to leverage SaaS solutions due to compliance or data privacy requirements such as GDPR. By leveraging dbt Core, they can achieve ultimate security as dbt Core would run in and comply with corporate network policies.

Running dbt Core within the corporate firewall also eliminates the need for lengthy risk assessments. However, ensuring that security patches are applied is left to the organization. Managing Single Sign-on with tools like Airflow and setting up Role Based Access Controls are also things that will have to be solved when using dbt Core as a dbt enterprise data management platform.

Who should consider using dbt Core

A lot of the power and flexibility with using dbt Core comes from being able to customize the entire dbt experience. However, all of this comes with the expense of having to build a data management platform. It is not uncommon for organizations to take six months or up to a year to set up and test their custom dbt enterprise data management platform. They will also likely need to hire external consultants who have worked with a myriad of technologies and understand how to integrate them.

Building a platform is not for the faint of heart and since it is something custom to the organization, it will also increase the total cost of ownership. Some organizations get the basics in place just to make their teams productive, but eventually, they find that without proper design and implementation, they cannot take full advantage of the value of dbt and the rest of the modern data stack.

Using dbt Core as a dbt enterprise data platform can be a cost-effective option for smaller teams with limited requirements and budgets. It can also be good for those trying out dbt. Large organizations with stringent security requirements of sensitive data and strong technical platform teams may also leverage this option as it allows them to deploy in a way that is compliant with corporate policies.

As platform complexity grows, so does the cost and time needed to implement the end-to-end platform. Larger organizations should consider whether developing something custom is better than using a managed solution or taking shortcuts that can reduce the value of dbt and the modern data stack.

Pros and cons of using dbt Core

Pros:

- Full control over the dbt environment and configurations

- No reliance on third-party services

- No subscription costs

- Able to meet corporate security requirements by installing within the corporate firewall

Cons:

- Requires technical expertise in defining the installation steps and documenting them

- May require hands-on involvement during setup and upgrades

- It can be time-consuming and challenging to maintain consistency across team members

- Upgrades and dependency management may be more complex compared to managed solutions

- Resolving installation issues can consume senior team members' time

- Total cost of ownership of the platform may be higher than using a managed solution

- Additional expertise will be needed to set up all the required technologies

- It will take longer to implement than a managed solution

Managed dbt Core for your enterprise data management platform

We have seen that dbt Cloud is a great dbt Core managed platform, but it has some limitations for large or complex deployments. On the other hand, using dbt Core alone gives organizations ultimate flexibility, but it comes at an added expense in terms of the time and cost that it takes to stand up and maintain a custom, one-of, enterprise data management platform.

What if you could combine the simplicity of dbt Cloud with the flexibility of dbt Core? A managed dbt Core platform, like Datacoves, can combine some of the best of both worlds. However, not all managed platforms are the same, below we discuss what to look for when considering a managed dbt Core enterprise data platform. Checkout our dbt pricing page for more pricing information.

Managed VS Code as a dbt IDE for development

By far, the tool most of the dbt enterprise users will interact with is the IDE. When considering a managed dbt Core platform, consider the developer experience. Will they have VS Code in the browser? Will you get a curated, best in class dbt IDE like VS Code with integrated libraries like SQLFluff and extensions that simplify the developer workflow? When required, will you be able to customize the developer environment with company specific Python libraries and extensions?

In a best-in-class developer setup, new users are onboarded in minutes with simple configuration screens that remove the need to edit text files such as profiles.yml and remove the complexity of creating and managing SSH keys. Version upgrades of dbt or any dependent library should be transparent to users.

Some platforms will also have added flexibility and let you leverage VS Code for non-dbt development tasks, such as opening a Jupyter notebook environment or a Python development environment with libraries such as Python Black.

Spinning up a pristine environment should be a matter of clicks. To test new dbt versions, you should be able to run your project in an isolated environment before switching all developers to the new version of dbt.

Datacoves ide, hosted vs code — Datacoves IDE, hosted VS Code

Managed dbt Core for DataOps

The advantage of a managed dbt Core platform when it comes to DataOps is that you do not have to worry about creating Docker images with all the dependencies you need as you perform your Continuous integration tests. Working with a partner that has experience with enterprise CI/CD platforms like Jenkins will also accelerate your time to value. A knowledgeable dbt Core managed platform partner will know best practices and guide you in setting up a world-class workflow that will scale with your enterprise. Developers should also be able to execute CI validations locally before pushing their changes to GitHub, thus reducing the back and forth to fix issues during the CI checks.

Some managed dbt Core platforms can run within your corporate firewall, allowing you to leverage existing tools like Jira, Bitbucket, Jenkins, or Gitlab. All this simplifies and accelerates the rollout of a dbt data management platform.

Managed dbt Core for deployment

Enterprises are complex, and as such, they must consider a flexible orchestration tool that has wide adoption, documentation and can scale to the needs of the company.

A mature enterprise dbt Core platform will include a full-fledged orchestrator such as Airflow. With Airflow, organizations can orchestrate pipelines that include Extract, Load, Transform, and Activation steps. Other aspects to consider is whether Airflow jobs can scale by leveraging technologies such as Kubernetes and whether it will be simple to send notifications to enterprise tools such as MS Teams.

Some dbt Core platforms that leverage dbt have also adopted mechanisms that simplify Airflow DAG creation by letting users create them using yml files vs creating traditional Python-based DAGs.

A complete enterprise data management platform will allow you to easily create multiple Airflow environments for development, testing, and production. It will also eliminate the burden of having to create Docker images with your dbt dependencies.

Managed dbt Core extensibility

An enterprise dbt Core platform will streamline the process of ramping up with dbt and it will offer the organization the flexibility it requires. This may include triggering AWS Glue jobs as well as Fivetran jobs. It may include triggering internal and external APIs or it may require a pre-configured development environment with Python libraries like Snowpark or Streamsets.

Managed dbt Core security

When it comes to security, different types of industries have different requirements. Regulated industries in Life Sciences or Finance may not want to or and unable to use SaaS solutions, so being able to deploy a dbt enterprise data management platform within the corporate firewall is a must.

Integrating with SSO solutions like Okta or PingID is also a must, and managing user's authorization using existing mechanisms like Active Directory simplifies compliance with corporate policies.

Another area that may be overlooked when using open-source solutions is validating that libraries are up to date and there are no vulnerabilities introduced into the network. A robust dbt Core managed platform will be able to leverage internal registries like Artifactory, which are set up to scan for these issues.

Finally, at enterprise scale, the dbt Core managed platform must have robust Role Based Access Controls to assure users have the appropriate level of access.

Who should consider using a managed dbt Core platform

Just like dbt Cloud, a reliable enterprise dbt Core data management platform will reduce the time to value. Teams will not have to figure out how to host and configure the multitude of tools and libraries and they will have a starting point for implementing best practices at the start of the project.

By eliminating the need to stand up a platform, teams can focus on change management and new ways of working that will have the biggest impact on the organization. The total cost of ownership will be lower for the company since they do not have to build and constantly maintain their one-of platform. In contrast, they will benefit by working with a partner that works with other large organizations and looks for ways to improve and enhance the platform for everyone.

A managed dbt Core enterprise data platform is necessary for organizations that are not in the business of creating and managing complex technologies and those who need the flexibility that comes with complex enterprise tools and processes. When dbt Core must be deployed within the corporate firewall, an enterprise managed dbt data platform can also be a good option for companies with stringent security requirements to protect sensitive data.

Smaller organizations that want to leverage a managed VS Code environment or who want to implement DataOps from the outset can also realize the benefits of a SaaS managed dbt Core data platform.

A managed dbt Core enterprise data platform gives companies the simplicity of dbt Cloud with the peace of mind of dbt Core. Best of all, if the platform leverages open-source technologies, there is no vendor lock-in.

Pros and Cons of using a managed dbt Core platform

Pros:

- Simplified developer onboarding

- Robust IDE such as VS Code

- Ability to configure a dbt environment

- Managed orchestrator such as Airflow

- Able to meet corporate security requirements by installing within the corporate firewall

- Requires a lot less technical expertise to set up and run

- Simplified dependency management across the development environment, CI/CD worker, and Airflow

- Lower total cost of ownership

Cons:

- Unable to have full control over the platform

- Reliance on third-party

- Licensing costs

Conclusion

Selecting between dbt Cloud, dbt Core, or a managed dbt Core enterprise platform comes down to the internal experience within an organization and their desire to focus on data vs developing and maintaining a custom dbt platform. The beauty of the dbt ecosystem is that the components are freely available and with enough time and money, anyone can do what they need. However, enterprises that want to move fast and leverage the full potential of dbt, should consider a managed dbt Core enterprise platform such as Datacoves to reduce the total cost of ownership and accelerate time to market. It is important to be aware of the dbt alternatives to make an informed decision.

News

Datacoves joins the TinySeed Family

5 mins read

Genesis of Datacoves and beyond

A few years ago, we identified a need to simplify how companies implement the Modern Data Stack, both from an infrastructure and process perspective. This vision led to the creation of Datacoves. Our efforts were recognized by Snowflake as we became a top-10 finalist in the 2022 Snowflake Startup Challenge. We have become trusted partners at Fortune 100 companies, and now we are ready for our next chapter.

Datacoves is thrilled to announce that we have been selected to be part of the Spring 2023 cohort at TinySeed. Their philosophy and culture align flawlessly with our values and aspirations. Our goal has always been to help our customers revolutionize the way their teams work with data, and this investment will enable us to continue delivering on that vision.

The TinySeed investment will also strengthen our commitment to supporting open-source projects within the dbt community, such as SQLFluff, dbt-checkpoint, dbt-expectations, and dbt-coves. Our contributions will not only benefit our customers but also the broader dbt community, as we believe open-source innovation enhances the tools and resources available to everyone.

We want to express our deep gratitude to Tristan Handy, CEO & Founder at dbt Labs and the entire dbt Labs team for partnering with us on the 2023 State of Analytics Engineering Community survey and for creating the amazing dbt framework. Without dbt, Datacoves would not be where it is today, and our customers would not enjoy the benefits that come with our solutions.

As we continue to grow and evolve, we look forward to finding new ways to collaborate with dbt Labs and the dbt community to further enhance how people use dbt and provide the best possible experience for everyone. Together, we can empower data teams around the world and revolutionize the way they work.

‍

TinySeed group picture — TinySeed 2023 Spring Cohort after some Rage Room fun

‍

Get our free ebook dbt Cloud vs dbt Core

Get the PDF